Conversion of Text to Hypertext

McKNIGHT, Cliff: Hypertext in Context Chapter 5 - Creating Hypertext
Chapter 5 Contents > Conversion of Text to Hypertext	SITEMAP

[ Pros and Cons ]

[ Text Access Frequencies ]

[ Electronic Versions ]

[ Nature of the Transformation ]

[ Text Mark-Up ]

Pros and Cons

The possible reasons why a text may be considered suitable for conversion to hypertext format include all those which apply to the creation of basic electronic texts. The advantages of electronic formats are most clearly seen in the improved access that they offer to texts. Thus, for example:

many readers can access the same text immediately and simultaneously via a network
lengthy texts can be readily searched, edited and incorporated into new documents if desired
version control can be managed with greater efficiency so that all readers can be confident that they are reading the most recent version of the text.

A hypertext implementation not only enjoys all the above advantages but also offers the increased convenience afforded by the dynamic linking of the constituent elements and a greatly increased flexibility of design.

For the publishing community some significant problems will need to be addressed before electronic products are commonplace. For the publisher the potential market is still relatively small, there are problems in terms of:

incompatible hardware and software systems and, as yet,
no proven techniques for protecting electronic texts from unauthorised reproduction.

A major criticism of the Nelson vision has concerned the cost of the effort required to create hypertext versions of existing texts on a major scale. This view assumes that each text would need to be individually transformed, with each hypertext link uniquely specified. Although the effort required to convert texts into hypertext on any scale should not be underestimated, there are a number of reasons why the task may not be as great as it first appears. These include:

the differing frequencies with which texts are accessed by readers,
the rôle of machine-readable texts in the printing process,
the nature of the transformation that will be appropriate, and
the increasing use of generic text mark-up languages.

[Top]

Text Access Frequencies

While the inclusion of all existing texts in the Docuverse might appear attractive, there is good reason to suggest that a limited subset might satisfy the requirements of many readers in certain disciplines and subject areas. This follows from the fact that much scientific and technical information has a limited ‘shelf life’, after which its importance gradually declines. The demand for scientific journal information clearly demonstrates this factor. The ADONIS project was a full-scale evaluation study of the parallel publishing of bio-medical journals on paper and CD-ROM (see Campbell and Stern, 1987). A pilot study described by Clarke (1981) showed that, in the chosen subject area, readers were primarily interested in material less than three years old.

Thus, for certain areas there may be little point in actually capturing archive material and this could effectively remove 100 year’s production of books and journals from the ‘electronic queue’ for well established disciplines. However, this is not to say that electronic bibliographic data should not be available.

[Top]

Electronic Versions

It is now standard practice for printed text to be processed electronically at some stage, although there is enormous variation in the precise form which such processing takes. Many authors create documents on word processors or microcomputers. An increasing number of publishers are prepared to accept electronic versions of texts or camera-ready copy instead of manuscripts or typed drafts, and the majority of publishers/printers produce a final electronic version as input for a typesetting machine. Thus, for the majority of texts published today, an electronic version of some kind will have been created, from which a hypertext could be fashioned.

Many publishers claim to have tried accepting electronic versions of manuscripts from authors in the past but have abandoned the practice due to the creation of an increased rather than decreased handling requirement. Frequently, the technical incompetence of authors and the difficulties in catering for a wide variety of disk formats and word processor file types are given as major problems. However, there has been a considerable degree of standardisation in the personal computer market in recent years with a few operating systems (i.e., MS-DOS and Apple Macintosh), disk formats (51/4 " and 31/2 ") and word processor packages (MS-Word, Word Perfect) in a dominant position. In addition, virtually every current word processor is capable of generating files in the industry standard ASCII format.

[Top]

Nature of the Transformation

There are sound reasons for suggesting that the content and structure of many documents may be largely maintained following conversion from text to hypertext, and preservation of these aspects would certainly reduce the labour costs of the conversion. There are many types of text which have a strongly regulated content, and conversion to electronic format would be no reason to make amendments.

Examples include:

industrial standards
guidelines
codes of practice
legal documents
technical documentation
historical records
religious documents

In terms of a text's structure, alterations can obviously vary from merely rearranging the sequence of the original macro-units (sections/chapters) to completely reorganising the material into a new structure (hierarchy, flat alphabetical sequence or net). Again, there are grounds for suggesting that some texts may be converted with relatively little restructuring. Some electronic texts such as computer operating system documentation (e.g., Symbolics' Genera) are published in parallel forms. There is an obvious need for both forms to have equivalent contents, but there is also a considerable advantage in maintaining a consistency of structure for the reader. Such readers may have gained considerable knowledge of the structure of the pre-existing printed version and may be confused by a radically different electronic structure. Users may also need to use both versions as the situation demands, and this could be under conditions of extreme stress. Consider, for example, operating procedures for an industrial plant which are normally accessed electronically but which also exist as a printed document in case of a total power failure. Many recent technical texts have benefitted from the increased importance of document design as an area of professional activity and are consequently well structured with regard to the users' requirements.

[Top]

Text Mark-Up

Recent advances in electronic text processing - and in particular the use of text mark-up – represents another form of assistance to the creation of hypertexts. In its broadest sense 'mark-up' refers to any method used to distinguish equivalent units of text such as words, sentences or paragraphs, or of indicating the various structural features of text such as headings, quotations, references or abstracts. Thus, the use of inter-word spacing, punctuation, indentation and contrasting typefaces are also examples of mark-up. However, mark-up is conventionally divided into two classes depending on whether it is procedural or descriptive.

Procedural mark-up, such as the Unix 'nroff' and 'troff' systems, refers to the special control characters that are inserted into electronic text files prior to their submission and subsequent interpretation by output devices such as photo-typesetting machines. Different codes are attached to section headings, paragraphs of body text, references and even individual characters and words so that each is set in an appropriate type style, size and line spacing. For example, to achieve the following emboldening:

"Answer question two or three."

the following troff mark-up would be necessary

"Answer question two \fB or \fR three."

The first command (\fB) instructs the typesetting machine to print the following characters in Times Bold. The second instruction (\fR) tells the output device to revert to the default style – Times Roman.

Descriptive mark-up further separates the description of the document from the interpretation by any particular output system since an item of descriptive mark-up is simply a label or 'tag' which is attached to a paragraph of body text or a chapter heading. Since no directions about formatting are included, the interpretation of the mark-up tags occurs entirely within the output system. This approach allows for greater flexibility in terms of moving text files between different mark-up and output systems. A generic descriptive markup system called Standard Generalised Mark-up Language (SGML) has been accepted as an ISO standard (ISO 8879) and is likely to become even more widely used in the future. (For an introduction to SGML, see Holloway, 1987.)

The generic coding of the structural units of documents via SGML, or some similar system, is likely to be of considerable significance to the future development of hypertext. It would enable the automatic generation of basic hypertexts which are based on document structure (i.e., the creation of nested hierarchies and the direct linkage of text elements) with a minimum of human involvement. Niblett and van Hoff (1989) describe a program (TOLK) that allows the user to covert SGML documents into a variety of hypertext and text forms for display or printing. Rahtz, Carr and Hall (1990) describe a hypertext interface (LACE) for electronic documents prepared using L ATE X.

Perhaps of greater significance is the US Department of Defence Computer-aided Acquisition and Logistic Support (CALS) programme. CALS has the aim of converting all the significant documentation supporting defence systems from paper to electronic forms via internationally agreed standards, including SGML. Although CALS will initially concern only the armed forces and their contractors, the size of the defence 'industry' in America means the programme will soon have a major impact far beyond this sector.

[Top]

[ Introduction ] CONVERSION OF TEXT TO HYPERTEXT [ Creation of Original Hypertext ] [ Characteristics of Extended Prose Arguments ] [ Hypertext Network or 'Web of Facts' ] [ Fallacy of Simple Networks as 'Ideal' Representations of Knowledge ] [ From Chaos to Order, From Order to Understanding ] [ Conclusion ] [References ] [Glossary ]