Menota handbook Ch. 4 (v. 2.0.): Document structure (original) (raw)
Chapter 4. Document structure
4.1 Introduction: The structure of the manuscript vs. the structure of the work
4.2 Main divisions of a TEI document
4.3 Chapters:
4.4 Paragraph text:
4.5 Metrical text: and
4.6 Headings:
4.7 Page, column and line breaks: , ,
4.8 Punctuation and hyphenation
4.9 Initials and highlighted characters
4.10 Overlapping structures
Version 2.0 (16 May 2008). Links updated 12 July 2016.
4.1 Introduction: The structure of the manuscript vs. the structure of the work
Viewed as physical objects, rather than as vehicles for texts, manuscripts have a certain structural hierarchy. What is regarded as a single manuscript may in fact comprise more than one volume; Flateyjarbók, for example, is bound in two volumes, and the large rímur codex Acc. 22 in three. A manuscript book is made up of quires or gatherings, each of which contains a number of leaves, normally eight. Each leaf has a recto side and a verso side, and each side may be further divided into columns. The text is then written in lines across the page or column. In order to be able to locate a word quickly and easily, all, or at least most, of these structural divisions must be registered. We need to know that a given word appears in the fifth line of the right-hand or b column on the recto side of folio 34. As it is customary to foliate manuscripts without regard to their quire division, the quires will not normally need to be included in the hierarchical structure, but since the quiring can have implications for the text itself this division should be indicated, and will also generally form part of the element, found in the document header.
At the same time, of course, manuscripts obviously do contain texts, which is the reason why most of us are interested in them in the first place. A single manuscript will often contain more than one work, each of which may, in the case of lengthy prose works such as sagas, be divided into chapters or sections. In the case of poetry, rímur for example, a single work (rímnaflokkur) will usually consist of several cantos or fits, each containing a number of stanzas, made up of a number of lines. It may be necessary to group these lines in some other ways as well. The stanzas comprising the mansöngur should be distinguished from the main body of the fit, for example, while to facilitate certain types of metrical analysis it might be desirable to divide the individual stanzas into couplets. Some types of poetry, such as the vikivakakvæði, will have a refrain or burden, which should ideally also be distinguished from the narrative section(s) of the stanza.
XML has at its foundation the notion of a text as a single hierarchical structure, which means that it does not work well where there are several concurrent hierarchies, as is obviously the case when one wishes for example to indicate the line divisions both in a poem and in the manuscript in which the poem is contained. The TEI Guidelines offer various solutions to this problem, enabling both the structure of the document and the structure of the text to be encoded.
4.1.1 Hierarchical divisions
The principal means of representing hierarchy is the
(i.e. “division”) element.The complex structure of a work such as a set of rímur could be represented by using four levels of
elements, for the cantos or fits, for the parts (for example the mansöngvar), for the stanzas, and**This type of markup focusses on the hierarchical structure of the text. The actual physical realisation of the text is considered of secondary importance – if of importance at all – when dealing with modern printed literary works: little significance is attached to the page and line breaks in the various editions of, say, Orwell's Nineteen Eighty-Four. In some cases, however, the early editions of Joyce's works, for example, supervised by the author himself, the physical make-up of the text can be of great consequence. It may also be necessary to maintain the pagination and lineation of standard editions of major works, as these are frequently used in citations in scholarly works. In the case of chirographically transmitted material, the physical organisation of the text is more likely to be recognised as being of importance and in need of encoding. This can be done hierarchically, as above, using
elements, which are then given the appropriate @type attributes, e.g. 'page' , 'column' or 'line' , but it seems more appropriate to reserve these elements for structural divisions in the text, while indicating the physical structure of the document through the use of so-called “milestone” tags, i.e. , and . These tags make up a separate hierarchy in the file and help to overcome the problem of overlapping structures in the mark-up; see also the discussion in ch. 4.10 below.The rest of this chapter presents how the text may be encoded at higher structural levels than characters and words. Important elements here are the larger divisions of the text, like chapters, paragraphs (with headings), and stanzas. This chapter also presents how pagination and foliation, together with column-breaks and line-breaks, may be encoded. The following TEIelements are presented:
Elements | Contents |
---|---|
, | Main divisions of the text, |
division into chapters (multiple levels are encoded by nesting elements), | |
prose paragraphs, | |
, | line groups and lines, |
headings, | |
,, | page-, column- and line-breaks. |
4.2 Main divisions of a TEI document
The following presentation is based on ch. 4 “Default Text Structure”of the TEI P5 Guidelines.
A TEI document is always at its highest level enclosed by the start tag and the end tag**. Within the element, two other elements appear in a fixed order, namely the and the** elements. Within the element, the body text may appear, enclosed in the element . If the text has front matter, there will be an element , placed before**** containing it. Similarly, there may be an element**, placed after and containing back matter. The elements , and are required in any TEI-conformant document, while and** are optional. This, then, is the basic structure of a TEI document:
Elements | Contents |
---|---|
The TEI document begins here, | |
... | the header goes here, |
the text itself begins here, | |
... | any front matter goes here, |
... | the main body of the text goes here, |
... | any back matter goes here, |
the text ends here, | |
the TEI document ends here. |
4.2.1 Another possible first division of the text: More than one element
The transcriber may want to divide a document into more than one text. This can be done with the element, which should be contained in the top level**** element taking the place of**** in the simpler scheme illustrated above. The following structure appears:
... ... ... ... ... ... ... ...
The main structure of the text, at the levels of work, first main division, second main division, first chapter of first main division, second chapter of first main division and so on, can be encoded in different ways. If the electronic document consists of more than one work, the structure illustrated above is the natural choice. In that case, one would get multiple sets of further structural divisions, one set within each of the elements. If the electronic document is considered as a single work, and placed in one element, there will only be a single element that needs further divisions.
4.3 Chapters:
Further division of the**** block is achieved through
elements, with one level nesting inside the other as the transcriber moves down through the hierarchical structure of the text. elements
4.3.1 Type- and level-specified In a complex document,
elements may be specified by @type and @n attributes. In this example, the three first chapters of a work have been contained in**Elements | Contents |
---|---|
... |
elements
4.3.2 Unspecified It is also possible to use**
Elements | Contents |
---|---|
... |
4.3.3 Nesting elements
Note that
elements may nest inside each other. For example, the levels of work, chapter and then paragraph can be encoded in the following manner:Elements | Contents |
---|---|
The whole work starts here, | |
the first subdivision starts here (nested), | |
... | one paragraph of the subdivision goes here, |
end of the subdivision, | |
end of the work. |
While **
4.4 Paragraph text:
The basic-level element for prose text is the paragraph,
. Typically, the deepest level**Elements | Contents |
---|---|
A new chapter starts here, | |
... | this contains the heading, |
... | first paragraph, |
... | second paragraph, |
... | third paragraph, |
The
element may appear in other contexts, such as in the element. It may also contain a number of other elements, but – as underlined above – it may not contain other elements, i.e. it is not allowed to nest.4.5 Metrical text: and
The elements discussed here are defined and explained in ch. 6 “Verse”of the TEI P5 Guidelines.
Texts in verse should be encoded using (line group), which in turn contains one or more**** elements (lines). As with
, elements can nest. According to the TEI Guidelines is a sibling of, i.e. at at the same level as, , and cannot be contained within it (unless it appears within a element). Example:Elements | Contents |
---|---|
... | A paragraph ends here, |
a line group starts here, | |
... | first line, |
... | second line, |
... | third line, |
the line group ends here, | |
... | and a new paragraph starts here. |
Nesting of elements is useful for marking up longer poems. When a poem consists of two levels of line groups one may encode its structure as shown here:
Elements | Contents |
---|---|
Here a line group on level one begins, a stanza, | |
here a subgroup starts, a couplet, | |
... | the first line, |
... | second line, |
and here the subgroup ends, the first of the couplets. | |
Here a new subgroup starts, | |
... | line, |
... | line, |
here the second subgroup ends, | |
and here the level one line group ends. |
The and elements may have several attributes, among other things for encoding information about rhyme or other metrical phenomena. See ch. 9.2 of this handbook for a more detailed presentation of metrical encoding.
Having
and as siblings can create problems for the encoding of prosimetrum texts, where lines or verse or even whole poems can appear within prose text, often as part of direct speech. However, rather than including directly within the element, we recommend inserting the and elements within elements, using one for each of them:Elements | Contents |
---|---|
A chapter opens here, | |
beginning with some prose text, indicated by a element. | |
... | The text goes here, |
4.6 Headings:
The element is used for containing headings on all levels of the document. If**** is placed at the start of a
element, it typically contains a chapter heading:Elements | Contents |
---|---|
Here a chapter begins, | |
... | its heading, |
... | the first paragraph of the chapter, |
... | the second paragraph, |
and here the chapter ends. |
The level of a heading follows from the enclosing element. A element within a level three
element, is a heading for a level three partition of the text.An overlap problem may occur when, as is common in Old Norse manuscripts, headings for chapters are placed on the same text line as the last words of the preceding chapter. Graphically, the heading of a following chapter is in fact placed inside the text block of the preceding chapter. As we would like to place headings at the beginning of the textual divisions to which they logically belong, we must override the structure of the layout. One way to do that is to ignore the heading of the following chapter when transcribing the last lines of the preceding chapter. When that chapter is closed with an end tag**, we open the next chapter with its start tag
It is generally recommended (ch. 4.7 below) that line break elements are inserted while transcribing the manuscript. Following that rule, it is obvious that one cannot keep a single series of line break elements through the intersection between the chapters in the case of a heading overlap. However, it is not invalid according to TEI that elements carrying the same number occur twice. Our recommendation is to use that possibility: When moving up again to encode the heading of the following chapter, then assign the actual number of that graphic line to its element.
Consider the following column (line numbers in left margin):
05 ...............................
06 .... these are the last
07 Header for words of
08 chapter two chapter 1.
09 Here begins the text
10 of chapter two .........
11 ..............................
The example would be encoded this way (word tags omitted):
....... ... these are the last words ofchapter 1.
Header for chapter twoHere begins the textof chapter two ... ......
In this case is it important for the processing of the XML document that the @rend attribute in the element gives the information that this headline is 'inline', and that it is located on the left side of the column. The element is used to encapsulate the with the words that are on that particular line in the header. It is possible to make XSLT stylesheets to process this kind of encoding, but it is not simple.
When double numbering of line breaks is used in a transcription, one should make sure that any automatic numbering program that is run on the elements is set up not to override manually given numbers.
4.7 Page, column and line breaks: , ,
4.7.1 Page breaks and column breaks
TEI uses the empty element to indicate page breaks. This element has an attribute @nwhich can be used for the page numbers. As it is customary to refer to the manuscript leaves, rather than pages, the value of the @nattribute should indicate front or back pages (recto, verso). Column breaks, , should also be indicated in manuscripts with two or more columns. Recommended values for the @n attribute of the element are “A”, “B” and so on. Example:
Elements | Contents |
---|---|
Folio one, recto page, begins here, | |
the first column begins here, | |
and the second column begins here. | |
--- | |
Folio one, verso page, begins here, | |
the first column of the verso page begins here, | |
and the second column begins here. |
Page break information from, for example, a printed standard edition, can be encoded in addition to the tagging that refers to the manuscript itself. If one for example would like to add page break information from a standard edition, we recommend using the @ed attribute:
4.7.2 Line breaks
Line breaks are also indicated with an empty element, the , which is placed at the beginning of a new line and may be numbered by using the @n attribute:
Line number one begins here.
We recommend that each page, column and line be identified with an element at the very beginning. So for a manuscript with two columns, the three first lines in the first column on the back of the third leaf (folio) would be encoded in this manner:
This is the first line. This is the second line. This is the third line. etc.
In other words, there should be as many elements as there are pages, as many elements as there are columns, and as many elements as there are lines. We strongly discourage the use of the element in the same way as the
element in HTML, in which there typically is one
element less than the number of lines (as the
element is inserted between the lines).
We recommend that is used consistently for indicating the line breaks of the manuscript itself. One may include more than one layer of line break encoding, distinguishing them from each another with the @ed attribute, as shown in ch. 4.7.1 above.
4.8 Punctuation and hyphenation
4.8.1 Punctuation
If a text has been encoded with each word within a element, we recommend that punctuation is encoded within me:punct elements. This element permits the same levels of text representation as the element, i.e. me:facs, me:dipl and me:norm. While punctuation on the me:facs and me:dipl levels in most cases will be identical, it is often radically different on the me:norm level. Here, many dots in the manuscript will simply be suppressed, while other punctuation marks will be added, including modern punctuation marks like quotation marks and exclamation marks. Suppressing a punctuation mark is simply done by leaving the element empty, while any supplied marks are encoded by adding a new me:punct element in which the **me:facs**and possibly also the me:dipl element will be empty.
A text transcribed as
ok nu sagdi hann. þat er eigi sva. sem þu segir
on the me:dipl level would probably be rendered as
“Ok nú,” sagði hann, “Þat er eigi svá sem þú segir.”
on the me:norm level, allowing for some variation in the type of quotation marks and the order of comma or full stop and quotation mark. In a fully marked-up text, the dot after “sva” would probably be suppressed on the me:norm level, while quotation marks would be added, and also a comma after “nu”. Finally, the dot after “hann” would be changed into a comma:
ok Ok nu nú sagdi sagði hann hann þat þat er er eigi eigi sva svá sem sem þu þú segir segir
In many cases, a dot should be interpreted as an abbreviation mark rather than a punctuation mark. In such cases, we recommend that the dot is encoded using the ordinary full stop in Basic Latin, but that it is placed within the element. A text transcribed as
nu fann kgr. engan mann þar
on the me:facs level would probably be rendered as
nu fann k_onon_gr engan mann þar
on the me:dipl level. In a fully marked-up text, the abbreviationr “kgr.” would be encoded within an element, while it would be expanded into “onon” (or “onun”) on the me:dipl level:
nu nu fann fann kgr. konongr engan engan mann mann þar þar
In some cases, a word abbreviated with a dot may occur at the end of a sentence, e.g.
nu fann hann eigi kgr.
This dot would be interpreted as an abbreviation mark and possibly also as a punctuation mark. On the me:facs level it would be encoded as no more than a dot, while on the**me:dipl** level it would be suppressed when “kgr.” had been expanded to “konongr”. The encoder might, however, add a dot as a punctuation mark within a**me:punct** element. That would certainly be the case on the me:norm level, possibly also on the me:dipl level:
nu nu Nú fann fann fann hann hann hann eigi eigi eigi kgr. konongr konungr
On all three levels, a dot will be displayed after the word “konungr”, but the dot on the me:facs level is classified as an abbreviation mark (since it occurs within the element), while the dot on the **me:dipl**and the me:norm levels is classified as a punctuation mark (since it occurs within the me:punct element).
The dot is by far the most common punctuation mark in Medieval Nordic sources. A question mark was sometimes used, while quotation marks and exclamation marks are post-medieval and only seen in normalised editions. There are a few additional punctuation marks, e.g. the punctus elevatus and the virgula. These marks can be encoded using entities, but should otherwise be kept within the me:punct element. See also ch. 6.3.8 below.
4.8.2 Hyphenation
In medieval manuscripts, hyphens are frequently used at the end of a line to indicate that the word continues on the next line. In such cases, we recommend that the hyphen is entered immediately before the element. This is what it would look like in a single-level transcription (cf. ch. 3.3):
This is an example of how hyphen- ation can be encoded.
If the hyphen is missing in the manuscript, we suggest that the element is used to contain the hyphen added by the transcriber:
This is an example of how hyphen- ation can be encoded.
If the editor wants to display supplied hyphens differently from those found in the manuscript, that can easily be done by a stylesheet.
In a multi-level transcription, hyphenation would be contained in the me:punct element. Taking the word “hæ-góma” as an example (from fig. 4.1 below, divided between line 3 and 4), the me:punct element would be placed within each textual level - facsimile, diplomatic and normalised.
hæ-góma hæ-góma hæ-góma
In a display of the facsimile level, hyphens will always be rendered, while they may be suppressed on the diplomatic level, and they will always be suppressed on the normalised level.
If the hyphen does not occur in the manuscript but is supplied by the transcriber or editor, we recommend adding a @type attribute with the value 'supplied' :
hæ- góma hæ- góma hæ- góma
Note that a single line break will appear several times in a multi-level transcriptions, if it occurs within a word. Great caution must therefore be taken with automatic numbering of elements.
4.9 Initials and highlighted characters
Medieval manuscripts often have initials, sometimes quite large and often decorated in various ways. It is also quite common to find a highlighted capital at the beginning of a section in the text, a littera notabilior. Some transcribers would simply transcribe an initial and a littera notabilior with capitals and refer to a facsimile for the way they have been drawn. Other transcribers would like to encode these traits of the manuscript. For this purpose, we recommend using the element with a @type and a @rend attribute.
Fig. 4.1. AM 619 4to, fol. 47r. Note the decorated initial “S” and the littera notabilior, beginning with a capital eth, “Д, in the last word of line 2.
Elements / attributes | Contents |
---|---|
contains a character | |
@type | specifies the type of character, e.g. 'initial' , 'littNot' |
@rend | specifies how the character has been rendered in the source |
In fig. 4.1, the last word of line 2 can be encoded as
Ðes
while the first word of line 16 can be encoded as
Salomon
This type of encoding is more relevant for the facsimile and possibly the diplomatic level, but not for the normalised level of text representation.
4.10 Overlapping structures
There are no simple ways of encoding overlapping structures in XML, since XML is a strict tree structure in which every element must be part of a single 'parent' element. For example, a word or sentence may be written over two manuscript pages. If we represent the manuscript page as an element, the words will not belong to a single page and a parser error will occur.
This problem is dealt with in the current chapter by using empty elements to represent page breaks in the manuscript, rather than a page of text (cf. ch. 4.7 above). The same is true for columns and lines, where words, sentences and paragraphs routinely overlap with the physical features of the manuscript. These elements, , and , are empty in the sense that they are inserted at a specific point in the structure without any extension. For this reason, they are often referred to as milestones. Note the position of the slash in these elements.
In ch. 11 “Representation of Primary Sources” in the TEI P5 Guidelines the elements , and are defined. These elements are counterparts to the elements , and , but are all empty, and should be used when the feature to be encoded crosses structural divisions. There are in fact many more elements which can cross structural divisions, e.g. , , and , but there are no corresponding , , and . Rather that adding these and several other elements we recommend using one generic empty element to cover all cases of overlapping structures. We have called this new element me:textSpan/ and given it attributes from the classes “att.spanning”, “att.transcriptional”, “att.typed” and “att.global”, and the attribute @me:category:
Elements / attributes | Contents |
---|---|
me:textSpan/ | A generic element to handle overlapping text structures |
@category | Specifies the type of span, restricted to this list of values: |
'add' | for contents that would otherwise be contained by the element, cf. ch. 7.2.1 |
'corr' | for contents that would otherwise be contained by the element, cf. ch. 7.4.3 |
'del' | for contents that would otherwise be contained by the |
'damage' | for contents that would otherwise be contained by the element, cf. ch. 7.5.1 |
'gap' | for contents that would otherwise be contained by the element, cf. ch. 7.3.1 |
'me:expunged' | for contents that would otherwise be contained by the me:expunged element, cf. ch. 7.4.2 |
'sic' | for contents that would otherwise be contained by the element, cf. ch. 7.4.3 |
'supplied' | for contents that would otherwise be contained by the element, cf. ch. 7.4.1 |
'unclear' | for contents that would otherwise be contained by the element, cf. ch. 7.3.2 |
'other' | for any other contents |
@spanTo | Specifies the end point of the text span, using values like: |
'an1' | anchor 1 |
'an2' | anchor 2, etc. |
An empty element (milestone) which attaches an identifier to a point within a text | |
@xml:id | Specifies the identifier corresponding to the one used in the @spanTo attribute of the preceding me:textSpan element, using values like: |
'an1' | anchor 1 |
'an2' | anchor 2, etc. |
We will discuss an example of an overlapping structure in AM 673 b 4to (Plácitusdrápa 1):
Fig. 4.2. AM 673 b 4to, fol. 1r, ll. 1-4
The first three lines read approximately:
genget fiornes ualdr [quaþ........fr]egr nu | mun er lægiasc miuks scalldu manra[un sli] | ca morlin_s_ boþe fi_n_na uestu i frægre f[rest]
The letters in brackets were read by earlier editors, especially Finnur Jónsson in 1889. For this section, we will discuss the text at the end of the second line and at the start of the third. It is clear that part of each word is missing, but the damaged manuscript forms a single feature. Text can be supplied from Finnur Jónsson’s transcription, but we want to represent both the damage and the supplied text as a single feature, which overlaps with the middle of the two words. The simple encoding, without the unclear text marked or the supplied text, would be:
manra ca
With the supplied text encoded in the conventional way, the following would produce an error:
manraaun
slica
The and elements, if used in their conventional way, would overlap with the elements, meaning that the word tag would close before an element inside it had closed. That would stop an XML processor from proceding any further with the document.
In these guidelines, we offer two solutions to the problem of overlapping structures. The first is more complex, but more robust. The second is simpler, but is less machine-readable and may affect the validation of the document structure in other respects. Even so, we recommend the latter solution.
4.10.1 Linked segments
The following approach is more sound from the point of view of an XML document, but creates extra tagging. The feature is encoded in a series of separate elements, linked together.
In order to encode linked segments, the encoder should break the overlapping feature into parts which fit within the XML structure (usually within the word or dipl/facs/norm elements). Each part is identified using the @xml:id attribute, and they are linked together using the following attributes:
Elements / attributes | Contents |
---|---|
@xml:id | provides a unique identifier for the element bearing the attribute |
@next | used at the start and in the middle: an IDREF pointing to the element which marks the next tag of the same feature |
@prev | used in the middle and at the end: an IDREF pointing to the element which marks the previous tag of the same feature |
The two-word example above is encoded thus:
manraun &slong;li ca
Adding all three textual levels, including the unclear text encoded at the facs level, we would have:
man manraun manraun &slong;li ca &slong;li ca slíka
It is recommended that the additional information for the feature (such as the editor responsible, type, etc.) be only included in the first element, but editors may wish to include the attributes in all elements.
For the purposes of display, the start of a feature can be marked by selecting the element with the 'next' attribute set, but not the 'prev'; and the end can be marked by selecting the element with the 'prev' attribute set but not the 'next'.
4.10.2 Boundary marking with empty elements
Another solution is to encode the beginning and end of a text span with empty elements. This method has been described in ch. 20 “Non-hierarchical Structures” of the TEI P5 Guidelines and will be applied here in a slightly modified version. As outlined above, we have introduced a generic element me:textSpan/ which is specified by way of a @category attribute. If, for example, the overlapping structure to be encoded is a piece of supplied text, this fact is expressed through the value of the @category attribute:
<me:textSpan category="supplied"/>
Thus, all instances of supplied text in the file will either be contained in elements (in non-overlapping contexts) or in <me:textSpan category="supplied"> elements (in overlapping contexts).
In addition to inserting the empty me:textSpan/ element at the beginning of the textual span, an attribute @spanTo is added with a suitable index, e.g.
<me:textSpan category="supplied" spanTo="an1"/>
It now remains to mark the end of the span, i.e. the extent of the supplied text, with another empty element, the TEI element. This must be specified with an @xml:id attribute having the same index as the @me:spanTo attribute at the beginning of the span:
The full encoding will be like this:
man<me:textSpan category="supplied" spanTo="an1"/>raun &slong;lica
Note that the value of @xml:id attribute must be unique within the whole document.
There is no simple answer to the problem of non-hierarchical structures in XML encoding. However, we believe that using empty elements as boundary markers may prove to be the simplest and most general encoding, and it is therefore the solution we recommend. With either technique, only one method should be used in each document.