Errata in REC-xml-20001006 (original) (raw)

W3C

XML 1.0 Second Edition Specification Errata

Abstract

This document records all known errors in the Second Edition of the Extensible Markup Language (XML) 1.0 Specification; for updates see the latest version.

The errata are numbered, classified as Substantive, Editorial or Clarification and listed in reverse chronological order of their date of publication. Changes to the text of the spec are indicated thus: deleted text, new text, modified text.

Please email error reports to xml-editor@w3.org.

Known Errors

Errata as of 2004-01-28

E62 Clarification

Section 2.4

Change the fourth paragraph so that it reads:

In the content of elements, character data is any string of characters which does not contain the start-delimiter of any markup and does not include the CDATA-section-close delimiter, "]]>". In a CDATA section, character data is any string of characters not including the CDATA-section-close delimiter, "]]>".

Rationale

Production [14] clearly forbids "]]>" in the content of elements. The clarification fixes what is believed to be an oversight in the fourth paragraph.

Errata as of 2003-09-10

E61 Substantive

Section 4.3.3

In the first sentence of the sixth paragraph (starting "In the absence of information provided..."), change "error" to "fatal error".

Rationale

There was a contradiction between the sixth and eight paragraphs. Since it doesn't make sense to parse an XML entity when the character encoding is not properly determined, the contradiction was resolved by favoring a fatal error.

Errata as of 2003-08-06

E60 Clarification

Section 2.8

Augment the last sentence of the last paragraph before production [30] so that it reads:

However, portions of the contents of the external subset or of these external parameter entities MAY conditionally be ignored by using the conditional section construct; this is not allowed in the internal subset but is allowed in external parameter entities referenced in the internal subset.

Section 3.4

Change the first paragraph to read:

[Definition: **Conditional sections** are portions of thedocument type declaration external subset or of external parameter entities which are included in, or excluded from, the logical structure of the DTD based on the keyword which governs them.]

Rationale

It was not totally clear that conditional sections are allowed in external parameter entities referenced from the internal subset.

Errata as of 2003-07-30

E59 Clarification

Section 2.3

Add the following note immediately after production [3]:

Note: The presence of #xD in the above production is maintained purely for backward compatibility with theFirst Edition. As explained in 2.11 End-of-Line Handling, all #xD characters literally present in an XML document are either removed or replaced by #xA characters before any other processing is done. The only way to get a #xD character to match this production is to use a character reference in an entity value literal.

Rationale

This was requested for XML 1.1, and was retrofitted to XML 1.0 since it applies to both versions. It is a motherhood note that documents what is already the case.

E58 Clarification

Section 5.1

Add the following paragraph at the end of the section:

Note that when processing invalid documents with a non-validating processor the application may not be presented with consistent information. For example, several requirements for uniqueness within the document may not be met, including more than one element with the same id, duplicate declarations of elements or notations with the same name, etc. In these cases the behavior of the parser with respect to reporting such information to the application is undefined.

Rationale

Further clarify the behaviour of non-validating parsers.

Errata as of 2003-07-02

E57 Substantive

Section 2.10

Amend the first paragraph after the example declarations so that it reads:

The value "default" signals that applications' default white-space processing modes are acceptable for this element; the value "preserve" indicates the intent that applications preserve all the white space. This declared intent is considered to apply to all elements within the content of the element where it is specified, unless [E13]overridden with another instance of the xml:space attribute.This specification does not give meaning to any value of xml:space other than "default" and "preserve". It is an error for other values to be specified; the XML processor MAY report the error or MAY recover by ignoring the attribute specification or by reporting the (erroneous) value to the application. Applications may ignore or reject erroneous values.

Rationale

Although the required behavior was clear when validating, it was under-specified when not validating. Making it an "error" makes it clear that "default" and "preserve" are the only blessed values, but that processors and applications are not obligated to react drastically.

Errata as of 2003-06-25

E56 Editorial

Section 2.3

Modify the last sentence of the paragraph immediately before the first note, so that it reads:

Names beginning with the string "xml", orwith any string which would match (('X'|'x') ('M'|'m') ('L'|'l')), are reserved for standardization in this or future versions of this specification.

Rationale

The sentence was ambiguous, it was not clear whether XmLxxx is reserved or not (it is).

Errata as of 2003-06-04

E55 Substantive

Section 4.4

In the table of required processor behavior, change the entry for "Reference in EntityValue" to an "Unparsed" entity from "Forbidden" to "Error".

Augment the first item in the bullet list in section 4.4.4 to read:

Add a new subsection as follows:

4.4.9 Error

It is an error for a reference to an unparsed entity to appear in the EntityValue in an entity declaration.

Rationale

This relaxes the definition of well-formedess slightly, taking into account the numerous parsers that actually implement "Bypassed" instead of "Forbidden". Making it an error instead of a fatal error, the definition of error applies: "Processors may detect and report the error and may recover from it". Consequently, processors that do not report an error become conformant while those that do report remain conformant.

Errata as of 2003-04-23

E54 Editorial

Appendix A.1

Augment the "Unicode3" entry so that it reads:

Unicode3

The Unicode Consortium. The Unicode Standard, Version 3.2, defined by: The Unicode Standard, Version 3.0 (Reading, MA, Addison-Wesley, 2000. ISBN 0-201-61633-5), as amended by the Unicode Standard Annex #27: Unicode 3.1 (http://www.unicode.org/reports/tr27) and the Unicode Standard Annex #28: Unicode 3.2 (http://www.unicode.org/reports/tr28).

E53 Editorial

Appendix A.2

Remove the "Berners-Lee et al." entry.

Rationale

Obsoleted by RFC 2396, which is already in A.1. There are no references to "Berners-Lee et al." left (the only one was removed in the 2nd edition).

Errata as of 2003-04-16

E52 Editorial

Appendix A.1

Expand the [ISO/IEC 10646] bibliographic entry so that it reads:

ISO/IEC 10646

ISO (International Organization for Standardization).ISO/IEC 10646-1:2000, Information technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: Architecture and Basic Multilingual Plane and ISO/IEC 10646-2:2001, Information technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 2: Supplementary Planes, as, from time to time, amended, replaced by a new edition or expanded by the addition of new parts. [Geneva]: International Organization for Standardization. (Seehttp://www.iso.ch for the latest version.)

Rationale

Account for the publication of Part 2 in 2001.

Errata as of 2003-04-02

E51 Editorial

Appendix A.1

Change the URL for the [IANA-CHARSETS] entry tohttp://www.iana.org/assignments/character-sets.

Appendix A.2

Change the URL for the [IANA-LANGCODES] entry tohttp://www.iana.org/assignments/language-tags.

Rationale

The old URLs are obsolete and point to pages saying that the registries have moved to the new URLs.

E50 Clarification

Section 2.11

Change the second paragraph to read:

To simplify the tasks of applications, the XML processor must behave as if it normalized all line breaks in external parsed entities (including the document entity) on input, before parsing, by translating both the two-character sequence #xD #xA and any #xD that is not followed by #xA to a single #xA character.

Rationale

This clarifies that whitespace normalization applies to all the text, not only to what is ultimately passed to the application.

E49 Editorial

Section 2.12

Add a note after the first paragraph following the first example:

Note:

Language information may also be provided by external transport protocols (e.g. HTTP or MIME). When available, this information may be used by XML applications, but the more local information provided by xml:lang should be considered to override it.

Rationale

This just recognizes a fact (the info may be available) and advises behaviour that matches HTML 4.0 (cf. http://www.w3.org/TR/html401/struct/dirlang.html#h-8.1.2).

Errata as of 2003-04-02

E48 Clarification

Section 3

Change item #4 in the numbered list at the end of the section to read:

The declaration matches ANY, and the content(after replacing any entity references with their replacement text) consists of character data and child elements whose types have been declared.

Note

The added text (within parathenses) comes from E15.

Rationale

It wasn't clear that ANY allows character data.

Errata as of 2003-03-26

E47 Clarification

Section 5.1

Augment the last sentence of the last paragraph so that it reads:

Except when standalone="yes", they must not process entity declarations or attribute-list declarations encountered after a reference to a parameter entity that is not read, since the entity may have contained overriding declarations; when standalone="yes", processors must process these declarations.

Rationale

It was not clear what processors may/should/must do when standalone="yes".

E46 Editorial

Section 2.2

Delete the last sentence of the first paragraph (the sentence starting 'The use of "compatibility characters",...'.

At the end of the section, add the following:

Note:

Document authors are encouraged to avoid "compatibility characters", as defined in section 6.8 of [Unicode] (see also D21 in section 3.6 of [Unicode3]). The characters defined in the following ranges are also discouraged. They are either control characters or permanently undefined Unicode characters:

[#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF], [#1FFFE-#x1FFFF], [#2FFFE-#x2FFFF], [#3FFFE-#x3FFFF], [#4FFFE-#x4FFFF], [#5FFFE-#x5FFFF], [#6FFFE-#x6FFFF], [#7FFFE-#x7FFFF], [#8FFFE-#x8FFFF], [#9FFFE-#x9FFFF], [#AFFFE-#xAFFFF], [#BFFFE-#xBFFFF], [#CFFFE-#xCFFFF], [#DFFFE-#xDFFFF], [#EFFFE-#xEFFFF], [#FFFFE-#xFFFFF], [#10FFFE-#x10FFFF].

E45 Substantive

Obsoletes E10

Section 3.3.3

Modify the paragraph introduced byE10 so that it reads:

It is an error if an attribute value contains areference to an entity for which no declaration has been read. This can happen only when a non-validating processor is being used.

Errata as of 2003-03-19

E44 Editorial

Section 3.3.1

Change the last sentence of the first paragraph so that it reads:

The validity constraints noted in the grammar are applied after the attribute value has been normalized as described in3.3.3 Attribute-Value Normalization.

E43 Editorial

Obsoletes E3, E4, E18 and E26

Section 4.2.2

Rewrite the paragraph beginning "[Definition: The SystemLiteral is called the entity'ssystem identifier.", the following paragraph and the following numbered list, so that they read:

[ Definition: The SystemLiteral is called the entity's system identifier. It ismeant to be converted to a URI reference (as defined in [IETF RFC 2396], updated by [IETF RFC 2732]),as part of the process of dereferencing it to obtain input for the XML processor to construct the entity's replacement text.] It is an error for a fragment identifier (beginning with a # character) to be part of a system identifier. Unless otherwise provided by information outside the scope of this specification (e.g. a special XML element type defined by a particular DTD, or a processing instruction defined by a particular application specification), relative URIs are relative to the location of the resource within which the entity declaration occurs.This is defined to be the external entity containing the '<' which starts the declaration, at the point when it is parsed as a declaration. A URI might thus be relative to the document entity, to the entity containing theexternal DTD subset, or to some other external parameter entity.Attempts to retrieve the resource identified by a URI may be redirected at the parser level (for example, in an entity resolver) or below (at the protocol level, for example, via an HTTP Location: header). In the absence of additional information outside the scope of this specification within the resource, the base URI of a resource is always the URI of the actual resource returned. In other words, it is the URI of the resource retrieved after all redirection has occurred.

System identifiers (and other XML strings meant to be used as URI references) may contain characters that, according to[IETF RFC 2396] and [IETF RFC 2732], must be escaped before a URI can be used to retrieve the referenced resource. The characters to be escaped are the contol characters #x0 to #x1F and #x7F (most of which cannot appear in XML), space #x20, the delimiters '<' #x3C, '>' #x3E and '"' #x22, the unwise characters '{' #x7B, '}' #x7D, '|' #x7C, '\' #x5C, '^' #x5E and '`' #x60, as well as all characters above #x7F. Since escaping is not always a fully reversible process, it must be performed only when absolutely necessary and as late as possible in a processing chain. In particular, neither the process of converting a relative URI to an absolute one nor the process of passing a URI reference to a process or software component responsible for dereferencing it should trigger escaping. When escaping does occur, it must be performed as follows:

  1. Each disallowed characterto be escaped isrepresented in UTF-8 [Unicode3] as one or more bytes.
  2. The resulting bytes are escaped with the URI escaping mechanism (that is, converted to %HH, where HH is the hexadecimal notation of the byte value).
  3. The original character is replaced by the resulting character sequence.

Rationale

Erratum E26 unintentionally did not take into account E18 and E3. This erratum doesn't change or add anything not already in E26, E18, E4 or E3, it just consolidates them.

E42 Clarification

Section 3.3.2

In the first paragraph, change "...should react..." to "...is to react...".

Change the last sentence of the second paragraph to read:

If a default value is declared, When an XML processor encounters an element without a specification for an attribute for which it has read a default value declaration, it must report the attribute with the declared default value to the applicationomitted attribute, it is to behave as though the attribute were present with the declared default value.

Rationale

Clarify whether processors are obligated to use attribute value default declarations.

Errata as of 2002-09-18

E41 Substantive

Section 2.12

Modify the last sentence of the first paragraph so that it reads:

The values of the attribute are language identifiers as defined by [IETF RFC 3066], Tags for the Identification of Languages, or its successor; in addition, the empty string is allowed.

Append the following to the paragraph immediately following the first example:

In particular, the empty value of xml:lang is used on an element B to override a specification of xml:lang on an enclosing element A, without specifying another language. Within B, it is considered that there is no language information available, just as if xml:lang had not been specified on B or any of its ancestors.

Change the sample declaration of xml:lang to:

xml:lang CDATA #IMPLIED

Change the last set of examples to read:

Rationale

When embedding an XML fragment within a document (such as wrapping a payload inside a SOAP envelope), it is necessary to be able to specify that language information specified higher up in the element tree doesn't apply in the fragment, i.e. to break the inheritance chain without specifying a new language. Note that the empty string is different from the RFC 3066 tag "und" (undetermined). The latter is used "if the language associated with an item cannot be determined" or "for works having textual content consisting of arbitrary syllables, humming or other human-produced sounds for which a language cannot be specified." (from MARC Code List for Languages). The former (empty string) may be used whenever language codes are not applicable, such as for "instrumental or electronic music; sound recordings consisting of nonverbal sounds; audiovisual materials with no narration, printed titles, or subtitles; machine-readable data files consisting of machine languages or character codes" (from MARC Code List for Languages) or whenever it is desired to break the inheritance chain and effectively say "no language information".

Errata as of 2002-08-21

E40 Editorial

Section 5.2

In each of the two list items of the bulleted list, change the instance of "may not" to "may fail to".

Rationale

Despite the strictures of RFC 2119, the phrase "may not" is dangerous, for it can be read as "must not".

Errata as of 2002-07-10

E39 Clarification

Section 5.2

Amend the last sentence of the last paragraph to read:

Applications which require DTD facilitiesnot related to validation (such as thedeclaration of default attributesand internal entities) that are or may be specified in external entities should use validating XML processors.

Rationale

It was not clear whether the relative clause "which are declared in external entities" applied to both the attributes and the entities, or just to the entities.

Errata as of 2002-06-19

E38 Substantive

Section 2.8

Remove the whole paragraph after the second example. This paragraph reads:

The version number "1.0" should be used to indicate conformance to this version of this specification; it is an error for a document to use the value "1.0" if it does not conform to this version of this specification. It is the intent of the XML working group to give later versions of this specification numbers other than "1.0", but this intent does not indicate a commitment to produce any future versions of XML, nor if any are produced, to use any particular numbering scheme. Since future versions are not ruled out, this construct is provided as a means to allow the possibility of automatic version recognition, should it become necessary. Processors may signal an error if they receive documents labeled with versions they do not support.

Change production [26] VersionNum to read:

[26] VersionNum ::= '1.0'

Rationale

With the advent of XML 1.1, this clarifies that 1.0 documents may not refer to entities of versions other than 1.0.

Errata as of 2002-03-20

E37 Clarification

Section 6

Change the definition for "#xN" to read:

where N is a hexadecimal integer, the expression matches the characterwhose number (code point) in ISO/IEC 10646 is N whose canonical (UCS-4) code value, when interpreted as an unsigned binary number, has the value indicated. The number of leading zeros in the #xN form is insignificant; the number of leading zeros in the corresponding code value is governed by the character encoding in use and is not significant for XML.

Rationale

It was found that the phrase "canonical (UCS-4) code value" was misleading. The phrase "the number of leading zeros in the corresponding code value is governed by the character encoding in use" doesn't really make sense.

E36 Substantive

Section 2.9

Change the third item of the bullet list of conditions for the "Standalone Document Declaration" VC to:

Rationale

The original condition required standalone="no" whenever normalization affected some white space (e.g. a TAB turned into a SPACE) or expanded some entities, even if external declarations had no effect.

E35 Editorial

Section 2.8

Change the first sentence of the 4th paragraph to read:

The function of the markup in an XML document is to describe its storage and logical structure and to associateattribute-valueattribute name-value pairs with its logical structures.

Rationale

There was a concern that "attribute-value pairs" could be interpreted such that attribute values come in pairs, that values are paired with attributes, or that an attribute is just an attribute name rather than combination of name and value. None of these interpretations are accurate.

E34 Clarification

Section 3.2.1

Change the next to last sentence of the paragraph immediately preceding the "Proper Group/PE Nesting" VC to read:

For compatibility, it is an error ifthe content model allows an element to match more than one occurrence of an element type in the content model.

Rationale

The original text could be interpreted to mean that the check for a non-deterministic content model for an element had to be performed only if that element actually occurred in the instance being processed, with a child matched ambiguously. The modified text clarifies that having a non-deterministic content model is a property of a DTD, not of a particular instance document using that DTD.

E33 Editorial

Section 2.8

Restore linebreaks in the first and next-to-last examples that were lost between the 1st and 2nd edition:

Hello, world!

Hello, world!

Errata as of 2002-03-06

E32 Editorial

Section 4.3.3

In the last paragraph, change "octet sequences" to "byte sequences".

Rationale

For consistency, "byte" everywhere.

E31 Editorial

Appendix H

Change the title to "W3C XML Core Working Group (Non-Normative)".

E30 Editorial

Appendix A.2

Change the URI for the WEBSGML entry tohttp://www.sgmlsource.com/8879/n0029.htm

Rationale

The original one was stale.

Errata as of 2002-02-20

E29 Substantive

Section 2.12

Remove the last 5 words from the last sentence of the first paragraph, so that it reads:

The values of the attribute are language identifiers as defined by [IETF RFC 3066], Tags for the Identification of Languages, or its successor on the IETF Standards Track.

Remove the entire Note following the first paragraph (already amended by E11):

Note:

[IETF RFC 3066] tags are constructed from two-letter language codes as defined by [ISO 639], from two-letter country codes as defined by [ISO 3166], or from language identifiers registered with the Internet Assigned Numbers Authority [IANA-LANGCODES].

Rationale

RFC 3066 is not Standard Track but BCP (Best Current Practice) in the IETF. The deleted note was incomplete, potentially misleading and otiose; it was misinterpreted by some to forbid 3-letter codes.

Errata as of 2001-10-31

E28 Substantive

Section 2.2

Paragraph 2: for "ISO/IEC 10646 [ISO/IEC 10646] (see also [ISO/IEC 10646-2000])" read simply "ISO IEC/10646:2000 [ISO/IEC 10646]".

Section 4.3.3

Paragraph 2: remove the words "Annex F of [ISO/IEC 10646],".

Appendix A.1

Remove the entire reference to ISO 10646, leaving only the anchor. Change "10646-2000" in the next entry to "10646:2000".

Rationale

It has become pointless to refer to ISO/IEC 10646:1993 as amended, which is now obsolete and unavailable.

E27 Substantive

Section 2.2

Second paragraph: for "...the UTF-8 and UTF-16 encodings of 10646" read "...the UTF-8 and UTF-16 encodings of Unicode 3.1", with a link to the Unicode3 entry.

Section 4.2.2

Numbered paragraph 1: change the reference "[IETF RFC 2279]" to "[Unicode3]".

Section 4.3.3

Last paragraph: add a new 3rd sentence:

"Specifically, it is a fatal error if an entity encoded in UTF-8 contains any irregular code unit sequences, as defined in Unicode 3.1."

with a reference to Unicode 3.1.

Appendix A.1

Change the [Unicode3] entry (leaving the anchor name unchanged) to read:

The Unicode Consortium. The Unicode Standard, Version 3.1, defined by: The Unicode Standard, Version 3.0 (Reading, MA, Addison-Wesley, 2000. ISBN 0-201-61633-5), as amended by the Unicode Standard Annex #27: Unicode 3.1 (http://www.unicode.org/reports/tr27).

Appendix A.2

Remove RFC 2279 as a non-normative reference, since it is now superseded. Also, for "IETF RFC2141" read "IETF RFC 2141".

Rationale

There was no normative reference for UTF-8, unless the phrase "UTF-8 and UTF-16 encodings of 10646" in2.2 is to be interpreted so, and if it is, it refers to an obsolete edition. The new sentence in 4.3.3 makes interpretation of UTF-8 well-defined in a case where Unicode allows a looser interpretation (that potentially creates security concerns).

Errata as of 2001-10-17

E26 Clarification

Obsoleted by E43
Obsoletes E4

Section 4.2.2

Rewrite the paragraph beginning "[Definition: The SystemLiteral is called the entity'ssystem identifier.", the following paragraph and the following numbered list, so that they read:

[ Definition: The SystemLiteral is called the entity's system identifier. It ismeant to be converted to a URI reference (as defined in [IETF RFC 2396], updated by [IETF RFC 2732]),as part of the process of dereferencing it to obtain input for the XML processor to construct the entity's replacement text.] It is an error for a fragment identifier (beginning with a # character) to be part of a system identifier. Unless otherwise provided by information outside the scope of this specification (e.g. a special XML element type defined by a particular DTD, or a processing instruction defined by a particular application specification), relative URIs are relative to the location of the resource within which the entity declaration occurs. A URI might thus be relative to the document entity, to the entity containing theexternal DTD subset, or to some other external parameter entity.

System identifiers (and other XML strings meant to be used as URI references) may contain characters that, according to[IETF RFC 2396] and [IETF RFC 2732], must be escaped before a URI can be used to retrieve the referenced resource. The characters to be escaped are the contol characters #x0 to #x1F and #x7F (most of which cannot appear in XML), space #x20, the delimiters '<' #x3C, '>' #x3E and '"' #x22, the unwise characters '{' #x7B, '}' #x7D, '|' #x7C, '\' #x5C, '^' #x5E and '`' #x60, as well as all characters above #x7F. Since escaping is not always a fully reversible process, it must be performed only when absolutely necessary and as late as possible in a processing chain. In particular, neither the process of converting a relative URI to an absolute one nor the process of passing a URI reference to a process or software component responsible for dereferencing it should trigger escaping. When escaping does occur, it must be performed as follows:

  1. Each disallowed characterto be escaped isrepresented in UTF-8 [IETF RFC 2279] as one or more bytes.
  2. The resulting bytes are escaped with the URI escaping mechanism (that is, converted to %HH, where HH is the hexadecimal notation of the byte value).
  3. The original character is replaced by the resulting character sequence.

Rationale

It was still unclear exactly when escaping was to be done and by whom.

Errata as of 2001-10-03

E25 Clarification

Section 4.2.2

Amend the second sentence of the next-to-last paragraph to read:

An XML processor attempting to retrieve the entity's content may useany combination of the public and system identifiers as well as additional information outside the scope of this specification to try to generate an alternative URI reference.

Rationale

It was felt that a too literal reading of the original text would prohibit using the system identifier or other information in attempting to generate an alternate URI reference, which was never the intention.

Errata as of 2001-09-23

E24 Clarification

Section 2.4

Change the last sentence of the third paragraph to read:

The right angle bracket (>) may be represented using the string ">", and must, for compatibility, be escaped using either ">" or a character reference when it appears in the string "]]>" in content, when that string is not marking the end of a CDATA section.

Rationale

The original sentence was somewhat ambiguous, the change clarifies which interpretation is correct.

E23 Substantive

Section 4.3.3

Amend the last sentence of the next-to-last paragraph to read:

Unless an encoding is determined by a higher-level protocol, it is also a fatal error if an XML entity contains no encoding declaration and its content is not legal UTF-8 or UTF-16.

Rationale

It was always the intent of the XML 1.0 spec to allow the character encoding to be determined externally. The sentence corrected here was introduced in the second edition.

Errata as of 2001-07-25

E22 Substantive

Section 4.3.3

Amend the second paragraph to read:

Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY begin with the Byte Order Mark described by Annex F of [ISO/IEC 10646], Annex H of [ISO/IEC 10646-2000], section 2.4 of [Unicode], and section 2.7 of [Unicode3](the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding signature, not part of either the markup or the character data of the XML document. XML processors must be able to use this character to differentiate between UTF-8 and UTF-16 encoded documents.

Rationale

The BOM in UTF-8 is already mentionned in Appendix F. It's happening anyway: Windows 2000's Notepad puts a BOM when one saves as UTF-8, and it's not an option. Since it makes some sense for a general-purpose text editor to do that, it's likely to spread to other editors.

Errata as of 2001-06-13

E21 Substantive

Section 2.8

Add a new production [28b] and modify production [28] to refer to it:

[28] doctypedecl ::= '<!DOCTYPE' S Name(S ExternalID)? S? ('[' intSubset ']'S?)? '>' [VC: Root Element Type]
[WFC: External Subset]
[28a] DeclSep ::= PEReference | S [WFC: PE Between Declarations]
[28b] intSubset ::= (markupdecl | DeclSep)*
[29] markupdecl ::= elementdecl | AttlistDecl EntityDecl NotationDecl PI
[WFC: PEs in Internal Subset]

Rationale

Clarify what internal subset means, in particular that it doesn't include the enclosing square brakets "[...]".

Errata as of 2001-05-24

E20 Substantive

Obsoletes erratum E108 to first edition

Section 2.3

Change productions [6] Names and [8] Nmtokens to use #x20 (a single space character) instead of S:

[6] Names ::= Name (#x20 Name)*
[8] Nmtokens ::= Nmtoken (#x20 Nmtoken)*

Add a note after production 8:

Note: The Names and Nmtokens productions are used to define the validity of tokenized attribute values after normalization (see 3.3.1 Attribute Types).

Rationale

This restores first edition erratum E62, which was rescinded by E108. It seems likely that when E108 was adopted the productions were incorrectly thought to apply to unnormalized attribute values, which would have prevented the use of non-#x20 whitespace (tabs and newlines) as separators in tokenized attribute values. In fact, it only prohibits the use of character references to these characters.

This change restores SGML compatibility (cf. the "name list" and "name token list" productions in SGML).


E19 Clarification

Section 4.5

Modify the third sentence of the second paragraph, so that it reads:

The actual replacement text that is included (or included in literal) as described above must contain the replacement text of any parameter entities referred to, and must contain the character referred to, in place of any character references in the literal entity value; however, general-entity references must be left as-is, unexpanded.

Errata as of 2001-04-24

E18 Clarification

Obsoleted by E43

Section 4.2.2

To the sentence:

Unless otherwise provided by information outside the scope of this specification (e.g. a special XML element type defined by a particular DTD, or a processing instruction defined by a particular application specification), relative URIs are relative to the location of the resource within which the entity declaration occurs.

(inside the paragraph following the Notation declared VC), append the following:

This is defined to be the external entity containing the '<' which starts the declaration, at the point when it is parsed as a declaration.

Rationale

This clarifies exactly where a declaration occurs, for purposes of determining the base for relative URIs. Given the example:

example.xml:

%pe; %intpe; ]> &ent;

subdir1/pe:

subdir2/extpe

Though the characters making up the declaration of ent appear insubdir2/extpe, they are not parsed as a declaration there. They are just treated as characters making up the replacement text of intpe. They are not parsed as a declaration until intpe is parsed, at which point the containing external entity is the document entity, so the relevant base URI is that of example.xml.

The fact that it is the containing external entity that is used may be summed up by saying that internal entities do not carry any base URI with them; indeed, they consist only of their replacement text.

If example.xml contained %extpe; instead of %intpe; the situation would be different: the contents of subdir2/extpe would be parsed as a declaration, and the relevant base URI would be that of subdir2

Errata as of 2001-04-11

E17 Editorial

Section 6

From the definition for "A | B", delete "but not both":

A | B

matches A or B but not both.

Rationale

"but not both" was found misleading by some and was in fact useless.


E16 Substantive

Appendix A

Move the entries for [IETF RFC 2396] and [IETF RFC 2732] from A.2 (informative) to A.1 (normative).

Rationale

In 4.2.2, immediately after the Notation Declared VC, there is a definition of system identifier which clearly depends normatively on those RFCs.

Errata as of 2001-03-27

E15 Clarification

Section 3

Rewrite the Element valid VC as follows:

Validity constraint: Element Valid

An element is valid if there is a declaration matching elementdecl where the Name matches the element type, and one of the following holds:

  1. The declaration matches EMPTY and the element has no content (not even entity references, comments, PIs or white space).
  2. The declaration matches children and the sequence of child elements (after replacing any entity references with their replacement text) belongs to the language generated by the regular expression in the content model, with optional white space (characters matching the nonterminal S), comments and PIs (i.e. markup matching production [27] Misc) between the start-tag and the first child element, between child elements, or between the last child element and the end-tag. Note that a CDATA section containing only white space or a reference to an entity whose replacement text is character references expanding to white space do not match the nonterminal S, and hence cannot appear in these positions; however, a reference to an internal entity with a literal value consisting of character references expanding to white space does match S, since its replacement text is the white space resulting from expansion of the character references.
  3. The declaration matches Mixed and the content (after replacing any entity references with their replacement text) consists of character data, comments, PIs and child elements whose types match names in the content model.
  4. The declaration matches ANY, and the types of any child elements (after replacing any entity references with their replacement text) have been declared.

Section 3.1

In the paragraph just after production [43] content, amend the definition of empty element so that the word "content" within the definition is a link to production [43].

Errata as of 2001-03-07

E14 Clarification

Section 4.3.2

Amend the last paragraph so that it reads:

A consequence of well-formedness in general entities is that the logical and physical structures in an XML document are properly nested; no start-tag, end-tag, empty-element tag, element, comment, processing instruction, character reference, or entity reference can begin in one entity and end in another.

Rationale

"General" is added because:

This clarifies that the following from the OASIS test suite:

xmltest/invalid/001.xml:

with 001.ent: %e; -->

is well-formed but violates a validity constraint.

Errata as of 2001-03-05

E13 Editorial

Section 2.10

In the first paragraph after the example, replace "overriden" with "overridden" (two d's) in the sentence "This declared intent is considered to apply to all elements within the content of the element where it is specified, unless overridden with another instance of the xml:space attribute."

Errata as of 2001-02-22

E12 Substantive

Appendix F.2

Change the [IETF RFC 2376] reference to [IETF RFC 3023] (keeping the same #RFC2376 fragment identifier in order not to break existing links).

Appendix A.2

Change the IETF RFC 2376 entry to:

IETF RFC 3023

IETF (Internet Engineering Task Force). RFC 3023: XML Media Types. eds. M. Murata, S. St.Laurent, D. Kohn. 2001. (See http://www.ietf.org/rfc/rfc3023.txt.)

Rationale

RFC 3023 updates and obsoletes RFC 2376.


E11 Substantive

Section 1.1

Amend the next to last paragraph so that it reads:

This specification, together with associated standards (Unicode and ISO/IEC 10646 for characters, Internet RFC 3066 for language identification tags, ISO 639 for language name codes, and ISO 3166 for country name codes), provides all the information necessary to understand XML Version 1.0 and construct computer programs to process it.

[The only change is that "RFC 1766" becomes "RFC 3066".]

Everywhere

Change all [IETF RFC 1766] references to [IETF RFC 3066] (keeping the same #RFC1766 fragment identifier in order not to break existing links).

Section 2.12

Remove the last sentence of the Note: "It is expected that the successor to [IETF RFC 1766] will introduce three-letter language codes for languages not presently covered by [ISO 639]."

Appendix A.1

Change the IETF RFC 1766 entry to:

IETF RFC 3066

IETF (Internet Engineering Task Force). RFC 3066: Tags for the Identification of Languages, ed. H. Alvestrand. 2001. (See http://www.ietf.org/rfc/rfc3066.txt.)

Rationale

RFC 3066 updates and obsoletes RFC 1766.


E10 Substantive

Obsoleted by E45

Section 3.3.3

Just after the paragraph beginning "All attributes for which no declaration has been read..." (just before the examples), append the following paragraph:

It is an error if an attribute refers to an entity when there is a declaration for that entity which the processor has not read. This can happen only when a non-validating processor is being used.

Errata as of 2001-01-25

E9 Clarification

Section 3.3.2

Change the title and the text of Attribute Default Legal Validity Constraint to:

Validity Constraint: Attribute Default Value Syntactically Correct

The declared default value must meet the syntactic constraints of the declared attribute type.

Note that only the syntactic constraints of the type are required here; other constraints (e.g. that the value be the name of a declared unparsed entity, for an attribute of type ENTITY) may come into play if the declared default value is actually used (an element without a specification for this attribute occurs).

Rationale

This clarification was prompted by the "sun/invalid/attr11.xml" test file in the OASIS test suite. The interpretation is that the default value of an attribute only needs to be syntactically correct unless it is actually used (i.e an element occurs without a specification for that attribute), in which case the default value must also meet the constraints bearing on this use. This is believed to be required for SGML compatibility and to be what the XML 1.0 spec currently says.


E8 Clarification

Section 4.1

Change the first sentence of the second paragraph of the Entity Declared WFC (not the VC of the same name) to read:

Note that non-validating processors are not obligated to read and process entity declarations occurring in parameter entities or in the external subset.

Rationale

The note was inconsistent with the normative text, as it read "external parameter entities" whereas internal parameter entities are also not necessarily processed.


E7 Clarification

Section 4.5

Remove the word "internal" from the title of the section.

Change the first paragraph, in particular removing the word "internal", so that it reads:

In discussing the treatment of internal entities, it is useful to distinguish two forms of the entity's value. [Definition: For an internal entity,the literal entity value** is the quoted string actually present in the entity declaration, corresponding to the non-terminal EntityValue.] [Definition: For an external entity, the literal entity value is the exact text contained in the entity.] [Definition: For an internal entity, thereplacement text** is the content of the entity, after replacement of character references and parameter-entity references.] [Definition: For an external entity, the replacement text is the content of the entity, after stripping the text declaration (leaving any surrounding whitespace) if there is one but without any replacement of character references or parameter-entity references.]

Rationale

The concept of an entity's replacement text is used throughout the spec, but was defined nowhere for external entities. Also, it was not clear whether the replacement text of an external entity is the content after replacement of character references and parameter-entity references, as for internal entities.

Errata as of 2000-12-06

E6 Editorial

Section 3.3.3

Modify the second example in the table at the end of the section to read as follows (add a in the middle):

a="&d;&d;A&a; &a;B&da;" A #x20 B #x20 #x20 A #x20 #x20 #x20 B #x20 #x20

Rationale

Illustrate how space characters (#x20) get normalized no matter whether they come from a character reference or not.

Errata as of 2000-12-01

E5 Editorial

Section 4.2.2

In the numbered list explaining the escaping of disallowed characters in URI references, changes "octets" to "bytes".

Rationale

For consistency. We had "octets" and "bytes" meaning the same thing, but apparently suggesting that they were different. "bytes" won by majority rule.

Errata as of 2000-11-22

E4 Clarification

Obsoleted by E26

Section 4.2.2

Replace the last sentence of the paragraph beginning with "URI references require encoding and escaping of certain characters." with the following:

The XML processor must escape disallowed characters as follows:

Rationale

The fact that the XML processor is responsible for escaping disallowed characters when resolving URI references was lost in the modifications of the 2nd edition.


E3 Clarification

Obsoleted by E43

Section 4.2.2

After the sentence reading "A URI might thus be relative to the document entity, to the entity containing the external DTD subset, or to some other external parameter entity.", which follows the definition of SystemLiteral, add the following:

Attempts to retrieve the resource identified by a URI may be redirected at the parser level (for example, in an entity resolver) or below (at the protocol level, for example, via an HTTPLocation: header). In the absence of additional information outside the scope of this specification within the resource, the base URI of a resource is always the URI of the actual resource returned. In other words, it is the URI of the resource retrieved after all redirection has occurred.

Errata as of 2000-11-16

E2 Substantive

Section 3.3.1

Add a validity constraint applying to productions [58] NotationType and [59] Enumeration as follows:

Validity constraint: No duplicate tokens
The notation names in a single NotationType attribute declaration, as well as the NmTokens in a single Enumeration attribute declaration, must all be distinct.

Rationale

Necessary to maintain compatibility with SGML.

Errata as of 2000-11-02

E1 Editorial

Section 3.3.3

In the set of examples at the end of the section, change the last character of the 3rd column of the last example from "#xD" to "#xA". The change makes the third column identical to the second column (for that third example).

Rationale

"#xD" was a typo.


Last updated Date:2004/11/2511:26:48Date: 2004/11/25 11:26:48 Date:2004/11/2511:26:48 by Author:htAuthor: ht Author:ht
xml-editor