XML Normalization (original) (raw)

W3C Editor's Draft 15 March 2013

This version:

http://www.w3.org/2008/xmlsec/Drafts/xml-norm/

Latest published version:

http://www.w3.org/TR/xml-norm/

Latest editor's draft:

http://www.w3.org/2008/xmlsec/Drafts/xml-norm/

Editors:

John Boyer, IBM (formerly PureEdge Solutions Inc.) (Canonical XML Version 1.0)

Glenn Marcy, IBM (Canonical XML Version 1.1)

Pratik Datta, Oracle (Canonical XML Version 2.0)

Frederick Hirsch, Nokia (Canonical XML Version 2.0)

Jim Dovey, Rakuten

Abstract

XML Normalization defines a means by which XML parsers can produce normalized output of any parsed document. This normalized form is similar to that produced by Canonicalized XML 1.1 [XML-C14N11], though the two are not interchangeable. Its intent is also different than that of Canonicalized XML 1.1: it exists primarily to assist clients of XML parser APIs such as SAX [SAX] to ensure that they are provided XML data in a predefined representation, whether as events or DOM nodes.

Any XML document is part of a set of XML documents that are logically equivalent within an application context, but which vary in physical representation based on syntactic changes permitted by XML 1.0 [XML10] and Namespaces in XML 1.0 [XML-NAMES]. This specification describes a method by which parsers can generate XML events or DOM nodes according to a normalized form that accounts for the permissible changes. It also allows for external specification of certain attributes of this normalized form.

The aim of this standard is to define a means by which a low-overhead streaming XML parser can output events in a manner which can be anticipated by a client of the parser, thus reducing that client's need for additional logic to handle variations in representation. It also provides a supplemental guide to implementing the same algorithm for DOM parsers. It is not intended to provide a canonicalized form of a document as defined by Canonical XML 1.1 [XML-C14N11], and has some incompatibilities with that standard, though its output is frequently similar. However, two semantically equivalent documents will produce similar output when processed using the same normalization parameters and algorithm.

Normalization for Streaming XML Parsers is applicable to XML 1.0. It is not defined for XML 1.1.

Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This document was published by the XML Security Working Group as an Editor's Draft. If you wish to make comments regarding this document, please send them to public-xmlsec@w3.org (subscribe,archives). All comments are welcome.

Publication as an Editor's Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy.W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes containsEssential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

1. Introduction
2. XML Normalization
3. Algorithm for DOM Normalization
- 3.1 The HashMap type
  * 3.1.1 Attributes
  * 3.1.2 Methods
- 3.2 The DOMNormalizer Interface
  * 3.2.1 Attributes
  * 3.2.2 Methods
4. Algorithm for Streaming Normalization
5. Output rules
A. References
- A.1 Normative references
- A.2 Informative references

1. Introduction

1.1 Conformance

As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.

The key words MUST, MUST NOT, REQUIRED, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this specification are to be interpreted as described in [RFC2119].

1.2 Terminology

See [XML-NAMES] for the definition of QName.

document subset

A document subset is a portion of an XML document that may not include all of the nodes in the document.

normalized form

The normalized form of an XML document is a physical representation of the document produced by the method described in this specification.

normalized XML

The term normalized XML refers to XML that is in normalized form. The XML normalization method is the algorithm defined by this specification that generates the normalized form of a given XML document or document subset. The term XML normalization refers to the process of applying the XML normalization method to an XML document or document subset.

subtree

Subtree refers to one XML element node, and all that it contains. In XPath terminology it is an element node and all its descendant nodes.

DOM

DOM or Document Object Model is a model of representing an XML document in tree structure. The W3C DOM standard [DOM-LEVEL-2-CORE] is one such DOM, but this specification does not require this particular set of DOM APIs; any similar model can be used as long as it has a tree representation of the XML document, whose root is a document node, and the document node's descendants are element nodes, attribute nodes, text nodes etc.

DOM parser

A software module that reads an XML document and constructs a DOM tree.

Event parser

A software module that reads an XML document and posts parsing events to an API client. SAX [SAX] is an example of an event parser.

Stream parser

A software module that reads an XML document and constructs a stream of XML events like "beginElement", "text", "endElement", and exposes an iterator-based API allowing clients to 'pull' these events. StAX [XML-PARSER-STAX] is an example of a stream parser.

1.3 Applications

Since the XML 1.0 Recommendation [XML10] and the Namespaces in XML 1.0 Recommendation [XML-NAMES] define multiple syntactic methods for expressing the same information, XML applications tend to take liberties with changes that have no impact on the information content of the document. XML normalization is designed to be useful to applications that wish to process an XML document in regards to a predetermined semantic representation, allowing clients of a stream or event parser to delegate the handling of differing representations of semantically-identical XML documents to the parser itself.

For example, a representation may make use of a well-known XML namespace prefix or it may use one of its own devising. The algorithm defined in this specification can be used to translate those prefixes while parsing, such that the client API need not anticipate multiple prefixes, nor need to manually compare potentially long namespace URIs at every step. This also applies to any XPath or QName values contained within the document.

Another example allows a client to instruct the parser to ignore certain subtrees, or to only return certain subtrees, and whether to report them as DOM elements or as raw text. For example, an XML-RPC request might consist of a document fragment containing protocol information and a document fragment containing response data. This specification allows a stream or event parser client to request that only one of these fragments is parsed and reported; it may also request that the raw text content of the other fragment be reported as a single block of text which can then be fed into a less-able parser further back in the chain. This can provide a performant alternative to the use of XPath expressions in some simple use cases.

Note

Although not stated as a requirement on implementations, nor formally proved to be the case, it is the intent of this specification that if the output generated by normalizing a document according to this specification is itself parsed using the same normalization rules, the output generated by the second normalization will be the same as that generated by the first normalization.

1.4 Limitations

Two XML documents may have differing information content that is nonetheless logically equivalent within a given application context. Although two XML documents are equivalent (aside from limitations given in this section) if their normalized forms are identical, it is not a goal of this work to establish a method such that two XML documents are equivalent if and only if their normalized forms are identical. Such a method is unachievable, in part due to application-specific rules such as those governing unimportant whitespace and equivalent data (e.g. <color>black</color> versus <color>rgb(0,0,0)</color>). There are also equivalencies established by other W3C Recommendations and Working Drafts. Accounting for these additional equivalence rules is beyond the scope of this work. They can be applied by the application or become the subject of future specifications.

The normalized form of an XML document may not be completely operational within the application context, though the circumstances under which this occurs are unusual.

The difficulties arise due to the loss of the following information not available in the data model:

notations and external unparsed entity references
attribute types in the document type declaration

In the first case, the loss of external unparsed entity references and the notations that bind them to applications means that normalized forms cannot properly distinguish among XML documents that incorporate unparsed data via this mechanism. This is an unusual case precisely because most XML processors currently discard the document type declaration, which discards the notation, the entity's binding to a URI, and the attribute type that binds the attribute value to an entity name. For documents that must be subjected to more than one XML processor, the XML design typically indicates a reference to unparsed data using a URI in the attribute value.

In the second case, the loss of attribute types can affect the normalized form in different ways depending on the type. Attributes of type ID cease to be ID attributes. Hence, any XPath expressions that refer to the normalized form using the id() function cease to operate. The attribute types ENTITY and ENTITIES are not part of this case; they are covered in the second case above. Attributes of enumerated type and of type ID, IDREF, IDREFS, NMTOKEN, NMTOKENS, and NOTATION fail to be appropriately constrained during future attempts to change the attribute value if the normalized form replaces the original document during application processing. Applications can avoid the difficulties of this case by ensuring that an appropriate document type declaration is prepended prior to using the normalized form in further XML processing. This is likely to be an easy task since attribute lists are usually acquired from a standard external DTD subset, and any entity and notation declarations not also in the external DTD subset are typically constructed from application configuration information and added to the internal DTD subset.

1.5 Requirements

Normalization for Streaming XML Parsers solves many of the major issues that have been identified by implementers with Canonical XML 1.0 [XML-C14N] and 1.1 [XML-C14N11]. It thus provides a better alternative to the use of canonicalization algorithms for the purposes outlined in this specification.

1.5.1 Performance

Canonicalization will be slow if the implementation uses the Canonical XML 1.1 specification as a formula without any attempt at optimization. This specification rectifies this problem by incorporating lessons learned from the implementation of that specification. Most mature canonicalization implementations solve the performance problem by inspecting the signature first, to see if it can be canonicalized using a simple tree walk algorithm whose performance is similar to regular XML serialization. If not they fall back to the expensive nodeset-based algorithm.

The use cases that cannot be addressed by the simple tree walk algorithm are mostly edge cases. This specification restricts the input to the normalization algorithm so that implementations can always use the simple tree walk algorithm. This facet is what lends this specification's suitability for use as part of a stream or event parser directly.

C14N 1.x uses an "XPath 1.0 Nodeset" to describe a document subset. This is the root cause of the performance problem and can be solved by not using a nodeset. This specification does not use a nodeset, visits each node exactly once, and only visits the nodes that are being normalized.

1.5.2 Streaming

A streaming implementation is required to be able to process very large documents without holding them all in memory; it should be able to process documents one chunk at a time.

1.5.3 Robustness

Whitespace handling in parser clients frequently means trimming all node contents. This specification provides a means for a parser to perform this duty internally depending on input from the parser client, and for such processing to be done in an intelligent manner with regards to QNames and XPaths in content. Specifically it uses three techniques for normalizing text content:

Optionally remove leading and trailing whitespace from text nodes,
Allow for QNames in content, particularly in the xsi:type attribute,
Optionally rewrite prefixes

1.5.4 Portability

It should be possible to normalize a sub-document in such a way that it may be moved into a completely different XML document while retaining its semantic meaning. This is the goal of Exclusive canonicalization [XML-EXC-C14N] that mostly satisfies this requirement except for the case of namespace prefixes embedded in content. This specification builds on exclusive canonicalization and solves the problem of namespaces in content, allowing parser clients to re-serialize sub-documents into larger documents without knowledge of the larger document's content or structure.

1.5.5 Simplicity

C14N 1.x algorithms are complex and depend on a full XPath library. This increases the work required for scripting languages make use of it as an XML document pre-processing tool. This specification addresses this issue by not using the complex nodeset model, and therefore not relying completely on XPath.

1.6 Test Cases for Canonical XML 2.0

Test cases for Canonical XML 2.0 are documented in "Test Cases for Canonical XML 2.0" [C14N2-TestCases].

2. XML Normalization

2.1 Data Model

The input to the normalization algorithm consists of an XML document subset, and set of options. The XML document subset can be expressed in two ways, with a DOM model or a Stream model.

2.1.1 Data Model for DOM Parsers

In the DOM model the XML subset is expressed as:

Inclusion List: Either the document Node D or a list of one or more element nodes E1, E2, … En.
(If out of this list, one element node Ei is a descendant of anotherEj, then that element node Ei is ignored.)
Exclusion List (optional): A list of zero or more element nodes E1,E2, … Em and a list of zero or more attribute nodes A1, A2, … AM.
These attribute nodes should not be namespace declaration or attributes in the xml namespace.

The XML subset consists of all the nodes in the Inclusion list and their descendants, minus all the nodes that are in the Exclusion list and their descendants.

The element nodes in the Inclusion list are also referred as apex nodes.

Note: This input model is a very limited form of the generic XPath Nodeset that was the input model for Canonical XML 1.x. It is designed to be simple and allow for a high performance algorithm, while still supporting the most essential use cases. Specifically:

This model does not support re-inclusion; i.e. all the exclusions are applied after all the inclusions. It is effectively a simplified form of the XPath Filter 2 model [XMLDSIG-XPATH-FILTER2] with one intersect followed by one optional subtract operation. Re-inclusion complicates the normalization algorithm, especially in the areas of namespace and XML attribute inheritance.
Exclusion is limited to complete subtrees and attribute nodes. Other kinds of nodes (text, comment, PI) cannot be excluded.
Attribute exclusion is also limited, such that namespace declaration and attributes from the XML namespace cannot be excluded.
Some examples of subsets that were were permitted in the Canonical XML 1.x, but not in this new version:
- A subset consisting of a single attribute all by itself.
- A subset consisting of an attribute without its owner element.
- A subset consisting of a text node all by itself.
- A subset consisting of a text node without its parent node.
- A subset consisting of an element without some of its text node children.

Note

The DOM model of XML Normalization does not support direct input of an octet stream; the Stream model exists for that purpose. The transformation of such a stream into the input model required for DOM processing by this specification is application-specific and should be defined in specifications that reference or make use of this one.

2.1.2 Data Model for Stream and Event Parsers

In the Stream model, the XML subset is again expressed as an Inclusion List and an Exclusion List. For streaming, however, nodes are identified using a set of simple XPath paths. An empty XPath in the Inclusion list SHALL be interpreted as referring to the document's root element as though its value were /. An empty XPath in the Exclusion list SHALL be ignored.

Specifically, only absolute XPaths are allowed, and only if they are comprised of element names and QNames. In addition, the following special characters and wildcards are permitted:

// to allow for selection of deeply-nested elements.
* to allow for any single unnamed element.

The parser MUST treat the inclusion of any other XPath components as an error, including:

Axes.
Context-node (.) and parent-node (..) references.
Expressions.
Functions.

The purpose of this is to limit the description of included/excluded nodes such that they can be easily compared against a stack of node names or QNames assembled by the parser to keep track of its current location in the document.

Note

Since XPath 1.0 [XPATH] requires that any namespaced elements be identified by QName, and since the canonicalization algorithm provides a means to rewrite namespace prefixes, the XPaths used as input MUST use the rewritten prefix values.

2.2 Parameters

Instead of separate algorithms for each variant of normalization, this specification takes the approach of a single algorithm subject to a variety of parameters that change its behavior to address specific use cases.

The following dictionaries define the logical parameters supported by this algorithm. The actual serialization that expresses the parameters in use may be defined as appropriate to specific applications of this specification (e.g., the <ds:CanonicalizationMethod> element in [XMLDSIG-CORE2]).

dictionary QNameAware { DOMString Name; };

2.2.1 Dictionary QNameAware Members

Name of type DOMString

The NCName name of an element or attribute.

dictionary Element : QNameAware { DOMString NS; };

2.2.2 Dictionary Element Members

NS of type DOMString

The URI of the namespace to which this element belongs.

dictionary QualifiedAttribute : QNameAware { DOMString NS; };

2.2.3 Dictionary QualifiedAttribute Members

NS of type DOMString

The URI of the namespace to which this attribute belongs.

dictionary UnqualifiedAttribute : QNameAware { DOMString ParentName; DOMString ParentNS; };

2.2.4 Dictionary UnqualifiedAttribute Members

ParentNS of type DOMString

The URI of the namespace of this attribute's parent element.

ParentName of type DOMString

The NCName of this attribute's parent element.

dictionary XPath : QNameAware { DOMString NS; };

2.2.5 Dictionary XPath Members

NS of type DOMString

The URI of the namespace to which this element belongs.

dictionary Parameters { boolean IgnoreComments = true; boolean TrimTextNodes = true; object PrefixRewrite = "none"; QNameAware[] QNameAware = []; array[QNameAware] ReturnCharacters = []; };

2.2.6 Dictionary Parameters Members

Whether to ignore comments during normalization.

PrefixRewrite of type object, defaulting to "none"

With a string value of "none", prefixes are left unchanged. With a string value of "sequential", prefixes are changed to "n0", "n1", "n2" … except the special prefixes xml and xmlns which are left unchanged. With a value of type HashMap, prefixes are rewritten only for namespaces whose URIs defined in the enumeration, except for xml and xmlns as described above.

QNameAware of type array of QNameAware, defaulting to []

A set of nodes whose entire content must be processed as QName-valued for the purposes of normalization, including prefix rewriting and recognition of prefix "visible utilization"

ReturnCharacters of type array[QNameAware], defaulting to []

A set of nodes whose contents should be returned as raw UTF-8 characters, not parsed.

TrimTextNodes of type boolean, defaulting to true

Whether to trim (i.e. remove leading and trailing whitespace) all text nodes while normalizing. Adjacent text nodes must be coalesced prior to trimming. If an element has an xml:space="preserve" attribute, then text node descendants of that element are not trimmed regardless the value of this parameter.

All of these parameters MUST be implemented.

In the XML Canonicalization space there were two separate canonicalization algorithms - Inclusive Canonicalization [XML-C14N11] and Exclusive Canonicalization [XML-EXC-C14N]. The major differences between these two algorithms is the treatment of namespace declarations and inherited attributes in the xml: namespace. But in the current version of Canonical XML 2.0, Inclusive canonicalization has been removed completely.

Exclusive canonicalization has been far more popular than inclusive, because of its "portability" property. I.e. if a subdocument is signed with exclusive canonicalization, and then this subdocument is moved off to a different XML context, the signature on that subdocument still remains valid. Inclusive canonicalization doesn't have this portability property, however inclusive canonicalization has an advantage over exclusive canonicalization 1.0, when it comes to QNames in content.

Exclusive canonicalization 1.0 only emits namespaces declarations that it considers are visibly utilized, so if there is QName embedded in text node or an attribute node, it doesn't recognize it. For example in this attribute xsi:type="xsd:string", the "xsd" prefix is embedded in the content, and so Exclusive canonicalization 1.0 will not consider the "xsd" prefix to be visibly utilized and hence not emit the xsd namespace declaration. Not emitting the declaration, makes it susceptible to certain wrapping attacks. Exclusive canonicalization 1.0 offers the "InclusiveNamespace" mechanism to deal with these kinds of prefixes. Any prefixes mentioned in this list will be treated inclusively, i.e. their namespace declarations will be emitted even if they are not used.

XML Normalization addresses the shortcomings of Exclusive Canonicalization 1.0 with the QNameAware parameter. This parameter can be used to list element or attribute nodes that are expected to have QNames. XML Normalization will scan for prefixes in these elements and attributes and consider them to be visibly utilized too. Since this is a superior approach, no equivalent to Inclusive canonicalization is defined in this specification.

Note

The algorithm for prefix scanning doesn't cover all kinds of prefix embedding. For example if a text node's value is a space separated list of QNames, this algorithm will not detect the prefixes of these QNames. It will only detect two kinds of embedding:

When the entire text node or attribute is a QName.
When a text node is an XPath expression containing prefixes.

Inclusive canonicalization also preserves the values of xml: attributes in context; it looks at the ancestors of the subdocument being processed, and collects the value of any inheritable xml attributes, specifically xml:lang,xml:space and xml:base, from these ancestor elements and emits them at the root of the subdocument. Exclusive canonicalization does not do this as it this violates the portability requirement. Likewise, XML Normalization ignores these attributes as well.

2.3 Processing Model

The basic normalization process consists of traversing the tree and outputting octets for each node. In DOM mode, this is literally an ordered tree traversal, while in Stream mode the traversal involves the parsing and posting of events for each element and node as it is encountered in the input stream.

Input: The XML subset consisting of an Inclusion list and an Exclusion list.

Processing for DOM mode

Sort inclusion list by document order: If the inclusion list only has the document node D there is nothing to sort. Otherwise remove all element nodes Ei that are descendants of some other element node in the inclusion list. Then sort the remaining element nodes E1,E2, … En by document order.
Normalize each subtree: For each element node Ei or document node D in the sorted list, do a depth first traversal to visit all the descendant nodes in the Ei subtree, and normalize each one of them in-place. While traversing, if the current node is an element and that element is in the exclusion list, prune the traversal, i.e. skip over that element and all its descendants.

Processing for Stream mode

Prepare a stack for storing element names: As each start-element token is encountered, add its QName to the stack. As each end-element token is encountered, it is removed from the top of the stack.
Parse the input octet-stream: Create events according to whether the current QName stack matches an element in the Inclusion list. If it also matches an element in the Exclusion list then the parser MUST NOT post an event. All attributes of an element must be collected prior to posting events for any attribute, so that namespace processing can correctly determine the utilization state of a given namespace.

During traversal of each node (or upon encountering each token type), normalize the value depending on its type as follows:

Root Node— Ignore the byte order mark, XML declaration, and anything from within the document type declaration. Continue traversal.
Element Nodes— Normalize the element's QName as appropriate, and process its child nodes, including attributes and namespaces. If the PrefixRewrite parameter is sequential or predefined, the element's QName will be written with the changed prefix.
If the element is identified by the ReturnCharacters parameter, then the source octet-stream for this element is used to replace the element node with a CDATA node. In Stream mode, all text encountered from the start of the start-element token to the end of the corresponding end-element token is reported as a CDATA block. In neither case is any normalization applied to the identified element or its content.
Attribute Nodes- Normalize the node's QName, and modify its string value. The string value of the node is modified by replacing all ampersands (&) with &, all open angle brackets (<) with <, all quotation mark characters with ", and the whitespace characters#x9, #xA, and #xD, with character references. The character references are written in uppercase hexadecimal with no leading zeroes (for example, #xD is represented by the character reference ).
If parameter PrefixRewrite is sequential or predefined and the attribute name has a namespace prefix, the prefix is changed to the rewritten prefix. Also with prefix rewriting enabled, the attribute content is treated specially if the attribute is among those enumerated for the QNameAware parameter. If so, the QName value of the attribute is rewritten with the new prefix.
Namespace Nodes- Process according to the namespace processing rules and include if the namespace is considered visibly utilized at this point. Regardless of utilization, the namespace's details should be recorded as 'in-scope' until the end of the current element.
Text Nodes- the string value, except all ampersands are replaced by &, all open angle brackets (<) are replaced by <, all closing angle brackets (>) are replaced by >, and all #xD characters are replaced by .
If parameter TrimTextNodes is true and there is no xml:space="preserve" declaration in context, trim the leading and trailing whitespace. E.g. trim <A> <B/> to <A><B/> and trim <A> this is text </A> to <A>this is text</A>. Whitespace is as defined in [XML10] i.e. it consists of one or more space (#x20) characters, carriage returns, line feeds, or tabs.
Note
A DOM parser might split up a long text node into multiple adjacent text nodes, and a Stream parser might report multiple consecutive text tokens, some of which may be empty. Be aware when trimming whitespace in such cases; the net result should be equivalent to doing so as if the adjacent text nodes were concatenated.
Note
When any element is treated as character data due to the effects of theReturnCharacters parameter, the resulting text node/event SHALL NOT be normalized according to these rules.
If parameter PrefixRewrite is sequential or predefined and if the parent element node is among those enumerated for the QNameAware parameter, then the QName value of the text node is rewritten with the new prefix.
Processing Instruction (PI) Nodes- these are not altered during normalization.
Comment Nodes- Deleted (or not reported) if generating normalized XML without comments. For normalized XML with comments, the comment is unchanged by the normalization algorithm.

Note

Although some XML models such as DOM don't distinguish namespace declarations from attributes, Normalization needs to treat them separately. In this document, attribute nodes that are actually namespace declarations are referred as "namespace nodes", other attributes are called "attribute nodes".

2.4 Namespace Processing

As part of the normalization process, while traversing the subtree, use the following algorithm to look at all the namespace declarations in an element, and decide which ones to output.

2.4.1 Namespace concepts

The following concepts are used in Namespace processing:

Explicit and Implicit namespace declarations

In DOM, there is no special node for namespace declarations, they are just present as regular attribute nodes. An "explicit" namespace declaration is an attribute node whose prefix is "xmlns" and whose localName is the prefix being declared.

DOM also allows declaring a namespace "implicitly", i.e. if a new DOM element or attribute is constructed using the createElementNS and createAttributeNS methods, then DOM adds a namespace declaration automatically when serializing the document.

Special namespaces

The "xml" and "xmlns" prefixes are reserved and have special behavior. See [XML-NAMES].

Apex nodes

An apex node is an element node in a document subset having no element node ancestor in the document subset.

Default namespace

The default namespace is declared by xmlns="...". To make the algorithm simpler this will be treated as a namespace declaration whose prefix value is "" i.e. an empty string.

Visibily utilized

This concept is required for exclusive normalization. An element E in the document subset visibly utilizes a namespace declaration, i.e. a namespace prefix P and bound valueV, if any of the following conditions are true:

The element E itself has a qualified name that uses the prefix P. (Note if an element does not have a prefix, that means it visibly utilizes the default namespace.)
OR The element E is among those enumerated for the QNameAware parameter, and the QName value of the element uses the prefix P (or, lacking a prefix, it visibly utilizes the default namespace)
OR The element E is among those enumerated for the QNameAware parameter, and is listed as an XPathElement. This value of the element is to be interpreted as an XPath 1.0 expression and any prefixes used in this XPath expression are considered to be visibility utilized.
OR An attribute A of that element has a qualified name that uses the prefix P, and that attribute is not in the exclusion list. (Note that unlike elements, if an attribute doesn't have a prefix, that means it is a locally scoped attribute. It does NOT mean that the attribute visibly utilizes the default namespace.)
OR An attribute A of that element is among those enumerated for the QNameAware parameter, and the QName value of the attribute uses the prefix P (or, lacking a prefix, it visibly utilizes the default namespace)

2.4.2 Namespace Prefix Rewriting

When the parameter PrefixRewrite="sequential" or PrefixRewrite="predefined" is set, all the prefixes except "xml" are rewritten to new prefixes. In the normalized output there is a one to one mapping between namespace URIs and rewritten prefixes. E.g. if in the input document fragment, a particular prefix is declared to many different namespace URIs at different parts of the document, during normalization this prefix will get rewritten to different prefixes, one rewritten prefix for each different namespace URI. Similarly if in the input document, many prefixes are declared to the same namespace URI, all of these prefixes will be normalized to the same rewritten prefix.

With PrefixRewrite="sequential" the prefixes are rewritten to "n0", "n1", "n2", … etc.

With PrefixRewrite="predefined" the prefix for any namespace in the predefined set is replaced using the value provided by the input set.

Prefixes are considered for rewriting only when they are visibly utilized, not when they are declared.
Once a namespace URI has been assigned a prefix, it always gets that prefix everywhere in the document.
Element nodes are visited in document order.
At each element node, all the visibly utilized prefixes are considered. The namespace URIs for these visibly utilized prefixes are sorted by lexical order, duplicates namespace URIs are removed, those namespace URIs that have already been assigned prefixes are removed, and then the remaining namespace URIs are assigned prefixes sequentially.

Prefix Rewriting also considers QNames in content, and during normalization the prefixes in these QNames are also rewritten.

Note

with PrefixRewrite="sequential", the normalized output will never have a default namespace, as that is also rewritten into a "nN" style prefix. With PrefixRewrite="predefined" the default namespace is rewritten with an explicit prefix only if one has been specified in the input set. Note that when using predefined it is not possible to promote a namespace to the default by supplying a prefix of "" (the empty string)— this is an error.

2.4.3 Namespace processing algorithm

Initialization: For sequential prefix rewriting maintain a counter N. This counter should be set to 0 at the beginning of the normalization process. Also maintain a map of namespace URI to rewritten prefixes; this map should be initialized to empty.

The following steps need to be executed at every Element node E.

Step 1: Create a list of visibly utilized prefixes.

If E itself has a qualified name that uses the prefix P, then P is visibly utilized. Note if E does not have a prefix, that means it visibly utilizes the default namespace.
If an attribute A of that elementE has a qualified name that uses the prefix P, and that attribute is not in the exclusion list. Note that, unlike elements, if an attribute doesn't have a prefix, that means it is a locally scoped attribute. It does NOT mean that the attribute visibly utilizes the default namespace.
If there is a QNameAware parameter, check whether the E or its attributes is enumerated in it as follows:
- If there is an Element subchild, whoseName and NS attributes matchE's localname and namespace respectively, then E is expected to have a single text node child containing a QName. Extract the prefix from this QName, and consider this prefix as visibly utilized.
- If there is a QualifiedAttr subchild, whose Name and NS attributes match one of E's qualified attribute's localname and namespace respectively, then that attribute is expected to contain a QName. Extract this prefix from the QName and consider this prefix as visibly utilized.
- If there is a UnqualifiedAttr subchild, whose Name attribute match one of E's unqualified attribute's name, and its ParentName andParentNS attributes match E's localname and namespace respectively, then that attribute is expected to contain a QName. Extract this prefix from the QName and consider this prefix as visibly utilized.
- If there is a XPathElement subchild, whose Name and NS attributes match E's localname and namespace respectively, then E is expected to have a single text node child containing a XPath 1.0 expression. Extract the prefixes from this XPath by using the following algorithm. All of these extracted prefixes should be considered as visibly utilized.
  * Search for single colons : in the XPath expression, but do not consider single colons inside quoted strings. Double colons are used for axes, e.g. inself::node() , "self:" is not a prefix, but an axis name.
  * The prefix will be present just before the single colon. Go backwards from the colon, skip whitespace, and extract the prefix, by collecting characters till the first non NCName match. e.g. in /soap : Body, extract the "soap". The NCName production is defined in [XML-NAMES].
This can be evaluated using perl style regular expressions as follows. Note the regular expressions here are provided as an example only, they are not normative.
1. First remove all single quoted and double quoted strings from the XPath, because prefixes cannot be present there. i.e. do a substitute of s/"[^"]*"//g and s/'[^']*'//g. Removing the quoted string eliminates false positives in the next step.
2. In the resultant string search for single colons and get the word just before colon, i.e search for match for m/([\w-_.]+)?\s*:(?!:)/ Note prefixes follow the NCName production, i.e. consists of alphanumeric or hyphen or underscore or dot, but cannot start with digit, hyphen or dot. . In an NCName, the allowed alphanumeric characters are not just Ascii, but any Unicode alphanumeric characters. However the regular expression provided here is a very simplified form of NCName production.
- If PrefixRewrite parameter is set tosequential each of the prefixes found in the above steps would need to be replaced by the a new prefix. For efficiency, consider combining this searching for prefixes step with the subsequent replacing prefixes step.

Create a list containing the namespace declarations for these visibly utilized prefixes. Remove the "xml" prefix from the this list if present.

Note

XML Normalization never emits the declaration for the xml or xmlns prefixes. As mentioned in [XML-NAMES] a valid XML document should never have the declaration for xmlns, so XML Normalization should never encounter this declaration. Also a valid XML document can optionally declare thexml prefix, but if present it MUST be bound tohttp://www.w3.org/XML/1998/namespace. XML Normalization SHOULD ignore this declaration.

Step 2: If the PrefixRewrite="sequential" parameter is set , then compute new prefixes for all the namespaces declarations in the list from Step 1, as follows:

Ignore the prefix value in the namespace declaration, and only take the namespace URI. Put all these namespace URIs in a list.
Sort this list of namespace URIs by lexicographic(ascending) order.
Remove duplicates from this list.
Create a list of rewritten namespace declarations as follows:
Iterate through the namespace URI list - if a namespace URI has already been assigned a prefix, use that. Otherwise:
- If PrefixRewrite="sequential", assign a new prefix value "nN" to each prefix, and then increment the value of counterN. The counter should be set to 0 in the beginning of the normalization process. (e.g. if the value of this counter was 5 when the traversal reached this element, and this element had 3 prefixes to be output, then use the prefixes "n5", "n6", "n7" and set the counter to 8 after that).
- If PrefixRewrite="predefined", then look in the input set for the namespace's URI. If a match is found, assign the prefix from the match. Otherwise, the prefix remains unchanged.

Step 3: Filter the list to remove prefixes that have already been output.

Take the list of visibly utilized prefix declarations from Step 1, or if Prefix Rewriting is enabled then the modified list from Step 2.
If in this list, any of the namespace declarations have already been output during the canonicalization of one of the element E's ancestors, say Ej, and has not been redeclared since then to a different value, i.e not been redeclared by an element between Ej and E, then remove it from this list.

Step 4: Sort this list of namespace declarations in lexicographic (ascending) order of prefixes. In case of prefix rewriting, sort by rewritten prefixes, not original prefixes.
Note that default namespace declaration has no prefix, so it is considered lexicographically least.

Step 5: Output each of these namespace nodes, as specified in theProcessing model.

2.4.4 Example of normalization with prefix rewriting

This following XML snippet will be used to determine the various options of prefixRewriting.

Example 1

<wsse:Security
xmlns:wsse="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd" xmlns:wsu="" title="undefined" rel="noopener noreferrer">http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd"> <wsse:UserName wsu:Id="i1"> ... <wsse:Timestamp wsu:Id="i2"> ... wsse:Security

2.4.4.1 With `PrefixRewrite="none"`

Example 2

<wsse:Security xmlns:wsse="" title="undefined" rel="noopener noreferrer">http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd"> <wsse:UserName xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd" wsu:Id="i1"> ... <wsse:Timestamp xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd" wsu:Id="i2"> ...

Note how the "wsu" prefix declaration is present in wsse:Security, but is not utilized. Normalization will "push the declaration down" into<UserName> and <Timestamp> where it is really used, i.e. the wsu declaration will be output twice, once in<UserName> and another in <Timestamp>, as shown above.

2.4.4.2 With `PrefixRewrite="sequential"`

Example 3

<n0:Security xmlns:n0="" title="undefined" rel="noopener noreferrer">http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd"> <n0:UserName xmlns:n1="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd" n1:Id="i1"> ... <n0:Timestamp xmlns:n1="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd" n1:Id="i2"> ...

Now observe what happens with sequential prefix rewriting, the "wsse" prefix is rewritten to "n0" and the "wsu" prefix is rewritten to "n1".

2.4.4.3 With `PrefixRewrite="predefined"`

Using the following predefined namespace prefixes:

http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd = "secutil"

Example 4

<wsse:Security xmlns:wsse="" title="undefined" rel="noopener noreferrer">http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd"> <wsse:UserName xmlns:secutil="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd" secutil:Id="i1"> ... <wsse:Timestamp xmlns:secutil="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd" secutil:Id="i2"> ...

Note that the "wsu" prefix was rewritten to "secutil" while the "wsse" prefix remained unchanged.

2.5 Attribute processing

Note

Namespace declarations are not considered as attributes, they are processed separately as namespace nodes.

Processing the attributes of an element E consists of the following steps:

Ignore any attributes that are present in the exclusion list. However note that namespace nodes cannot be excluded.
Sort all the attributes in increasing lexicographic order with namespace URI as the primary key and local name as the secondary key (an empty namespace URI is lexicographically least).
If it is a qualified attribute and the PrefixRewrite parameter is sequential, modify the QName of the attribute name to use the new prefix. i.e. one of n0, n1, n2, ... etc. Do not do this for the xml prefix, as this is not changed during prefix rewriting.
If the attribute is among those enumerated by the QNameAware parameter, then change the QName in that attribute value to use the new prefix.

3. Algorithm for DOM Normalization

This section is non-normative.

This section presents an IDL representation of the normalization algorithm for DOM parsers, with function descriptions in the form of pseudocode.

The DOM normalization algorithm consists of two components: a HashMap, which is a simple dictionary mapping namespace URIs to prefixes; and an interface representing the normalizer functionality itself.

3.1 The `HashMap` type

This section is non-normative.

[Constructor] interface HashMap { readonly attribute unsigned long count; getter DOMString valueForKey ([TreatNullAs = EmptyString] DOMString key); setter void setValueForKey ([TreatNullAs = EmptyString] DOMString key, DOMString? value); void removeAll (); };

3.1.1 Attributes

This section is non-normative.

count of type unsigned long, readonly

The number of items in the map.

3.1.2 Methods

removeAll

Removes all values in the map.

No parameters.

Return type: void

setValueForKey

Assigns a value for a key. A null value removes the entry from the map.

Parameter	Type	Nullable	Optional	Description
key	DOMString	✘	✘
value	DOMString	✔	✘

valueForKey

Fetches a value stored by the given key.

Parameter	Type	Nullable	Optional	Description
key	DOMString	✘	✘

3.2 The `DOMNormalizer` Interface

[Constructor] interface DOMNormalizer { readonly attribute unsigned int prefixCounter; readonly attribute HashMap rewrittenPrefixes; attribute Parameters properties; attribute DOMString[] outputPrefixes; void normalize (object<> inclusionList, object<> exclusionList); void normalizeSubtree (object node); void processNode (object node, HashMap namespaceContext); void processDocument (object documentNode, HashMap namespaceContext); void processElement (object elementNode, HashMap namespaceContext); void processText (object textNode, HashMap namespaceContext); void processComment (object commentNode, HashMap namespaceContext); void addNamespaces (object elementNode, HashMap namespaceContext); DOMString[] processNamespaces (object elementNode, HashMap namespaceContext); };

3.2.1 Attributes

outputPrefixes of type array of DOMString,

An array of prefixes which have been output and are thus 'in scope' for the current element.

prefixCounter of type unsigned int, readonly

This is a counter only for prefix rewriting in sequential mode. It is initialized to zero.

properties of type Parameters,

The parameters to the normalization process.

rewrittenPrefixes of type HashMap, readonly

A hash table of uri -> rewrittenPrefix. It is initialized to empty. Finding out the rewritten prefix for an original prefix is a two step lookup: first look up the URI for the original prefix in the namespaceContext hash table, then look up the rewritten prefix for the URI in the rewrittenPrefixes hash table.

3.2.2 Methods

addNamespaces

Add namespaces from this element to the namespace context. This function is called for every ancestor element, and also at every element of the subtrees (minus the exclusion set and any subtrees of elements identified by theproperties.ReturnCharacters array).

Pseudocode:

addNamespaces(element, namespaceContext) { for each explicit and implicit namespace declaration in the element { if namespaceContext already has this prefix with the same URI { do nothing } else if namespaceContext already has this prefix with a different URI { update the namespaceContext hash table with the new prefix->URI mapping

        if this prefix exists in outputPrefixes
            remove it
    }
    else if namespaceContext doesn't have this prefix
    {
        add the new prefix -> URI mapping to the namespaceContext
    }
}

}

Parameter	Type	Nullable	Optional	Description
elementNode	object	✘	✘
namespaceContext	HashMap	✘	✘

Return type: void

normalize

The top-level normalization function.

Pseudocode:

normalize(list of subtree, list of exclusion elements and attributes) { put the exclusion elements and attributes in hash table for easier lookup

sort the multiple subtrees by document order
  
for each subtree
{
    normalizeSubtree(subtree)
}

}

Parameter	Type	Nullable	Optional	Description
inclusionList	object<>	✘	✘
exclusionList	object<>	✘	✘

Return type: void

normalizeSubtree

Normalize an individual subtree.

Pseudocode:

canonicalizeSubtree(node) { if (node is the document node or a document root element) { // (whole document is being processed, no ancestors to worry about) processNode(node) } else { starting from the element, walk up the tree to collect a list of ancestors

    for each of this node's ancestor elements starting with the document
    root, but not including the element itself 
        addNamespaces(element)
      
    processNode(node)
}

}

Parameter	Type	Nullable	Optional	Description
node	object	✘	✘

Return type: void

Process a Comment node.

Preudocode:

processComment(commentNode, namespaceContext) { if properties.IgnoreComments remove the node from the DOM }

Parameter	Type	Nullable	Optional	Description
commentNode	object	✘	✘
namespaceContext	HashMap	✘	✘

Return type: void

processDocument

Process the Document node.

Pseudocode:

processDocument(document, namespaceContext) { for (each child node) { processNode(child, namespaceContext) } }

Parameter	Type	Nullable	Optional	Description
documentNode	object	✘	✘
namespaceContext	HashMap	✘	✘

Return type: void

processElement

Process an Element node.

Pseudocode:

processElement(elementNode, namespaceContext) { if elementNode exists in the exclusion hash table return

if elementNode is listed in properties.ReturnCharacters
{
    serialize elementNode as UTF-8 text
    replace elementNode with a text node containing that text
    return
}

make copies of namespaceContext and outputPrefixes in the stack
//(by copying, any changes made can be undone when this function returns)

nsToBeOutputList = processNamespaces(element)
attributeList = []

if (properties.PrefixRewrite != "none")
{
    determine the namespace for the element and update its prefix according to
          namespaceContext and rewrittenPrefixes
    element.namespace.prefix = new prefix value
}

for each of the namespaces in the nsToBeOutputList
    add appropriate "xmlns" attribute to attributeList

for each non-namespace attribute in the element
{
    replace/apply namespace prefix according to properties.PrefixRewrite
    if the element is in Properties.QNameAware
        adjust prefixes within its content as appropriate
    
    add attribute to attributeList
}

element.attributes = attributeList

Loop through all child nodes and call
    processNode(child, copy(namespaceContext))

remove namespace prefixes in nsToBeOutputList from outputPrefixes

}

Parameter	Type	Nullable	Optional	Description
elementNode	object	✘	✘
namespaceContext	HashMap	✘	✘

Return type: void

processNamespaces

Process the list of namespaces for this element.

Pseudocode:

processNamespaces(element) { addNamespaces(element)

create a list of visibly utilized prefixes - visiblePrefixes, which includes
    a) the prefix used by the element itself
    b) the prefix used by all the qualified attributes of the element
    c) the prefix embedded in the attribute value of any QName aware attributes
    d) the prefix embedded in the any text node child, if QName aware

if properties.PrefixRewrite != "none"
{
    newNamespaceURIs = []    // empty List

    for each prefix in visiblePrefixes
        get the URI for this prefix from the namespaceContext hash table
        check if the URI already exists in the rewrittenPrefixes hash table
        if it does not add the URI to newNamespaceURIs

    sort the newNamespaceURIs list in lexical order

    if properties.PrefixRewrite = "sequential"
    {
        for each URI in the newNamespaceURIs list
            assign a prefix "nN" where N is value of prefixCounter
            increment prefixCounter by 1
            add the mapping URI -> nN  into the rewrittenPrefixes hash table
    }
    else if properties.PrefixRewrite is a HashMap
    {
        for each URI in the newNamespaceURIs list
            lookup the prefix for this URI in properties.PrefixRewrite
            if there is a prefix
                add the mapping URI -> prefix into rewrittenPrefixes
    }
}

nsToBeOutput = [] // empty hash table

for each prefix in visiblePrefixes 
{
    find the URI that this prefix maps to in the namespaceContext hash table

    if PrefixRewrite != "none"
        convert this prefix to rewrittenPrefix, by using the URI to
        lookup the rewrittenPrefix in the rewrittenPrefixes hash table

    if this prefix (original or rewritten) does not exist in outputPrefixes
        add this prefix to outputPrefixes 
        add the prefix-> URI mapping into the nsToBeOutput hash table
}

sort the nsToBeOutputList by the prefix

return nsToBeOutputList

}

Parameter	Type	Nullable	Optional	Description
elementNode	object	✘	✘
namespaceContext	HashMap	✘	✘

processNode

Redirects to the appropriate node processing function.

Pseudocode:

processNode(node, namespaceContext) { call the appropriate function - processDocument, processElement, processTextNode, ... depending on the node type. }

Parameter	Type	Nullable	Optional	Description
node	object	✘	✘
namespaceContext	HashMap	✘	✘

Return type: void

processText

Process a Text node.

Pseudocode:

processText(textNode) { if this text node is outside document root return

in the text replace 
   all ampersands by &amp;, 
   all open angle brackets (<) by &lt;, 
   all closing angle brackets (>) by &gt;, 
   and all #xD characters by &#xD;.

if properties.TrimTextNodes is true and there is no xml:space="preserve"
        declaration in scope
{
    if previous node was not a text node
        trim leading whitespace
    if next node is not a text node
        trim trailing whitespace
}

if propertiesPrefixRewrite != "none" and this text node is a child of
        a QName aware element
{
    search for embedded prefixes, and replace with rewritten prefixes
}

replace the text content of the node with the modified text

}

Parameter	Type	Nullable	Optional	Description
textNode	object	✘	✘
namespaceContext	HashMap	✘	✘

Return type: void

4. Algorithm for Streaming Normalization

This section is non-normative.

Unlike DOM parsers which represent XML document as a tree of nodes, streaming parsers represent an XML document as stream of events like "start-element", "end-element", "text" etc. A document subset can also be represented as a stream of events. This stream of events in exactly in the same order as a tree walk, so the same approach can be also used to normalize an event stream. Below you can find a description of the SAX2 [SAX] event-handler interface with comments on the application of normalization to the generated events.

Since this algorithm and that employed for StAX [XML-PARSER-STAX] relies on much the same parsing events, we leave the application of this algorithm to a 'pull' parser up to the reader.

4.1 The `ElementStack` Type

This section is non-normative.

The ElementContext dictionary is used to store information about a single element. One of these is pushed onto the stack during processing of astartElement() event, and it is removed while processing the corresponding endElement() event.

dictionary ElementContext { HashMap namespaceContext = []; DOMString[] outputPrefixes = []; DOMString elementQName = ""; DOMString localName = ""; DOMString prefix = ""; boolean isQNameAware = false; };

4.1.1 Dictionary ElementContext Members

This section is non-normative.

elementQName of type DOMString, defaulting to ""

The QName of the current element.

isQNameAware of type boolean, defaulting to false

Whether the element is QName aware and must have its contents scanned for mapped prefixes.

localName of type DOMString, defaulting to ""

The element's unqualified name. This may be derived from the elementQName property.

namespaceContext of type HashMap, defaulting to []

The current namespace mapping context, for use with prefix rewriting.

outputPrefixes of type array of DOMString, defaulting to []

The list of currently output namespaces prefixes, i.e. those that are considered visibly utilized at the present.

prefix of type DOMString, defaulting to ""

The element's namespace prefix. This may be derived from the elementQName property.

The ElementStack interface implements a basic stack of ElementContext dictionaries. Its push() operation duplicates some of the properties of the current top-of-stack object for you.

[Constructor] interface ElementStack { unsigned int count (); ElementContext push (DOMString QName); ElementContext top (); void pop (DOMString QName); };

4.1.2 Methods

count

Returns the number of items on the stack.

No parameters.

pop

Checks the topmost ElementContext to ensure that it matches the given QName, and removes it from the stack if it matches. If it does not match, a DOMException is raised.

Parameter	Type	Nullable	Optional	Description
QName	DOMString	✘	✘

Return type: void

push

Creates a duplicate of the ElementContext on the top of the stack and replaces its elementQName, localName, and prefix properties based on the provided QName parameter. The new object is placed on top of the stack and returned.

Parameter	Type	Nullable	Optional	Description
QName	DOMString	✘	✘

Return type: [ElementContext](#idl-def-ElementContext)

top

Returns the topmost ElementContext without modifying the stack.

No parameters.

Return type: [ElementContext](#idl-def-ElementContext)

4.2 SAX2 Events

The following interface describes some events defined by the SAX2 parser specification. Any events not enumerated below are unchanged by this algorithm.

interface SAXEvents { const readonly int StartDocument = 1; const readonly int EndDocument = 2; const readonly int StartElement = 3; const readonly int EndElement = 4; const readonly int Characters = 5; const readonly int IgnorableWhitespace = 6; const readonly int ProcessingInstruction = 7; const readonly int Comment = 8; const readonly int CDATABlock = 9; const readonly int StartPrefixMapping = 10; const readonly int EndPrefixMapping = 11; };

4.2.1 Constants

CDATABlock of type readonly int

The event contains some of the raw character data from within an XML`` block.

Characters of type readonly int

Characters from a text node (but not a CDATA node) will be posted using this event. According to [SAX], this may contain raw entity codes; for normalization entity-replacement MUST be enabled. Thus any occurrences of & will be replaced by the resulting & character, and so on.

An XML comment of the form  was parsed. The event contains the text content of the comment, i.e. A Comment.

EndDocument of type readonly int

The document's outermost element has been closed. This is preceded by the EndElement event for that element.

EndElement of type readonly int

A closing element tag has been parsed. In the case of a self-closing or 'empty' element, this event will follow directly from the StartElement event for this same element.

EndPrefixMapping of type readonly int

The element containing an xmlns attribute has been closed.

IgnorableWhitespace of type readonly int

The parser encountered some whitespace characters which may be safely ignored; they are present for formatting purposes only and have no semantic or lexical meaning.

ProcessingInstruction of type readonly int

An XML processing instruction of the form <?name param1="1" param2="2"?> has been parsed. The event provides the name component along with the remaining characters as a single character string (i.e. param1="1" param2="2").

StartDocument of type readonly int

The start of the document was encountered. The next event will be a StartElement for the document's outermost element.

StartElement of type readonly int

An element's opening tag has been parsed. Information on the element's namespace and all attached attributes is included with this event.

StartPrefixMapping of type readonly int

The parser has encountered an xmlns attribute and has mapped a prefix to a URI.

4.3 SAX2 Normalization Algorithm

Below is a partial definition of a SAX2 event handler interface. The documentation for each event defines how the parser should normalize the parameters for that event.

Note

Note that handling of characters when TrimTextNodes is true involves buffering eachCharacters event until the next event arrives. If the next event is not alsoCharacters, then the buffered text has trailing whitespace trimmed and its event is posted to the client. It TrimTextNodes is false, then no buffering occurs.

[Constructor] interface SAX2Normalizer { attribute ElementStack elementStack; attribute Parameters normalizationParameters; attribute char[] currentCharacters; attribute HashMap pendingNamespaces; attribute int rewriteCounter; attribute HashMap rewrittenPrefixes; void postStartPrefixMappingEvent (DOMString prefix, DOMString uri); void postStartElementEvent (DOMString uri, DOMString localName, DOMString qName, object[] attrList); void postEndElementEvent (DOMString uri, DOMString localName, DOMString qName); void postIgnorableWhitespace (char[] text); void postComment (char[] comment); void postCDATA (char[] data); void postCharacters (char[] text); };

4.3.1 Attributes

currentCharacters of type array of char,

When normalizationParameters.TrimTextNodes is true, the text for a Characters event are first placed into this variable. The event is function is passed these characters once the following event has been received. In this way, the parser can determine whether to trim whitespace from the end of the string without accumulating the entire text block in memory.

elementStack of type ElementStack,

A stack of element information representing the current path into the XML document's tree. A new ElementContext is pushed upon each SAXEvents.StartElement event, and is popped upon the corresponding SAXEvents.EndElement event.

normalizationParameters of type Parameters,

All normalization parameters are stored here.

pendingNamespaces of type HashMap,

Records all namespace prefix to URI mappings reported

rewriteCounter of type int,

When normalizationParameters.PrefixRewrite is "sequential", this attribute is used to generate the new, numbered prefixes. It is initialized to zero.

rewrittenPrefixes of type HashMap,

A map of namespace URIs to prefixes, containing only those which have been reassigned in accordance with normalizationParameters.PrefixRewrite.

4.3.2 Methods

postCDATA

As per XML Canonicalization, CDATA sections are replaced with their character content. This method instead posts a Characters event.

Parameter	Type	Nullable	Optional	Description
data	char[]	✘	✘

Return type: void

postCharacters

Certain characters are replaced with character entities and the characters are either posted directly or, if TrimTextNodes is enabled, they are buffered in case of needing to trim trailing whitespace based on the type of the next event.

Pseudocode:

void postCharacters(text) { if normalizationParameters.TrimTextNodes is true { if currentCharacters is empty // better: if previous event was not EndElement, Characters, or CDATA { // start of a text node trim leading whitespace } else { output any buffered characters (no trimming) currentCharacters := [] } }

replace all instances of "&" with "&amp;"
replace all instances of "<" with "&lt;"
replace all instances of ">" with "&rt;"
replace all carriage returns ('\r') with "&#xD;"
replace all tabs ('\t') with "&#x9;"

if normalizationParameters.TrimTextNodes is true
{
    currentCharacters := text
}
else
{
    post the event immediately: characters(text)
}

}

Parameter	Type	Nullable	Optional	Description
text	char[]	✘	✘

Return type: void

If IgnoreComments is true, does not post the event.

Parameter	Type	Nullable	Optional	Description
comment	char[]	✘	✘

Return type: void

postEndElementEvent

End element events only require prefix rewriting for the qName parameter, if appropriate.

Pseudocode:

void postEndElementEvent(uri, localName, qName) { trim and post any buffered characters

context := elementStack.top()
elementStack.pop(qName)    // throws an exception if qNames do not match

if normalizationParameters.PrefixRewrite is not "none"
{
    prefix := rewrittenPrefixes(uri)
    qName := prefix + ":" + localName
}

post event: endElement(uri, localName, qName)

}

Parameter	Type	Nullable	Optional
uri	DOMString	✘	✘
localName	DOMString	✘	✘
qName	DOMString	✘	✘

Return type: void

postIgnorableWhitespace

If TrimTextNodes is true, does not post the event.

Parameter	Type	Nullable	Optional	Description
text	char[]	✘	✘

Return type: void

postStartElementEvent

When a start element event is to be sent, the following additional processing occurs to modify the parameters of that event. Note that attribute values are also normalized according to section 3.3.3 of [XML10].

Pseudocode:

void postStartElementEvent(uri, localName, qName, attrList) { trim and post any buffered characters

if normalizationParameters.ReturnCharacters references this element
{
    postEvent(CDATABlock, element outer XML)
    skip processing of element subtree and EndElement event
    return
}

context := elementStack.push(qName)

for each [prefix, uri] pair in pendingNamespaces
{
    if context.namespaceContext(prefix) does not match attribute value
    {
        context.namespaceContext(prefix) := attribute value
        context.outputPrefixes(prefix) := null  // remove from outputPrefixes
    }
}

pendingNamespaces.removeAll()

for each xmlns or xmlns:prefix attribute in attrList
{
    remove attribute from attrList
}

if element is QName aware
    context.isQNameAware = true

// get a HashMap of prefix -> uri
// this also rewrites contents of QNameAware attributes
usedNamespaces := visiblyUsedNamespaces(context, attrList)

if qName has a prefix and normalizationParameters.PrefixRewrite is not "none"
{
    prefix := element prefix
    if rewrittenPrefixes(uri) is not null
    {
        prefix := rewrittenPrefixes(uri)
    }
    else if normalizationParameters.PrefixRewrite is "sequential"
    {
        prefix := "nN" where N is the value of rewriteCounter
        increment rewriteCounter
        rewrittenPrefixes(uri) := prefix
    }
    else if normalizationParameters.PrefixRewrite is a HashMap and it contains a value for the uri
    {
        prefix := normalizationParameters.PrefixRewrite(uri)
        rewrittenPrefixes(uri) := prefix
    }
    
    qName := prefix + ":" + localName
}

append any default attributes for the element to attrList

for each [name, value] in attrList
{
    if name has a prefix other than 'xml' and normalizationParameters.PrefixRewrite is not "none"
    {
        // all prefixes have been enumerated by now
        split name into prefix and local
        attrUri := context.namespaceContext(prefix)
        if rewrittenNamespaces(attrUri) is not null
        {
            prefix := rewrittenNamespaces(attrUri)
            name := prefix + ":" + local     // replace name in attrList
        }
    }
    
    normalize attribute value
}

for each [prefix, uri] pair in usedNamespaces
{
    if prefix is an empty string
    {
        insert new attribute with name "xmlns" and value uri at start of attributes
    }
    else
    {
        insert new attribute with name "xmlns:" + prefix and value uri at start of attributes
    }
}

post event: startElement(uri, qName, localName, attrList)

}

Parameter	Type	Nullable	Optional
uri	DOMString	✘	✘
localName	DOMString	✘	✘
qName	DOMString	✘	✘
attrList	object[]	✘	✘

Return type: void

postStartPrefixMappingEvent

Stores the mapping in pendingNamespaces; they will be placed into an element's context during the next StartElement event.

Parameter	Type	Nullable	Optional	Description
prefix	DOMString	✘	✘
uri	DOMString	✘	✘

Return type: void

5. Output rules

All text is encoded in UTF-8.
Line breaks normalized to #xA on input (automatically done by a DOM parser).
Attribute values are normalized according to XML 1.0 [XML10] section 3.3.3.
Whitespace outside of the document element and within start and end tags is normalized.
Special characters in attribute values and character content are replaced by character references.
Default attributes are added to each element.

A. References

Dated references below are to the latest known or appropriate edition of the referenced work. The referenced works may be subject to revision, and conformant implementations may follow, and are encouraged to investigate the appropriateness of following, some or all more recent editions or replacements of the works cited. It is in each case implementation-defined which editions are supported.

A.1 Normative references

[RFC2119]

S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. March 1997. Internet RFC 2119. URL: http://www.ietf.org/rfc/rfc2119.txt

[XML-C14N11]

John Boyer; Glenn Marcy. Canonical XML Version 1.1. 2 May 2008. W3C Recommendation. URL: http://www.w3.org/TR/2008/REC-xml-c14n11-20080502/

[XML-NAMES]

Richard Tobin et al. Namespaces in XML 1.0 (Third Edition). 8 December 2009. W3C Recommendation. URL: http://www.w3.org/TR/2009/REC-xml-names-20091208/

[XML10]

C. M. Sperberg-McQueen et al. Extensible Markup Language (XML) 1.0 (Fifth Edition). 26 November 2008. W3C Recommendation. URL: http://www.w3.org/TR/2008/REC-xml-20081126/

[XMLDSIG-CORE2]

Mark Bartel; John Boyer; Barb Fox et al. XML Signature Syntax and Processing Version 2.0. 24 January 2012. W3C Candidate Recommendation. (Work in progress.) URL: http://www.w3.org/TR/2012/CR-xmldsig-core2-20120124/

[XPATH]

James Clark; Steven DeRose. XML Path Language (XPath) Version 1.0. 16 November 1999. W3C Recommendation. URL: http://www.w3.org/TR/1999/REC-xpath-19991116/

A.2 Informative references

[C14N2-TestCases]

Pratik Datta; Frederick Hirsch; . Test Cases for Canonical XML 2.0. 5 January 2012. W3C First Public Working Draft. URL: http://www.w3.org/2008/xmlsec/Drafts/c14n-20/test-cases/

[DOM-LEVEL-2-CORE]

Arnaud Le Hors et al. Document Object Model (DOM) Level 2 Core Specification. 13 November 2000. W3C Recommendation. URL: http://www.w3.org/TR/2000/REC-DOM-Level-2-Core-20001113/

[SAX]

D. Megginson, et al. SAX: The Simple API for XML. May 1998. URL: http://www.megginson.com/downloads/SAX/

[XML-C14N]

John Boyer. Canonical XML Version 1.0. 15 March 2001. W3C Recommendation. URL: http://www.w3.org/TR/2001/REC-xml-c14n-20010315

[XML-EXC-C14N]

Donald E. Eastlake 3rd; Joseph Reagle; John Boyer. Exclusive XML Canonicalization Version 1.0. 18 July 2002. W3C Recommendation. URL: http://www.w3.org/TR/2002/REC-xml-exc-c14n-20020718/

[XML-PARSER-STAX]

Christopher Fry. JSR 173: Streaming API for XML for Java Specification 8th October 2003. v1.0 URL: http://jcp.org/en/jsr/detail?id=173

[XMLDSIG-XPATH-FILTER2]

Merlin Hughes; John Boyer; Joseph Reagle. XML-Signature XPath Filter 2.0. 8 November 2002. W3C Recommendation. URL: http://www.w3.org/TR/2002/REC-xmldsig-filter2-20021108/

XML Normalization (original) (raw)

W3C Editor's Draft 15 March 2013

Abstract

Status of This Document

Table of Contents

1. Introduction

1.1 Conformance

1.2 Terminology

1.3 Applications

1.4 Limitations

1.5 Requirements

1.5.1 Performance

1.5.2 Streaming

1.5.3 Robustness

1.5.4 Portability

1.5.5 Simplicity

1.6 Test Cases for Canonical XML 2.0

2. XML Normalization

2.1 Data Model

2.1.1 Data Model for DOM Parsers

2.1.2 Data Model for Stream and Event Parsers

2.2 Parameters

2.2.1 Dictionary QNameAware Members

2.2.2 Dictionary Element Members

2.2.3 Dictionary QualifiedAttribute Members

2.2.4 Dictionary UnqualifiedAttribute Members

2.2.5 Dictionary XPath Members

2.2.6 Dictionary Parameters Members

2.3 Processing Model

2.4 Namespace Processing

2.4.1 Namespace concepts

2.4.2 Namespace Prefix Rewriting

2.4.3 Namespace processing algorithm

2.4.4 Example of normalization with prefix rewriting

2.4.4.1 With PrefixRewrite="none"

2.4.4.2 With PrefixRewrite="sequential"

2.4.4.3 With PrefixRewrite="predefined"

2.5 Attribute processing

3. Algorithm for DOM Normalization

3.1 The HashMap type

3.1.1 Attributes

3.1.2 Methods

3.2 The DOMNormalizer Interface

3.2.1 Attributes

3.2.2 Methods

4. Algorithm for Streaming Normalization

4.1 The ElementStack Type

4.1.1 Dictionary ElementContext Members

4.1.2 Methods

4.2 SAX2 Events

4.2.1 Constants

4.3 SAX2 Normalization Algorithm

4.3.1 Attributes

4.3.2 Methods

5. Output rules

A. References

A.1 Normative references

A.2 Informative references

2.4.4.1 With `PrefixRewrite="none"`

2.4.4.2 With `PrefixRewrite="sequential"`

2.4.4.3 With `PrefixRewrite="predefined"`

3.1 The `HashMap` type

3.2 The `DOMNormalizer` Interface

4.1 The `ElementStack` Type