Polyglot Markup: A robust profile of the HTML5 vocabulary (original) (raw)
Abstract
A document that uses polyglot markup is a document that is a stream of bytes that parses into identical document trees (with some exceptions, as noted in the Introduction) when processed either as HTML or when processed as XML. Polyglot markup that meets a well-defined set of constraints is interpreted as compatible, regardless of whether it is processed as HTML or as XHTML, per the HTML5 specification. Polyglot markup uses a specific DOCTYPE, namespace declarations, and a specific case—normally lower case but occasionally camel case—for element and attribute names. Polyglot markup uses lower case for certain attribute values. Further constraints include those on void elements, named entity references, and the use of scripts and style.
Status of This Document
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
Beware. This specification is no longer in active maintenance and the HTML Working Group does not intend to maintain it further.
This specification summarizes design guidelines for authors who wish their XHTML or HTML documents to be conforming whether parsed as HTML or as XML. The document is intended to be useful to web authors, in particular those who want to serve receivers without concern for whether they have XML or HTML parsers available. Such concerns may, for instance, arise in content syndication or when receivers are on legacy systems. HTML polyglots facilitate migration to and from XHTML, including transition from XML 1.x to HTML5, and this document serves to accurately specify the requirements of a UTF-8 based profile for such documents.
No recommendation is made in this document or by the W3C regarding whether or not to publish polyglot content. In general, authors are encouraged to publish HTML content using HTML5 syntax and media types (either HTML syntax and text/html
, or XHTML syntax and application/xhtml+xml
).
This document is not a specification for user agents and creates no obligations on user agents. Note that this document does not define how HTML5-conforming user agents should process HTML documents. Nor does it define the meaning of the Internet Media Type text/html
. For user agent guidance and for these definitions, see [HTML5] and [RFC2854].
Please submit bugs for this document by using the W3C's public bug database ( http://www.w3.org/Bugs/Public/) with the product set to HTML WG and the component set toHTML/XHTML Compatibility Authoring Guide (ed: Eliot Graff). If you cannot access the bug database, submit comments by email to the mailing list noted below.
This document was published by the HTML working group as a Working Group Note. If you wish to make comments regarding this document, please send them topublic-html@w3.org (subscribe,archives).
Publication as a Working Group Note does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the5 February 2004 W3C Patent Policy.W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes containsEssential Claim(s) must disclose the information in accordance withsection 6 of the W3C Patent Policy.
This document is governed by the 1 September 2015 W3C Process Document.
Table of Contents
- 1. Conformance
- 2. Introduction
- 3. Syntax
- 4. Writing HTML documents
- 4.1 Processing instructions and the XML declaration
- 4.2 Specifying a document’s character encoding
- 4.3 The DOCTYPE
- 4.4 Namespaces
* 4.4.1 Element-level namespaces
* 4.4.2 Attribute-level namespaces - 4.5 Element syntax
* 4.5.1 Required elements and tags
* 4.5.1.1 A minimal HTML document
* 4.5.1.2 Required element examples
* 4.5.2 Excluded elements
* 4.5.3 Case-sensitivity
* 4.5.3.1 Element names
* 4.5.3.2 Attribute names
* 4.5.3.3 Attribute values - 4.6 Element content
* 4.6.1 Void elements
* 4.6.2 Raw text elements (script and style)
* 4.6.2.1 Options for delivering safe text content
* 4.6.2.2 Safe CDATA content
* 4.6.2.2.1 Safe rules for CDATA use
* 4.6.2.2.2 Comment syntax in script
* 4.6.3 Escapable raw text elements
* 4.6.4 Foreign elements
* 4.6.5 Special elements - 4.7 Text
* 4.7.1 Newlines in textarea and pre elements - 4.8 Attributes
* 4.8.1 Disallowed attributes
* 4.8.2 Language attributes
* 4.8.3 Attributes with special considerations
* 4.8.3.1 The id attribute - 4.9 Named entity references
- 4.10 Comments
- 4.11 Scripting and styling polyglot markup
* 4.11.1 JavaScript: innerHTML vs document.write()
* 4.11.2 CSS: Attribute selectors that require a namespace prefix
- 5. Example document
- A. Acknowledgements
- B. References
1. Conformance
As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.
The key words MUST, MUST NOT, REQUIRED, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this specification are to be interpreted as described in [RFC2119].
2. Introduction
This section is non-normative.
It is sometimes valuable to be able to serve HTML5 documents that are also well formed XML documents. An author may, for example, use XML tools to generate a document, and they and others may process the document using XML tools. The language used to create documents that can be parsed by both HTML and XML parsers is called polyglot markup.Polyglot markup is the overlap language of documents that are both HTML5 documents and XML documents. It is recommended that these documents be served as either text/html
(if the content is transmitted to an HTML-aware user agent) or application/xhtml+xml
(if the content is transmitted to an XHTML-aware user agent). Other permissible MIME types are text/xml
, application/xml
, and any MIME type whose subtype ends with the four characters "+xml
". [XML-MT]
2.1 Scope
Polylglot markup is a robust – but entirely optional – profile of the HTML vocabulary. All web content need not be authored in polyglot markup and it is primarily an option for authors wanting increased robustness of their documents.Polyglot markup works best, and can be a beneficial option, in controlled environments and for authoring tools.
Polyglot markup is ideal for publishing when there's a strong desire to serve both HTML and XML tool chains without simultaneously having to maintain dual copies of the content: one in HTML and a second in XHTML. In addition, a single polyglot markup output requires less infrastructure to produce than to produce both HTML and XHTML output for the same content.Polyglot markup is also be beneficial when lightweight processes—such as quick testing or even hand-authoring—are applied to content intended to be published both as HTML and XHTML, especially if that content is not sent through a tool chain.
Note
XML-based HTML tools or systems intended for the most general contexts of use cannot depend on polyglot input: for maximum flexibility, such tools should use the technique of using an HTML parser that produces an XML-compatible DOM or event stream.
2.2 Robustness
The goal of polyglot markup is a syntax that is robust the way the Web Content Accessibility Guidelines (WCAG) 2.0 describes it: ”Maximize compatibility with current and future user agents, including assistive technologies. [WCAG20]
Authors need not understand the benefits of robustness in order to benefit from the syntax of polyglot markup. However, in order to promote its benefits, it is necessary to understand that polyglot markup does not add semantics, and as such is not any more or less semantic than other flavors of HTML. Polyglot markup does, however, work to preserve semantics, including during the authoring process. Polyglot markup also does not ensure accessibility,as it does not add any accessibility requirements that other relevant specifications have not already added. But polyglot markup can work to preserve accessibility through adherence to required practices.
Polyglot markup approaches robustness by defining constraints on the serialization of a DOM tree in a manner that is likely to retain semantics when that serialization is reparsed using a variety of parsers, be they full featured and bug free HTML5 parsers, somewhat HTML-aware parsers, and even XML parsers.
For the most part, polyglot markup is just a pure deduction of the validity constraints and syntax requirements that HTML and XHTML each dictate, many of which took "polyglotness" into consideration when they were added to HTML5. However, for reasons of robustness, this specification sometimes goes further than the principle of the lowest common denominator would have required.
For instance, included in the set of constraints on the serialization is the requirement to use the UTF-8 encoding. While not the only theoretical possibility, the choice of UTF-8 as the sole option is justified by the underlying principle of robustness. E.g. if someone opted to use the KOI8-R
, encoding, then, as a side-effect of HTML-conformance and XML well-formedness requirements, the author would be forced to rely on a higher protocol (such as MIME Content-Type
) in order to support XML parsers. By requiring UTF-8, that side-effect is avoided.
Using robust syntax can enable documents to be parsed more reliable in less capable parsers. But even if the document can be expected to be parsed and validated by tools that fully conform to HTML5, polyglot markup adds robustness. As an example, when serialized as HTML, the closing tag for the p
element is entirely optional and will be inferred if not present. But inclusion of closings tags, as required by XML and, thus, by polyglot markup, cause no harm beyond a minor increase in transfer size (an increase often mitigated by compression), but does allow validators to detect situations where the implicit closing rules don't match what the author intended.
Note
Note that XML-based polyglot markup syntax is not the only way to increase robustness. For instance, an HTML validator or an authoring tool could require all tags to be closed even if this is not required by the HTML syntax.
3. Syntax
3.1 Principles
Polyglot markup results in:
- a valid HTML document. [HTML5]
- a well-formed XML document. [XML10]
- identical DOMs when processed as HTML and when processed as XML, with some notable exceptions: HTML and XML parsers generate different DOMs for some
xml
(xml:lang
,xml:space
, andxml:base
),xmlns
(xmlns=""
andxmlns:xlink=""
), andxlink
(such asxlink:href
) attributes. XML requires and HTML5 permits these attributes in certain locations and the attributes are preserved by HTML parsers. The exception must not break the requirement to be a valid HTML document.
Polyglot Markup specifies a Robust Syntax, by which it is meant a syntax that maximizes support and minimizes authoring choice.
Support is maximized:
- by supporting both HTML and XML parsing;
- by utilizing code that, as far as possible, results in DOM equivalent parsing in generic as well as specialized parsers, including challenged parsers of various kinds;
- because the code is ready to be reused/repurposed/redited/reparsed in any authoring tool or parser.
Auhoring choices are minimized
- through strict syntax requirements partly dictated by the polyglot approach and partly motivated by the robust approach.
Polyglot markup is not constrained:
Polyglot markup is scripted according to the rules of XML (does not use document.write
, for example) and excludes HTML elements that are impossible to replicate in an XML parser (does not use the noscript
element, for example).Polyglot markup triggers non-quirks mode in HTML parsers, as non-quirks mode is closest to XML-mode rendering, in regard to both DOM and CSS.Polyglot markup results in the same encoding and the same language in both HTML-mode and XML-mode.
Polyglot markup, itself being valid HTML5, supports extensibility as it is defined inSection 2.2.3 Extensibility of HTML5, so long as the extension does not violate the rules of polyglot markup. [HTML5] In addition, being well formed XML, polyglot markup can be extended when it is served as application/xhtml+xml
.
4. Writing HTML documents
4.1 Processing instructions and the XML declaration
Processing instructions and the XML declaration are both forbidden in polyglot markup.
4.2 Specifying a document’s character encoding
Polyglot markup uses the UTF-8 character encoding, the only character encoding for which both HTML and XML require support. HTML requires UTF-8 to be explicitly declared to avoidfallback to a legacy encoding. [HTML5]
For XML, UTF-8 is an encoding default. Documents served with an XML content type therefore do not need to use any of the HTML encoding declaration methods, although if the document might be interpreted as text/html
it SHOULD do so.
Polyglot markup declares the UTF-8 character encoding in the following ways, which may be used separately or in combination (but note that there can only be a single HTML encoding declaration):
- Within the document
- By using the Byte Order Mark (BOM) character
- By using the HTML encoding declaration
* either in itscharset
attribute form:<meta charset="UTF-8"/>
* or in its alternative form:<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
- Outside the document
- By adding
"charset=utf-8"
to the MIME/HTTP Content-Type header [HTTP11], as the following examples show in HTML and XML, respectively:
Example 1
- By adding
Content-type: text/html; charset=utf-8
Example 2
Content-type: application/xhtml+xml; charset=utf-8
Note that, when serving polyglot documents as XML, charset=UTF-8
can safely be omitted, due to the UTF-8 encoding default of XML:
Example 3
Content-type: application/xhtml+xml
Note
Both XML and HTML parsers are required to support the byte order mark. The HTML encoding declaration has no effect in XML. When the HTML encoding declaration is the only encoding declaration, the encoding default from XML makes XML parsers treat content as UTF-8.
The W3C Internationalization (i18n) Group recommends that one always include a visible encoding declaration in an HTML document, because it helps developers, testers, or translation production managers to check the encoding of a document visually.
4.3 The DOCTYPE
Polyglot markup uses a document type declaration (DOCTYPE) specified by section 8.1.1 of [HTML5]. In addition, the DOCTYPE conforms to the following rules:
- The string
DOCTYPE
is in uppercase letters. - The string
SYSTEM
, if present, is in uppercase letters. - The string
PUBLIC
, if present, is in uppercase letters. - A Formal Public Identifier (FPI), if present, is a case-sensitive match of the registered FPI to which it points.
- A URI, if present in the document type declaration, is a case-sensitive match of the URI to which it points.
- If the URI is the string
about:legacy-compat
, polyglot markup includes the string in lowercase letters, as required by HTML5. - If the URI is an http URL, the URI points to the correct resource, using case-sensitive letters.
- If the URI is the string
Note
For valid XML the document element named in the document type declaration must exactly match the top-level element of the document, including in case. This rule is relaxed for well-formed, rather than valid, XML documents. Because XHTML requires a lower-case html
element, Polyglot documents SHOULD use lower-casehtml
for the element named in the DOCTYPE declaration. Bear in mind that a customized XHTML DTD with element and entity declarations inside the document type definition subset within the document, or one that points to an alternate DTD, may have special case requirements.
Note that using about:legacy-compat
in XML may yield unpredictable parsing results, depending on the XML processing pipeline.
Polyglot markup does not use document type declarations for HTML4, HTML3, or HTML2, regardless of whether they contain a URI or not and regardless of their effect in HTML5 parsers, as these document type declarations are not compatible with XHTML.
4.4 Namespaces
The following rules apply to namespaces used in polyglot markup.
4.4.1 Element-level namespaces
[HTML5] introduces undeclared (native) default namespaces for the root HTML element, html
, the root SVG element, svg
, and the root MathML element, math
.Polyglot markup declares the following default namespaces, when the markup languages are included in the document, to maintain XML compatibility [XML10]:
<html xmlns="http://www.w3.org/1999/xhtml">
<math xmlns="http://www.w3.org/1998/Math/MathML">
<svg xmlns="http://www.w3.org/2000/svg">
Polyglot markup declares the default namespaces on the root HTML element, html
, the root SVG element, svg
, and the root MathML element math
, and on any HTML elements used as children of SVG or MathML elements.Polyglot markup does not declare any other default or prefixed element namespace, because [HTML5] does not natively support the declaring of any other default or prefixed element namespace.
4.4.2 Attribute-level namespaces
[HTML5] introduces undeclared (native) support for attributes in the XLink namespace and with the prefix xlink:
. To maintain XML-compatibility, polyglot markup explicitly declares the XLink namespace:xmlns:xlink="http://www.w3.org/1999/xlink"
). [XML10]
For conformance with the HTML specification’s conformance rules, the declaration has to take place in each foreign content section where it is used, typically on a such section’s root element (e.g. on the svg
start tag for an SVG section and on the math
start tag for a MathML section) since the declaration must occur before using any of the xlink:
prefixed attributes,
xlink:actuate
xlink:arcrole
xlink:href
xlink:role
xlink:show
xlink:title
xlink:type
The xml:
namespace prefix used in xml:base
, xml:lang
,xml:space
, and xml:id
does not need to be declared in XML documents, and thereforepolyglot markup does not declare these prefixes via xmlns
. The prefixes are implicitly declared in XML and are automatically applied to the appropriate attributes in HTML. See CSS namespaces [CSS3NAMESPACE] how to use CSS selectors with these attributes.
For more about the issues related to attribute selectors and namespaces, with and without prefixes, see the section on Scripting and styling polyglot markup.
4.5 Element syntax
Polyglot markup conforms to the following rules regarding elements.
4.5.1 Required elements and tags
Polyglot markup does not employ optional tags. HTML5’s concept of optional tags – missing start tags and/or end tags – covers elements that the HTML parser itself automatically adds to the DOM if the code doesn’t contain the tags for them. Because XML does not have such a feature that adds missing start and/or end tags to the DOM, omitting a tag in polyglot markup is equivalent to producing a document that is not well-formed or, if both tags are omitted, equivalent to not adding the element at all.
The fact that polyglot markup doesn’t operate with optional tags may create surprises for an author not used to adding the tbody
tags in their markup, for example, or to someone accustomed to omitting the end tag of the p
element. However, the requirement to be well-formed with regard to tags is a key feature of polyglot markup that makes the code robust against subpar parsers and authoring surprises.
4.5.1.1 A minimal HTML document
Every polyglot markup document therefore contains an html
, head
, title
, and body
element. The html
element is the root element. The head
and body
elements are children of the html
element. The title
element is a child of the head
element. Therefore, the following is the most basic polyglot markup document.
Example 4
4.5.1.2 Required element examples
Whenever it uses a tr
element, polyglot markup always wraps the tr
element inside atbody
, thead
, or tfoot
element. In HTML, if a group of one or more adjacent tr
elements are not explictly wrapped inside a tbody
, thead
, or tfoot
element, the HTML parser creates and wraps a new tbody
element around the tr
elements. XML parsers do not create the tbody
element, thus offering the potential for creating different DOMs.
Correct:
Example 5
Ambiguous string | Info | HTML interpretation | XML interpretation | |
---|---|---|---|---|
if inside <![CDATA[section]]> | if outside <![CDATA[section]]> | |||
< | LESS-THAN SIGN | uninterpreted (but see the </script and </style rows) | uninterpreted | interpreted (commences tags, comments, CDATA) |
& | AMPERSAND | uninterpreted | uninterpreted | interpreted commences character reference or entity |
<!-- | start of comment | partly unintepreted | uninterpreted | interpreted |
--> | end of comment | partly unintepreted | uninterpreted | interpreted |
<![CDATA[ | start of CDATA declaration | uninterpreted | uninterpreted | interpreted (begins CDATA block) |
]]> | end of CDATA declaration | uninterpreted | uninterpreted | interpreted (ends CDATA block) |
cdata content | the content of CDATA sections | uninterpreted | — | |
</script | if occuring inside script element and followed by one of "tab" (U+0009), "LF" (U+000A), "FF" (U+000C), "CR" (U+000D), U+0020 SPACE, ">" (U+003E), or "/" (U+002F) | terminates parent | uninterpreted | terminates parent |
</style | if occuring inside style element and followed by one of "tab" (U+0009), "LF" (U+000A), "FF" (U+000C), "CR" (U+000D), U+0020 SPACE, ">" (U+003E), or "/" (U+002F) | terminates parent | uninterpreted | terminates parent |
all other tags, well-formed or not | uninterpreted | uninterpreted | interpreted subject to normal parsing rules | |
&#foo; | character references | uninterpreted | uninterpreted | interpreted subject to normal parsing rules |
none of the above strings | Any other string | uninterpreted | uninterpreted | uninterpreted |
Syntactically, the polyglot subset is found by
- either limiting the content to safe text content, that is, text that gets interpreted the same way in HTML and in XML.
- or trying to even out the constraints differences by wrapping the contents in a
CDATA
section. TheCDATA
code is then seen as text by the HTML parser (and can thus interfere with the scripting or styling language!), while the XML parser sees the content as text without markup semantics.
Limiting the contents to safe text content requires more planning and control over the code, but can be said to be more robust than the CDATA
option as it requires no extra, potentially breakable code to make the scripting or styling language work. The CDATA
option on the other hand, gives more freedom and robustness against various errors that can happen because the author isn’t aware of the safe text content limitations or because the code is inserted by a tool that is unable to guarantee that the content is safe.
4.6.2.1 Options for delivering safe text content
Polyglot markup can deliver safe text content both externally and internally.
- External safe text content. Polyglot markup can include scripts or stylesheets by linking to external files rather than including the code in-line. External files are parsed as the respective script or stylesheet and are thus not limited by the same restrictions as safe text content.
Fig. 2 Examples of linking to external scripts or stylesheets
Example 9 - Inline safe text content. Polyglot markup does not use characters or constructs that are interpreted differently in HTML and XML. This means not using the characters
<
and&
as well as theCDATA
end mark string –]]>
.Polyglot markup is agnostic as to whether one uses character entities or a numeric character references, so long as they are valid. That is, for polyglot markup, there is no difference between&
and<
.
Fig. 3 Examples of content that is not safe text content
Example 10 For CSS, the inline safe text content option would work very well most of the time, as<
and&
are not key parts of CSS and not very often used. But when it comes to JavaScript, the&
and the<
are key verbs (operators) of the language, and thus one soon runs into trouble – it is better to use external safe text content.
Fig. 4 Inline content containing no ambiguous strings
Example 11
Note
A workaround for using ambiguous strings is to include the properly escaped characters inside the src
attribute of style
or script
tags.
4.6.2.2 Safe CDATA content
Polyglot markup accepts raw text content wrapped in a CDATA
section;however instead of permitting any content (except the very CDATA end mark string – ]]>
), only the subset that corresponds to the particular raw text element’s HTML constraints is permitted. See the “HTML interpretation” column in the parsing differences table above – all the cells with the text ”uninterpreted” are also uninterpreted as CDATA and thus constitutes the safe subset of CDATA.
Wrapping raw text in a CDATA section introduces a new problem: when consumed as HTML, the start and end mark of the CDATA section is seen by the script or stylesheet interpreter and can thus cause syntax errors or even halt the script and stylesheet execution. A solution is to comment out the CDATA start and end marks by using the comment methods of the script or stylesheet language. Additionally, such as when script
is used as a coding block container, it may be necessary to even comment out the scripting/styling comments by hiding them inside an XML comment.
4.6.2.2.1 Safe rules for CDATA use
These rules assume that CDATA is of limited use for CSS.
General rules:
- The CDATA section is subject to HTML’s restrictions on
<script>
and<style>
. - There can be only one CDATA section per raw text element.
- A CDATA section must appear at the start of its containing element, and hence be the first child of that element.
- Before the CDATA section there can only be content that creates one node - preferably only one line of code - which may consist of whitespace, an XML comment, or a construct of the scripting/styling language (usually a comment of the scripting/styling language).
- After the CDATA section there can only be content that creates one node - preferably only one line of code - which may consist of whitespace, an XML comment, or a construct of the scripting/styling language (usually a comment of the scripting/styling language).
Note
The statement that a "CDATA section must appear at the start of its containing element, and hence be the first child of that element," is due to how parsers may create DOM nodes based on characters and whitespace. The following script element, because it contains no whitespace outside the CDATA node, has one node, whether parsed as HTML or as XML:
<script><![CDATA[foo]]>/<script>
Because an author may need to comment out the CDATA "start tag" and "end tag,"polyglot markup allows for one node before and after the CDATA section. The following example has three nodes: one text node before the CDATA section, one for the CDATA section, itself, and one after the CDATA section:
The ]]>
string:
- is always commented out if
<![CDATA[
is commented out. - is never commented out if
<![CDATA[
is not commented out. - Example 13
The <![CDATA[
string can be handled in 3 ways:
<![CDATA[
- without commenting it out.
Example 14 Note
Using the<![CDATA[
block without commenting it out is not conforming astype="text/css"
ortype="text/javascript"
content when parsed as HTML.//<![CDATA[
- using scripting language comments for the entire block.
Example 15 Note that the comment starts in the node before the CDATA section.<!--//--><![CDATA[
- Same as 2, but the scripting comment is hidden inside an XML comment.
Example 16 Note that the scripting language must accept<!--
as syntactically legal. JavaScript does, but other scripting languages may not.
This approach is compatible with CSS; however, rule 2 above prevents validity.
4.6.3 Escapable raw text elements
Escapable raw text elements are elements in which character references are permitted but where the HTML parser treats elements as text rather than as markup. For polyglot markup, escapable raw text elements are:
title
textarea
Polyglot markup uses the same rules of safe text content for escapable raw text elements, except that character entities are permitted for escapable raw text elements.
4.6.4 Foreign elements
The exact rules of for foreign content elements are defined by the respective specifications.
4.6.5 Special elements
Unless otherwise specified, elements have no special restrictions other than those that apply to all polyglot markup.
The iframe
element has restrictions in polyglot markup, because theHTML specification sets special restraints on iframe
in XML documents. [HTML5]
4.7 Text
textarea
and pre
elements
4.7.1 Newlines in When polyglot markup uses either a textarea
or pre
element, the text within the element should not begin with a newline. This is because HTML and SGML-based systems delete the initial newline on parsing, while XML parsers do not.
4.8 Attributes
Polyglot markup surrounds all attribute values with quotation marks.Polyglot markup surrounds attribute values with either single quotation marks or with double quotation marks.
Polyglot markup does not use directly typed newline characters within an attribute.
Within an attribute's value, polyglot markup represents tabs, line feeds, and carriage returns as numeric character references rather than by using literal characters. For example, within an attribute's value, polyglot markup uses 	
for a tab rather than the verbatim string literal, \t
. This is because of attribute normalization in XML [XML10]. Note, too, that JavaScript and CSS in attribute values are affected by attribute value normalization, because a comment ends up commenting out not to the end of the source line but to the end of the entire attribute value.
The following example uses numeric character references (escaped characters) for the line feed, tab, and less-than characters within a srcdoc
attribute.
Example 17
Note
Because of attribute-value normalization in XML [XML10], polyglot markup does not use newline characters within an attribute. Practically speaking, for source code with newlines within attributes, DOMs generated via XML and HTML will be different; however, whitespace differences have no behavioral impact on the page unless:
- explicitly examined by JavaScript, rendering the differences of small consequence.
- used in attributes whose content is rendered visually, such as the content of
@alt
.
Note that directly typed newline characters are overtly not allowed in any attribute containing a URI.
See also Attribute values.
4.8.1 Disallowed attributes
The following attributes are not allowed in HTML or XHTML within polyglot markup. These attributes have effects in documents parsed as XML but do not have effects in documents parsed as text/html
. The HTML5 spec therefore defines them as invalid in text/html documents. [HTML5]
xml:space
xml:base
Note that the xml:space
and xml:base
attributes are allowed on SVG and MathML elements. The attributes may therefore appear in polyglot markup when they appear within SVG or MathML as foreign content.
4.8.2 Language attributes
When specifying the language mapping of an element, polyglot markup uses both the lang
and the xml:lang
attributes. Neither attribute is to be used without the other, and polyglot markup maintains identical values for both lang
and xml:lang
.
The root element SHOULD always specify the language, or else HTML’s fallback language effect may step in and cause the language to vary depending on whether the document is consumed as XML (where the fallback language is not required to work) or consumed via file
URI (where fallback language via external HTTPContent-Language
would not work). Note that the internal http-equiv="Content-Language
meta
element is non-conforming in HTML5. For more, see e.g. HTML5’slanguage determination rules.
4.8.3 Attributes with special considerations
The following attributes or their considerations require exceptions to the general rules for polyglot markup.
id
attribute
4.8.3.1 The Polyglot markup does not contain any space characters within the value of an id
attribute. This is because values for the id
attributemay not contain space characters in HTML5. [HTML5]
Polyglot markup ensures that every id attribute must be unique within the document and must be a legal XML name, starting with a letter. [XML10]
4.9 Named entity references
Polyglot markup uses only the following named entity references:
amp
lt
gt
apos
quot
For entities beyond the previous list, polyglot markup uses character references. For example, polyglot markup uses  
instead of
. Note that polyglot markup may use decimal values for escape characters (such as in the previous example); however, the Character Model for the World Wide Web recommends that content SHOULD use the hexadecimal form of character escapes rather than the decimal form when both are available. [CHARMOD]
Polyglot markup always uses character references for the less than sign (<
) and ampersand (&
) when they are used as characters, however for CDATA inside foreign content, strings within comments, and for safe CDATA, the following rules apply:
- for
script
andstyle
elements that contain safe CDATA, they may used as defined by the rules for safe CDATA; - for CDATA sections in a foreign content section (SVG, MathML), the XML rules for CDATA apply;
polyglot markup
4.11 Scripting and stylingWhen applying JavaScript and CSS to polyglot markup, the goal is to get the same result whether consumed as HTML or as XML. It is therefore important to be aware of scripting and styling features that give different results in HTML vs XML. These issues comes in addition to the polyglot usage rules for raw text elements.
innerHTML
vs document.write()
4.11.1 JavaScript: Although document.write()
and document.writeln()
works in HTML, neither function works in XHTML. The polyglot alternative is the innerHTML
property, which works for both HTML and XHTML.
Note
The innerHTML
property takes a string. However, XML parsers will parse that string as XML in XHTML while HTML parsers parse will parse that string as HTML in HTML. And because of this difference in parsing, the code that innerHTML
inserts must follow the guidelines for polyglot markup so that the resulting DOM generated by the XML parser do not differ from the DOM generated by the HTML parser.
4.11.2 CSS: Attribute selectors that require a namespace prefix
CSS enables authors to select elements by referencing their attributes using attribute selectors:[attr]{property:value}
. Generally speaking, attribute selectors can be used freely since polyglot markup relies on default namespaces, which do not affect attributes.
However, some of the attributes required by polyglot markup are namespaced. Some are namespaced by default, such as the xmlns
attribute. Some attributes are namespaced by a prefix that is namespaced by default, such as xml:
,xmlns:
, and xlink:
. In addition, extension specs may allow namespaced attributes other than those defined by the HTML specification. As result, a selector such as [xmlns]{rule:foo}
will not work in XHTML, where the attribute has an associated namespace. The same is true for prefixed attributes. Even if one escapes the colon ([xml\:lang]{rule:foo}
), such selectors will only work in HTML (except for the namespace declaration for the xlink:
prefix. This works in XML and in HTML and must thus be selected in a namespaced way in both syntaxes).
To be able to select namespaced attributes in XML, the attribute selector must include a namespace prefix. [SELECT]
For the unprefixed, namespaced attribute xmlns
, a polyglot selector that works in both HTML and XML can be created by using the asterisk (*
) for the namespace prefix, indicating that the selector is to match all attribute names without regard to the attribute's namespace:
Example 18
[*|xmlns]{color:lime}
For prefixed attributes, then, because the rules of polyglot markup
as well as the HTML specification itself dictates that the presence of a xml:lang="foo"
must be accompanied with a correspondinglang="foo"
attribute, then, in a conforming polyglot document, one can use the same approach as for the xmlns
attribute.
Example 19
[*|lang]{color:lime}
Note
However, the requirement of polyglot markup to use both xml:lang="foo"
and lang="foo"
means that even [lang]{color:lime}
would work, in both XML parsers and HTML parsers.
When it comes to the xmlns:xlink
attribute, which is required for polyglot svg
elements, then, because it, in contrast to xml:lang
, belongs to a foreign content element in HTML/XHTML, it is namespaced even in HTML. Hence, the only way – in HTML as well as in XML – to use this attribute as a selector, is by declaring the namespace of the xmlns:
prefix in CSS:
Example 20
@namespace xmlns "http://www.w3.org/2000/xmlns/";
[xmlns|-xlink]{border:dashed lime 3px}
In cases where the user agent does not support namespaces in CSS and/or in markup, it is necessary to use more than one selector. This could happen if the author declares prefixes – default or prefixed – which are an extension specification permits or if the user agent does not support attribute selectors with CSS namespace prefix.
Example 21
/*Selector for legacy user agents without support for namespace prefixed attribute selector:*/
[xmlns],
/*Selector for user agents with support for namespace prefixed attribute selector:*/
[*|xmlns]
{color:lime}
5. Example document
The following example code acts as polyglot markup and validates as either XHTML or as HTML. You can view the page live served as HTML, at http://dev.w3.org/html5/html-polyglot/SamplePage.html and the same bytes served as XHTML, at http://dev.w3.org/html5/html-polyglot/SamplePage.xhtml.
Note
The example document is served as 'text/html'
. Some legacy user agents do not support SVG in when served up as 'text/html'
as it is in this example. The example page could also be served as 'application/xhtml+xml'
instead, with the file extension .html, maintaining adherence to polyglot markup and enabling the rendering of the SVG.
Example 22
A Sample Page Using Polglot Markup<h1>Sample Page Using Polyglot Markup</h1>
<p>
The source code for <a href="#SampleDoc">this document</a> uses <dfn id="sampleDef">polyglot markup</dfn>,
a document that is a stream of bytes that parses into identical document trees
(with the exception of the xmlns attribute on the root element) when processed as HTML and when processed as XML.
The source code for this document also contains additional comments about the use of
<a href="#sampleDef">polyglot markup</a>.
</p>
<h2>Foreign Elements</h2>
<p>
The following shapes use SVG elements.
<a href="#sampleDef">Polyglot markup</a> introduces undeclared (native) default namespaces
for the the root SVG element (<code>svg</code>) and respects the mixed-case element names and values
when appropriate, as described in the section on [Element-Level Namespaces](#element-level-namespaces), the section on [Element Names](#element-names)
and the section on [Attribute values](#attribute-values).
</p>
<!-- <a href="#sampleDef">Polyglot markup</a> declares the xlink: namespace on the <svg> element to maintain XML-compatibility -->
<svg width="350" height="250" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
<g>
<title>Three SVG shapes</title>
<desc>
This SVG image contains an ellipse filled with a gradient that goes from white to blue as it moves outward from the center.
A yellow rectangle with a black border overlaps the ellipse in the upper-left quadrant,
and a red spiral on a white background overlaps the ellipse in the bottom-right quadrant.
The red spiral is also a link to the example code for that SVG shape.
</desc>
<defs>
<!-- Note that "radialGradient" and "myGradient" respect mixed-case values. -->
<radialGradient id="myGradient" cx="50%" cy="50%" r="50%" fx="50%" fy="50%">
<stop offset="0%" style="stop-color:rgb(200,200,200); stop-opacity:0"/>
<stop offset="100%" style="stop-color:rgb(0,0,255); stop-opacity:1"/>
</radialGradient>
</defs>
<ellipse cx="50%" cy="50%" rx="50%" ry="42%" style="fill:url(#myGradient)"/>
<rect x="0" y="0" width="100" height="100" style="fill: yellow; stroke: black;"/>
<a xlink:href="http://www.example.org/foo">
<!--
Note that the following attribute contains newlines which will produce a different DOM,
but will not affect the way in which SVG functions in the least.
-->
<path transform="translate(60, -175)"
d="M153 334 C153 334 151 334 151 334 C151 339 153 344 156 344 C164 344 171 339 171 334
C171 322 164 314 156 314 C142 314 131 322 131 334 C131 350 142 364 156 364
C175 364 191 350 191 334 C191 311 175 294 156 294 C131 294 111 311 111 334
C111 361 131 384 156 384 C186 384 211 361 211 334 C211 300 186 274 156 274"
style="fill:white;stroke:red;stroke-width:2"/>
</a>
</g>
</svg>
<h2>Void Elements</h2>
<!-- Given an empty instance of an element whose content model is not EMPTY (in this case, an empty paragraph)
<a href="#sampleDef">polyglot markup</a> does not use the minimized form, as described in Section 6.4 Void Elements -->
<p></p>
<p>
There is an empty <code>p</code> element before this paragraph.
<a href="#sampleDef">Polyglot markup</a> uses <code><p></p></code> and not <code><p/></code>.
</p>
<p>
<a href="#sampleDef">Polyglot markup</a> treats certain elements as self-closing,
void elements, such as the following <code>img</code> element.
</p>
<img height="48" width="72" alt="W3C" src="http://www.w3.org/Icons/w3c_home"/>
<p>
For more information, see the [Void Elements](#empty-elements) section.
</p>
<h2>Required Elements</h2>
<p>
The following table uses the required <code>tbody</code> element, as described in the
[Required elements and tags](#required-elements) section.
</p>
<table>
<tbody>
<tr>
<th>Column One</th>
<th>Column Two</th>
</tr>
<tr>
<td>Row 1, Column 1</td>
<td>Row 1, Column 2</td>
</tr>
<tr>
<td>Row 2, Column 1</td>
<td>Row 2, Column 2</td>
</tr>
<tr>
<td>Row 3, Column 1</td>
<td>Row 3, Column 2</td>
</tr>
</tbody>
</table>
<p>
The following table makes use of the <code>col</code> element and therefore uses the
then required <code>colgroup</code> element as <code>col</code> element wrapper for,
as described in the [Required elements and tags](#required-elements) section.
</p>
<table>
<colgroup>
<col style="background-color:silver"/>
<col style="background-color:gray"/>
<col style="background-color:yellow"/>
</colgroup>
<tbody>
<tr>
<th>ISBN</th>
<th>Title</th>
<th>Price</th>
</tr>
<tr>
<td>3476896</td>
<td>My first HTML</td>
<td>$53</td>
</tr>
<tr>
<td>1234567</td>
<td>Intermediate Polyglot</td>
<td>$49</td>
</tr>
</tbody>
</table>
<h2>Named Entity References</h2>
<p>
The paragraph you now read, uses the string <code>&amp;</code> for ampersands (“&”) and uses,
as described in the section on [Named entity references](#named-entity-references), the string <code>&#xA0;</code>
for a non-breaking space between the following two words: <i>“<a href="#sampleDef">polyglot markup</a>”</i>.
</p>
A. Acknowledgements
Many thanks to Robin Berjon, David Carlisle, Daniel Glazman, Richard Ishida, Tony Ross, Sam Ruby, Jonas Sicking, Henri Sivonen, Manu Sporny, and Philip Taylor. Special thanks to the W3C TAG and the W3C Internationalization (i18n) Core Working Group.
B. References
B.1 Normative references
[CHARMOD]
Martin Dürst; François Yergeau; Richard Ishida; Misha Wolf; Tex Texin et al. Character Model for the World Wide Web 1.0: Fundamentals. 15 February 2005. W3C Recommendation. URL: http://www.w3.org/TR/charmod/
[CSS3NAMESPACE]
Elika Etemad; Anne van Kesteren. CSS Namespaces Module. 29 September 2011. W3C Recommendation. URL: http://www.w3.org/TR/css3-namespace/
[HTML5]
Ian Hickson, Robin Berjon; Steve Faulkner; Travis Leithead; Erika Doyle Navara; Theresa O'Connor; Silvia Pfeiffer. HTML5. October 2014. W3C Recommendation. URL: http://www.w3.org/TR/html5/
[HTTP11]
R. Fielding, Ed.; J. Reschke, Ed.. Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content. June 2014. Proposed Standard. URL: http://www.ietf.org/rfc/rfc7231.txt
[RFC2119]
S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. March 1997. Best Current Practice. URL: http://www.ietf.org/rfc/rfc2119.txt
[RFC2854]
D. Connolly; L. Masinter. The 'text/html' Media Type. June 2000. Informational. URL: http://www.ietf.org/rfc/rfc2854.txt
[SELECT]
Tantek Çelik; Elika Etemad; Daniel Glazman; Ian Hickson; Peter Linss; John Williams et al. Selectors Level 3. 29 September 2011. W3C Recommendation. URL: http://www.w3.org/TR/css3-selectors/
[XML-MT]
M. Murata, S. St.Laurent, D. Kohn. XML Media Types. IETF RFC 3023. URL: http://www.ietf.org/rfc/rfc3023.txt.
[XML10]
Tim Bray; Jean Paoli; Michael Sperberg-McQueen; Eve Maler; François Yergeau et al. Extensible Markup Language (XML) 1.0 (Fifth Edition). 26 November 2008. W3C Recommendation. URL: http://www.w3.org/TR/xml
B.2 Informative references
[WCAG20]
Ben Caldwell; Michael Cooper; Loretta Guarino Reid; Gregg Vanderheiden et al. Web Content Accessibility Guidelines (WCAG) 2.0. 11 December 2008. W3C Recommendation. URL: http://www.w3.org/TR/WCAG20/