XML 2.0? No, seriously. (original) (raw)

This page is an archive, possibly an out-of-date one. You’ll find newer weblog postings at So…and more about norm on his homepage.

, Issue 23; 20 Feb 2008; last modified 08 Oct 2010

Maybe its madness to consider XML 2.0 seriously. The cost of deployment would be significant. Simultaneously convincing a critical mass of users to switch without turning the design process into a farce would be very difficult. And yet, the alternatives look a little like madness too.

Design and programming are human activities; forget that and all is lost.

I found three topics on my desk simultaneously last week.

The proposal to amendthe character set of XML 1.0 identifiers by erratum.
the proposal to deployCURIEs, an awkward, confusing extension of theQName concept.
A thread of discussion suggesting that we consider allowing prefix undeclaration in_Namespaces in XML 1.0_. That's right 1.0.

We're in an odd place.

XML has been more successful, and in more and more different arenas, than could have been imagined. But…

XML 1.0 is seriously broken in the area of internationalization, one of its key strengths, because it hasn't kept pace with changes to Unicode.

QNames, originally designed as a way of creating qualified element and attribute names have also been used in more and more different arenas than could have been imagined. Unfortunately, the constraints that make sense for XML element and attribute names, don't make sense, are unacceptable, in many of the other arenas.

And in XML, we learned that it is sometimes useful to be able to take a namespace binding out of scope.

XML 1.1 addressed some of these concerns, but also introduced backwards incompatibilities. Those incompatibilities seemed justified at the time, although they seem so obviously unnecessary and foolish now. In short, we botched our opportunity to fix the problem “right”.

What to do?

I think I could just about (have, even) accept any one of the items on that list above. Fixing the Unicode problem in XML 1.0 by erratum is stretching the definition of erratum to the breaking point, but by itself is probably an acceptable compromise. Adding pseudo-QName identifiers to the world is confusing and ugly, but by itself probably not the worst thing that could be done. And allowing XML 1.0 documents to undeclare namespace prefixes, by itself, seems sensible in retrospect.

But all three? Really?

Perhaps, dare I say it, it is time to consider XML 2.0 instead. Trouble is, if XML 2.0 gets spun up as an open-ended design exercise, it'll be crushed by thesecond-system effect. And if XML 2.0 gets spun up as “only” a simplification of XML 1.0, it won't get any traction. If XML 2.0 is to be a success, it has to offer enough in the way of new functionality to convince people with successful XML 1.0 deployments (that's everyone, right?) that it's worth switching. At the same time, it has to be about the same size and shape as XML 1.0 when it's done or it'll be perceived as too big, too complicated, too much work.

With that in mind, here are some candidate requirements for XML 2.0.

All well-formed XML 1.0 documents that do not include an internal or external subset shall be well-formed XML 2.0 documents.
In other words, backwards compatibility for well-formed XML documents! But it's time to move all that DTD stuff off into another specification. Maybe we can even add <!NAMESPACE in XML 2.0 DTDs. If that spec ever gets written.
The XML 2.0 specification shall be no longer than the XML 1.0 specification.
In other words, you can't add seventy-three new whiz-bang features. You can't do anything that will require more prose to explain than you can remove by taking out DTD syntax.
All XML 2.0 documents shall support XML Namespaces.
In other words, what most of the XML world already requires. The experiment is over, namespaces won. Like it or not.
XML 2.0 shall define a mapping from QNames to URIs.
In other words, db:para ≡ (http://docbook.org/ns/docbook, para) ≡ http://docbook.org/ns/docbook#para, by definition. (Forxmlns:db="http://docbook.org/ns/docbook"; and we can argue about the precise mapping rules later.)
XML 2.0 shall allow QNames to represent a broader range of values.
In other words, isbn:1234 is too useful to forbid. But we're still not allowing it as the name of an element or attribute.
XML 2.0 shall provide an unambiguous, context-_in_sensitive lexical form for QNames.
In other words, it will be possible to represent any XML 2.0 document without any namespace declarations at all. I'vegiven some thought to how I think this might be done.
XML 2.0 shall do away with the requirement that documents can have only a single root element.
In other words, make document = extParsedEnt. Perhaps this is only a plausible requirement, but the fact is that many tools, like XSLT, are already comfortable with such instances and I'm going to take advantage of it in the next item.
XML 2.0 shall address the problem of named character references.
In other words, making it possible to write   or &Exists even in documents that don't have any entity declarations. The actual notation wouldn't have to use “&” but it might as well.
I have in mind a proposal for this:

<xml:entity name="nbsp" text="&#160;"/>  
<xml:entity name="Exists" text="∃"/>  
<xml:entity href="myentities.xml"/>  
<document>...</document>

As a matter of simplicity, I'm pretty confident I want to treat these new entities like the old ones, and like CDATA sections, and say that they are purely an authoring convenience; they don't survive parsing. In fact, I'm not even sure the parser has to report those elements, it can consume them as it goes.
That means you have to have a facility like XSLT 2.0's character maps to put them back at serialization time, if you want them back. Yes, I know this is still an inconvenience for some, but the alternative would require that all XML tools grow support for entity reference objects and that seems inconvenient for far more people.

I think it is possible to address the requirements I've outlined without doing undue violence to existing applications. From an API perspective, I think the worst part will be dealing with QNames as first-class objects. It will mean, for example, that attribute values become lists. In the simple case, a list of one text node, but for attributes that contain QNames (in their context-insensitive format), a list of (text|QName)*.

In my optimistic moments, I imagine that XML 2.0 could thread the needle between insufficient value to motivate transition and so much complexity that it can't possibily succeed. Though whether a committee could thread this particular needle (with this particular camel) is an open question.