XML 2.0 (original) (raw)

This page is an archive, possibly an out-of-date one. You’ll find newer weblog postings at So…and more about norm on his homepage.

, Issue 198; 10 Nov 2004; last modified 08 Oct 2010

I think the goal for XML 2.0, if there ever is one, should be to simplify XML in the same way that the goal for XML was to simplify SGML.

XML 2.0. In any of several flavors, it's been the subject of hundreds of messages on xml-dev. Lots of folks have written about it; I've kept track of at least six essays on the topic, going all the way back to 2000:

Edd Dumbill summarized recent xml-dev discussion, taking place even as this essay was knocking around in my “ideas” bucketI've been thinking about this for a while. My thoughts aren't really any more coherent today, but it occurs to me that there's really no better time to publish this than the week before XML 2004 where we'll all be hanging out in the bar looking for things to chat about anyway., inHow Do I Hate Thee? on 03 Nov 2004.
Derek Denny-Brown proposedWhere XML goes astray… on 12 Oct 2004.
Liam Quin was thinking about thefuture of XML on 26 Sep 2004.
Kendall Grant Clark askedCan We Get There From Here? on 20 Feb 2002.
Tim Bray draftedXML-SW on 10 Feb 2002.
Simon St. Laurent pointed outXML's Interoperability Problems in Jun 2000.

There are big gaps in that list; surely someone wrote about it in 2001 and 2003. I don't pay that much attention because I'm not convinced that XML 2.0 is a good idea. Thecomplete failure of XML 1.1 doesn't leave me very optimistic, but maybe a big change would be more palatable than an incremental one. Certainly the potential payoff is larger.

But what is that payoff? I mean, what's wrong with XML 1.x?

Depending on your perspective, the answer to that question is probably somewhere between almost nothing and almost everything. I fall more towards the former end of the spectrum, but a lot has changed since 1998.

Change is a big part of the problem. XML 1.0 has some oddities, many the result of SGML legacy, but taken by itself isn't too bad. For better or worse, though, we don't take it by itself anymore, we take it with namespaces and inclusions and a choice of schema languages, a little bit of querying and some transformation, all sometimes wrapped up in a fancy web service. We've built up a big stack:

| | WS-* | | | | | ----------------- | ----------- | -------- | | | XSLT | XML Query | | | | | XPath | RDF/XML | | | | | RELAX NG | XML Schema | | | | XML Base | xml:id | XInclude | | | | XML Namespaces | | | | | | | XML Infoset | | | | | | XML | | |

That sure is an awful lot of…stuff heaped on top of those three little letters. I think the goal for XML 2.0, if there ever is one, should be to simplify XML in the same way that the goal for XML was to simplify SGML.

So, what do I think that would look like?

One simplification we would make is editorial: an XML 2.0 specification would unifyXML,XML Namespaces,XML Infoset,XML Base, and xml:id into a single document.

Next, we'd tackle a significant bit of SGML legacy: removing the syntactic privileges afforded DTDs. In XML 2.0, there would be no “<!DOCTYPE>” declaration, no entities (except the built in entities and their close cousins, numeric character references), no attribute or element types of any kind, and no fixed or default values for attributes. In XML 2.0, documents would be either well-formed, or the wouldn't be XML.

I'd like to be clear: I've got nothing against DTDs. I'd be happy to work on a DTD V2.0 specification that described DTD validation of XML 2.0 documents. You just wouldn't have a <!DOCTYPE> declaration, so you'd have to associate the DTDs with documents in some other way, just like you associate RELAX NG Grammars and W3C XML Schemas in some other way.

Now, I've just screwed all the mathematicians (and other folks) by taking away their named character entities and I can seeDavid Carlisle wincing out there in the audience. Bear with me, I have an answer for that problem this time (unlikelast time).

My proposal for solving the entity problem is going to involve namespaces, so let's make some simplifications there, too. A radical simplification would be to simply throw them all out, declare defeat and try to invent something new to solve the naming problems. Or maybe try to convince the world that the naming problem doesn't exist, that the fact that <p> is sometimes TEI and sometimes HTML isn't a problem in practice. I'm not going to start out that radical. I'm just going to try to round off some of namespace's sharper corners.

In XML 2.0, all documents would be namespace aware. Furthermore, the “null namespace,” the namespace in which elements appear if there is no namespace declaration, would have an explicit URI (and could, consequently, be associated with a prefix). This reduces all of the magic of the “null namespace” to simply a question of a default declaration. We could go a step further and simply outlaw the null namespace, but that seems a bit extreme to me.

Ignoring <!DOCTYPE> declarations and a few wrinkles between XML 1.0 and XML 1.1, so far, all well-formed, namespace-aware, XML 1.x documents would be XML 2.0 documents, simply by changing the version in the XML declaration. If the null namespace was outlawed, you'd have to add a namespace declaration to the top of all the documents. That seems cumbersome. On the other hand, the Web Architecture documentsays that all elements should be in a namespace.

Anyway, for the moment, I'm not going that far.

So that means:

<?xml version='2.0'?>
<doc/>

and

<?xml version='2.0'?>
<doc xmlns="http://the-uri-for-the-default-namespace/"/>

and

<?xml version='1.0'?>
<x:doc xmlns:x="http://the-uri-for-the-default-namespace/"/>

are all logically the same document.

That's a bunch of simplification. Now let's tackle a real technical challenge: QNames in content. I think the right answer here is to raise the stature of QNames so that they're first class objects in XML 2.0. XML 2.0 would have Document, Element, Attribute, Processing Instruction, Character, Comment, Namespace,and QName Information Items.

For legacy (and authoring!) convenience, we'd keep the existing QName forms for element and attribute names, but we'd also introduce unambiguous lexical forms for QNames: in XML 2.0, <{uri}name> would be a well-formed serialization of a QName with the namespace name “uri” and the local name “name”.

What does this really mean? The big problem withQNames in content is that the parser can't tell where the QNames are. Consider the following example, where the intent is that “a:localname” is a QName:

<?xml version="1.0"?>
<doc xmlns:a="http://example.com/xmlns/a">
What about the QName a:localname?
</doc>

An XML 1.0 parser can't actually determine that “a:localname” is a QName. In XML 2.0, we would fix that:

<?xml version="2.0"?>
<doc xmlns:a="http://example.com/xmlns/a">
What about the QName <{http://example.com/xmlns/a}localname>?
</doc>

The Infoset for this document consists of a Document Information Item containing a single Element Information Item containing 22 Character Information Items followed by a QName Information Item followed by 2 more Character Information Items.

The “<{uri}name>” form is unambiguous, but it's awfully tedious for the author, so we'd provide a prefix form as well. As a convenience, <:p:name> would be a well-formed serialization of a QName with the namespace name currently bound to the prefix “p” and the local name “name”. So this would be equivalent:

<?xml version="2.0"?>
<doc xmlns:a="http://example.com/xmlns/a">
What about the QName <:a:localname>?
</doc>

These forms are allowed in element content and attribute values. This means that attribute values don't consist only of Character Information Items, they consist of Character and QName Information Items.

What's gained here is that the QNames in content can be recognized by the parser, so we aren't “hiding” QName values, making general tools blind to which namespace declarations are actually used.

It's this syntactic form that provides an answer to the character entity problem. Now we can define a namespace with the semantics that QNames in that namespace represent characters. For examplehttp://www.w3.org/2003/entities/iso8879/isonum for the ISO Numeric and Special Graphic characters.

To write an “·” (middle dot) where I don't have a glyph for it, or a convenient way to insert that glyph, I can write<:num:middot> (or<{http://www.w3.org/2003/entities/iso8879/isonum}middot> if I don't have a prefix bound). And because these lexical forms are recognized in both element and attribute values, I can put them anywhere I want. I concede that “<:num:middot>” isn't quite as easy to type as “·”, but it's not a lot harder and I don't think it's more difficult to read.

We could take this even farther, allow these QName forms not only in attribute values and element content, but also in “Names”. In other words, this document:

<?xml version="2.0"?>
<doc xmlns="http://example.com/xmlns/doc"
     xmlns:a="http://example.com/xmlns/a"
     xmlns:b="http://example.com/xmlns/b">
  <p a:att="value" b:att="value"/>
</doc>

Could be serialized like this:

<?xml version="2.0"?>
<<{http://example.com/xmlns/doc}doc>
  xmlns:a="http://example.com/xmlns/a">
  <<{http://example.com/xmlns/doc}p> a:att="value"
   <{http://example.com/xmlns/b}:att>=”value”/>
</<{http://example.com/xmlns/doc}doc>>

I wouldn't recommend that serialization and I certainly wouldn't want to author in it, but it would allow applications to serialize any document or document fragment.

Michael Sperberg-McQueen pointed out that a slight syntactic extension would allow you to specify the prefix as well. This would be handy, for example to deal with the way theXQuery 1.0 and XPath 2.0 Data Model has implementedQNames as triples. I'm not sure this is necessary, but it might be a good thing.

On the whole, I think these proposals are a net simplification. I have some reservations about adding QName Information Items, and particularly about allowing them in attribute values, but I haven't thought of a better solution to the QName mess. And if XML 2.0 is worth doing at all, I think it's only worth doing if it is simpler than XML 1.0_and_ solves the QName mess.

There's some more work we can do around the margins: clarify the semantics of xml:lang and xml:space attributes, perhaps allow documents to have multiple top-level elements, removing the distinction between documents and external parsed entities (which don't exist anymore), and maybe something about a binary format, depending on how that work plays out.

If you're an XML grease monkey, you can probably think of a few more things, but let your mantra be “simplify”. Repeat after me: no new features.