URLs and internationalization from Martin J. Duerst on 1996-12-20 (uri@w3.org from December 1996) (original) (raw)

Hello everybody,

This is hopefully the last of the series of mails regarding the URL syntax draft (of course discussion may follow). This series of mails may have created, with some of you, the impression that I think there is nothing good in the current draft. If this should be the case, I appologize from my heart. I think that it is a very good draft, and I would like it to become even better. And I think that this can be acomplished without unnecessary time delays.

So let me get to the point of internationalization (i18n) of URLs. Currently, URLs are not in a very good state re. i18n, and many people doubt whether that can be improved. I think it can. If you look at the discussion in ftp-wg, the URN syntax draft, the IAB charset workshop report (draft-weider-iab-char-wrkshop-00.txt), my draft on domain name internationalization (draft-duerst-dns-i18n-00.txt), and in particular http://www.alis.com:8085/~yergeau/url-00.html, you will see that there is one direction we should go, namely UTF-8.

There are also some people that think that URL i18n should never happen. I have addressed some of their concerns in my mail about transcribability.

The draft currently in very many places does the right thing, if not to further URL i18n, then at least to not make it more difficult in the future, and to not create too much legacy cases that make transition more difficult. Below, I will both mention these cases and those parts where I think change is needed to keep the doors open for the future.

As there seems strong interest to get finished soon with the draft, it would probably be too time-consuming to include a full i18n solution, including transitory provisions, into it. I therefore propose to write (myself) a separate document on URL i18n. I hope the newly forming working group will adopt it as one of their documents, and will integrate the relevant portions of it into the "URL schemes requirements" document that is currently the main focus of the new group. I also volunteer to participate as author/editor of that document, to take care of i18n and related issues.

After these preliminaries, let's have a look at the current syntax draft:

1.4. Syntax Notation and Common Elements

Unlike many specifications which use a BNF-like grammar to define the bytes (octets) allowed by a protocol, the URL grammar is defined in terms of characters. Each literal in the grammar corresponds to the character it represents, rather than to the octet encoding of that character in any particular coded character set. How a URL is represented in terms of bits and bytes on the wire is dependent upon the character encoding of the protocol used to transport it, or the charset of the document which contains it.

Good! If URLs might ever be extended beyond their canonical form, and decently internationalized, that will not have to be changed at all.

2.3.1. Escaped Encoding

An escaped character is encoded as a character triplet, consisting of the percent character "%" followed by the two hexadecimal digits representing the character's octet code in an 8-bit coded character set. For example, "%20" is the escaped encoding for the space character.
  escaped     = "%" hex hex
  hex         = digit | "A" | "B" | "C" | "D" | "E" | "F" |
                        "a" | "b" | "c" | "d" | "e" | "f"
The 8-bit coded character set of the octet must be a superset of the US-ASCII coded character set, such that the US-ASCII characters have the same escaped encoding regardless of the larger octet character set.

I commented on this in terms of protocol autonomy. There are some important concerns also for i18n. It is nice to see that people think URLs make more sense if the characters they represent can be identified. But it is extremely assuming and unfair to request that without exception for ASCII, whereas there is no guarantee whatsoever for the rest of the world.

I therefore propose that the paragraph:

The 8-bit coded character set of the octet must be a superset of the US-ASCII coded character set, such that the US-ASCII characters have the same escaped encoding regardless of the larger octet character set.

be dropped/eliminated/removed. I also strongly suggest that the draft be reverted to the "octet"->"character" model as in the previous RFC. I suggest that the language from that RFC is taken.

The coded character set chosen must correspond to the character set of the mechanism that will interpret the URL component in which the escaped character is used. A sequence of escape triplets are used if the character is coded as a sequence of octets.

It makes ample sense that the mapping from URL to the octets used in the mechanism is deterministic and well specified, without any external information. But there is no need that the %HH-deencoded octets correspond exactly to what is used by the mechanism. For a good example of why this is so, please see my draft-duerst-dns-i18n-00.txt.

I therefore propose that the above paragraph be removed, and be replaced by:

The definition of individual URL schemes must assure that the
mapping from the resource identification to an URL and from
the URL to the mechanisms and protocols required to access
the resource are defined unambiguously.

Any character, from any character set, can be included in a URL via the escaped encoding, provided that the mechanism which will interpret the URL has an octet encoding for that character. However, only that mechanism (the originator of the URL) can determine which character is represented by the octet. A client without knowledge of the origination mechanism cannot unescape the character for display.

This is the current, deplorable state. It's not satisfying at all. It can be changed, not overnight, but step by step. As a preparation, I propose replacement by the following (assuming that generally speaking, the draft is changed to "octet"->"character":

The octets encoded in the URL will in many cases in turn encode
characters. In current practice, various encodings are used,
which means that only the originator of the URL can determine
which character is represented by which octets.
It can be expected that in the future, UTF-8 [RFC 2044], which
is fully compatible with US-ASCII, will
be the encoding of choice for URL components. Schemes and
mechanisms and the underlying protocols are suggested to
start using UTF-8 directly (for new schemes, similar to [URN]),
to make a gradual transition to UTF-8 (see draft-ietf-ftpext-intl-ftp-0?.txt
for an example), or to define a mapping from their representation
of characters to UTF-8 if UTF-8 cannot be used directly
(see draft-duerst-dns-i18n-0?.txt for an example).

This proposal may seem quite daring to many of you. But it is in nice accordance with a well known previous case: The specification, in RFC1866, of ISO 10646 as the "future document character set" for HTML. And it is less strict; no interpretation of octets in terms of UTF-8 is required, and no encoding of represented characters in terms of UTF-8 is required (whereas RFC 1866 requires interpretation of numeric character references in terms of ISO 10646). The big advantage of this proposal is also that the many readers of this document will be alerted to an issue and will be able to judge by themselves.

2.3.3. Excluded Characters

Excluded characters must be escaped in order to be properly represented within a URL. However, there do exist some systems that allow characters from the "unwise" and "national" sets to be used in URL references; a robust implementation should be prepared to handle those characters when it is possible to do so.

This is very dangerous. It sounds as if some systems could deal with such cases, but with "charset" labeling of document content and increased use of transcoding, this will work less and less!

This paragraph should therefore be changed as follows:

Excluded octets must be escaped in all cases in order to be
properly represented, transmitted, and transcoded within an URL.
There exist some systems that allow the unescaped use of such
octets [and the characters they represent]. As long as and
for those components where there is no uniform solution
(see [the last proposed text]), the consistency of the
URLs over various transports and transcodings cannot be
guaranteed in any way.

Enjoy the holydays, Martin.

Received on Friday, 20 December 1996 17:33:34 UTC