[Python-Dev] urllib.quote and unicode bug resuscitation attempt (original) (raw)

Mike Brown mike at skew.org
Thu Jul 13 10:26:43 CEST 2006


Stefan Rank wrote:

on 12.07.2006 07:53 Martin v. Löwis said the following: > Anthony Baxter wrote: >>> The right thing to do is IRIs. >> For 2.5, should we at least detect that it's unicode and raise a >> useful error? > > That can certainly be done, sure. > > Martin

That would be great. And I agree that updating urllib.quote for unicode should be part of a grand plan that updates all of urllib[2] and introduces an irilib / urischemes / uriparse module in 2.6 as Martin and John J Lee suggested. =) cheers, stefan

Put me down as +1 on raising a useful error instead of a KeyError or whatever, and +1 on having an irilib, but -1 on working toward accepting unicode in the URI-oriented urllib.quote(), because (a.) user expectations for strings that contain non-ASCII-range characters will vary, and (b.) percent-encoding is supposed to only operate on a byte-encoded view of non-URI information, not the information itself.

Longer explanation:

I, too, initially thought that quote() was outdated since it choked on unicode input, but eventually I came to realize that it's wise to reject such input, because to attempt to percent-encode characters, rather than bytes, reflects a fundamental misunderstanding of the level at which percent-encoding is intended to operate.

This is one of the hardest aspects of URI processing to grok, and I'm not very good at explaining it, even though I've tried my best in the Wikipedia articles. It's basically these 3 points:

  1. A URI can only consist of 'unreserved' characters, as I'm sure you know. It's a specific set that has varied slightly over the years, and is a subset of printable ASCII.

  2. A URI scheme is essentially a mapping of non-URI information to a sequence of URI characters. That is, it is a method of producing a URI from non-URI information within a particular information domain ...and vice-versa.

  3. A URI scheme should (though may not do so very clearly, especially the older it is!) tell you that the way to represent a particular bit of non-URI information, 'info', in a URI is to convert_to_bytes(info), and then, as per STD 66, make the bytes that correspond, in ASCII, to unreserved characters manifest as those characters, and all others manifest as their percent-encoded equivalents. In urllib parlance, this step is 'quoting' the bytes.

3.1. [This isn't crucial to my argument, but has to be mentioned to complete the explanation of percent-encoding.] In addition, those bytes corresponding, in ASCII, to some 'reserved' characters are exempt from needing to be percent-encoded, so long as they're not being used for their reserved purpose (if any) in whatever URI component they're going into -- Semantically, there's no difference between such bytes when expressed in the URI as a literal reserved character or as a percent-encoded byte. URI scheme specs vary greatly in how they deal with this nuance. In any case, urllib.quote() has the 'safe' argument which can be used to specify the exempt reserved characters.

In the days when the specs that urllib was based on were relevant, 99% of the time, the bytes being 'quoted' were ASCII-encoded strings representing ASCII character-based non-URI information, so quite a few of us, including many URI scheme authors, were tempted to think that what was being 'quoted'/percent-encoded was the original non-URI information, rather than a bytewise view of it mandated by a URI scheme. That's what I was doing when I thought that quote(some_unicode_path) should 'work', especially in light of Python's "treat all strings alike" guideline. But if you accept all of the above, which is what I believe the standard requires, then unicode input is a very different situation from str input; it's unclear whether and how the caller wants the input to be converted to bytes, if they even understand what they're doing at all.

See, right now, quote('abc 123%') returns 'abc%20123%25', as you would expect. Similarly, everyone would probably expect u'abc 123%' to return u'abc%20123%25', and if we were to implement that, there'd probably be no harm done.

But look at quote('\xb7'), which, assuming you accept everything I've said above is correct, rightfully returns '%B7'. What would someone expect quote(u'\xb7') to return? Some might want u'%B7' because they want the same result type as the input they gave, with no other changes from how it would normally be handled. Some might want u'%C2%B7' because they're conflating the levels of abstraction and expect, say, UTF-8 conversion to be done on their input. Some (like me) might want a TypeError or ValueError because we shouldn't be handing such ambiguous data to quote() in the first place. And then there's the u'\u0100'-and-up input to worry about; what does a user expect to be done with that?

I would prefer to see quote() always reject unicode input with a TypeError. Alternatively, if it accepts unicode, it should produce unicode, and since it can only reasonably assume what the user wants done with ASCII-range characters, it should only accept input < u'\x80'.

In any case, quote() should be better documented to explain what it accepts ( a byte sequence ) and why ( it is intended to be used at the stage of URI production where non-URI info, such as a unicode filesystem path, has already been converted to bytes according to the requirements of a URI scheme, and now needs to be represented as a URI-safe character sequence ) and exactly what it produces ( a str representing URI character s).

Mike



More information about the Python-Dev mailing list