[Python-Dev] Re: URL processing conformance and principles (was Re: urllib.urlopen...) (original) (raw)

Mike Brown mike at skew.org
Fri Sep 17 07:39:11 CEST 2004

Previous message: [Python-Dev] Re: URL processing conformance and principles (was Re: urllib.urlopen...)
Next message: [Python-Dev] Re: URL processing conformance and principles (was Re: urllib.urlopen...)
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

"Martin v. Löwis" wrote:

> Are we in agreement on these points?

I think I have to answer "no". The % notation is not a quirk of the BNF.

That's not what I said at all. The quirk of the BNF is a completely separate issue, and is this: BNF mandates that its terminals are integers, e.g. character ":" in a particular BNF-based grammar represents the value 58 (in decimal). RFC 2396 makes use of the grammar to define the generic syntax, but stipulates (well, rfc2396bis clarifies that the intent was to stipulate) that the intent is to actually define the syntax in terms of characters, so the ":" in the grammar really does mean the colon character, in that spec.

So there is no disagreement there, really.

> - A URL/URI consists of a finite sequence of Unicode characters;

No. An URI contains of a finite sequence of characters.

You are correct. This is stated in RFC 2396, and Martin Duerst and I pushed for rfc2396bis to settle upon a definition of character just to make it extra clear, so I should have known better.

> - If given unicode, each character in the string directly represents > a character in the URL/URI and needs no interpretation;

No. Only ASCII characters in the string need no interpretation. For non-ASCII characters, urllib needs to assume some escaping mechanism.

Err, no. Let me start over. The question is: what do we do with a unicode object given as the 'url' argument in urllib.urlopen(), etc.?

Assumption 1: Resolution to absolute form and subsequent dereferencing of a character sequence that is intended to identify a resource, in order to be performed in a manner that is conformant with [pick one: RFC 1630, RFC 1738, RFC 1808, RFC 2396, the RFC that rfc2396bis will likely become, or the RFC that the IRIs draft will likely become], requires that the character sequence actually be [depending on which spec you chose] a URL, a URI reference, or an IRI reference. Those standards do not define how to resolve & dereference other types of resource identifiers, be they character sequences or otherwise.

Assumption 2: The aforementioned standards unambiguously define the syntax to which a resource-identifying character sequence must conform in order to be considered a URL, a URI reference, or an IRI reference. The standards do not define how character sequences that do not conform to the syntax can be processed (but they do not forbid such processing; they just say that they aren't applicable to those situations).

Assumption 3: When an argument is given to an RFC 1808-era URL resolution function that is documented as requiring that the argument be [an object that represents] a 'URL', then the caller implicitly asserts that whatever object passed indeed represents a URL.

Assumption 4: The object passed into the function, of course, is going to manifest relatively concretely, as, say, a Python str or unicode object, so the function, if it intends to perform standards-conformant resolution, must behave as if it has interpreted the object as a resource-identifying sequence of abstract characters, and must verify somehow that the sequence adheres to the syntax requirements of a URL / URI ref / IRI ref. This verification can either be an explicit syntax check, or can be a feature of the conversion of the object as resource-identifying characters.

In either case, we need to define the mechanics of that conversion. This is what I am attempting to unambiguously do for str and unicode arguments by saying how each item in a str or unicode object maps to the characters that are going to be treated as a URL/URI ref.

It is true that we are under no obligation in our API to assume a one-to-one mapping between the characters in a unicode argument and the characters in the resource-identifying string that, in turn, may or may not be a URL, but to do otherwise seems a bit unintuitive, to me. You seem to be suggesting that a one-to-one mapping be assumed until a syntax error is found. Then, if the syntax error is of a certain type (like the character is > U+007F, then you seem to be saying that you want some kind of cleanup to be performed in order to ensure that the resulting string is conformant to the URL syntax.

I feel that since urllib is under no obligation to assume anything about what the syntax-violating characters are intended to mean, it would be within its rights to reject the argument altogether, and I would rather see it do that than try to guess what the user intended -- especially in this domain, where such guesses, if wrong, only lead developers to be even more confused about topics that are already barely understood as it is.

For example, some specs (HTML, XHTML, XSLT) suggest that processors of those types of documents perform UTF-8 based percent-encoding of any non-ASCII characters that mistakenly appear in attribute values that are normally supposed to contain URI references (hrefs and the like). Users who rely on this then wonder why many widely-deployed HTTP servers/CGI/PHP apps, etc. -- the ones that assume %-encoded octets in the Request-URI are iso-8859-1 based -- misinterpret the characters. To me, convenience afforded by the automatic percent-encoding is outweighed by the harm introduced by the wrong guesses and the reinforcement of the belief in the document author or developer that a URI reference is whatever string of characters they want it to be.

I have a feeling this is a matter of personal philosophy. I've never been a huge fan of the "be lenient in what you accept, strict in what you produce" mantra. URLs/URIs have a strict syntax, and IMHO we should enforce it so that developers can learn about and code to standards, rather than becoming reliant upon the crutch of lenient-yet-convenient APIs.

But if we are going to accept arbitrary strings and then attempt to make 'em fit the URL syntax, then we should, IMHO, acknowledge (in API documentation) that this is behavior provided for the sake of having a convenient API, and is not within the scope of the standards. Hopefully the marginal percentage of developers who actually read the API docs can then learn that u'http://m.v.l\xd6wis/' is not a URL, even if urllib happens to convert it to one, and in my perfect fantasy-world, they'd be less inclined to give us any reason to make lenient APIs. Actually, in a perfect world I probably would not be inclined to obsess over such things :)

-Mike

Previous message: [Python-Dev] Re: URL processing conformance and principles (was Re: urllib.urlopen...)
Next message: [Python-Dev] Re: URL processing conformance and principles (was Re: urllib.urlopen...)
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list