(original) (raw)

On Thu, Jun 24, 2010 at 3:59 PM, Guido van Rossum <guido@python.org> wrote:

The protocol specs typically go out of their way to specify what byte

values they use for syntactically significant positions (e.g. ':' in

headers, or '/' in URLs), while hand-waving about the meaning of "what

goes in between" since it is all typically treated as "not of

syntactic significance". So you can write a parser that looks at bytes

exclusively, and looks for a bunch of ASCII punctuation characters

(e.g. '<', '>', '/', '&'), and doesn't know or care whether the stuff

in between is encoded in Latin-15, MacRoman or UTF-8 -- it never looks

"inside" stretches of characters between the special characters and

just copies them. (Sometimes there may be *some* sections that are

required to be ASCII and there equivalence of a-z and A-Z is well

defined.)


Yes, these are the specific characters that I think we can handle specially. For instance, the list of all string literals used by urlsplit and urlunsplit:
'//'
'/'

':'
'?'
'#'
''
'http'
A list of all valid scheme characters (a-z etc)
Some lists for scheme-specific parsing (which all contain valid scheme characters)

All of these are constrained to ASCII, and must be constrained to ASCII, and everything else in a URL is treated as basically opaque.


So if we turned these characters into byte-or-str objects I think we'd basically be true to the intent of the specs, and in a practical sense we'd be able to make these functions polymorphic. I suspect this same pattern will be present most places where people want polymorphic behavior.


For now we could do something incomplete and just avoid using operators we can't overload (is it possible to at least make them produce a readable exception?)

I think we'll avoid a lot of the confusion that was present with Python 2 by not making the coercions transitive. For instance, here's something that would work in Python 2:


urlunsplit(('http', 'example.com', '/foo', u'bar=baz', ''))

And you'd get out a unicode string, except that would break the first time that query string (u'bar=baz') was not ASCII (but not until then!)


Here's the urlunsplit code:

def urlunsplit(components):
scheme, netloc, url, query, fragment = components
if netloc or (scheme and scheme in uses_netloc and url[:2] != '//'):
if url and url[:1] != '/': url = '/' + url

url = '//' + (netloc or '') + url
if scheme:
url = scheme + ':' + url
if query:
url = url + '?' + query
if fragment:
url = url + '#' + fragment

return url

If all those literals were this new special kind of string, if you call:

urlunsplit((b'http', b'example.com', b'/foo', 'bar=baz', b''))


You'd end up constructing the URL b'http://example.com/foo' and then running:

url = url + special('?') + query

And that would fail because b'http://example.com/foo' + special('?') would be b'http://example.com/foo?' and you cannot add that to the str 'bar=baz'. So we'd be avoiding the Python 2 craziness.


--
Ian Bicking | http://blog.ianbicking.org