[Python-Dev] Supporting raw bytes data in urllib.parse.* (was Re: Polymorphic best practices) (original) (raw)

Neil Hodgson nyamatongwe at gmail.com
Wed Sep 22 01:15:13 CEST 2010


Ian Bicking:

I think the use case everyone has in mind here is where you get a URL from one of these sources, and you want to handle it.  I have a hard time imagining the sequence of events that would lead to mojibake. Naive parsing of a document in bytes couldn't do it, because if you have a non-ASCII-compatible document your ASCII-based parsing will also fail (e.g., looking for b'href="(.*?)"').

It depends on what the particular ASCII-based parsing is doing. For example, the set of trail bytes in Shift-JIS includes the same bytes as some of the punctuation characters in ASCII as well as all the letters. A search or split on '@' or '|' may find the trail byte in a two-byte character rather than a true occurrence of that character so the operation 'succeeds' but produces an incorrect result.

Over time, the set of trail bytes used has expanded - in GB18030 digits are possible although many of the most important characters for parsing such as ''' "#%&.?/''' are still safe as they may not be trail bytes in the common double-byte character sets.

Neil



More information about the Python-Dev mailing list