[Python-Dev] Supporting raw bytes data in urllib.parse.* (was Re: Polymorphic best practices) (original) (raw)
Neil Hodgson nyamatongwe at gmail.com
Wed Sep 22 01:15:13 CEST 2010
- Previous message: [Python-Dev] Supporting raw bytes data in urllib.parse.* (was Re: Polymorphic best practices)
- Next message: [Python-Dev] Supporting raw bytes data in urllib.parse.* (was Re: Polymorphic best practices)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Ian Bicking:
I think the use case everyone has in mind here is where you get a URL from one of these sources, and you want to handle it. I have a hard time imagining the sequence of events that would lead to mojibake. Naive parsing of a document in bytes couldn't do it, because if you have a non-ASCII-compatible document your ASCII-based parsing will also fail (e.g., looking for b'href="(.*?)"').
It depends on what the particular ASCII-based parsing is doing. For example, the set of trail bytes in Shift-JIS includes the same bytes as some of the punctuation characters in ASCII as well as all the letters. A search or split on '@' or '|' may find the trail byte in a two-byte character rather than a true occurrence of that character so the operation 'succeeds' but produces an incorrect result.
Over time, the set of trail bytes used has expanded - in GB18030 digits are possible although many of the most important characters for parsing such as ''' "#%&.?/''' are still safe as they may not be trail bytes in the common double-byte character sets.
Neil
- Previous message: [Python-Dev] Supporting raw bytes data in urllib.parse.* (was Re: Polymorphic best practices)
- Next message: [Python-Dev] Supporting raw bytes data in urllib.parse.* (was Re: Polymorphic best practices)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]