[Python-Dev] Supporting raw bytes data in urllib.parse.* (was Re: Polymorphic best practices) (original) (raw)
Nick Coghlan ncoghlan at gmail.com
Wed Sep 22 14:07:47 CEST 2010
- Previous message: [Python-Dev] Supporting raw bytes data in urllib.parse.* (was Re: Polymorphic best practices)
- Next message: [Python-Dev] Supporting raw bytes data in urllib.parse.* (was Re: Polymorphic best practices)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Wed, Sep 22, 2010 at 12:59 PM, Stephen J. Turnbull <stephen at xemacs.org> wrote:
Neil Hodgson writes:
> Over time, the set of trail bytes used has expanded - in GB18030 > digits are possible although many of the most important characters > for parsing such as ''' "#%&.?/''' are still safe as they may not > be trail bytes in the common double-byte character sets. That's just not true. Many double-byte character sets in use are based on ISO-2022, which allows the whole GL repertoire to be used. Perhaps you're thinking about variable-width encodings like Shift JIS and Big5, where I believe that restriction on trailing bytes for double-byte characters holds. However, 7-bit encodings with control sequences remain common in several contexts, at least in Japan and Korea. In particular, I can't say how frequent it is, especially nowadays, but I have seen ISO-2022-JP in URLs "on the wire".
Notably, utf-16 and utf-32 make no promises regarding avoidance of ASCII character codes in trail bytes - only utf-8 is guaranteed to be compatible with parsing as if it were ASCII (and even then, you need to be careful only to split the string at known ASCII characters rather than at arbitrary points).
The known-ASCII-incompatible multibyte encodings I came up with when I reviewed the list in the codecs module docs the other day were: CP932 (the example posted here that prompted me to embark on this check in the first place) UTF-7 UTF-16 UTF-32 shift-JIS big5 iso-2022-* EUC-CN/KR/TW
The only known-ASCII-compatible multibyte encodings I found were UTF-8 and EUC-JP (all of the non-EBCDIC single byte encodings appeared to be ASCII compatible though)
I didn't check any of the other CP* encodings though, since I already had plenty of examples to show that the assumption of ASCII compatibility isn't likely to be valid in general unless there is some other constraint (such as the RFCs for safely encoding URLs to an octet-sequence).
Cheers, Nick.
-- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia
- Previous message: [Python-Dev] Supporting raw bytes data in urllib.parse.* (was Re: Polymorphic best practices)
- Next message: [Python-Dev] Supporting raw bytes data in urllib.parse.* (was Re: Polymorphic best practices)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]