[Python-Dev] Supporting raw bytes data in urllib.parse.* (was Re: Polymorphic best practices) (original) (raw)

Stephen J. Turnbull stephen at xemacs.org
Wed Sep 22 04:59:23 CEST 2010


Neil Hodgson writes:

Over time, the set of trail bytes used has expanded - in GB18030 digits are possible although many of the most important characters for parsing such as ''' "#%&.?/''' are still safe as they may not be trail bytes in the common double-byte character sets.

That's just not true. Many double-byte character sets in use are based on ISO-2022, which allows the whole GL repertoire to be used.

Perhaps you're thinking about variable-width encodings like Shift JIS and Big5, where I believe that restriction on trailing bytes for double-byte characters holds. However, 7-bit encodings with control sequences remain common in several contexts, at least in Japan and Korea. In particular, I can't say how frequent it is, especially nowadays, but I have seen ISO-2022-JP in URLs "on the wire".

What really saves the day here is not that "common encodings just don't do that". It's that even in the case where only syntactically significant bytes in the representation are URL-encoded, they are URL-encoded. As long as the parsing library restricts itself to treating only wire-format input, you're OK.[1] But once you start doing things that involve decoding URL-encoding, you can run into trouble.

Footnotes: [1] With conforming input. I assume that the libraries know how to defend themselves from non-conforming input, which could be any kind of bug or attack, not just mojibake.



More information about the Python-Dev mailing list