[Python-Dev] Python-3.0, unicode, and os.environ (original) (raw)

Stephen J. Turnbull stephen at xemacs.org
Mon Dec 8 09:57:19 CET 2008


Glenn Linderman writes:

On approximately 12/7/2008 8:13 PM, came the following characters from

I have no problem with having strict validation available. But doesn't validation take significantly longer than decoding?

I think you're thinking of XML, where validation can take significant resources over and above syntax checking. For Unicode, not unless you're seriously CPU-bound. Unicode validation is a matter of a few range checks and a couple of flags to handle things like lone surrogates.

In the case of "excess length" in UTF-8, you can actually often do it in zero time if you use a table to analyze the leading byte (eg, 0xC0 and 0xC1 are invalid UTF-8 leading bytes because they would necessarily decode to U+0000 to U+007F, ie, the ASCII range), because you have to make a check for 0xFE and 0xFF anyway, which can't be UTF-8 leading bytes. (I'm not sure this generalizes to longer UTF-8 sequences, but it would reject the use of 0xC0 0xAF to sneak in a "/" in zero time!)

So I think it should be logically decoupled... do validation when/where it is needed for security reasons,

Security is an important application, but the real issue is that naively decoded text is a bomb with a sensitive impact fuse. Pass it around long enough, and it will blow up eventually.

The whole point of the fairly complex rules about Unicode formats and the requirement that broken coding be a fatal error in a connforming Unicode process is intended to ensure that Unicode exceptions[1] only ever occur on input (or memory corruption and the like, which is actually a form of I/O, of course). That's where efficiency comes from.

I think Python 3 should aspire to (eventually) be a conforming process by default, with lax behavior an option.

and allow internal [de]coding to be faster.

"Internal decoding" is (or should be) an oxymoron. Why would your software be passing around text in any format other than internal? So decoding will happen (a) on I/O, which is itself almost certainly slower than making a few checks for Unicode hygiene, or (b) on receipt of data from other software that whose sanitation you shouldn't trust more than you trust the Internet.

Encoding isn't a problem, AFAICS.

You didn't address the issue that if the decoding to a canonical form is done first, many of the insecurities just go away, so why throw errors?

Because as long as you're decoding anyway, it costs no more to do it right, except in rare cases. Why do you think Python should aspire to "quick and dirty" in a context where dirty is known to be unhealthy, and there is no known need for speed? Why impose "doing it right" on the application programmer when there's a well-defined spec for that that we could implement in the standard library?

It's the errors themselves that people are objecting to. See Guido's posts for concisely stated arguments for a "don't ask, don't tell" policy toward Unicode breakage. I agree that Python should implement that policy as an option, but I think that the user should have to request it either with a runtime option or (in the case of user == app programmer) by deliberately specifying a lax codec. The default Unicode codecs should definitely aspire to full Unicode conformance within their sphere of responsibility.

Footnotes: [1] A character outside the repertoire that the app can handle is not a "Unicode exception", unless the reason the app can't handle it is that the Unicode handler blew up.



More information about the Python-Dev mailing list