[Python-Dev] Python-3.0, unicode, and os.environ (original) (raw)

Glenn Linderman v+python at g.nevcal.com
Mon Dec 8 07:04:08 CET 2008


On approximately 12/7/2008 9:11 PM, came the following characters from the keyboard of Adam Olsen:

On Sun, Dec 7, 2008 at 9:45 PM, Glenn Linderman <v+python at g.nevcal.com> wrote:

On approximately 12/7/2008 8:13 PM, came the following characters from the keyboard of Stephen J. Turnbull:

Glenn Linderman writes:

> But if you are interested in checking for security issues, shouldn't you > first decode into some canonical form, Yes. That's all that is being asked for: that Python do strict decoding to a canonical form by default. That's a lot to ask, as it turns out, but that is what we (the minority of strict Unicode adherents, that is) want. I have no problem with having strict validation available. But doesn't validation take significantly longer than decoding? So I think it should be logically decoupled... do validation when/where it is needed for security reasons, and allow internal [de]coding to be faster. I'd like to see benchmarks of such a claim.

"significantly" seems to be the only word at question; it seems that there are a fair number of validation checks that could be performed; the numeric part of UTF-8 decoding is just a sequence of shifts, masks, and ORs, so can be coded pretty tightly in C or assembly language.

Anything extra would be slower; how much slower is hard to predict prior to the implementation. My "significantly" was just the expectation that the larger code with more conditional branches that is required for validation is less likely to stay in cache, and take longer to load into cache, and take longer to execute. This also seems to be supported by Stephen's comment "That's a lot to ask, as it turns out."

Once upon a time I did write an unvalidated UTF-8 encoder/decoder in C, I wonder if I could find that code? Can you supply a validated decoder? Then we could run some benchmarks, eh?

I'm mostly indifferent about which should be the default... maybe there shouldn't be a default! Use the "vUTF-8" decoder for strict validation, and the "fUTF-8" decoder for the faster, non-validating version. Or something like that. With appropriate documentation. Of course, "UTF-8" already exists... as "fUTF-8", so for compatibility, I guess it shouldn't change... but it could be deprecated.

You didn't address the issue that if the decoding to a canonical form is done first, many of the insecurities just go away, so why throw errors? Unicode is intended to allow interaction between various bits of software. It may be that a library checked it in UTF-8, then passed it to python. It would be nice if the library validated too, but a major advantage of UTF-8 is older libraries (or protocols!) intended for ASCII need only be 8-bit clean to be repurposed for UTF-8. Their security checks continue to work, so long as nobody down stream introduces problems with a non-validating decoder.

So I don't understand how this is responsive to the "decoding removes many insecurities" issue?

Yes, you might use libraries. Either they have insecurities, or not. Either they validate, or not. Either they decode, or not. They may be immune to certain attacks, because of their structure and code, or not.

So when you examine a library for potential use, you have documentation or code to help you set your expectations about what it does, and whether or not it may have vulnerabilities, and whether or not those vulnerabilities are likely or unlikely, whether you can reduce the likelihood or prevent the vulnerabilities by wrapping the API, etc. And so you choose to use the library, or not.

This whole discussion about libraries seems somewhat irrelevant to the question at hand, although it is certainly true that understanding how a library handles Unicode is an important issue for the potential user of a library.

So how does a non-validating decoder introduce problems? I can see that it might not solve all problems, but how does it introduce problems? Wouldn't the problems be introduced by something else, and the use of a non-validating decoder may not catch the problem... but not be the cause of the problem?

And then, if you would like to address the original issue, that would be fine too.

-- Glenn -- http://nevcal.com/

A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking



More information about the Python-Dev mailing list