[Python-Dev] Re: Regression in unicodestr.encode()? (original) (raw)

Tim Peters tim.one@comcast.net
Tue, 09 Apr 2002 21:13:37 -0400


[Guido]

I knew all that, but I thought I'd read about a hack to encode NUL using c0 80, specifically to get around the limitation on encoded strings containing a NUL.

Ah, that violates the "shortest encoding" rule, so is invalid UTF-8. I'm sure people have done it, though, and that many UTF-8 encoders accept it. Python's doesn't:

unicode('\xc0\x80', 'utf-8') Traceback (most recent call last): File "", line 1, in ? UnicodeError: UTF-8 decoding error: illegal encoding

Believe it or not, accepting non-shortest encodings is considered to be "a security hole"(!). That's a sad story of its own ...