[Python-Dev] Re: Regression in unicodestr.encode()? (original) (raw)

Guido van Rossum guido@python.org
Tue, 09 Apr 2002 20:50:23 -0400


[Guido van Rossum] > Hm, but isn't there a way to encode a NUL that doesn't produce a NUL? > In some variant?

[Fran�ois]

There is also a rule about the shortest coding. It is invalid UTF-8 to use more bytes than required, and a given UCS character has a unique UTF-8 representation. Moreover, decoders should raise an exception on non-minimal UTF-8 codings, and I do not know how Python behaves with this. The Gambit author once told me he found a way to implement the test very efficiently.

One could use multi-byte sequences, that is, a sequence having no NULs, that would fool a lazy UTF-8 decoder into producing a NUL. But for this, one has to break the shortest coding rule, and start from invalid UTF-8.

I knew all that, but I thought I'd read about a hack to encode NUL using c0 80, specifically to get around the limitation on encoded strings containing a NUL. But I can't find the reference so I'll shut up.

--Guido van Rossum (home page: http://www.python.org/~guido/)