[Python-Dev] len(chr(i)) = 2? (original) (raw)

Stephen J. Turnbull stephen at xemacs.org
Thu Nov 25 03:17:44 CET 2010


Alexander Belopolsky writes:

Any non-trivial text processing is likely to be broken in presence of surrogates.

If you're worried about this, write a UCS-2-producing codec that rejects surrogates or stuffs them into the private zone of the BMP. Maybe such a codec should be default, but so far nobody seems to want one enough; they want UTF-16 even though they know it's wrong.

One of the things that makes the 16-bit code unit attractive to me is that the options for working around the variable-width nature of UTF-16 (without actually implementing conformance to UTF-16 in internal operations!) are many. If you use octets as code units, you don't have such options: you have to do it right.

Processing surrogate pairs in python code is hard.

Sure, but as James Knight and MAL point out, so is processing compose characters, and those errors will go undetected in your proposals, even with a strict UCS-2 definition. What can you do? Banning composing characters isn't going to fly!

Yes, allowing non-trusted users to specify fill character is unlikely, but it is quite likely that naive slicing or iteration over string units would result in

Traceback (most recent call last):

Naive slicing yes, but naive iteration (ie, iteration that consumes the whole string, or up to a known character, rather than up to a specified position) is highly unlikely to result in such a traceback. It is precisely that property (non-BMP characters get passed through unchanged, or ignored) that makes extension to non-BMP code points attractive.

I agree again, but I feel that exposing code units rather than code points at the Python string level takes us back to 2.x days of mixing bytes and strings.

It does, but there's a difference. With bytes as UTF-8, only ASCII values have defined semantics in Unicode. The rest have semantics that is context-dependent, and they are frequent in any non-English processing and many English use cases (math symbols, correctly- oriented punctuation). With 16-bit code units, all values have well- defined semantics in Unicode, and non-characters are going to be extremely rare in the vast majority of use cases. IOW, you can think of Python as a UCS-2 device processing characters, and let surrounding UTF-16 processors deal with the errors.

Let me quote Guido circa 2001 again:

""" ... if we had wanted to use a variable-lenth internal representation, we should have picked UTF-8 way back, like Perl did. Moving to a UTF-16-based internal representation now will give us all the problems of the Perl choice without any of the benefits. """

I don't understand what changed since 2001 that made this argument invalid.

Nothing. The internal representation of Python is UCS-2, not UTF-16. People who want to think otherwise are kidding themselves. The presence of surrogates is not sufficient to call something UTF-16. Preserving the Unicode code points through any builtin operations is a necessary condition, and Python doesn't do that. However, in my opinion, it's not a big deal to allow surrogates in UCS-2 a la ISO 10646-1:1996. That lets people who want a quick and dirty way to handle BMP text that might (but usually won't) contain some non-BMP characters go a long way fast. "Although practicality beats purity."

I note that an opinion has been raised on this thread that if we want compressed internal representation for strings, we should use UTF-8. I tend to agree, but UTF-8 has been repeatedly rejected as too hard to implement. What makes UTF-16 easier than UTF-8? Only the fact that you can ignore bugs longer, in my view.

That's mostly true. My guess is that we can probably ignore those bugs for as long as it takes someone to write the higher-level libraries that James suggests and MAL has actually proposed and started a PEP for.



More information about the Python-Dev mailing list