[Python-3000] Unicode and OS strings (original) (raw)

Stephen J. Turnbull stephen at xemacs.org
Fri Sep 14 08:02:56 CEST 2007


"Marcin 'Qrczak' Kowalczyk" <qrczak at knm.org.pl> writes:

This means that a way of handling such points is very useful, and as long as there's enough PUA space, the approach I suggested can handle all of these various issues.

PUA already has a representation in UTF-8, so this is more incompatible with UTF-8 than needed,

Hm? It's not incompatible at all, and we're not interested in a representation in UTF-8, but rather in UTF-16 (ie, the Python internal encoding). And it is needed, because these characters by assumption are not present in Unicode at all. (More precisely, they may be present, but the tables we happen to have don't have mappings for them.)

and hijacks characters

No, it doesn't. As I responded to Greg Ewing, there is an issue about things like pickling which use Python internal representations, but not for anything which normally communicates with Python through codecs.

which might be used (for example I'm using some PUA ranges for encoding my script, they are being transported between processes, and I would be upset if some language had mangled them to something else).

Your escaping proposal guarantees mangling because it turns characters into tuples of code units; it does not preserve character set information. It only works for you because you only have one private script you care about, so you know what those code units mean.

If we don't have character set information, then of course that's the best you can do, and my proposal will do something equivalent. But if we do have character set information, then my proposal is far more powerful. It allows us to process PUA characters as characters (ie, put them in strings, slice and dice, merge and meld) with some hope of recovering the character's semantics after many transformations of the containing string.

In any case, it would not be hard to create an API allowing a Python program to "reserve" a block in a PUA. You still have the issue of collision among multiple applications wanting the same block, of course. You may be able to guarantee that will never happen in your application, but there are examples of OSes that assigned characters in the PUA (Mac OS and Microsoft Windows both did so at one time or another, although they may not be doing it currently, I haven't checked).

While U+0000 is also representable in UTF-8, it cannot occur in filenames, program arguments, environment variables etc., in many contexts it was free.

In your experience, and mine, but is it in POSIX? If not, I'd rather not add the restriction, no matter how harmless it seems in practice. (Of course practicality beats purity, but your proposal has many other defects, too.)

I'm also very bothered by the fact that the interpretation of U+0000 differs in different contexts in your proposal. As I'm sure you know, the semantics of mixing codecs with different semantics (specifically, the treatment of particular code units) is very hairy. Once you get a string into Python, you normally no longer know where it came from, but now whether something came from the program argument or environment or from a stdio stream changes the semantics of U+0000. For me personally, that's a very good reason to object to your proposal.

Of course my escaping scheme can preserve \0 too, by escaping it to U+0000 U+0000, but here it's incompatible with the real UTF-8.

No. It's never compatible with UTF-8 because it assigns a different meaning to U+0000 from ASCII NUL.

Your scheme also suffers from the practical problem that strings containing escapes are no longer arrays of characters. One effect of my scheme is to extend the "string is array" model to any application that doesn't need to treat more non-BMP characters than there is space available in the PUA. Once implemented, it could easily be adapted to handle characters in Planes 1-16, thus avoiding any use of surrogates in the vast majority of cases.



More information about the Python-3000 mailing list