[Python-3000] Unicode and OS strings (original) (raw)

Stephen J. Turnbull stephen at xemacs.org
Thu Sep 13 23:12:04 CEST 2007

Previous message: [Python-3000] Unicode and OS strings
Next message: [Python-3000] Unicode and OS strings
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

"Marcin 'Qrczak' Kowalczyk" <qrczak at knm.org.pl> writes:

Of course, if the input data already contains PUA characters, there would be an ambiguity. We can rule this out for most codecs, as they don't support PUA characters. The major exception would be UTF-8,

Most codecs other than UTF-8 don't have this problem.

All Japanese codecs do. Corporate variants of JIS remain alive, and well. They're not limited to Microsoft and Apple, but also IBM, Fujitsu/Sun, Hitachi, and NEC software allow entry of characters not in the JIS sets.

Unicode people are generally allergic to any non-standard variants of Unicode specifications, and feel that this is a heresy. I experimentally and optionally use U+0000 escaping, but I'm not convinced that anything like this is a good idea, and it should probably not be enabled by default.

-1

Heresy, no. That doesn't make it anything like a good idea. There are plenty of character sets, even those that are ISO 2022 compatible, with undefined code points. Such code points regularly do appear in text content where the coded character set is either incorrectly specified or ambiguous. This means that a way of handling such points is very useful, and as long as there's enough PUA space, the approach I suggested can handle all of these various issues. Any application where there won't be enough PUA space is very special, either demanding more than 2 planes worth of private space (planes 15 and 16), or demanding very high efficiency (needs to fit in the BMP private space). The approach I suggest has the advantage that applications with a small PUA usage (IIRC more than 4000 PUA code points are available in the BMP) will have string length == character count.

the contexts we are talking about don't allow U+0000 anyway.

zsh at least allows you to type ^V^SPC to enter an ASCII NUL character on the command line, and to assign a string containing NULs to an environment variable.

Previous message: [Python-3000] Unicode and OS strings
Next message: [Python-3000] Unicode and OS strings
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-3000 mailing list