[Python-3000] Unicode and OS strings (original) (raw)

Marcin 'Qrczak' Kowalczyk qrczak at knm.org.pl
Fri Sep 14 09:49:33 CEST 2007

Previous message: [Python-3000] Unicode and OS strings
Next message: [Python-3000] Unicode and OS strings
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Dnia 14-09-2007, Pt o godzinie 15:02 +0900, Stephen J. Turnbull napisał(a):

> PUA already has a representation in UTF-8, so this is more incompatible > with UTF-8 than needed,

Hm? It's not incompatible at all, and we're not interested in a representation in UTF-8, but rather in UTF-16

PUA is representable in both. When the command line contains an UTF-8 encoding of U+E650 (a PUA character), the script should better receive a UTF-16 or UTF-32 encoding of U+E650 in the appropriate place, otherwise we are corrupting user data.

(ie, the Python internal encoding).

(Python also uses UTF-32 alternatively to UTF-16.)

And it is needed, because these characters by assumption are not present in Unicode at all. (More precisely, they may be present, but the tables we happen to have don't have mappings for them.)

They are present! For UTF-8, UTF-16 and UTF-32 PUA is not special in any way. It's just a block of characters which will never be officially assigned by the Unicode Consortium, so they can be used privately among parties who agree about their meaning.

Your escaping proposal guarantees mangling because it turns characters into tuples of code units; it does not preserve character set information.

Huh? What do you mean by preserving character set information?

It preserves the byte string contents, which is all that is needed. It has the same result as UTF-8 for all valid UTF-8 sequences not containing NUL.

> While U+0000 is also representable in UTF-8, it cannot occur in > filenames, program arguments, environment variables etc., in many > contexts it was free.

In your experience, and mine, but is it in POSIX?

Yes. Both as specified and in the reality (e.g. POSIX offers the second parameter of main() of type char ** as the only way to receive command line arguments, and they are NUL-terminated).

I'm also very bothered by the fact that the interpretation of U+0000 differs in different contexts in your proposal.

Well, for any scheme which attempts to modify UTF-8 by accepting arbitrary byte strings is used, something must be interpreted differently than in real UTF-8.

Once you get a string into Python, you normally no longer know where it came from, but now whether something came from the program argument or environment or from a stdio stream changes the semantics of U+0000. For me personally, that's a very good reason to object to your proposal.

This can be said about any modification of UTF-8.

Of course you can use such encoding on a standard stream too. In this case only U+0000 cannot be used normally, and the resulting stream will contain whatever bytes were present in filenames and other strings being output to it.

> Of course my escaping scheme can preserve \0 too, by escaping it to > U+0000 U+0000, but here it's incompatible with the real UTF-8.

No. It's never compatible with UTF-8 because it assigns a different meaning to U+0000 from ASCII NUL.

It is compatible with UTF-8 except for U+0000, and a true U+0000 cannot occur anyway in these contexts, so this incompatibility is mostly harmless.

Your scheme also suffers from the practical problem that strings containing escapes are no longer arrays of characters.

They are no less arrays of characters than strings containing combining marks.

[And now I'm gone for 4 days.]

-- _("< Marcin Kowalczyk _/ qrczak at knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/

Previous message: [Python-3000] Unicode and OS strings
Next message: [Python-3000] Unicode and OS strings
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-3000 mailing list