[Python-3000] Unicode and OS strings (original) (raw)

Stephen J. Turnbull stephen at xemacs.org
Sat Sep 15 02:13:31 CEST 2007

Previous message: [Python-3000] Unicode and OS strings
Next message: [Python-3000] Unicode and OS strings
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

"Marcin 'Qrczak' Kowalczyk" <qrczak at knm.org.pl> writes:

And it is needed, because these characters by assumption are not present in Unicode at all. (More precisely, they may be present, but the tables we happen to have don't have mappings for them.)

They are present! For UTF-8, UTF-16 and UTF-32 PUA is not special in any way.

The characters I am referring to are the unstandardized so-called "corporate characters" that are very common in Japanese text. My solution handles your problem, slightly less efficiently than yours does, perhaps, but in a Unicode-conforming way. Yours doesn't help with mine at all.

It preserves the byte string contents, which is all that is needed.

That is not true in any environment where the encoding is not known with certainty.

It has the same result as UTF-8 for all valid UTF-8 sequences not containing NUL.

Sorry, I'm talking about real Japanese and other situations where there is no corresponding Unicode character point, and a solution which not only handles that but also handles corrupt UTF-8. Valid UTF-8 is not a problem, it's the solution. But a robust language should handle text that is not valid UTF-8 in a way that allows the programmer or user to implement error correction at a finer-grained level than dumping core.

I'm also very bothered by the fact that the interpretation of U+0000 differs in different contexts in your proposal.

Well, for any scheme which attempts to modify UTF-8 by accepting arbitrary byte strings is used, something must be interpreted differently than in real UTF-8.

Wrong. In my scheme everything ends up in the PUA, on which real UTF-8 imposes no interpretation by definition.

I haven't gone back to check yet, but it's possible that a "real UTF-8 conforming process" is required to stop processing and issue an error or something like that in the cases we're trying to handle. But your extension and James Knight's extension both fall afoul of any such provision, too.

Once you get a string into Python, you normally no longer know where it came from, but now whether something came from the program argument or environment or from a stdio stream changes the semantics of U+0000. For me personally, that's a very good reason to object to your proposal.

This can be said about any modification of UTF-8.

It's not true of James Knight's proposal, because the same modification can be used for both program arguments and file streams.

And my proposal doesn't modify UTF-8 at all; it takes advantage of the farsighted wisdom of the designers of Unicode and puts all the non-standard "characters", including broken encoding, in the PUA.

Of course you can use such encoding on a standard stream too. In this case only U+0000 cannot be used normally, and the resulting stream will contain whatever bytes were present in filenames and other strings being output to it.

A programmer can use it, but his users will curse his name every time a binary stream gets corrupted because they forgot that little detail.

Of course my escaping scheme can preserve \0 too, by escaping it to U+0000 U+0000, but here it's incompatible with the real UTF-8.

No. It's never compatible with UTF-8 because it assigns a different meaning to U+0000 from ASCII NUL.

It is compatible with UTF-8 except for U+0000, and a true U+0000 cannot occur anyway in these contexts, so this incompatibility is mostly harmless.

Forcing users to use codecs of subtly different semantics simply because they're getting I/O from different sources is a substantial harm.

Your scheme also suffers from the practical problem that strings containing escapes are no longer arrays of characters.

They are no less arrays of characters than strings containing combining marks.

Those marks are characters in their own right. Your escapes are not, nor are surrogates.

It's true that users will be surprised by the count of characters in many cases with unnormalized Unicode, but these can be reduced to a very few by normalizing to NFC.

Previous message: [Python-3000] Unicode and OS strings
Next message: [Python-3000] Unicode and OS strings
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-3000 mailing list