[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces (original) (raw)

"Martin v. Löwis" martin at v.loewis.de
Sat Apr 25 17:00:17 CEST 2009

Previous message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Next message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

OK, looks like my analysis matches yours, except that I wasn't sure if the third case (a string that "likely wasn't intended") could result in exceptions. From what you're saying, it sounds like it would actually be similar to the second case - I'm not clear on how surrogates work, though.

On decoding, there is a guarantee that it decodes successfully. There is also a guarantee that the result will re-encode successfully, and yield the same byte string.

If you pass a different string into encoding, you still may get exceptions. For example, if the filesystem encoding is latin-1, passing u"\u20ac" will continue to raise exceptions, even under the python-escape error handler - that error handler will only handle surrogates.

There isn't really that much trickery to surrogates. They have to come in pairs to be meaningful, with the first one in the range D800..DBFF (high surrogate), and the second in the range DC00..DCFF (low surrogate). Having a lone low surrogate is not meaningful; this is how the escaping works.

Proper surrogate pairs encode characters outside the BMP, for use with UTF-16: each code contributes 10 bits (just count how many codes there are in D800..DCFF), together, a pair encodes 20 bits, allowing for 2**20 characters, starting at U+10000.

When they find that the files they created are inaccessible to others, they will often stop using funny characters. Which sounds fairly practical - and the irony of someone with a "funny character" in his surname telling me this hasn't escaped me :-)

Sure: my Unix account name was always "loewis", and even on Windows, our admins didn't dare to put the umlaut into the account name - it would be difficult to login with a US keyboard, for example. People who use non-ASCII characters in filenames around here are primarily non-IT people who aren't aware that these characters are different from the rest.

I recognize that for other languages (without trivial transliterations) the problem is more severe, and people are more likely to create files with Cyrillic, or Japanese, names (say) if the systems accepts them at all.

Regards, Martin

Previous message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Next message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list