[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces (original) (raw)
Glenn Linderman v+python at g.nevcal.com
Tue Apr 28 22:34:21 CEST 2009
- Previous message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
- Next message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On approximately 4/28/2009 6:01 AM, came the following characters from the keyboard of Lino Mastrodomenico:
2009/4/28 Glenn Linderman <v+python at g.nevcal.com>:
The switch from PUA to half-surrogates does not resolve the issues with the encoding not being a 1-to-1 mapping, though. The very fact that you think you can get away with use of lone surrogates means that other people might, accidentally or intentionally, also use lone surrogates for some other purpose. Even in file names. It does solve this issue, because (unlike e.g. U+F01FF) '\udcff' is not a valid Unicode character (not a character at all, really) and the only way you can put this in a POSIX filename is if you use a very lenient UTF-8 encoder that gives you b'\xed\xb3\xbf'.
Wrong.
An 8859-1 locale allows any byte sequence to placed into a POSIX filename.
And while U+DCFF is illegal alone in Unicode, it is not illegal in Python str values. And from my testing, Python 3's current UTF-8 encoder will happily provide exactly the bytes value you mention when given U+DCFF.
Since this byte sequence doesn't represent a valid character when decoded with UTF-8, it should simply be considered an invalid UTF-8 sequence of three bytes and decoded to '\udced\udcb3\udcbf' (not '\udcff').
Martin: maybe the PEP should say this explicitly? Note that the round-trip works without ambiguities between '\udcff' in the filename: b'\xed\xb3\xbf' -> '\udced\udcb3\udcbf' -> b'\xed\xb3\xbf' and b'\xff' in the filename, decoded by Python to '\udcff': b'\xff' -> '\udcff' -> b'\xff'
Others have made this suggestion, and it is helpful to the PEP, but not sufficient. As implemented as an error handler, I'm not sure that the b'\xed\xb3\xbf' sequence would trigger the error handler, if the UTF-8 decoder is happy with it. Which, in my testing, it is.
-- Glenn -- http://nevcal.com/
A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
- Previous message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
- Next message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]