[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces (original) (raw)
James Y Knight foom at fuhm.net
Tue Apr 28 07:19:22 CEST 2009
- Previous message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
- Next message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Apr 27, 2009, at 11:35 PM, Martin v. Löwis wrote:
No. You seem to assume that all bytes < 128 decode successfully always. I believe this assumption is wrong, in general:
py> "\x1b$B' \x1b(B".decode("iso-2022-jp") #2.x syntax Traceback (most recent call last): File "", line 1, in UnicodeDecodeError: 'iso2022jp' codec can't decode bytes in position 3-4: illegal multibyte sequence All bytes are below 128, yet it fails to decode.
Surely nobody uses iso2022 as an LC_CTYPE encoding. That's expressly
forbidden by POSIX, if I'm not mistaken...and I can't see how it would
work, considering that it uses all the bytes from 0x20-0x7f, including
0x2f ("/"), to represent non-ascii characters.
Hopefully it can be assumed that your locale encoding really is a non- overlapping superset of ASCII, as is required by POSIX...
I'm a bit scared at the prospect that U+DCAF could turn into "/", that
just screams security vulnerability to me. So I'd like to propose
that only 0x80-0xFF <-> U+DC80-U+DCFF should ever be allowed to be
encoded/decoded via the error handler.
James
- Previous message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
- Next message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]