[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces (original) (raw)
Glenn Linderman v+python at g.nevcal.com
Tue Apr 28 21:07:54 CEST 2009
- Previous message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
- Next message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On approximately 4/28/2009 10:53 AM, came the following characters from the keyboard of James Y Knight:
On Apr 28, 2009, at 2:50 AM, Martin v. Löwis wrote:
James Y Knight wrote: Hopefully it can be assumed that your locale encoding really is a non-overlapping superset of ASCII, as is required by POSIX...
Can you please point to the part of the POSIX spec that says that such overlapping is forbidden? I can't find it...I would've thought it would be on this page: http://opengroup.org/onlinepubs/007908775/xbd/charset.html but it's not (at least, not obviously). That does say (effectively) that all encodings must be supersets of ASCII and use the same codepoints, though. However, ISO-2022 being inappropriate for LCCTYPE usage is the entire reason why EUC-JP was created, so I'm pretty sure that it is in fact inappropriate, and I cannot find any evidence of it ever being used on any system.
It would seem from the definition of ISO-2022 that what it calls "escape sequences" is in your POSIX spec called "locking-shift encoding". Therefore, the second bullet item under the "Character Encoding" heading prohibits use of ISO-2022, for whatever uses that document defines (which, since you referenced it, I assume means locales, and possibly file system encodings, but I'm not familiar with the structure of all the POSIX standards documents).
A locking-shift encoding (where the state of the character is determined by a shift code that may affect more than the single character following it) cannot be defined with the current character set description file format. Use of a locking-shift encoding with any of the standard utilities in the XCU specification or with any of the functions in the XSH specification that do not specifically mention the effects of state-dependent encoding is implementation-dependent.
From http://en.wikipedia.org/wiki/EUC-JP: "To get the EUC form of an ISO-2022 character, the most significant bit of each 7-bit byte of the original ISO 2022 codes is set (by adding 128 to each of these original 7-bit codes); this allows software to easily distinguish whether a particular byte in a character string belongs to the ISO-646 code or the ISO-2022 (EUC) code."
Also: http://www.cl.cam.ac.uk/~mgk25/ucs/iso2022-wc.html
I'm a bit scared at the prospect that U+DCAF could turn into "/", that just screams security vulnerability to me. So I'd like to propose that only 0x80-0xFF <-> U+DC80-U+DCFF should ever be allowed to be encoded/decoded via the error handler. It would be actually U+DC2f that would turn into /. Yes, I meant to say DC2F, sorry for the confusion. I'm happy to exclude that range from the mapping if POSIX really requires an encoding not to be overlapping with ASCII. I think it has to be excluded from mapping in order to not introduce security issues. However... There's also SHIFT-JIS to worry about...which apparently some people actually want to use as their default encoding, despite it being broken to do so. RedHat apparently refuses to provide it as a locale charset (due to its brokenness), and it's also not available by default on my Debian system. People do unfortunately seem to actually use it in real life. https://bugzilla.redhat.com/showbug.cgi?id=136290 So, I'd like to propose this: The "python-escape" error handler when given a non-decodable byte from 0x80 to 0xFF will produce values of U+DC80 to U+DCFF. When given a non-decodable byte from 0x00 to 0x7F, it will be converted to U+0000-U+007F. On the encoding side, values from U+DC80 to U+DCFF are encoded into 0x80 to 0xFF, and all other characters are treated in whatever way the encoding would normally treat them. This proposal obviously works for all non-overlapping ASCII supersets, where 0x00 to 0x7F always decode to U+00 to U+7F. But it also works for Shift-JIS and other similar ASCII-supersets with overlaps in trailing bytes of a multibyte sequence. So, a sequence like "\x81\xFD".decode("shift-jis", "python-escape") will turn into u"\uDC81\u00fd". Which will then properly encode back into "\x81\xFD". The character sets this doesn't work for are: ebcdic code pages (obviously completely unsuitable for a locale encoding on unix),
Why is that obvious? The only thing I saw that could exclude EBCDIC would be the requirement that the codes be positive in a char, but on a system where the C compiler treats char as unsigned, EBCDIC would qualify.
Of course, the use of EBCDIC would also restrict the other possible code pages to those derived from EBCDIC (rather than the bulk of code pages that are derived from ASCII), due to:
If the encoded values associated with each member of the portable character set are not invariant across all locales supported by the implementation, the results achieved by an application accessing those locales are unspecified.
_iso2022-* (covered above), and shift-jisx0213 (because it has replaced _ with yen, and - with overline).
If it's desirable to work with shiftjisx0213, a modification of the proposal can be made: Change the second sentence to: "When given a non-decodable byte from 0x00 to 0x7F, that byte must be the second or later byte in a multibyte sequence. In such a case, the error handler will produce the encoding of that byte if it was standing alone (thus in most encodings, \x00-\x7f turn into U+00-U+7F)." It sounds from https://bugzilla.novell.com/showbug.cgi?id=162501 like some people do actually use shiftjisx0213, unfortunately.
-- Glenn -- http://nevcal.com/
A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
- Previous message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
- Next message: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]