[Python-Dev] PEP 383 update: utf8b is now the error handler (original) (raw)

Stephen J. Turnbull stephen at xemacs.org
Tue May 5 19:31:28 CEST 2009

Previous message: [Python-Dev] PEP 383 update: utf8b is now the error handler
Next message: [Python-Dev] PEP 383 update: utf8b is now the error handler
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

MRAB writes:

I don't think "people shouldn't be using non-ASCII-compatible encodings for locale encodings" is a sufficient rationale for a hard error here. I mean, of course they should be using UTF-8. Maybe Python 3.1 should just go ahead and error on any other encoding on POSIX platforms?

I don't see why the error handler couldn't in principle be used with encodings other than UTF-8, although in that case all of the low surrogates should be open to use.

I should have been more clear here, I guess. The error handler can, and in the PEP will be by default, used with all "sane" locale encodings on POSIX.

It occurs to me that the PEP maybe should say that it is an error
to have your POSIX locale set to UTF-16 or something like that.

What "sane" means in this context is

ASCII NUL is the bytearray terminator, and can't be used as a byte in a file name. This rules out UTF-16, UTF-32, and widechar EUC encodings, as well as some very rare ones.
An ASCII character always translates to the Unicode character with the same code (ie, "to itself"). It is not a part of other sequences (control sequences, or a trailing byte). This rules out EBCDIC, ISO-2022-, Shift JIS, and Big5, among the encodings I'm familiar with. EBCDIC because only by accident will an EBCDIC character map to the same ASCII character with the same code. The ISO-2022- encodings are out because ASCII characters are used in escape sequences. Shift JIS and Big5 because in those encodings, a high-bit-set octet signals the start of a multibyte sequence, and some of the trailing bytes may be in the ASCII range.

What's left? Well, UTF-8, all of the ISO-8859 sets, several national standards (such as the KOI8 family for Cyrillic), IBM and Microsoft "code pages", and the "packed" EUC encodings used for Japanese, Chinese, and Korean. These all have the character that ASCII is ASCII, and all non-ASCII characters are encoded using only high-bit-set octets. In fact, in practice, on Unix these are invariably what you encounter.

So what's the problem? Backward compatibility for Microsoft OSes, which not only used to use MBCS national character sets, but "cleverly" packed more characters into the encoding by using ASCII as trailing bytes. Ie, the aforementioned "insane" Shift JIS (which is mandated by the leading Japanese cellphone service provider even today) and Big5 (the leading encoding for Chinese until very recently). These are very commonly found on archival media, and even on USB keys and so on which tend to be FAT-formatted. This doesn't prevent usage of the Unicode APIs, but up to Windows 2000 most Japanese vendors' OEM version of Windows used FAT format and Shift JIS as the file system encoding, and I know of Japanese offices where Windows 98 systems were in use as recently as early 2007.

It's the removable media which are the problem, because on Windows you just use the Unicode APIs. But they're not available on Unix, so you need the byte-oriented APIs.

Is this a real problem? I don't know, I don't do Windows, I don't do computing with my cellphone, and I don't need to get Japanese (that might be mixed with Russian ones!!) filenames off of ancient media or CIFS fileshares using Shift JIS. I guess it's possible that cellphones do everything except add filenames to directories in Shift JIS, but the filenames are in UTF-16.

OTOH, it seems to me that an optional extension to handling error on ASCII is technically feasible and would be nearly trivial to add to the PEP. The biggest cost would be adding the error argument to various functions (as Zooko requested) so that surrogate-replace-extended could be specified if needed.

Footnotes: [1] Unicode 5.0 uses the terms "high-half" and "low-half" at least once, in section 16.6, but the context is such that I take it to refer to "half of the surrogate area". Section 3.8 doesn't use these, instead noting that "leading" and "trailing" are sometimes used instead of "high" and "low". Better to avoid the word "half" in PEP 383, I think.

"Leading" and "trailing" simply state the order, not the set ("high" or "low"), so are not good terms to use.

But it's the order that's important. If you've just finished reading a character, and encounter a trailing surrogate, then it was produced by the 'utf8b' error handler; nothing else in a Python codec can do that. If you've just finished reading a character, are in a UTF-16 Python, and encounter a leading surrogate, then you immediately gobble the following code, which must be a trailing surrogate, and combine them to produce a character. The remaining case is that you encounter a valid character. Anything else is an error, and (assuming no bugs), no Python codec will produce anything else.

This does imply that programs that take advantage of the error
handler specified in this PEP are on their own if they accept data
from any sources that are not known to be Unicode-conforming.
OTOH, as far as I can see if other sources are known to be Unicode
conformant, it's reasonably (but not perfectly) safe to combine
them with strings from this PEP (and of course use either 'utf8b'
or 'strict', as appropriate, when passing data out of Python).

Should there be a function or method to check for conformance and lone surrogates?

string.encode('utf-8',errors=strict) will do for now.

Previous message: [Python-Dev] PEP 383 update: utf8b is now the error handler
Next message: [Python-Dev] PEP 383 update: utf8b is now the error handler
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list