[Python-Dev] PEP 383 update: utf8b is now the error handler (original) (raw)

Stephen J. Turnbull stephen at xemacs.org
Tue May 5 16:57:36 CEST 2009


"Martin v. Löwis" writes:

I've updated the PEP accordingly.

I have three substantive comments. First, although consequences for Python 3 byte interfaces (ie, "none") are explicitly stated, as far as I can see this PEP could apply to Python 2 as well. I don't think it's intended that way. Either way, I think you should clarify that point.

Second, I suggest "surrogate-replace" as the name of the error handler rather than "utf8b". (Elsewhere I've suggested others, but I think this is the best of the bunch.)

Third, it is not clear to me why non-decodable ASCII should be an error. There are plenty of low surrogates for the purpose. Is there another technical reason? Stupid or not, Shift-JIS- and Big5-encoded file systems are quite common in Asia still (including non-rewritable media). I think surrogate-replacement of ASCII should at least be an option.

I don't think "people shouldn't be using non-ASCII-compatible encodings for locale encodings" is a sufficient rationale for a hard error here. I mean, of course they should be using UTF-8. Maybe Python 3.1 should just go ahead and error on any other encoding on POSIX platforms?

I have a number of nitpicking comments and technical clarifications on the PEP. Rationale is in footnotes. There were also a few typos I noticed.

  1. There is no such thing as a "half-surrogate" in Unicode. "Lone surrogate" is clear enough. Or for somewhat fancier English, "isolated surrogate" or "non-syntactic surrogate". To emphasize that Python codecs will only produce them in contexts where a Unicode character or high surrogate (for UTF-16 Python) is syntactically required, "isolated low surrogate" or "isolated trailing surrogate" might be good.[1]

  2. The specification should state, and the discussion emphasize, that strings which were produced by surrogate replacement must not be used in data interchange with systems that do not specifically accept such strings, and that this is the responsibility of the application.[2]

    Rather than saying that "dealing with such conflicts is out of scope of this PEP", I would say

    """Dealing with such conflicts is the responsibility of the application. Since this PEP's mechanism produces valid Unicode where possible, and produces invalid code points only via the error handler, one strategy is for the application to validate all other sources of strings as Unicode conforming. There may be other useful application-specific strategies, as well."""

  3. In the discussion, the transition from the example of alternative use of 'python-escape' to discussion of the error handler interface extension is a bit abrupt. I suggest rewriting as:

    """The extension to the encode error handler interface proposed by this PEP is necessary to implement the 'utf8b' error handler, because there are required byte sequences which cannot be generated from replacement Unicode. However, the encode error handler interface presently requires replacement Unicode to be provided in lieu of the non-encodable Unicode from the source string. Then it promptly encodes that replacement Unicode. In some error handlers, such as the 'utf8b' proposed here, it is also simpler and more efficient for the error handler to provide a pre-encoded replacement byte string, rather than forcing it to calculating Unicode from which the encoder would create the desired bytes."""

Typos (line references are to pep-0383.txt svn r72332):

l. 86: "Byte-orientied" -> "Byte-oriented" l. 98, 118, 124, 127, 132, 136: "python-escape" -> "utf8b" l. 130: "provide" -> "provided" l. 134: "calculating" -> "calculate"

Footnotes: [1] Unicode 5.0 uses the terms "high-half" and "low-half" at least once, in section 16.6, but the context is such that I take it to refer to "half of the surrogate area". Section 3.8 doesn't use these, instead noting that "leading" and "trailing" are sometimes used instead of "high" and "low". Better to avoid the word "half" in PEP 383, I think.

[2] Since this error handler is going to be the default for POSIX I/O, of course people are going to mostly ignore that restriction. The point is, passing such strings to systems that don't expect them is a bug, and the PEP should make it clear that it's the app's bug, not the other system's. On the other hand, using those strings in a context of consenting adults (and I do mean double-opt-in here) is perfectly acceptable. I'm specifically thinking of use in the Tahoe protocol discussed by Zooko O'Whielacronx; it may not be usable there for backward compatibility reasons, but "Unicode conformance" is not an issue in principle.

This does imply that programs that take advantage of the error
handler specified in this PEP are on their own if they accept data
from any sources that are not known to be Unicode-conforming.
OTOH, as far as I can see if other sources are known to be Unicode
conformant, it's reasonably (but not perfectly) safe to combine
them with strings from this PEP (and of course use either 'utf8b'
or 'strict', as appropriate, when passing data out of Python).


More information about the Python-Dev mailing list