[Python-Dev] PEP 383 update: utf8b is now the error handler (original) (raw)

"Martin v. Löwis" martin at v.loewis.de
Tue May 5 23:01:49 CEST 2009


I have three substantive comments. First, although consequences for Python 3 byte interfaces (ie, "none") are explicitly stated, as far as I can see this PEP could apply to Python 2 as well. I don't think it's intended that way. Either way, I think you should clarify that point.

Done: the Python-Version header already clarifies that point.

Second, I suggest "surrogate-replace" as the name of the error handler rather than "utf8b".

I think this is bike-shedding.

Third, it is not clear to me why non-decodable ASCII should be an error. There are plenty of low surrogates for the purpose. Is there another technical reason? Stupid or not, Shift-JIS- and Big5-encoded file systems are quite common in Asia still (including non-rewritable media). I think surrogate-replacement of ASCII should at least be an option.

It's a security risk. If U+DCXX would map to \xXX, then somebody could embed U+DC2E U+DC2E U+DC2F into a character string; even if this gets sanitized, nobody would expect that this will actually access ../

1. There is no such thing as a "half-surrogate" in Unicode. "Lone surrogate" is clear enough. Or for somewhat fancier English, "isolated surrogate" or "non-syntactic surrogate". To emphasize that Python codecs will only produce them in contexts where a Unicode character or high surrogate (for UTF-16 Python) is syntactically required, "isolated low surrogate" or "isolated trailing surrogate" might be good.[1]

Fixed. I removed the world "half" everywhere. It really doesn't mean anything to me (it could have been called sunnygate instead, making no difference).

I tried to understand "surrogate", and it was explained to me that "surrogate" is something that stands for something - but then I would argue that the two subsequence codes form a surrogate - they stand for something else. The individual surrogate code (in Unicode terminology) doesn't stand for anything. So don't you agree that it is the Unicode terminology that is in error, not the PEP?

2. The specification should state, and the discussion emphasize, that strings which were produced by surrogate replacement must not be used in data interchange with systems that do not specifically accept such strings, and that this is the responsibility of the application.[2]

No. The specification puts no requirements on applications whatsoever. So if you propose to use MUST NOT in the RFC 2119 sense, I strongly disagree.

Applications that desire mojibake are free to produce it; we are consenting adults; and all that.

3. In the discussion, the transition from the example of alternative use of 'python-escape' to discussion of the error handler interface extension is a bit abrupt. I suggest rewriting as:

"""The extension to the encode error handler interface proposed by this PEP is necessary to implement the 'utf8b' error handler, because there are required byte sequences which cannot be generated from replacement Unicode. However, the encode error handler interface presently requires replacement Unicode to be provided in lieu of the non-encodable Unicode from the source string. Then it promptly encodes that replacement Unicode. In some error handlers, such as the 'utf8b' proposed here, it is also simpler and more efficient for the error handler to provide a pre-encoded replacement byte string, rather than forcing it to calculating Unicode from which the encoder would create the desired bytes."""

Unfortunately, I failed to understand where you want this text to go. What paragraphs should I remove, or (if none), after which paragraph should I insert this text?

Regards, Martin



More information about the Python-Dev mailing list