[Python-Dev] PEP 540: Add a new UTF-8 mode (v2) (original) (raw)

Victor Stinner victor.stinner at gmail.com
Wed Dec 6 05:34:59 EST 2017

Previous message (by thread): [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)
Next message (by thread): [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Naoki,

2017-12-06 5:07 GMT+01:00 INADA Naoki <songofacandy at gmail.com>:

Oh, revised version is really short!

And I have one worrying point. With UTF-8 mode, open()'s default encoding/error handler is UTF-8/surrogateescape.

The Strict UTF-8 Mode is for you if you prioritize correctness over usability.

In the very first version of my PEP/idea, I wanted to use UTF-8/strict. But then I started to play with the implementation and I got many "practical" issues. Using UTF-8/strict, you quickly get encoding errors. For example, you become unable to read undecodable bytes from stdin. stdin.read() only gives you an error, without letting you decide how to handle these "invalid" data. Same issue with stdout.

Compare encodings of the UTF-8 mode and the Strict UTF-8 Mode: https://www.python.org/dev/peps/pep-0540/#encoding-and-error-handler

I tried to summarize all these kinds of issues in the second short subsection of the rationale: https://www.python.org/dev/peps/pep-0540/#passthough-undecodable-bytes-surrogateescape

In the old long version of the PEP, I tried to explain UTF-8/strict issues with very concrete examples, the removed "Use Cases" section: https://github.com/python/peps/blob/f92b5fbdc2bcd9b182c1541da5a0f4ce32195fb6/pep-0540.txt#L490

Tell me if I should rephrase the rationale of the PEP 540 to better justify the usage of surrogateescape.

Maybe the "UTF-8 Mode" should be renamed to "UTF-8 with surrogateescape, or backslashreplace for stderr, or surrogatepass for fsencode/fsencode on Windows, or strict for Strict UTF-8 Mode"... But the PEP title would be too long, no? :-)

And opening binary file without "b" option is very common mistake of new developers. If default error handler is surrogateescape, they lose a chance to notice their bug.

When open() in used in text mode to read "binary data", usually the developer would only notify when getting the POSIX locale (ASCII encoding). But the PEP 538 already changed that by using the C.UTF-8 locale (and so the UTF-8 encoding, instead of the ASCII encoding).

I'm not sure that locales are the best way to detect such class of bytes. I suggest to use -b or -bb option to detect such bugs without having to care of the locale.

On the other hand, it helps some use cases when user want byte-transparent behavior, without modifying code to use "surrogateescape" explicitly.

Which is more important scenario? Anyone has opinion about it? Are there any rationals and use cases I missing?

Usually users expect that Python 3 "just works" and don't bother them with the locale (thay nobody understands).

The old version of the PEP contains a long list of issues: https://github.com/python/peps/blob/f92b5fbdc2bcd9b182c1541da5a0f4ce32195fb6/pep-0540.txt#L924-L986

I already replaced the strict error handler with surrogateescape for sys.stdin and sys.stdout on the POSIX locale in Python 3.5: https://bugs.python.org/issue19977

For the rationale, read for example these comments:

https://bugs.python.org/issue19846#msg205727 "As I would state it, the problem is that python's boundary with the OS is not yet uniform. (...) Note that currently, input() and sys.stdin.read() won't read undecodable data so this is somewhat symmetrical but it seems to me that saying "everything that interfaces with the OS except the standard streams will use surrogateescape on undecodable bytes" is drawing a line in an unintuitive location."
https://bugs.python.org/issue19977#msg206141 "My impression was that python3 was supposed to help get rid of UnicodeError tracebacks, not mojibake. If mojibake was the problem then we should never have gone down the surrogateescape path for input."
https://bugs.python.org/issue19846#msg205646 "For example I'm using [LANG=C] for testcases to set the language uncomplicated to english."

In bug reports, to get the user expectations, just ignore all core developers comments :-)

Users set the locale to C to get messages in english and still expects "Unicode" to work properly.

Only Python 3 is so strict about encodings. Most other programming languages, like Python 2, "just works", since they process data as bytes.

Victor

Previous message (by thread): [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)
Next message (by thread): [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list