[Python-Dev] PEP 540: Add a new UTF-8 mode (v2) (original) (raw)

INADA Naoki songofacandy at gmail.com
Wed Dec 6 09:02:16 EST 2017


And I have one worrying point. With UTF-8 mode, open()'s default encoding/error handler is UTF-8/surrogateescape. The Strict UTF-8 Mode is for you if you prioritize correctness over usability.

Yes, but as I said, I cares about not experienced developer who doesn't know what UTF-8 mode is.

In the very first version of my PEP/idea, I wanted to use UTF-8/strict. But then I started to play with the implementation and I got many "practical" issues. Using UTF-8/strict, you quickly get encoding errors. For example, you become unable to read undecodable bytes from stdin. stdin.read() only gives you an error, without letting you decide how to handle these "invalid" data. Same issue with stdout.

I don't care about stdio, because PEP 538 uses surrogateescape for stdio/error https://www.python.org/dev/peps/pep-0538/#changes-to-the-default-error-handling-on-the-standard-streams

I care only about builtin open()'s behavior. PEP 538 doesn't change default error handler of open().

I think PEP 538 and PEP 540 should behave almost identical except changing locale or not. So I need very strong reason if PEP 540 changes default error handler of open().

In the old long version of the PEP, I tried to explain UTF-8/strict issues with very concrete examples, the removed "Use Cases" section: https://github.com/python/peps/blob/f92b5fbdc2bcd9b182c1541da5a0f4ce32195fb6/pep-0540.txt#L490

Tell me if I should rephrase the rationale of the PEP 540 to better justify the usage of surrogateescape.

OK, "List a directory into a text file" example demonstrates why surrogateescape is used for open(). If os.listdir() returns surrogateescpaed data, file.write() will be fail. All other examples are about stdio.

But we should achieve good balance between correctness and usability of default behavior.

Maybe the "UTF-8 Mode" should be renamed to "UTF-8 with surrogateescape, or backslashreplace for stderr, or surrogatepass for fsencode/fsencode on Windows, or strict for Strict UTF-8 Mode"... But the PEP title would be too long, no? :-)

I feel short name is enough.

And opening binary file without "b" option is very common mistake of new developers. If default error handler is surrogateescape, they lose a chance to notice their bug. When open() in used in text mode to read "binary data", usually the developer would only notify when getting the POSIX locale (ASCII encoding). But the PEP 538 already changed that by using the C.UTF-8 locale (and so the UTF-8 encoding, instead of the ASCII encoding).

With PEP 538 (C.UTF-8 locale), open() uses UTF-8/strict, not UTF-8/surrogateescape.

For example, this code raise UnicodeDecodeError with PEP 538 if the file is JPEG file.

with open(fn) as f:
    f.read()

I'm not sure that locales are the best way to detect such class of bytes. I suggest to use -b or -bb option to detect such bugs without having to care of the locale.

But many new developers doesn't use/know -b or -bb option.

On the other hand, it helps some use cases when user want byte-transparent behavior, without modifying code to use "surrogateescape" explicitly.

Which is more important scenario? Anyone has opinion about it? Are there any rationals and use cases I missing? Usually users expect that Python 3 "just works" and don't bother them with the locale (thay nobody understands). The old version of the PEP contains a long list of issues: https://github.com/python/peps/blob/f92b5fbdc2bcd9b182c1541da5a0f4ce32195fb6/pep-0540.txt#L924-L986 I already replaced the strict error handler with surrogateescape for sys.stdin and sys.stdout on the POSIX locale in Python 3.5: https://bugs.python.org/issue19977 For the rationale, read for example these comments: [snip]

OK, I'll read them and think again about open()'s default behavior. But I still hope open()'s behavior is consistent with PEP 538 and PEP 540.

Regards,



More information about the Python-Dev mailing list