[Python-Dev] PEP 540: Add a new UTF-8 mode (v2) (original) (raw)

Victor Stinner victor.stinner at gmail.com
Thu Dec 7 19:48:25 EST 2017

Previous message (by thread): [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)
Next message (by thread): [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

2017-12-08 0:26 GMT+01:00 Guido van Rossum <guido at python.org>:

You will quickly get decoding errors, and that is INADA's point. (Unless you use encoding='Latin-1'.) His worry is that the surrogateescape error handler makes it so that you won't get decoding errors, and then the failure mode is much harder to debug.

Hum, my question was more to know if Python fails because of an operation failing with strings whereas bytes were expected, or if Python fails with a decoding error... But now I'm not sure aynmore that this level of detail really matters.

Let me think out loud. To explain unicode issues, I like to use filenames, since it's something that users view commonly, handle directly and can modify (and so enter many non-ASCII characters like diacritics and emojis ;-)).

Filenames can be found on the command line, in environment variables (PYTHONSTARTUP), stdin (read a list of files from stdin), stdout (write the list of files into stdout), but also in text files (the Mercurial "makefile problem).

I consider that the command line and environment variables should "just work" and so use surrogateescape. It would be too annoying to not even be able to start Python because of an Unicode error. For example, it wouldn't be easy to identify which environment variable causes the issue. Hopefully, the UTF-8 doesn't change anything here: surrogateescape is already used since Python 3.3 for the command line and environment variables.

For stdin/stdout, I think that the main motivation here is to write Unix command line tools using Python 3: pass-through undecodable bytes without bugging the user with Unicode. Users don't use stdin and stdout as regular files, they are more used as pipes to pass data between programs with the Unix pipe in a shell like "producer | consumer". Sometimes stdout is redirected to a file, but I consider that it is expected to behave as a pipe and the regular TTY stdout. IMHO we are still in the safe surrogateescape area (for the specific case of the UTF-8 mode).

Ok, now comes the real question, open().

For open(), I used the example of a code snippet writing the content of a directory (os.listdir) into a text file. Another example is to read filenames from a text files but pass-through undecodable bytes thanks to surrogateescape.

But Naoki explained that open() is commonly misused to open binary files and Python should somehow fail badly to notify the developer of their mistake.

If I should make a choice between the two categories of usage of open(), "read undecodable bytes in UTF-8 from a text file" versus "misuse open() on binary file", I expect that the later is more common that that open() shouldn't use surrogateescape by default.

While stdin and stdout are usually associated to Unix pipes and Unix tools working on bytes, files are more commonly associated to important data that must not be lost nor corrupted. Python is expected to "help" the developer to use the proper options to read content from a file and to write content into a file. So I understand that open() should use the "strict" error handler in the UTF-8 mode, rather than "surrogateescape".

I can survive to this "tiny" change to my PEP. I just posted a 3rd version of my PEP where open() error handler remains strict (is no more changed by the PEP).

Victor

Previous message (by thread): [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)
Next message (by thread): [Python-Dev] PEP 540: Add a new UTF-8 mode (v2)
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list