[Python-Dev] PEP 540: Add a new UTF-8 mode (v2) (original) (raw)

Nick Coghlan ncoghlan at gmail.com
Wed Dec 6 00:46:17 EST 2017


Something I've just noticed that needs to be clarified: on Linux, "C" locale and "POSIX" locale are aliases, but this isn't true in general (e.g. it's not the case on *BSD systems, including Mac OS X).

To handle that in PEP 538, I made it clear that everything is keyed specifically off the "C" locale, since that's what you actually get by default.

So if PEP 540 is going to implicitly trigger switching encodings, it needs to specify whether it's going to look for the C locale or the POSIX locale (I'd suggest C locale, since that's the actual default that causes problems).

The precedence relationship with locale coercion also needs to be spelled out: successful locale coercion should skip implicitly enabling UTF-8 mode (for opt-in UTF-8 mode, we'd still try to coerce the locale setting as appropriate, so extensions modules are more likely to behave themselves).

On 6 December 2017 at 14:07, INADA Naoki <songofacandy at gmail.com> wrote:

Oh, revised version is really short!

And I have one worrying point. With UTF-8 mode, open()'s default encoding/error handler is UTF-8/surrogateescape. Containers are really growing. PyCharm supports Docker and many new Python developers use Docker instead of installing Python directly on their system, especially on Windows. And opening binary file without "b" option is very common mistake of new developers. If default error handler is surrogateescape, they lose a chance to notice their bug. On the other hand, it helps some use cases when user want byte-transparent behavior, without modifying code to use "surrogateescape" explicitly. Which is more important scenario? Anyone has opinion about it? Are there any rationals and use cases I missing?

For platforms that offer a C.UTF-8 locale, I'd like "LC_CTYPE=C.UTF-8 python" and "PYTHONCOERCECLOCALE=0 LC_CTYPE=C PYTHONUTF8=1" to be equivalent (aside from the known limitation that extension modules may not do the right thing in the latter case).

For the locale coercion case, the default error handler for open remains as "strict", which means I'd be in favour of keeping it as "strict" by default in UTF-8 mode as well. That would flip the toggle in the PEP: "strict UTF-8" would be the default selection for "PYTHONUTF8=1, and you'd choose the more relaxed option via "PYTHONUTF8=permissive".

That way, the combination of PEPs 538 and 540 would give us the following situation in the C locale:

  1. Our preferred approach is to coerce LC_CTYPE in the C locale to a UTF-8 based equivalent
  2. Only if that fails (e.g. as it will on CentOS 7) do we resort to implicitly enabling CPython's internal UTF-8 mode (which should behave like C.UTF-8, except for the fact extension modules won't respect it)

That way, the ideal outcome is that a UTF-8 based locale exists, and we use it automatically when needed. UTF-8 mode than lets us cope with older platforms where neither C.UTF-8 nor an equivalent exists.

Cheers, Nick.

-- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia



More information about the Python-Dev mailing list