[Python-Dev] PEP 538: Coercing the legacy C locale to a UTF-8 based locale (original) (raw)
Nick Coghlan ncoghlan at gmail.com
Sun Mar 5 02:50:38 EST 2017
- Previous message (by thread): [Python-Dev] Type annotations and metaclasses
- Next message (by thread): [Python-Dev] PEP 538: Coercing the legacy C locale to a UTF-8 based locale
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi folks,
Late last year I started working on a change to the CPython CLI (not the shared library) to get it to coerce the legacy C locale to something based on UTF-8 when a suitable locale is available.
After a couple of rounds of iteration on linux-sig and python-ideas, I'm now bringing it to python-dev as a concrete proposal for Python 3.7.
For most folks, reading the Abstract plus the draft docs updates in the reference implementation will tell you everything you need to know (if the C.UTF-8, C.utf8 or UTF-8 locales are available, the CLI will automatically attempt to coerce the legacy C locale to one of those rather than persisting with the latter's default assumption of ASCII as the preferred text encoding).
However, the full PEP goes into a lot more detail on:
- exactly what's broken about CPython's behaviour in the legacy C locale
- why I'm in favour of this particular approach to fixing it (i.e. it integrates better with other C/C++ components, as well as being amenable to redistributor backports for 3.6, and environment based configuration for 3.5 and earlier)
- why I think implementing both this change and Victor's more comprehensive "PYTHONUTF8 mode" proposal in PEP 540 will be better than implementing just one or the other (in some situations, ignoring the platform locale subsystem entirely really is the right approach, and that's the aspect PEP 540 tackles, while this PEP tackles the situations where the C locale behaviour is broken, but you still need to be consistent with the platform settings).
Cheers, Nick.
================================== PEP: 538 Title: Coercing the legacy C locale to a UTF-8 based locale Version: RevisionRevisionRevision Last-Modified: DateDateDate Author: Nick Coghlan <ncoghlan at gmail.com> Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 28-Dec-2016 Python-Version: 3.7 Post-History: 03-Jan-2017 (linux-sig), 07-Jan-2017 (python-ideas), 05-Mar-2017 (python-dev)
Abstract
An ongoing challenge with Python 3 on *nix systems is the conflict between needing to use the configured locale encoding by default for consistency with other C/C++ components in the same process and those invoked in subprocesses, and the fact that the standard C locale (as defined in POSIX:2001) typically implies a default text encoding of ASCII, which is entirely inadequate for the development of networked services and client applications in a multilingual world.
PEP 540 proposes a change to CPython's handling of the legacy C locale such that CPython will assume the use of UTF-8 in such environments, rather than persisting with the demonstrably problematic assumption of ASCII as an appropriate encoding for communicating with operating system interfaces. This is a good approach for cases where network encoding interoperability is a more important concern than local encoding interoperability.
However, it comes at the cost of making CPython's encoding assumptions diverge from those of other C and C++ components in the same process, as well as those of components running in subprocesses that share the same environment.
It also requires changes to the internals of how CPython itself works, rather than using existing configuration settings that are supported by Python versions prior to Python 3.7.
Accordingly, this PEP proposes that independently of the UTF-8 mode proposed in PEP 540, the way the CPython implementation handles the default C locale be changed such that:
- unless the new
PYTHONCOERCECLOCALE
environment variable is set to0
, the standalone CPython binary will automatically attempt to coerce theC
locale to the first available locale out ofC.UTF-8
,C.utf8
, orUTF-8
- if the locale is successfully coerced, and PEP 540 is not accepted, then
PYTHONIOENCODING
(if not otherwise set) will be set toutf-8:surrogateescape
. - if the locale is successfully coerced, and PEP 540 is accepted, then
PYTHONUTF8
(if not otherwise set) will be set to1
- if the subsequent runtime initialization process detects that the legacy
C
locale remains active (e.g. none ofC.UTF-8
,C.utf8
orUTF-8
are available, locale coercion is disabled, or the runtime is embedded in an application other than the main CPython binary), and thePYTHONUTF8
feature defined in PEP 540 is also disabled (or not implemented), it will emit a warning on stderr that use of the legacyC
locale's default ASCII text encoding may cause various Unicode compatibility issues
With this change, any *nix platform that does not offer at least one of
the
C.UTF-8
, C.utf8
or UTF-8
locales as part of its standard
configuration would only be considered a fully supported platform for
CPython
3.7+ deployments when either the new PYTHONUTF8
mode defined in PEP 540
is
used, or else a suitable locale other than the default C
locale is
configured explicitly (e.g. en_AU.UTF-8
, zh_CN.gb18030
).
Redistributors (such as Linux distributions) with a narrower target audience than the upstream CPython development team may also choose to opt in to this locale coercion behaviour for the Python 3.6.x series by applying the necessary changes as a downstream patch when first introducing Python 3.6.0.
Background
While the CPython interpreter is starting up, it may need to convert from
the char *
format to the wchar_t *
format, or from one of those
formats
to PyUnicodeObject *
, in a way that's consistent with the locale
settings
of the overall system. It handles these cases by relying on the operating
system to do the conversion and then ensuring that the text encoding name
reported by sys.getfilesystemencoding()
matches the encoding used during
this early bootstrapping process.
On Apple platforms (including both Mac OS X and iOS), this is straightforward, as Apple guarantees that these operations will always use UTF-8 to do the conversion.
On Windows, the limitations of the mbcs
format used by default in these
conversions proved sufficiently problematic that PEP 528 and PEP 529 were
implemented to bypass the operating system supplied interfaces for binary
data
handling and force the use of UTF-8 instead.
On Android, many components, including CPython, already assume the use of UTF-8 as the system encoding, regardless of the locale setting. However, this isn't the case for all components, and the discrepancy can cause problems in some situations (for example, when using the GNU readline module [16_]).
On non-Apple and non-Android *nix systems, these operations are handled using the C locale system in glibc, which has the following characteristics [4_]:
- by default, all processes start in the
C
locale, which usesASCII
for these conversions. This is almost never what anyone doing multilingual text processing actually wants (including CPython and C/C++ GUI frameworks). - calling
setlocale(LC_ALL, "")
reconfigures the active locale based on the locale categories configured in the current process environment - if the locale requested by the current environment is unknown, or no
specific
locale is configured, then the default
C
locale will remain active
The specific locale category that covers the APIs that CPython depends on is
LC_CTYPE
, which applies to "classification and conversion of characters,
and to multibyte and wide characters" [5_]. Accordingly, CPython includes
the
following key calls to setlocale
:
- in the main
python
binary, CPython callssetlocale(LC_ALL, "")
to configure the entire C locale subsystem according to the process environment. It does this prior to making any calls into the shared CPython library - in
Py_Initialize
, CPython callssetlocale(LC_CTYPE, "")
, such that the configured locale settings for that category always match those set in the environment. It does this unconditionally, and it doesn't revert the process state change inPy_Finalize
(This summary of the locale handling omits several technical details related to exactly where and when the text encoding declared as part of the locale settings is used - see PEP 540 for further discussion, as these particular details matter more when decoupling CPython from the declared C locale than they do when overriding the locale with one based on UTF-8)
These calls are usually sufficient to provide sensible behaviour, but they can still fail in the following cases:
- SSH environment forwarding means that SSH clients may sometimes forward client locale settings to servers that don't have that locale installed. This leads to CPython running in the default ASCII-based C locale
- some process environments (such as Linux containers) may not have any explicit locale configured at all. As with unknown locales, this leads to CPython running in the default ASCII-based C locale
The simplest way to deal with this problem for currently released versions of CPython is to explicitly set a more sensible locale when launching the application. For example::
LC_ALL=C.UTF-8 LANG=C.UTF-8 python3 ...
The C.UTF-8
locale is a full locale definition that uses UTF-8
for
the
LC_CTYPE
category, and the same settings as the C
locale for all
other
categories (including LC_COLLATE
). It is offered by a number of Linux
distributions (including Debian, Ubuntu, Fedora, Alpine and Android) as an
alternative to the ASCII-based C locale.
Mac OS X and other *BSD systems have taken a different approach, and
instead
of offering a C.UTF-8
locale, instead offer a partial UTF-8
locale
that
only defines the LC_CTYPE
category. On such systems, the preferred
environmental locale adjustment is to set LC_CTYPE=UTF-8
rather than to
set
LC_ALL
or LANG
. [17_]
In the specific case of Docker containers and similar technologies, the appropriate locale setting can be specified directly in the container image definition.
Another common failure case is developers specifying LANG=C
in order to
see otherwise translated user interface messages in English, rather than the
more narrowly scoped LC_MESSAGES=C
.
Relationship with other PEPs
This PEP shares a common problem statement with PEP 540 (improving Python 3's behaviour in the default C locale), but diverges markedly in the proposed solution:
- PEP 540 proposes to entirely decouple CPython's default text encoding from the C locale system in that case, allowing text handling inconsistencies to arise between CPython and other C/C++ components running in the same process and in subprocesses. This approach aims to make CPython behave less like a locale-aware C/C++ application, and more like C/C++ independent language runtimes like the JVM, .NET CLR, Go, Node.js, and Rust
- this PEP proposes to override the legacy C locale with a more recently defined locale that uses UTF-8 as its default text encoding. This means that the text encoding override will apply not only to CPython, but also to any locale aware extension modules loaded into the current process, as well as to locale aware C/C++ applications invoked in subprocesses that inherit their environment from the parent process. This approach aims to retain CPython's traditional strong support for integration with other components written in C and C++, while actively helping to push forward the adoption and standardisation of the C.UTF-8 locale as a Unicode-aware replacement for the legacy C locale in the wider C/C++ ecosystem
After reviewing both PEPs, it became clear that they didn't actually conflict at a technical level, and the proposal in PEP 540 offered a superior option in cases where no suitable locale was available, as well as offering a better reference behaviour for platforms where the notion of a "locale encoding" doesn't make sense (for example, embedded systems running MicroPython rather than the CPython reference interpreter).
Meanwhile, this PEP offered improved compatibility with other C/C++ components, and an approach more amenable to being backported to Python 3.6 by downstream redistributors.
As a result, this PEP was amended to refer to PEP 540 as a complementary solution that offered improved behaviour both when locale coercion triggered, as well as when none of the standard UTF-8 based locales were available.
The availability of PEP 540 also meant that the LC_CTYPE=en_US.UTF-8
legacy
fallback was removed from the list of UTF-8 locales tried as a coercion
target,
with CPython instead relying solely on the proposed PYTHONUTF8 mode in such
cases.
Motivation
While Linux container technologies like Docker, Kubernetes, and OpenShift are best known for their use in web service development, the related container formats and execution models are also being adopted for Linux command line application development. Technologies like Gnome Flatpak [7_] and Ubunty Snappy [8_] further aim to bring these same techniques to Linux GUI application development.
When using Python 3 for application development in these contexts, it isn't uncommon to see text encoding related errors akin to the following::
$ docker run --rm fedora:25 python3 -c 'print("ℙƴ☂ℌøἤ")'
Unable to decode the command from the command line:
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce2' in
position 7: surrogates not allowed $ docker run --rm ncoghlan/debian-python python3 -c 'print("ℙƴ☂ℌøἤ")' Unable to decode the command from the command line: UnicodeEncodeError: 'utf-8' codec can't encode character '\udce2' in position 7: surrogates not allowed
Even though the same command is likely to work fine when run locally::
$ python3 -c 'print("ℙƴ☂ℌøἤ")'
ℙƴ☂ℌøἤ
The source of the problem can be seen by instead running the locale
command
in the three environments::
$ locale | grep -E 'LC_ALL|LC_CTYPE|LANG'
LANG=en_AU.UTF-8
LC_CTYPE="en_AU.UTF-8"
LC_ALL=
$ docker run --rm fedora:25 locale | grep -E 'LC_ALL|LC_CTYPE|LANG'
LANG=
LC_CTYPE="POSIX"
LC_ALL=
$ docker run --rm ncoghlan/debian-python locale | grep -E
'LC_ALL|LC_CTYPE|LANG' LANG= LANGUAGE= LC_CTYPE="POSIX" LC_ALL=
In this particular example, we can see that the host system locale is set to "en_AU.UTF-8", so CPython uses UTF-8 as the default text encoding. By contrast, the base Docker images for Fedora and Debian don't have any specific locale set, so they use the POSIX locale by default, which is an alias for the ASCII-based default C locale.
The simplest way to get Python 3 (regardless of the exact version) to behave
sensibly in Fedora and Debian based containers is to run it in the
C.UTF-8
locale that both distros provide::
$ docker run --rm -e LANG=C.UTF-8 fedora:25 python3 -c 'print("ℙƴ☂ℌøἤ")'
ℙƴ☂ℌøἤ
$ docker run --rm -e LANG=C.UTF-8 ncoghlan/debian-python python3 -c
'print("ℙƴ☂ℌøἤ")' ℙƴ☂ℌøἤ
$ docker run --rm -e LANG=C.UTF-8 fedora:25 locale | grep -E
'LC_ALL|LC_CTYPE|LANG' LANG=C.UTF-8 LC_CTYPE="C.UTF-8" LC_ALL= $ docker run --rm -e LANG=C.UTF-8 ncoghlan/debian-python locale | grep -E 'LC_ALL|LC_CTYPE|LANG' LANG=C.UTF-8 LANGUAGE= LC_CTYPE="C.UTF-8" LC_ALL=
The Alpine Linux based Python images provided by Docker, Inc, already use the C.UTF-8 locale by default::
$ docker run --rm python:3 python3 -c 'print("ℙƴ☂ℌøἤ")'
ℙƴ☂ℌøἤ
$ docker run --rm python:3 locale | grep -E 'LC_ALL|LC_CTYPE|LANG'
LANG=C.UTF-8
LANGUAGE=
LC_CTYPE="C.UTF-8"
LC_ALL=
Similarly, for custom container images (i.e. those adding additional content on top of a base distro image), a more suitable locale can be set in the image definition so everything just works by default. However, it would provide a much nicer and more consistent user experience if CPython were able to just deal with this problem automatically rather than relying on redistributors or end users to handle it through system configuration changes.
While the glibc developers are working towards making the C.UTF-8 locale universally available for use by glibc based applications like CPython [6_], this unfortunately doesn't help on platforms that ship older versions of glibc without that feature, and also don't provide C.UTF-8 as an on-disk locale the way Debian and Fedora do. For these platforms, the mechanism proposed in PEP 540 at least allows CPython itself to behave sensibly, albeit without any mechanism to get other C/C++ components that decode binary streams as text to do the same.
Design Principles
The above motivation leads to the following core design principles for the proposed solution:
- if a locale other than the default C locale is explicitly configured, we'll continue to respect it
- if we're changing the locale setting without an explicit config option,
we'll
emit a warning on stderr that we're doing so rather than silently changing
the process configuration. This will alert application and system
integrators
to the change, even if they don't closely follow the PEP process or Python
release announcements. However, to minimize the chance of introducing new
problems for end users, we'll do this without using the warnings
system, so
even running with
-Werror
won't turn it into a runtime exception - any changes made will use existing configuration options
To minimize the negative impact on systems currently correctly configured to use GB-18030 or another partially ASCII compatible universal encoding leads to an additional design principle:
- if a UTF-8 based Linux container is run on a host that is explicitly configured to use a non-UTF-8 encoding, and tries to exchange locally encoded data with that host rather than exchanging explicitly UTF-8 encoded data, CPython will endeavour to correctly round-trip host provided data that is concatenated or split solely at common ASCII compatible code points, but may otherwise emit nonsensical results.
Specification
To better handle the cases where CPython would otherwise end up attempting
to operate in the C
locale, this PEP proposes that CPython automatically
attempt to coerce the legacy C
locale to a UTF-8 based locale when it is
run as a standalone command line application.
It further proposes to emit a warning on stderr if the legacy C
locale
is in effect at the point where the language runtime itself is initialized,
and the PEP 540 UTF-8 encoding override is also disabled, in order to warn
system and application integrators that they're running CPython in an
unsupported configuration.
Legacy C locale coercion in the standalone Python interpreter binary
When run as a standalone application, CPython has the opportunity to reconfigure the C locale before any locale dependent operations are executed in the process.
This means that it can change the locale settings not only for the CPython runtime, but also for any other C/C++ components running in the current process (e.g. as part of extension modules), as well as in subprocesses that inherit their environment from the current process.
After calling setlocale(LC_ALL, "")
to initialize the locale settings in
the current process, the main interpreter binary will be updated to include
the following call::
const char *ctype_loc = setlocale(LC_CTYPE, NULL);
This cryptic invocation is the API that C provides to query the current
locale
setting without changing it. Given that query, it is possible to check for
exactly the C
locale with strcmp
::
ctype_loc != NULL && strcmp(ctype_loc, "C") == 0 # true only in the C
locale
This call also returns "C"
when either no particular locale is set, or
the
nominal locale is set to an alias for the C
locale (such as POSIX
).
Given this information, CPython can then attempt to coerce the locale to one that uses UTF-8 rather than ASCII as the default encoding.
Three such locales will be tried:
C.UTF-8
(available at least in Debian, Ubuntu, and Fedora 25+, and expected to be available by default in a future version of glibc)C.utf8
(available at least in HP-UX)UTF-8
(available in at least some *BSD variants)
For C.UTF-8
and C.utf8
, the coercion will be implemented by actually
setting the LANG
and LC_ALL
environment variables to the candidate
locale name, such that future calls to setlocale()
will see them, as
will
other components looking for those settings (such as GUI development
frameworks).
For the platforms where it is defined, UTF-8
is a partial locale that
only
defines the LC_CTYPE
category. Accordingly, only the LC_CTYPE
environment variable would be set when using this fallback option.
To adjust automatically to future changes in locale availability, these checks will be implemented at runtime on all platforms other than Mac OS X and Windows, rather than attempting to determine which locales to try at compile time.
If the locale settings are changed successfully, and the
PYTHONIOENCODING
environment variable is currently unset, then it will be forced to
PYTHONIOENCODING=utf-8:surrogateescape
.
When this locale coercion is activated, the following warning will be printed on stderr, with the warning containing whichever locale was successfully configured::
Python detected LC_CTYPE=C, LC_ALL & LANG set to C.UTF-8 (set another
locale or PYTHONCOERCECLOCALE=0 to disable this locale coercion
behaviour).
When falling back to the UTF-8
locale, the message would be slightly
different::
Python detected LC_CTYPE=C, LC_CTYPE set to UTF-8 (set another locale
or PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour).
In combination with PEP 540, this locale coercion will mean that the
standard
Python binary and locale aware C/C++ extensions should once again "just
work"
in the three main failure cases we're aware of (missing locale
settings, SSH forwarding of unknown locales, and developers explicitly
requesting LANG=C
), as long as the target platform provides at least one
of the candidate UTF-8 based environments.
If PYTHONCOERCECLOCALE=0
is set, or none of the candidate locales is
successfully configured, then initialization will continue as usual in the C
locale and the Unicode compatibility warning described in the next section
will
be emitted just as it would for any other application.
The interpreter will always check for the PYTHONCOERCECLOCALE
environment
variable (even when running under the -E
or -I
switches), as the
locale
coercion check necessarily takes place before any command line argument
processing.
Changes to the runtime initialization process
By the time that Py_Initialize
is called, arbitrary locale-dependent
operations may have taken place in the current process. This means that
by the time it is called, it is too late to switch to a different locale -
doing so would introduce inconsistencies in decoded text, even in the
context
of the standalone Python interpreter binary.
Accordingly, when Py_Initialize
is called and CPython detects that the
configured locale is still the default C
locale and the PYTHONUTF8
feature from PEP 540 is disabled, the following warning will
be issued::
Python runtime initialized with LC_CTYPE=C (a locale with default ASCII encoding), which may cause Unicode compatibility problems. Using C.UTF-8 C.utf8, or UTF-8 (if available) as alternative Unicode-compatible locales is recommended.
In this case, no actual change will be made to the locale settings.
Instead, the warning informs both system and application integrators that they're running Python 3 in a configuration that we don't expect to work properly.
The second sentence providing recommendations would be conditionally
compiled
based on the operating system (e.g. recommending LC_CTYPE=UTF-8
on *BSD
systems.
New build-time configuration options
While both of the above behaviours would be enabled by default, they would also have new associated configuration options and preprocessor definitions for the benefit of redistributors that want to override those default settings.
The locale coercion behaviour would be controlled by the flag
--with[out]-c-locale-coercion
, which would set the
PY_COERCE_C_LOCALE
preprocessor definition.
The locale warning behaviour would be controlled by the flag
--with[out]-c-locale-warning
, which would set the
PY_WARN_ON_C_LOCALE
preprocessor definition.
On platforms where they would have no effect (e.g. Mac OS X, iOS, Android, Windows) these preprocessor variables would always be undefined.
Platform Support Changes
A new "Legacy C Locale" section will be added to PEP 11 that states:
- as of CPython 3.7, the legacy C locale is only supported when operating in "UTF-8" mode. Any Unicode handling issues that occur only in that locale and cannot be reproduced in an appropriately configured non-ASCII locale will be closed as "won't fix"
- as of CPython 3.7, *nix platforms are expected to provide at least one of
C.UTF-8
(full locale),C.utf8
(full locale) orUTF-8
(LC_CTYPE
-only locale) as an alternative to the legacyC
locale. Any Unicode related integration problems with C/C++ extensions that occur only in that locale and cannot be reproduced in an appropriately configured non-ASCII locale will be closed as "won't fix".
Rationale
Improving the handling of the C locale
It has been clear for some time that the C locale's default encoding of
ASCII
is entirely the wrong choice for development of modern networked
services. Newer languages like Rust and Go have eschewed that default
entirely,
and instead made it a deployment requirement that systems be configured to
use
UTF-8 as the text encoding for operating system interfaces. Similarly,
Node.js
assumes UTF-8 by default (a behaviour inherited from the V8 JavaScript
engine)
and requires custom build settings to indicate it should use the system
locale settings for locale-aware operations. Both the JVM and the .NET CLR
use UTF-16-LE as their primary encoding for passing text between
applications
and the underlying platform.
The challenge for CPython has been the fact that in addition to being used for network service development, it is also extensively used as an embedded scripting language in larger applications, and as a desktop application development language, where it is more important to be consistent with other C/C++ components sharing the same process, as well as with the user's desktop locale settings, than it is with the emergent conventions of modern network service development.
The core premise of this PEP is that for all of these use cases, the assumption of ASCII implied by the default "C" locale is the wrong choice, and furthermore that the following assumptions are valid:
- in desktop application use cases, the process locale will already be configured appropriately, and if it isn't, then that is an operating system or embedding application level problem that needs to be reported to and resolved by the operating system provider or application developer
- in network service development use cases (especially those based on Linux containers), the process locale may not be configured at all, and if it isn't, then the expectation is that components will impose their own default encoding the way Rust, Go and Node.js do, rather than trusting the legacy C default encoding of ASCII the way CPython currently does
Defaulting to "surrogateescape" error handling on the standard IO streams
By coercing the locale away from the legacy C default and its assumption of
ASCII as the preferred text encoding, this PEP also disables the implicit
use
of the "surrogateescape" error handler on the standard IO streams that was
introduced in Python 3.5 ([15_]), as well as the automatic use of
surrogateescape
when operating in PEP 540's UTF-8 mode.
Rather than introducing yet another configuration option to address that,
this PEP proposes to use the existing PYTHONIOENCODING
setting to ensure
that the surrogateescape
handler is enabled when the interpreter is
required to make assumptions regarding the expected filesystem encoding.
The aim of this behaviour is to attempt to ensure that operating system provided text values are typically able to be transparently passed through a Python 3 application even if it is incorrect in assuming that that text has been encoded as UTF-8.
In particular, GB 18030 [12_] is a Chinese national text encoding standard that handles all Unicode code points, that is formally incompatible with both ASCII and UTF-8, but will nevertheless often tolerate processing as surrogate escaped data - the points where GB 18030 reuses ASCII byte values in an incompatible way are likely to be invalid in UTF-8, and will therefore be escaped and opaque to string processing operations that split on or search for the relevant ASCII code points. Operations that don't involve splitting on or searching for particular ASCII or Unicode code point values are almost certain to work correctly.
Similarly, Shift-JIS [13_] and ISO-2022-JP [14_] remain in widespread use in Japan, and are incompatible with both ASCII and UTF-8, but will tolerate text processing operations that don't involve splitting on or searching for particular ASCII or Unicode code point values.
As an example, consider two files, one encoded with UTF-8 (the default
encoding
for en_AU.UTF-8
), and one encoded with GB-18030 (the default encoding
for
zh_CN.gb18030
)::
$ python3 -c 'open("utf8.txt", "wb").write("ℙƴ☂ℌøἤ\n".encode("utf-8"))'
$ python3 -c 'open("gb18030.txt",
"wb").write("ℙƴ☂ℌøἤ\n".encode("gb18030"))'
On disk, we can see that these are two very different files::
$ python3 -c 'print("UTF-8: ", open("utf8.txt", "rb").read().strip());
print("GB18030:", open("gb18030.txt",
"rb").read().strip())'
UTF-8:
b'\xe2\x84\x99\xc6\xb4\xe2\x98\x82\xe2\x84\x8c\xc3\xb8\xe1\xbc\xa4\n'
GB18030:
b'\x816\xbd6\x810\x9d0\x817\xa29\x816\xbc4\x810\x8b3\x816\x8d6\n'
That nevertheless can both be rendered correctly to the terminal as long as they're decoded prior to printing::
$ python3 -c 'print("UTF-8: ", open("utf8.txt", "r",
encoding="utf-8").read().strip());
print("GB18030:", open("gb18030.txt", "r",
encoding="gb18030").read().strip())'
UTF-8: ℙƴ☂ℌøἤ
GB18030: ℙƴ☂ℌøἤ
By contrast, if we just pass along the raw bytes, as cat
and similar
C/C++
utilities will tend to do::
$ LANG=en_AU.UTF-8 cat utf8.txt gb18030.txt
ℙƴ☂ℌøἤ
�6�6�0�0�7�9�6�4�0�3�6�6
Even setting a specifically Chinese locale won't help in getting the GB-18030 encoded file rendered correctly::
$ LANG=zh_CN.gb18030 cat utf8.txt gb18030.txt
ℙƴ☂ℌøἤ
�6�6�0�0�7�9�6�4�0�3�6�6
The problem is that the terminal encoding setting remains UTF-8,
regardless
of the nominal locale. A GB18030 terminal can be emulated using the
iconv
utility::
$ cat utf8.txt gb18030.txt | iconv -f GB18030 -t UTF-8
鈩櫰粹槀鈩屆羔激
ℙƴ☂ℌøἤ
This reverses the problem, such that the GB18030 file is rendered correctly, but the UTF-8 file has been converted to unrelated hanzi characters, rather than the expected rendering of "Python" as non-ASCII characters.
With the emulated GB18030 terminal encoding, assuming UTF-8 in Python results in both files being displayed incorrectly::
$ python3 -c 'print("UTF-8: ", open("utf8.txt", "r",
encoding="utf-8").read().strip());
print("GB18030:", open("gb18030.txt", "r",
encoding="gb18030").read().strip())'
| iconv -f GB18030 -t UTF-8
UTF-8: 鈩櫰粹槀鈩屆羔激
GB18030: 鈩櫰粹槀鈩屆羔激
However, setting the locale correctly means that the emulated GB18030 terminal now displays both files as originally intended::
$ LANG=zh_CN.gb18030 \
python3 -c 'print("UTF-8: ", open("utf8.txt", "r",
encoding="utf-8").read().strip());
print("GB18030:", open("gb18030.txt", "r",
encoding="gb18030").read().strip())'
| iconv -f GB18030 -t UTF-8
UTF-8: ℙƴ☂ℌøἤ
GB18030: ℙƴ☂ℌøἤ
The rationale for retaining surrogateescape
as the default IO encoding
is
that it will preserve the following helpful behaviour in the C locale::
$ cat gb18030.txt \
| LANG=C python3 -c "import sys; print(sys.stdin.read())" \
| iconv -f GB18030 -t UTF-8
ℙƴ☂ℌøἤ
Rather than reverting to the exception seen when a UTF-8 based locale is explicitly configured::
$ cat gb18030.txt \
| python3 -c "import sys; print(sys.stdin.read())" \
| iconv -f GB18030 -t UTF-8
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib64/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 0:
invalid start byte
Note: an alternative to setting PYTHONIOENCODING
as the PEP currently
proposes would be to instead always default to surrogateescape
on the
standard streams, and require the use of PYTHONIOENCODING=:strict
to
request
text encoding validation during stream processing. Adopting such an approach
would bring Python 3 more into line with typical C/C++ tools that pass along
the raw bytes without checking them for conformance to their nominal
encoding,
and would hence also make the last example display the desired output::
$ cat gb18030.txt \
| PYTHONIOENCODING=:surrogateescape python3 -c "import sys;
print(sys.stdin.read())"
| iconv -f GB18030 -t UTF-8
ℙƴ☂ℌøἤ
Dropping official support for ASCII based text handling in the legacy C locale
We've been trying to get strict bytes/text separation to work reliably in the legacy C locale for over a decade at this point. Not only haven't we been able to get it to work, neither has anyone else - the only viable alternatives identified have been to pass the bytes along verbatim without eagerly decoding them to text (C/C++, Python 2.x, Ruby, etc), or else to ignore the nominal C/C++ locale encoding entirely and assume the use of either UTF-8 (PEP 540, Rust, Go, Node.js, etc) or UTF-16-LE (JVM, .NET CLR).
While this PEP ensures that developers that need to do so can still opt-in to running their Python code in the legacy C locale, it also makes clear that we don't expect Python 3's Unicode handling to be reliable in that configuration, and the recommended alternative is to use a more appropriate locale setting.
Providing implicit locale coercion only when running standalone
Over the course of Python 3.x development, multiple attempts have been made
to improve the handling of incorrect locale settings at the point where the
Python interpreter is initialised. The problem that emerged is that this is
ultimately too late in the interpreter startup process - data such as
command
line arguments and the contents of environment variables may have already
been
retrieved from the operating system and processed under the incorrect ASCII
text encoding assumption well before Py_Initialize
is called.
The problems created by those inconsistencies were then even harder to diagnose and debug than those created by believing the operating system's claim that ASCII was a suitable encoding to use for operating system interfaces. This was the case even for the default CPython binary, let alone larger C/C++ applications that embed CPython as a scripting engine.
The approach proposed in this PEP handles that problem by moving the locale
coercion as early as possible in the interpreter startup sequence when
running
standalone: it takes place directly in the C-level main()
function, even
before calling in to the `Py_Main()`` library function that implements the
features of the CPython interpreter CLI.
The Py_Initialize
API then only gains an explicit warning (emitted on
stderr
) when it detects use of the C
locale, and relies on the
embedding application to specify something more reasonable.
Querying LC_CTYPE for C locale detection
LC_CTYPE
is the actual locale category that CPython relies on to drive
the
implicit decoding of environment variables, command line arguments, and
other
text values received from the operating system.
As such, it makes sense to check it specifically when attempting to determine whether or not the current locale configuration is likely to cause Unicode handling problems.
Setting both LANG & LC_ALL for C.UTF-8 locale coercion
Python is often used as a glue language, integrating other C/C++ ABI compatible components in the current process, and components written in arbitrary languages in subprocesses.
Setting LC_ALL
to C.UTF-8
imposes a locale setting override on all
C/C++ components in the current process and in any subprocesses that inherit
the current environment. This is important to handle cases where the problem
has arisen from a setting like LC_CTYPE=UTF-8
being provided on a system
where no UTF-8
locale is defined (e.g. when a Mac OS X ssh client is
configured to forward locale settings, and the user logs into a Linux
server).
Setting LANG
to C.UTF-8
ensures that even components that only check
the LANG
fallback for their locale settings will still use C.UTF-8
.
Together, these should ensure that when the locale coercion is activated, the switch to the C.UTF-8 locale will be applied consistently across the current process and any subprocesses that inherit the current environment.
Allowing restoration of the legacy behaviour
The CPython command line interpreter is often used to investigate faults that occur in other applications that embed CPython, and those applications may still be using the C locale even after this PEP is implemented.
Providing a simple on/off switch for the locale coercion behaviour makes it much easier to reproduce the behaviour of such applications for debugging purposes, as well as making it easier to reproduce the behaviour of older 3.x runtimes even when running a version with this change applied.
Implementation
A draft implementation of the change (including test cases and
documentation)
is linked from issue 28180 [1_], which is an end user request that
sys.getfilesystemencoding()
default to utf-8
rather than ascii
.
This patch is now being maintained as the pep538-coerce-c-locale
feature
branch [18_] in Nick Coghlan's fork of the CPython repository on GitHub.
NOTE: As discussed in [1_], the currently posted draft implementation has some known issues on Android.
Backporting to earlier Python 3 releases
Backporting to Python 3.6.0
If this PEP is accepted for Python 3.7, redistributors backporting the change specifically to their initial Python 3.6.0 release will be both allowed and encouraged. However, such backports should only be undertaken either in conjunction with the changes needed to also provide a suitable locale by default, or else specifically for platforms where such a locale is already consistently available.
Backporting to other 3.x releases
While the proposed behavioural change is seen primarily as a bug fix addressing Python 3's current misbehaviour in the default ASCII-based C locale, it still represents a reasonably significant change in the way CPython interacts with the C locale system. As such, while some redistributors may still choose to backport it to even earlier Python 3.x releases based on the needs and interests of their particular user base, this wouldn't be encouraged as a general practice.
However, configuring Python 3 environments (such as base container images) to use these configuration settings by default is both allowed and recommended.
Acknowledgements
The locale coercion approach proposed in this PEP is inspired directly by
Armin Ronacher's handling of this problem in the click
command line
utility development framework [2_]::
$ LANG=C python3 -c 'import click; cli = click.command()(lambda:None);
cli()' Traceback (most recent call last): ... RuntimeError: Click will abort further execution because Python 3 was configured to use ASCII as encoding for the environment. Either run this under Python 2 or consult http://click.pocoo.org/python3/ for mitigation steps.
This system supports the C.UTF-8 locale which is recommended.
You might be able to resolve your issue by exporting the
following environment variables:
export LC_ALL=C.UTF-8
export LANG=C.UTF-8
The change was originally proposed as a downstream patch for Fedora's system Python 3.6 package [3_], and then reformulated as a PEP for Python 3.7 with a section allowing for backports to earlier versions by redistributors.
The initial draft was posted to the Python Linux SIG for discussion [10_] and then amended based on both that discussion and Victor Stinner's work in PEP 540 [11_].
The "ℙƴ☂ℌøἤ" string used in the Unicode handling examples throughout this PEP is taken from Ned Batchelder's excellent "Pragmatic Unicode" presentation [9_].
Stephen Turnbull has long provided valuable insight into the text encoding handling challenges he regularly encounters at the University of Tsukuba (筑波大学).
References
.. [1] CPython: sys.getfilesystemencoding() should default to utf-8 (http://bugs.python.org/issue28180)
.. [2] Locale configuration required for click applications under Python 3 (http://click.pocoo.org/5/python3/#python-3-surrogate-handling)
.. [3] Fedora: force C.UTF-8 when Python 3 is run under the C locale (https://bugzilla.redhat.com/show_bug.cgi?id=1404918)
.. [4] GNU C: How Programs Set the Locale ( https://www.gnu.org/software/libc/manual/html_node/Setting-the-Locale.html)
.. [5] GNU C: Locale Categories ( https://www.gnu.org/software/libc/manual/html_node/Locale-Categories.html)
.. [6] glibc C.UTF-8 locale proposal (https://sourceware.org/glibc/wiki/Proposals/C.UTF-8)
.. [7] GNOME Flatpak (http://flatpak.org/)
.. [8] Ubuntu Snappy (https://www.ubuntu.com/desktop/snappy)
.. [9] Pragmatic Unicode (http://nedbatchelder.com/text/unipain.html)
.. [10] linux-sig discussion of initial PEP draft (https://mail.python.org/pipermail/linux-sig/2017-January/000014.html)
.. [11] Feedback notes from linux-sig discussion and PEP 540 (https://github.com/python/peps/issues/171)
.. [12] GB 18030 (https://en.wikipedia.org/wiki/GB_18030)
.. [13] Shift-JIS (https://en.wikipedia.org/wiki/Shift_JIS)
.. [14] ISO-2022 (https://en.wikipedia.org/wiki/ISO/IEC_2022)
.. [15] Use "surrogateescape" error handler for sys.stdin and sys.stdout on UNIX for the C locale (https://bugs.python.org/issue19977)
.. [16] test_readline.test_nonascii fails on Android (http://bugs.python.org/issue28997)
.. [17] UTF-8 locale discussion on "locale.getdefaultlocale() fails on Mac OS X with default language set to English" (http://bugs.python.org/issue18378#msg215215)
.. [18] GitHub branch diff for ncoghlan:pep538-coerce-c-locale
(
https://github.com/python/cpython/compare/master...ncoghlan:pep538-coerce-c-locale
)
Copyright
This document has been placed in the public domain under the terms of the CC0 1.0 license: https://creativecommons.org/publicdomain/zero/1.0/
-- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.python.org/pipermail/python-dev/attachments/20170305/cebfded8/attachment-0001.html>
- Previous message (by thread): [Python-Dev] Type annotations and metaclasses
- Next message (by thread): [Python-Dev] PEP 538: Coercing the legacy C locale to a UTF-8 based locale
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]