[Python-Dev] PEP 538: Coercing the legacy C locale to a UTF-8 based locale (original) (raw)
Nick Coghlan ncoghlan at gmail.com
Thu May 4 11:01:38 EDT 2017
- Previous message (by thread): [Python-Dev] PEP 538: Coercing the legacy C locale to a UTF-8 based locale
- Next message (by thread): [Python-Dev] PEP 538: Coercing the legacy C locale to a UTF-8 based locale
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 4 May 2017 at 12:24, INADA Naoki <songofacandy at gmail.com> wrote:
[PEP 538]
* PEP 540 proposes to entirely decouple CPython's default text encoding from the C locale system in that case, allowing text handling inconsistencies to arise between CPython and other locale-aware components running in the same process and in subprocesses. This approach aims to make CPython behave less like a locale-aware application, and more like locale-independent language runtimes like the JVM, .NET CLR, Go, Node.js, and Rust https://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html says:
Every instance of the Java virtual machine has a default charset, which may or may not be one of the standard charsets. The default charset is determined during virtual-machine startup and typically depends upon the locale and charset being used by the underlying operating system. I don't know about .NET runtime on Unix much. (mono and .NET Core). "Go, Node.js and Rust" seems enough examples.
I'll push an update to drop the JVM and .NET from the list of examples.
New build-time configuration options ------------------------------------ [snip] In case of (b), while warning about C locale is not shown, warning about coercion is still shown. So when people don't want to see warning under C locale and there is no (C.UTF-8, C.utf8, UTF-8) locales, there are three ways: * Set PYTHONUTF=1 (if PEP 540 is accepted) * Set PYTHONCOERCECLOCALE=0. * Use both of
--without-c-locale-coercion
and--without-c-locale-warning
configure options. Is my understanding right?
Yes, that sounds right.
BTW, I prefer PEP 540 provides
--with-utf8mode
option which enables UTF-8 mode by default. And if it is added, there are too few use cases for--without-c-locale-warning
.There are some use cases people want to use UTF-8 by default in system wide. (e.g. container, webserver in Cent OS, etc...) On the other hand, most of C locale usage are "per application" basis, rather than "system wide." configure option is not suitable for such per application setting, off course.
Yeah, in addition to Barry requesting such an option in one of the earlier linux-sig reviews, my main rationale for including it is that providing both config options offers a quick compatibility fix for any distro where emitting the coercion and/or C locale warning on stderr causes problems.
The only one of those that Fedora encountered in the F26 alpha was deemed a bug in the affected application (something in autotools was checking for "no output on stderr" instead of "subprocess exit code is 0", and the fix was to switch it to check the subprocess exit code), but there are enough Linux distros and BSD variants out there that I'm a lot more comfortable shipping the change with straightforward "off" switches for the stderr output.
But I don't propose removing the option from PEP 538. We can discuss about reducing configure options later.
+1.
On platforms where they would have no effect (e.g. Mac OS X, iOS, Android, Windows) these preprocessor variables would always be undefined. Why
--with[out]-c-locale-coercion
have no effect on macOS, iOS and Android?
On these three, we know the system encoding is UTF-8, so we never interpreted the C locale as meaning "ascii" in the first place.
On Android, locale coercion fixes readline. Do you mean locale coercion happen always regardless this configuration option?
Right, the change for Android is that we switch to calling 'setlocale(LC_ALL, "C.UTF-8")' during interpreter startup instead of 'setlocale(LC_ALL, "")'. That change is guarded by "#ifdef ANDROID", rather than either of the new conditionals.
On macOS,
LCALL=C python
doesn't make Python's stdio toascii:surrogateescape
?
Similar to Android, CPython itself is hardcoded to assume UTF-8 on Mac OS X, since that's a platform API guarantee that users can't change.
Even so, locale coercion may fix libraries like readline, curses. While C locale is less common on macOS, I don't understand any reason to disable it on macOS.
My understanding is that other libraries and applications also automatically use UTF-8 for system interfaces on Mac OS X and iOS. It could be that that understanding is wrong, and locale coercion would provide a benefit there as well.
(Checking the draft implementation, it turns out I haven't actually implemented the configure logic to make those config settings platform dependent yet - they're currently only undefined on Windows by default, since that doesn't use the autotools based build system)
I know almost nothing about iOS, but it's similar to Android or macOS in my expectation.
Improving the handling of the C locale -------------------------------------- ... locale settings for locale-aware operations. Both the JVM and the .NET CLR use UTF-16-LE as their primary encoding for passing text between applications and the underlying platform. JVM and .NET examples are misleading again. They just use UTF-16-LE for syscall on Windows, like Python. I don't know about them much, but I believe they don't use UTF-16 for system encoding on Linux.
Sorry, this was ambiguous - it's meant to refer to applications calling in to the JVM or CLR app runtime, not to the JVM or CLR calling out to the host operating system. I'll try to make it clearer in the next update.
Defaulting to "surrogateescape" error handling on the standard IO streams ------------------------------------------------------------------------- By coercing the locale away from the legacy C default and its assumption of ASCII as the preferred text encoding, this PEP also disables the implicit use of the "surrogateescape" error handler on the standard IO streams that was introduced in Python 3.5 ([15]), as well as the automatic use of
surrogateescape
when operating in PEP 540's UTF-8 mode. I agree that this PEP shouldn't break byte transparent behavior in C locale by coercing. But I feel behavior difference between coerced C.UTF-8 locale and usual C.UTF-8 locale can be pitfall. I read following part of the section and I agree that there is no way to solve all issue. But how about using surrogateescape handler in C.* locales like C locale?
That would be entirely possible, as the code responsible for that adjustment is the lines:
char *loc = setlocale(LC_CTYPE, NULL);
if (loc != NULL && strcmp(loc, "C") == 0)
errors = "surrogateescape";
Changing that to include "C.UTF-8" as a second locale that also
implies the use of surrogateescape
would be low risk, and means we
wouldn't need to call Py_SetStandardStreamEncoding.
As a result, non UTF-8 data (such as latin-1 or GB-18030) would automatically round-trip, regardless of whether C.UTF-8 was explicitly set as the locale, or reached as the result of locale coercion.
It solves Python 3.7 subprocess under Python 3.7 with coerced C.UTF-8 locale at least.
It will also extend host/container encoding mismatch compatibility to containers that explicitly set the C.UTF-8 locale.
That makes me more confident in making that change, as it would be rather counterproductive if our changes gave base image developers an incentive not to set C.UTF-8 as their default locale.
Anyway, I think https://bugs.python.org/issue15216 should be fixed in Python 3.7 too. Python applications which requires byte transparent stdio can use
setencoding(errors="surrogateescape")
explicitly.
Agreed.
Cheers, Nick.
P.S. I've pushed the JVM/CLR related clarifications, but the standard stream changes will require a bit more thought and corresponding updates to the reference implementation - I'll aim to get to that this weekend.
-- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia
- Previous message (by thread): [Python-Dev] PEP 538: Coercing the legacy C locale to a UTF-8 based locale
- Next message (by thread): [Python-Dev] PEP 538: Coercing the legacy C locale to a UTF-8 based locale
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]