bpo-28180: Implementation for PEP 538 by ncoghlan · Pull Request #659 · python/cpython (original) (raw)
It still raises a good question though, as that setting does affect Python 2 differently from the way it affects Python 3 - it changes the implicit encoding step on stdout, but stdin still relies on passing the raw bytes through without interpretation:
$ LANG=C python2
Python 2.7.13 (default, Jan 12 2017, 17:59:37)
[GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> print(u"こんにちは")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-14: ordinal not in range(128)
>>> print("こんにちは")
こんにちは
>>>
$ PYTHONIOENCODING=utf-8:surrogateescape LANG=C python2
Python 2.7.13 (default, Jan 12 2017, 17:59:37)
[GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> print(u"こんにちは")
こんにちは
>>> print("こんにちは")
こんにちは
>>>
It's also a potential problem that 'surrogateescape' doesn't exist in Python 2, so it may be better to just use Py_SetStandardStreamEncoding
in PEP 538, and leave enabling surrogateescape
in subprocesses as well to PEP 540 (via PYTHONUTF8=1
in the parent environment).
It also turns out that LANG=C python2
is an easy way to demonstrate that GNU readline just plain doesn't handle UTF-8 properly in the C locale - attempting to edit the print(u"こんにちは")
line at the interactive prompt to remove the u
prefix or add it back results in nonsense:
>>> print(u"こんにちは")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-14: ordinal not in range(128)
>>> print(�こんにちは")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-13: ordinal not in range(128)
>>> print("こんにちは")
こんにちは
>>> print(u��にちは") ")
こ�u��にちは