bpo-28180: Implementation for PEP 538 by ncoghlan · Pull Request #659 · python/cpython (original) (raw)

It still raises a good question though, as that setting does affect Python 2 differently from the way it affects Python 3 - it changes the implicit encoding step on stdout, but stdin still relies on passing the raw bytes through without interpretation:

$ LANG=C python2
Python 2.7.13 (default, Jan 12 2017, 17:59:37) 
[GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> print(u"こんにちは")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-14: ordinal not in range(128)
>>> print("こんにちは")
こんにちは
>>> 
$ PYTHONIOENCODING=utf-8:surrogateescape LANG=C python2
Python 2.7.13 (default, Jan 12 2017, 17:59:37) 
[GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> print(u"こんにちは")
こんにちは
>>> print("こんにちは")
こんにちは
>>> 

It's also a potential problem that 'surrogateescape' doesn't exist in Python 2, so it may be better to just use Py_SetStandardStreamEncoding in PEP 538, and leave enabling surrogateescape in subprocesses as well to PEP 540 (via PYTHONUTF8=1 in the parent environment).

It also turns out that LANG=C python2 is an easy way to demonstrate that GNU readline just plain doesn't handle UTF-8 properly in the C locale - attempting to edit the print(u"こんにちは") line at the interactive prompt to remove the u prefix or add it back results in nonsense:

>>> print(u"こんにちは")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-14: ordinal not in range(128)
>>> print(�こんにちは")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-13: ordinal not in range(128)
>>> print("こんにちは")
こんにちは
>>> print(u��にちは") ")
こ�u��にちは