Issue 27048: distutils._msvccompiler._get_vc_env() fails with UnicodeDecodeError if an env var is not encodable (original) (raw)

Issue27048

process

Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: steve.dower Nosy List: abarry, dstufft, eric.araujo, eryksun, ezio.melotti, martin.panter, paul.moore, python-dev, r.david.murray, steve.dower, tim.golden, vstinner, zach.ware
Priority: normal Keywords: patch

Created on 2016-05-17 20:09 by abarry, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
subprocess_errors_simple_1.patch abarry,2016-05-17 20:09 review
subprocess_distutils_errors_1.patch abarry,2016-05-17 20:09 review
subprocess_errors_simple_2.patch abarry,2016-05-17 20:38 review
subprocess_distutils_errors_2.patch abarry,2016-05-17 20:38 review
subprocess_distutils_errors_3.patch abarry,2016-05-17 21:27 review
distutils_windows_non_ascii_1.patch abarry,2016-05-17 23:21 review
Messages (18)
msg265774 - (view) Author: Anilyka Barry (abarry) * (Python triager) Date: 2016-05-17 20:09
I got a random UnicodeDecodeError while trying to install a module with distutils. Traced it back and it being my name having a non-ascii character floating around in my environment. I'm including two patches: subprocess_errors_simple_1.patch simply adds 'errors="replace"' to the stdin, stdout and stderr, while subprocess_distutils_errors_1.patch fixes this a bit more thoroughly. I haven't added any tests yet, I'll add them to the proper patch when a concensus on which one to use is reached.
msg265775 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-05-17 20:33
I would not think errors=replace would be the right answer...that would lose information, wouldn't it? What is the data being used for that causes the problem? Normally in situations like this we'd use surrogateescape at the system boundary, but I don't know enough about distutils to know if that is the right answer.
msg265776 - (view) Author: Anilyka Barry (abarry) * (Python triager) Date: 2016-05-17 20:38
The data that causes the issue is my name (Émanuel), which is present in a couple of places in my environment. I also completely forgot about surrogateescape, my bad. Here are two new patches with the surrogateescape error handler instead of the replace one.
msg265777 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-05-17 21:17
Hi, > I got a random UnicodeDecodeError while trying to install a module with distutils. Please provide more context to your issue: * What is your operating system? * What is your locale? (locale encoding on UNIX, ANSI and OEM code pages on Windows) * Where does the data come from? * etc. > subprocess_errors_simple_1.patch I don't think that it's ok to change the *default* error handler in the subprocess module. Adding encoding/errors parameters to subprocess.Popen is an old topic: see for example the issue #6135.
msg265778 - (view) Author: Anilyka Barry (abarry) * (Python triager) Date: 2016-05-17 21:27
Oops, I thought I had written that - must have accidentally deleted it. I'm on Windows 7, my locale is fr-CA / cp1252, and the non-ascii data (my name =) comes from my environment. I discussed a bit with R. David Murray off-issue, and in the end it seems that it would be better to have subprocess accept parameters, but not do anything by default, and then have distutils pass these parameters. Here is a new patch.
msg265781 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-05-17 22:30
Copy of https://gist.github.com/Vgr255/67fc0245056454d6a68608790a9d417b E:\LGP_Test\LGP-0.1>py setup.py install running install running build running build_ext building 'lgp' extension Traceback (most recent call last): File "setup.py", line 10, in Extension("lgp", ["src/lgpmodule.c"]), File "E:\Python\3.5.0\lib\distutils\core.py", line 148, in setup dist.run_commands() File "E:\Python\3.5.0\lib\distutils\dist.py", line 955, in run_commands self.run_command(cmd) File "E:\Python\3.5.0\lib\distutils\dist.py", line 974, in run_command cmd_obj.run() File "E:\Python\3.5.0\lib\distutils\command\install.py", line 539, in run self.run_command('build') File "E:\Python\3.5.0\lib\distutils\cmd.py", line 313, in run_command self.distribution.run_command(command) File "E:\Python\3.5.0\lib\distutils\dist.py", line 974, in run_command cmd_obj.run() File "E:\Python\3.5.0\lib\distutils\command\build.py", line 135, in run self.run_command(cmd_name) File "E:\Python\3.5.0\lib\distutils\cmd.py", line 313, in run_command self.distribution.run_command(command) File "E:\Python\3.5.0\lib\distutils\dist.py", line 974, in run_command cmd_obj.run() File "E:\Python\3.5.0\lib\distutils\command\build_ext.py", line 338, in run self.build_extensions() File "E:\Python\3.5.0\lib\distutils\command\build_ext.py", line 447, in build_ extensions self._build_extensions_serial() File "E:\Python\3.5.0\lib\distutils\command\build_ext.py", line 472, in _build _extensions_serial self.build_extension(ext) File "E:\Python\3.5.0\lib\distutils\command\build_ext.py", line 532, in build_ extension depends=ext.depends) File "E:\Python\3.5.0\lib\distutils\_msvccompiler.py", line 317, in compile self.initialize() File "E:\Python\3.5.0\lib\distutils\_msvccompiler.py", line 210, in initialize vc_env = _get_vc_env(plat_spec) File "E:\Python\3.5.0\lib\distutils\_msvccompiler.py", line 92, in _get_vc_env universal_newlines=True, File "E:\Python\3.5.0\lib\subprocess.py", line 629, in check_output **kwargs).stdout File "E:\Python\3.5.0\lib\subprocess.py", line 698, in run stdout, stderr = process.communicate(input, timeout=timeout) File "E:\Python\3.5.0\lib\subprocess.py", line 1055, in communicate stdout = self.stdout.read() File "E:\Python\3.5.0\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 49: chara cter maps to
msg265783 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-05-17 22:37
> File "E:\Python\3.5.0\lib\distutils\_msvccompiler.py", line 92, in _get_vc_env > universal_newlines=True, This function was added in Python 3.5 by the issue #23970: "Adds distutils._msvccompiler for new Visual Studio versions." (VS 2015, I guess). The module distutils._msvccompiler doesn't exist in Python 3.5.
msg265784 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-05-17 22:41
> The module distutils._msvccompiler doesn't exist in Python 3.5. Oops, I mean: it *only* exists in Python 3.5 (and newer, it doesn't exist in Python 3.4).
msg265786 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-05-17 22:51
Python 3.4 has a similar function: query_vcvarsall() in Lib/distutils/msvc9compiler.py.
msg265788 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-05-17 23:07
IMO adding a Popen(errors=...) parameter would be a new feature for 3.6+ only. Also, the patch does not take the error handler into account when encoding and decoding multiple PIPEs in communicate(). I think it would be better to fix this bug in Lib/distutils/_msvccompiler.py only. The equivalent bug fix would look like: out = subprocess.check_output( ..., stderr=subprocess.STDOUT, # No universal_newlines! ) ... out = io.TextIOWrapper(io.BytesIO(out), errors="surrogateescape").read() Or maybe a simpler version is sufficient: (I’m not familiar enough with the use case or Windows to say.) out = out.decode("ascii", "surrogateescape")
msg265790 - (view) Author: Anilyka Barry (abarry) * (Python triager) Date: 2016-05-17 23:21
You are right Martin, this also fixes it, and it seems much less controversial :) Here you go.
msg265791 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-05-17 23:27
Ok, I now understood the issue: the distutils._msvccompiler._get_vc_env() functions runs the "C:\Program Files (x86)\Microsoft visual studio 14.0\VC\vcvarsall.bat" program in a shell and then run the "set" command in the same shell to retrieve environment variables set by the batch script. Problem: The "set" program encodes Unicode environment variables to bytes to write them into stdout, but at least one environment variable is not encodable. (It's unclear to me which encoding is used: ANSI code page? OEM code page? It doesn't matter much.) Windows provides a nice Unicode API to retrieve environment varaibles. Using the "set" program is the wrong approach to retrieve environment varaibles. I suggest to replace the "set" program with a script Python script which dumps all environment variables as UTF-8. The script would use os.environ which uses the Windows Unicode API to retrieve environment variables directly as Unicode. -- Using an error handler is the wrong option here. The root problem is that the "set" program replaces unencodable characters with junk. Trying to replace or ignore junk doesn't solve the problem. The Visual Studio environment variables can contains unencodable characters, we should be able to retrieve them as Unicode without using a very limited encoding like cp1252 codepage or any OEM code page. -- Another option is to still use the set program, but completly skip variables which contain undecodable bytes. I dislike this option :-/
msg266062 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2016-05-22 09:18
When writing to a file or pipe, cmd's internal commands default to either the current codepage of the attached console or, for a detached process, the system locale codepage. It uses a best-fit encoding that tries to map characters to similar characters in the codepage. For example: >>> win_unicode_console.enable() >>> os.environ['TEST©'] = '®' >>> out = subprocess.check_output('cmd /c set test') >>> out.decode(sys.__stdout__.encoding).strip() 'TESTc=r' You can force it to use Unicode (UTF-16LE) with the /u option: >>> out = subprocess.check_output('cmd /u /c set test') >>> out.decode('utf-16le').strip() 'TEST©=®' subprocess could use "/u /c" instead of "/c". This would only affect the output of internal shell commands such as "dir" and "set".
msg266071 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-05-22 12:18
2016-05-22 11:18 GMT+02:00 Eryk Sun <report@bugs.python.org>: > subprocess could use "/u /c" instead of "/c". This would only affect the output of internal shell commands such as "dir" and "set". Oh, I didn't know the interesting "cmd /u" command. It looks like a short and working fix for the "set" issue. As a side project, we can also evaluate using "cmd /u" by default when shell=True is used with subprocess on Windows.
msg268732 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2016-06-17 16:33
New changeset 15900c612ca7 by Steve Dower in branch '3.5': Issue #27048: Prevents distutils failing on Windows when environment variables contain non-ASCII characters https://hg.python.org/cpython/rev/15900c612ca7 New changeset bb22ae1e1bcd by Steve Dower in branch 'default': Issue #27048: Prevents distutils failing on Windows when environment variables contain non-ASCII characters https://hg.python.org/cpython/rev/bb22ae1e1bcd
msg268733 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2016-06-17 16:36
I went ahead and updated the subprocess call just in _msvccompiler to use cmd /u, as I like that fix. Not so keen on doing it for all subprocess(shell=True) calls, since we can't reliably predict whether the output will actually respect the option.
msg268734 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2016-06-17 16:37
Oh, and before anyone asks, I used "errors='replace'" because we get all the env variables but don't use most of them. If we do end up needing one that can't be decoded, this should make it obvious, but there's no point failing early because of a variable that we're not going to use.
msg268737 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2016-06-17 19:38
vcvarsall.bat is mostly `set` and `echo` statements, which print using UTF-16LE with "/u". It also runs reg.exe, but with stdout and stderr redirected to nul, so that's no problem. The final `set` command prints cmd's UTF-16LE environment. Using errors='replace' shouldn't be necessary. It doesn't hurt, however.
History
Date User Action Args
2022-04-11 14:58:31 admin set github: 71235
2016-06-17 19:38:41 eryksun set messages: +
2016-06-17 16:37:47 steve.dower set messages: +
2016-06-17 16:36:07 steve.dower set status: open -> closedmessages: + assignee: steve.dowerresolution: fixedstage: patch review -> resolved
2016-06-17 16:33:51 python-dev set nosy: + python-devmessages: +
2016-05-22 12🔞32 vstinner set messages: +
2016-05-22 09🔞54 eryksun set nosy: + eryksunmessages: +
2016-05-17 23:27:45 vstinner set messages: +
2016-05-17 23:21:38 abarry set files: + distutils_windows_non_ascii_1.patchmessages: +
2016-05-17 23:20:30 vstinner set title: Breakages in subprocess.Popen with non-ascii characters in environment -> distutils._msvccompiler._get_vc_env() fails with UnicodeDecodeError if an env var is not encodable
2016-05-17 23:07:38 martin.panter set nosy: + martin.pantermessages: +
2016-05-17 22:51:45 vstinner set messages: +
2016-05-17 22:41:13 vstinner set messages: +
2016-05-17 22:38:04 vstinner set nosy: + dstufft, paul.moore, tim.golden, eric.araujo, zach.ware, steve.dowercomponents: + Distutils, Windowsversions: + Python 3.5
2016-05-17 22:37:48 vstinner set messages: +
2016-05-17 22:30:16 vstinner set messages: +
2016-05-17 21:27:48 abarry set files: + subprocess_distutils_errors_3.patchmessages: +
2016-05-17 21:17:13 vstinner set messages: +
2016-05-17 20:38:34 abarry set files: + subprocess_distutils_errors_2.patch
2016-05-17 20:38:24 abarry set files: + subprocess_errors_simple_2.patchmessages: +
2016-05-17 20:33:28 r.david.murray set nosy: + r.david.murraymessages: +
2016-05-17 20:09:20 abarry set files: + subprocess_distutils_errors_1.patch
2016-05-17 20:09:12 abarry create