[Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API (original) (raw)
Victor Stinner victor.stinner at haypocalc.com
Tue Oct 25 00:57:42 CEST 2011
- Previous message: [Python-Dev] Case consistency [was: Re: [Python-checkins] cpython: Cleanup code: remove int/long idioms and simplify a while statement.]
- Next message: [Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi,
I propose to raise Unicode errors if a filename cannot be decoded on Windows, instead of creating a bogus filenames with questions marks. Because this change is incompatible with Python 3.2, even if such filenames are unusable and I consider the problem as a (Python?) bug, I would like your opinion on such change before working on a patch.
--
Windows works internally on Unicode strings since Windows 95 (or something like that), but provides also an "ANSI" API using the ANSI code page and byte strings for backward compatibility. It was already proposed to drop completly the bytes API in our nt (os) module, but it may break the Python backward compatibility (and it is difficult to list Python programs using the bytes API to access the file system).
The ANSI API uses MultiByteToWideChar (decode) and WideCharToMultiByte (encode) functions in the default mode (flags=0): MultiByteToWideChar() replaces undecodable bytes by '?' and WideCharToMultiByte() ignores unencodable characters (!!!). This behaviour produces invalid filenames (see for example the issue #13247) and the user is unable to detect codec errors.
In Python 3.2, I changed the MBCS codec to make it strict: it raises a UnicodeEncodeError if a character cannot be encoded to the ANSI code page (e.g. encode Ł to cp1252) and a UnicodeDecodeError if a character cannot be decoded from the ANSI code page (e.g. b'\xff' from cp932).
I propose to reuse our MBCS codec in strict mode (error handler="strict"), to notice directly encode/decode errors, with the Windows native (wide character) API. It should simplify the source code: replace 2 versions of a function by 1 version + optional code to decode arguments and/or encode the result.
--
Read also the previous thread:
[Python-Dev] Byte filenames in the posix module on Windows Wed Jun 8 00:23:20 CEST 2011 http://mail.python.org/pipermail/python-dev/2011-June/111831.html
--
FYI I patched again Python MBCS codec: it now handles correclty ignore and replace mode (to encode and decode), but now also supports any error handler.
--
We might use the PEP 383 to store undecoable bytes as surrogates (U+DC80- U+DCFF). But the situation is the opposite of the situtation on UNIX: on Windows, the problem is more on encoding (text->bytes) than on decoding (bytes->text). On UNIX, problems occur when the system is misconfigured (e.g. wrong locale encoding). On Windows, problems occur when your application uses the old (ANSI) API, whereas your filesystem is fully Unicode compliant and you created Unicode filenames with a program using the new (Windows) API.
Only few programs are fully Unicode compliant. A lot of programs fail if a filename cannot be encoded to the ANSI code page (just 2 examples: Mercurial and Visual Studio).
Victor
- Previous message: [Python-Dev] Case consistency [was: Re: [Python-checkins] cpython: Cleanup code: remove int/long idioms and simplify a while statement.]
- Next message: [Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]