[Python-Dev] PEP 529: Change Windows filesystem encoding to UTF-8 (original) (raw)

Steve Dower steve.dower at python.org
Mon Sep 5 01:59:04 EDT 2016


I posted an update to PEP 529 at https://github.com/python/peps/blob/master/pep-0529.txt and a diff below. The update includes more detail on the affected code within CPython - including a number of references to broken code that would be resolved with the change - and more details about the necessary changes.

As with PEP 528, I don't think it's possible to predict the impact better than I already have, and the beta period will be essential to determine whether this change is completely unworkable. I am fully prepared to back out the change if necessary prior to RC.

Cheers, Steve


@@ -16,7 +16,8 @@ operating system, often via C Runtime functions. However, these have been long discouraged in favor of the UTF-16 APIs. Within the operating system, all text is represented as UTF-16, and the ANSI APIs perform encoding and decoding using -the active code page. +the active code page. See Naming Files, Paths, and Namespaces_ for +more details. This PEP proposes changing the default filesystem encoding on Windows to utf-8, and changing all filesystem functions to use the Unicode APIs for filesystem @@ -27,10 +28,10 @@ characters outside of the user's active code page. Notably, this does not impact the encoding of the contents of files. These will -continue to default to locale.getpreferredencoding (for text files) or plain -bytes (for binary files). This only affects the encoding used when users pass a -bytes object to Python where it is then passed to the operating system as a path -name. +continue to default to locale.getpreferredencoding() (for text files) or +plain bytes (for binary files). This only affects the encoding used when users +pass a bytes object to Python where it is then passed to the operating system as +a path name. Background

Update the path converter to always decode bytes or buffer objects into text -using PyUnicode_DecodeFSDefaultAndSize. +using PyUnicode_DecodeFSDefaultAndSize(). Change the narrow field from a char* string into a flag that indicates whether the original object was bytes. This is required for functions that need @@ -172,11 +195,13 @@

Add a legacy mode flag, enabled by the environment variable -PYTHONLEGACYWINDOWSFSENCODING. When this flag is set, the default filesystem -encoding is set to mbcs rather than utf-8, and the error mode is set to -'replace' rather than 'strict'. The path_converter will continue to decode -to wide characters and only *W APIs will be called, however, the bytes passed in -and received from Python will be encoded the same as prior to this change. +PYTHONLEGACYWINDOWSFSENCODING. + +When this flag is set, the default filesystem encoding is set to mbcs rather +than utf-8, and the error mode is set to replace rather than +surrogatepass. Paths will continue to decode to wide characters and only *W +APIs will be called, however, the bytes passed in and received from Python will +be encoded the same as prior to this change. Undeprecate bytes paths on Windows

@@ -186,6 +211,52 @@ whatever is returned from sys.getfilesystemencoding() rather than the user's active code page. +Affected Modules +---------------- + +This PEP implicitly includes all modules within the Python that either pass path +names to the operating system, or otherwise use sys.getfilesystemencoding(). + +As of 3.6.0a4, the following modules require modification: + +* os +* _overlapped +* _socket +* subprocess +* zipimport + +The following modules use sys.getfilesystemencoding() but do not need +modification: + +* gc (already assumes bytes are utf-8) +* grp (not compiled for Windows) +* http.server (correctly includes codec name with transmitted data) +* idlelib.editor (should not be needed; has fallback handling) +* nis (not compiled for Windows) +* pwd (not compiled for Windows) +* spwd (not compiled for Windows) +* _ssl (only used for ASCII constants) +* tarfile (code unused on Windows) +* _tkinter (already assumes bytes are utf-8) +* wsgiref (assumed as the default encoding for unknown environments) +* zipapp (code unused on Windows) + +The following native code uses one of the encoding or decoding functions, but do +not require any modification: + +* Parser/parsetok.c (docs already specify sys.getfilesystemencoding()) +* Python/ast.c (docs already specify sys.getfilesystemencoding()) +* Python/compile.c (undocumented, but Python filesystem encoding implied) +* Python/errors.c (docs already specify os.fsdecode()) +* Python/fileutils.c (code unused on Windows) +* Python/future.c (undocumented, but Python filesystem encoding implied) +* Python/import.c (docs already specify utf-8) +* Python/importdl.c (code unused on Windows) +* Python/pythonrun.c (docs already specify sys.getfilesystemencoding()) +* Python/symtable.c (undocumented, but Python filesystem encoding implied) +* Python/thread.c (code unused on Windows) +* Python/traceback.c (encodes correctly for comparing strings) +* Python/_warnings.c (docs already specify os.fsdecode()) Rejected Alternatives

@@ -249,44 +320,50 @@ Code that does not manage encodings when crossing protocol boundaries may currently be working by chance, but could encounter issues when either encoding -changes. For example:: +changes. For example: - filename = open('filename_in_mbcs.txt', 'rb').read() - text = open(filename, 'r').read() + >>> filename = open('filename_in_mbcs.txt', 'rb').read() + >>> text = open(filename, 'r').read() To correct this code, the encoding of the bytes in filename should be -specified, either when reading from the file or before using the value:: +specified, either when reading from the file or before using the value: - # Fix 1: Open file as text - filename = open('filename_in_mbcs.txt', 'r', encoding='mbcs').read() - text = open(filename, 'r').read() + >>> # Fix 1: Open file as text + >>> filename = open('filename_in_mbcs.txt', 'r', encoding='mbcs').read() + >>> text = open(filename, 'r').read() - # Fix 2: Decode path - filename = open('filename_in_mbcs.txt', 'rb').read() - text = open(filename.decode('mbcs'), 'r').read() + >>> # Fix 2: Decode path + >>> filename = open('filename_in_mbcs.txt', 'rb').read() + >>> text = open(filename.decode('mbcs'), 'r').read() Explicitly using 'mbcs'

Code that explicitly encodes text using 'mbcs' before passing to file system -APIs. For example:: +APIs is now passing incorrectly encoded bytes. For example: - filename = open('files.txt', 'r').readline() - text = open(filename.encode('mbcs'), 'r') + >>> filename = open('files.txt', 'r').readline() + >>> text = open(filename.encode('mbcs'), 'r') To correct this code, the string should be passed without explicit encoding, or -should use os.fsencode():: +should use os.fsencode(): - # Fix 1: Do not encode the string - filename = open('files.txt', 'r').readline() - text = open(filename, 'r') + >>> # Fix 1: Do not encode the string + >>> filename = open('files.txt', 'r').readline() + >>> text = open(filename, 'r') - # Fix 2: Use correct encoding - filename = open('files.txt', 'r').readline() - text = open(os.fsencode(filename), 'r') + >>> # Fix 2: Use correct encoding + >>> filename = open('files.txt', 'r').readline() + >>> text = open(os.fsencode(filename), 'r') +References +========== + +.. _Naming Files, Paths, and Namespaces: + https://msdn.microsoft.com/en-us/library/windows/desktop/aa365247.aspx + Copyright



More information about the Python-Dev mailing list