[Python-Dev] File system path encoding on Windows (original) (raw)
Steve Dower steve.dower at python.org
Fri Aug 19 14:59:32 EDT 2016
- Previous message (by thread): [Python-Dev] Failures in test_site.py - how to debug?
- Next message (by thread): [Python-Dev] File system path encoding on Windows
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi python-dev
About a week ago I proposed on python-ideas making some changes to how Python deals with encodings on Windows, specifically in relation to how Python interacts with the operating system.
Changes to the console were uncontroversial, and I have posted patches at http://bugs.python.org/issue1602 and http://bugs.python.org/issue17620 to enable the full range of Unicode input to be used at interactive stdin/stdout.
However, changes to sys.getfilesystemencoding(), which determines how the os module (and most filesystem functions in general) interpret bytes parameters, were more heatedly discussed. I've summarised the discussion in this email
I'll declare up front that my preferred change is to treat bytes as utf-8 in Python 3.6, and I've posted a patch to do that at http://bugs.python.org/issue27781. Hopefully I haven't been too biased in my presentation of the alternatives, but this is so you at least know which way I'm biased.
I'm looking for some agreement on the answers to the questions I pose in the summary.
There is much more detail about them presented after that, as there are a number of non-obvious issues at play here. I suspect this will eventually become a PEP, but it's presented here as a summary of a discussion and not a PEP.
Cheers, Steve
Summary
Representing file system paths on Windows as bytes may result in data loss due to the way Windows encodes/decodes strings via its bytes API.
We can mitigate this by only using Window's Unicode API and doing our own encoding and decoding (i.e. within posixmodule.c's path converter). Invalid characters could cause encoding exceptions rather than data loss.
We can go further to fix this by declaring the encoding of bytes paths on Windows must be utf-8, which would also prevent encoding exceptions, as utf-8 can fully represent all paths on Windows (natively utf-16-le).
Even though using bytes for paths on Windows has been deprecated for three releases, this is not widely known and it may be too soon to change the behaviour.
Questions:
- should we always use Window's Unicode APIs instead of switching between bytes/Unicode based on parameter type?
- should we allow users to pass bytes and interpret them as utf-8 rather than letting Windows do the decoding?
- should we do it in 3.6, 3.7 or 3.8?
Background
File system paths are almost universally represented as text in some encoding determined by the file system. In Python, we expose these paths via a number of interfaces, such as the os and io modules. Paths may be passed either direction across these interfaces, that is, from the filesystem to the application (for example, os.listdir()), or from the application to the filesystem (for example, os.unlink()).
When paths are passed between the filesystem and the application, they are either passed through as a bytes blob or converted to/from str using sys.getfilesystemencoding(). The result of encoding a string with sys.getfilesystemencoding() is a blob of bytes in the native format for the default file system.
On Windows, the native format for the filesystem is utf-16-le. The recommended platform APIs for accessing the filesystem all accept and return text encoded in this format. However, prior to Windows NT (and possibly further back), the native format was a configurable machine option and a separate set of APIs existed to accept this format. The option (the "active code page") and these APIs (the "*A functions") still exist in recent versions of Windows for backwards compatibility, though new functionality often only has a utf-16-le API (the "*W functions").
In Python, we recommend using str as the default format because (with the surrogateescape handling on POSIX), it can correctly round-trip all characters used in paths. On Windows this is strongly recommended because the legacy OS support for bytes cannot round-trip all characters used in paths. Our support for bytes explicitly uses the *A functions and hence the encoding for the bytes is "whatever the active code page is". Since the active code page cannot represent all Unicode characters, the conversion of a path into bytes can lose information without warning (and we can't get a warning from the OS here - more on this later).
As a demonstration of this:
open('test\uAB00.txt', 'wb').close() import glob glob.glob('test*') ['test\uab00.txt'] glob.glob(b'test*') [b'test?.txt']
The Unicode character in the second call to glob has been replaced by a '?', which means passing the path back into the filesystem will result in a FileNotFoundError (though ironically, passing it back into glob() will find the file again, since '?' is a single-character wildcard). You can observe the same results in os.listdir() or any function that matches the return type to the parameter type.
Why is this a problem?
While the obvious and correct answer is to just use str everywhere, in
general on POSIX systems there is no possibility of confusion when using
bytes exclusively. Even if the encoding is "incorrect" by some standard,
the file system can still map the bytes back to the file. Making use of
this avoids the cost of decoding and reencoding, such that
(theoretically, and only on POSIX), code like below is faster because of
the use of b'.'
:
for f in os.listdir(b'.'): ... os.stat(f) ...
On Windows, if a filename exists that cannot be encoding with the active code page, you will receive an error from the above code. These errors are why in Python 3.3 the use of bytes paths on Windows was deprecated (listed in the What's New, but not clearly obvious in the documentation
- more on this later). The above code produces multiple deprecation warnings in 3.3, 3.4 and 3.5 on Windows.
However, we still keep seeing libraries use bytes paths, which can cause unexpected issues on Windows (well, all platforms, but less and less common on POSIX as systems move to utf-8 - Windows long ago decided to move to utf-16 for the same reason, but Python's bytes interface did not keep up). Given the current approach of not-very-aggressively recommending that library developers either write their code twice (once for bytes and once for str) or use str exclusively are not working, we should consider alternative mitigations.
Proposals
There are two dimensions here - the fix and the timing. We can basically choose any fix and any timing.
The main differences between the fixes are the balance between incorrect behaviour and backwards-incompatible behaviour. The main issue with respect to timing is whether or not we believe using bytes as paths on Windows was correctly deprecated in 3.3 and sufficiently advertised since to allow us to change the behaviour in 3.6.
Fixes
Fix #1: Change sys.getfilesystemencoding() to utf-8 on Windows
Currently the default filesystem encoding is 'mbcs', which is a meta-encoder that uses the active code page. However, when bytes are passed to the filesystem they go through the *A APIs and the operating system handles encoding. In this case, paths are always encoded using the equivalent of 'mbcs:replace' - we have no ability to change this (though there is a user/machine configuration option to change the encoding from CP_ACP to CP_OEM, so it won't necessarily always match mbcs...)
This proposal would remove all use of the *A APIs and only ever call the *W APIs. When Windows returns paths to Python as str, they will be decoded from utf-16-le and returned as text. When paths are to be returned as bytes, we would decode from utf-16-le to utf-8 using surrogatepass (as Windows does not validate surrogate pairs, so it is possible to have invalid surrogates in filenames). Equally, when paths are provided as bytes, they are decoded from utf-8 into utf-16-le and passed to the *W APIs.
The use of utf-8 will not be configurable, with the possible exception of a "legacy mode" environment variable or Xflag.
surrogateescape does not apply here, as we are not concerned about keeping arbitrary bytes in the path. Any bytes path returned from the operating system will be valid; any bytes path created by the user may raise a decoding error (currently it would raise a file not found or similar OSError).
The choice of utf-8 (as opposed to returning utf-16-le bytes) is to ensure the ability to round-trip, while also allowing basic manipulation of paths - essentially just slicing and concatenating at '' characters. Applications doing this have to ensure that their encoding matches sys.getfilesystemencoding(), or just use str everywhere.
It is debated, but I believe this is not a backwards compatibility issue because:
- byte paths in Python are specified as being encoded by sys.getfilesystemencoding()
- byte paths on Windows have been deprecated for three versions
Unfortunately, the deprecation is not explicitly called out anywhere in the docs apart from the What's New page, so there is an argument that it shouldn't be counted despite the warnings in the interpreter. However, this is more directly addressed in the discussion of timing below.
Equally, sys.getfilesystemencoding() documents the specific return values for various platforms, as well as that it is part of the protocol for using bytes to represent filesystem strings.
I believe both of these arguments are invalid, that the only code that will break as a result of this change is relying on deprecated functionality and incorrect encoding, and that the (probably noisy) breakage that will occur is less bad than the silent breakage that currently exists.
As far as implementation goes, there is already a patch for this at http://bugs.python.org/issue27781. In short, we update the path converter to decode bytes (path->narrow) to Unicode (path->wide) and remove all the code that would call *A APIs. In my patch I've changed path->narrow to a flag that indicates whether to convert back to bytes on return, and also to prevent compilation of code that tries to use ->narrow as a string on Windows (maybe that will get too annoying for contributors? good discussion for the tracker IMHO).
Fix #2: Do the mbcs decoding ourselves
This is essentially the same as fix #1, but instead of changing to utf-8 we keep mbcs as the encoding.
This approach will allow us to utilise new functionality that is only available as *W APIs, and also lets us be more strict about encoding/decoding to bytes. For example, rather than silently replacing Unicode characters with '?', we could warn or fail the operation, potentially modifying that behaviour with an environment variable or flag.
Compared to fix #1, this will enable some new functionality but will not fix any of the problems immediately. New runtime errors may cause some problems to be more obvious and lead to fixes, provided library maintainers are interested in supporting Windows and adding a separate code path to treat filesystem paths as strings.
This is a middle-ground proposal. On the positive side, it significantly reduces the code we have to maintain in CPython (e.g. posixmodule.c), as we won't require separate code paths to call the *A APIs. However, it doesn't really improve things for users apart from giving more exceptions, which are likely unexpected (people probably handle OSError but not UnicodeDecodeError when accessing the file system).
Fix #3: Make bytes paths on Windows an error
By preventing the use of bytes paths on Windows completely we prevent users from hitting encoding issues. However, we do this at the expense of usability. Obviously the deprecation concerns also play a big role in whether this is feasible.
I don't have numbers of libraries that will simply fail on Windows if this "fix" is made, but given I've already had people directly email me and tell me about their problems we can safely assume it's non-zero.
I'm really not a fan of this fix, because it doesn't actually make things better in a practical way, despite being more "pure".
Timing #1: Change it in 3.6
This timing assumes that we believe the deprecation of using bytes for paths in Python 3.3 was sufficiently well advertised that we can freely make changes in 3.6. A typical deprecation cycle would be two versions before removal (though we also often leave things in forever when they aren't fundamentally broken), so we have passed that point and theoretically can remove or change the functionality without breaking it.
In this case, we would announce in 3.6 that using bytes as paths on Windows is no longer deprecated, and that the encoding used is whatever is returned by sys.getfilesystemencoding().
Timing #2: Change it in 3.7
This timing assumes that the deprecation in 3.3 was valid, but acknowledges that it was not well publicised. For 3.6, we aggressively make it known that only strings should be used to represent paths on Windows and bytes are invalid and going to change in 3.7. (It has been suggested that I could use a keynote at PyCon to publicise this, and while I'd totally accept a keynote, I'd hate to subject a crowd to just this issue for an hour :) ).
My concern with this approach is that there is no benefit to the change at all. If we aggressively publicise the fact that libraries that don't handle Unicode paths on Windows properly are using deprecated functionality and need to be fixed by 3.7 in order to avoid breaking (more precisely - continuing to be broken, but with a different error message), then we will alienate non-Windows developers further from the platform (net loss for the ecosystem) and convince some to switch to str everywhere (net gain for the ecosystem). It doesn't
For those who listen and change to str, it removes the need to make any change in 3.7 at all, so we would really just be making noise about something that some people may not have noticed without necessarily going in and fixing anything. For those who don't listen, the change in 3.7 is going to break them just as much as if we made the change in 3.6.
Timing #3: Change it in 3.8
This timing assumes that the deprecation in 3.3 was not sufficient and we need to start a new deprecation cycle. This is strengthened by the fact that the deprecation announcement does not explicitly include the io module or the builtin open() function, and so some developers may believe that using bytes for paths with these is okay despite the os module being deprecated.
The one upside to this approach is that it would also allow us to change locale.getpreferredencoding() to utf-8 on Windows (to affect the default behaviour of open(..., 'r') ), which I don't believe is going to be possible without a new deprecation cycle. There is a strong argument that the following code should also round-trip regardless of platform:
with open('list.txt', 'w') as f: ... for i in os.listdir('.'): ... print(i, file=f) ... with open('list.txt', 'r') as f: ... files = list(f) ...
Currently, the default encoding for open() cannot represent all filenames that may be returned from listdir(). This may affect makefiles and configuration files that contain paths. Currently they will work correctly for paths that can be represented in the machine's active code page (though it should be noted that the *A APIs may be changed in a process by user/machine configuration to use the OEM code page rather than the active code page, which would potentially lead to encoding issues even for CP_ACP compatible names).
Possibly resolving both issues simultaneously is worth waiting for two more releases? I'm not convinced the change to getfilesystemencoding() needs to wait for getpreferredencoding() to also change, or that they necessarily need to match, but it would not be hugely surprising to see the changes bundled together.
I'll also note that there has been limited discussion about changing getpreferredencoding() so far, though there have been a number of "+1" votes alongside some "+1 with significant concerns" votes. Changing the default encoding of the contents of data files is pretty scary, so I'm not in any rush to force it in. On the other hand, changing the encoding for paths without changing the default encoding for text files may break "bytes in, bytes through, bytes out" for some files (especially makefiles and .ini files). Arguably this idea was already deprecated with Python 3's bytes/text separation anyway.
Acknowledgements
Thanks to Stephen Turnbull, Eryk Sun, Victor Stinner and Random832 for their significant contributions and willingness to engage, and to everyone else on python-ideas for contributing to the discussion.
- Previous message (by thread): [Python-Dev] Failures in test_site.py - how to debug?
- Next message (by thread): [Python-Dev] File system path encoding on Windows
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]