[Python-Dev] File system path encoding on Windows (original) (raw)

Stephen J. Turnbull turnbull.stephen.fw at u.tsukuba.ac.jp
Mon Aug 22 05:47:06 EDT 2016


Nick Coghlan writes:

On 21 August 2016 at 06:31, Steve Dower <steve.dower at python.org> wrote:

My biggest concern is that it then falls onto users to know how to start Python with that flag.

The users I'm most worried about belong to organizations where concerted effort has been made to "purify" the environment so that they can use bytes-oriented code. That is, getfilesystemencoding() == getpreferredencoding() == what is actually used throughout the system. Such organizations will be able to choose the flag correctly, and implement it organization-wide, I'm pretty sure. I doubt that all will choose UTF-8 at this point in time, though I wish they would.

Not necessarily, as this is one of the areas where commercial redistributors can earn their revenue stream - by deciding that flipping the default behaviour is the right thing to do for their user base (which is inevitably only a subset of the overall Python user base).

This assumes that the Python applications are the mission-critical ones for their clients. What if they're not? I think the commercial redistributors will have to make their decisions on a client-by-client basis, too. They may be in a better position to do so, but why buy trouble? They'll be quite conservative (unless they're basically monopoly IT supplier to a whole organization, but they'll still have to face potential problems from existing files on users' storage, and perhaps applications that they supply but don't "own").

I have real trouble seeing trying to force UTF-8 as a good idea until the organization mandates UTF-8. :-( This really is an organizational decision, to be implemented with client resources. We can't do it for them, redistributors can't do it for them. It needs to be an option in Python.

Python itself is already ready for UTF-8, except that on Windows getfilesystemencoding and getpreferredencoding can't honestly return 'utf-8', AIUI. I understand that that is exactly what Steve wants to change, but "honestly" is the rub. What happens if Python 3.6 is only part of a bytes-oriented system, receives a filename forced to UTF-8- encoded bytes, and passes that over a pipe or in shared memory or in a file to a non-Python-3.6 application that trusts the system defaults? "Boom!", no? Is there any experience anywhere in any implementation language with systems used on Windows that use this approach of pretending the Windows world is UTF-8? If not, why is it a good idea for Python to go first?

Making that possible doesn't mean redistributors will actually follow through, but it's an option worth keeping in mind, as while it does increase the ecosystem complexity in the near term (since default behaviour may vary based on how you obtained your Python runtime), in the longer term it can allow for better informed design decisions at the reference interpreter level. (For business process wonks, it's essentially like running through a deliberate divergence/convergence cycle at the level of the entire language ecosystem: http://theagilepirate.net/archives/1392 )

It's worse than "the entire language ecosystem" -- it's your whole business.[1] If the proposed change to getfilesystemencoding and file system APIs creates issues at all, it matters because files on disk, or other applications that receive bytes from Python, refer to filenames encoded in the preferred encoding != UTF-8. It's unlikely in the extreme that all such files are exclusively used by Python, which at best means individual users will need to manage encodings file by file. At worst, some of the filenames so encoded will be shared with applications that expect the preferred encoding, and then you've got a war on your hands.

On the other hand, having code opt-in or out of the new handling requires changing code (which is presumably not going to happen, or we wouldn't consider keeping the old behaviour and/or letting the user control it),

I don't understand why this argument doesn't cut both ways equally. If you believe that, you should also believe that the same people who won't change code to opt in also won't use a Python containing fix #1, and may not install it at all. Doesn't that matter?

I think you'll want to escalate this to a PEP as well

+1 for the reasons Nick gives. The conclusions of this discussion need a canonical URL.

Footnotes: [1] I'm assuming that readers are going to associated "language" <--> "Python". The blog post Nick refers to is about the whole business.



More information about the Python-Dev mailing list