[Python-Dev] Unicode strings as filenames (original) (raw)

Martin v. Loewis martin@v.loewis.de
Fri, 4 Jan 2002 00:34:25 +0100


>> What's the correct way to deal with filenames in a Unicode >> environment? Consider this: >> >> >>> import site site.encoding >> 'latin-1'

Martin> Setting site.encoding is certainly the wrong thing to do. How Martin> can you know all users of your system use latin-1? Why is setting site.encoding appropriate to your environment at the time you install Python wrong? I can't know that all users of my system (whatever the definition of "my system" is) will use latin-1. Somewhere along the way I have to make some assumptions, however.

Well, then accept the assumption that almost everybody will use an ASCII superset. That may be still wrong, for the case of EBCDIC users, but those are rare on Unix.

However, on our typical Unix system, three different encodings are in use: ISO-8859-1 (for tradition), ISO-8859-15 (because it has the Euro), and UTF-8 (because it removes all the limitations). Notice that all of our users speak German, and we still could not set a meaningful site.encoding except for 'ascii'.

On any given computer I assume the people who install Python will set site.encoding appropriate to their environment.

That is probably wrong. Most users will install precompiled packages, and thus site.py will have the value that the package held, which will be 'ascii' for most packages.

The example I used was latin-1 simply because the folks I'm working with are in Austria and they came up with the example. I assume the best default encoding for them is latin-1.

Well, latin-1 does not have a Euro sign, which may be more and more of a problem.

The application writers themselves will have no problem restricting internal filenames to be ascii. I assume it users want to save files of their own, they will choose characters from the Unicode character set they use most frequently.

That is a meaningful assumption. However, it is one that you have to make in your application, not one that you should users expect to make in their Python installations.

The above setlocale call prints

'LCCTYPE=enUS;LCNUMERIC=enUS;LCTIME=enUS;LCCOLLATE=enUS;LCMONETARY=enUS;LCMESSAGES=enUS;LCPAPER=en;LCNAME=en;LCADDRESS=en;LCTELEPHONE=en;LCMEASUREMENT=en;LCIDENTIFICATION=en'

You may want to extend your system to support the same configuration that your users have, i.e. you might want to install an Austrian locale on your system, and set LANG to de_AT. If your system also sets all the LC_ variables for you, I recommend to unset them - setting LANG is enough (to override all other LC_ variables, setting LC_ALL to de_AT should also work).

I can't get to the machines in Austria right now to see how their locales are set, though I suspect they haven't fiddled their LC* environment, because they are having the problems I described.

If if they set the environment variables, they'd still have the problem because your application doesn't call setlocale.

I do expect that they have set LANG to de_AT, or de_AT.ISO-8859-1.

Perhaps they also have this problem because they use Python 2.1 or earlier.

This suggests to me that the Python docs need some introductory material on this topic. It appears to me that there are two people in the Python community who live and breathe this stuff are you, Martin, and Marc-Andr�. For most of the rest of us, especially if we've never conciously written code for consumption outside an ascii environment, the whole thing just looks like a quagmire.

Well, I'd happily review any introductory material somebody else writes :-)

Regards, Martin