[Python-Dev] Unicode Imports (original) (raw)

David Hopwood david.nospam.hopwood at blueyonder.co.uk
Sat Sep 9 23:22:10 CEST 2006

Previous message: [Python-Dev] Unicode Imports
Next message: [Python-Dev] Unicode Imports
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Martin v. Löwis wrote:

David Hopwood schrieb:

On Windows, file system pathnames can contain arbitrary Unicode characters (well, almost). Despite the existence of "ANSI" filesystem APIs, and regardless of what 'sys.getfilesystemencoding()' returns, the underlying file system encoding for NTFS and FAT filesystems is UTF-16LE.

Thus, either: - the fact that sys.getfilesystemencoding() returns a non-Unicode encoding on Windows is a bug, or - any program that relies on sys.getfilesystemencoding() being able to encode arbitrary Windows pathnames has a bug. We need to decide which of these is the case. There is a third option: - the operating system has a bug

This behaviour is by design. If it is a bug, then it is a "won't ever fix -- no way, no how" bug, that Python must accomodate if it is to properly support Unicode on Windows.

It is actually this option that rules out the other two. sys.getfilesystemencoding() returns "mbcs" on Windows, which means CPACP. The file system encoding is an encoding that converts a file name into a byte string. Unfortunately, on Windows, there are file names which cannot be converted into a byte string in a standard manner. This is an operating system bug (or mis-design; they should have chosen UTF-8 as the byte encoding of file names, instead of making it depend on the system locale, but they of course did so for backwards compatibility with Windows 3.1 and 9x).

Although UTF-8 was invented (in September 1992) technically before the release of the first version of NT supporting NTFS (NT 3.1 in July 1993), it had not been invented before the decision to use Unicode in NTFS, or in Windows NT's file APIs, had been made.

(I believe OS/2 HPFS had not supported Unicode, even though NTFS was otherwise almost identical to it.)

At that time, the decision to use Unicode at all was quite forward-looking; the final version of Unicode 1.0 had only been published in June 1992 (although it had been approved earlier; see <http://www.unicode.org/history/>).

UTF-8 was only officially added to the Unicode standard in an appendix of Unicode 2.0 (published July 1996), and only given essentially equal status to UTF-16 and UTF-32 in Unicode 3.0 (September 1999).

As a side note: every encoding in Python is a Unicode encoding; so there aren't any "non-Unicode encodings".

It was clear from context that I meant "encoding capable of representing all Unicode characters".

Programs that rely on sys.getfilesystemencoding() being able to represent arbitrary file names on Windows might have a bug; programs that rely on sys.getfilesystemencoding() being able to encode all elements of sys.path do not (at least not for Python 2.5 and earlier).

Elements of sys.path can be Unicode strings in Python 2.5, and should be pathnames supported by the underlying OS. Where is it documented that there is any further restriction on them? And why should there be any further restriction on them?

-- David Hopwood <david.nospam.hopwood at blueyonder.co.uk>

Previous message: [Python-Dev] Unicode Imports
Next message: [Python-Dev] Unicode Imports
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list