(original) (raw)

Given the stated rationale of PEP 383, I was wondering what Windows actually does.� So, I created some ISO8859-15 and ISO8859-8 encoded file names on a device, plugged them into my Windows Vista machine, and fired up Python 3.0.

First, os.listdir("f:") returns a list of strings for those file names... but those unicode strings are illegal.

You can't even print them without getting an error from Python.� In fact, you also can't print strings containing the proposed half-surrogate encodings either: in both cases, the output encoder rejects them with a UnicodeEncodeError.�� (If not even Python, with its generally lenient attitude, can print those things, some other libraries probably will fail, too.)

What about round tripping? So, if you take a malformed file name from an external device (say, because it was actually encoded iso8859-15 or East Asian) and write it to an NTFS directory, it seems to write malformed UTF-16 file names.� In essence, Windows doesn't really use unicode, it just implements 16bit raw character strings, just like UNIX historically implements raw 8bit character strings.

Then I tried the same thing on my Ubuntu 9.04 machine.�� It turns out that, unlike Windows, Linux is seems to be moving to consistent use of valid UTF-8.� If you plug in an external device and nothing else is known about it, it gets mounted with the utf8 option and the kernel actually seems to enforce UTF-8 encoding.�� I think this calls into question the rationale behind PEP 383, and we should first look into what the roadmap for UNIX/Linux and UTF-8 actually is.� UNIX may have consistent unicode support (via UTF-8) before Windows.

As I was saying, I think PEP 383 needs a lot more thought and research...

Tom