[Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue (original) (raw)

Guido van Rossum guido at python.org
Tue Sep 30 15:59:42 CEST 2008


On Mon, Sep 29, 2008 at 11:22 PM, Georg Brandl <g.brandl at gmx.net> wrote:

No, that was not what I meant (although it is another possibility). As I wrote, Martin's proposal that I support here is using the modified UTF-8 codec that successfully roundtrips otherwise invalid UTF-8 data.

I thought that the "successful rountripping" pretty much stopped as soon as the unicode data is exported to somewhere else -- doesn't it contain invalid surrogate sequences?

In general, I'm very reluctant to use utf-8b given that it doesn't seem to be well documented as a standard anywhere. Providing some minimal APIs that can process raw-bytes filenames still makes more sense -- it is mostly analogous of our treatment of text files, where the underlying binary data is also accessible.

You seem to forget that (disregarding OSX here, since it already enforces UTF-8) the majority of file names on Posix systems will be encoded correctly.

Apparently under certain circumstances (external FS mounted) OSX can also have non-UTF-8 filenames.

[...]

With the filenames decoded by UTF-8, your files named têste, ô, dossié will be displayed and handled correctly. The others are invalid in the filesystem encoding UTF-8 and therefore would be represented by something like

u'dir\uXXffname' where XX is some private use Unicode namespace. It won't look pretty when printed, but then, what do other applications do? They e.g. display a question mark as you show above, which is not better in terms of readability. But it will work when given to a filename-handling function. Valid filenames can be compared to Unicode strings. A real-world example: OpenOffice can't open files with invalid bytes in their name. They are displayed in the "Open file" dialog, but trying to open fails. This regularly drives me crazy. Let's not make Python not work this way too, or, even worse, not even display those filenames.

How can it regularly drive you crazy when "the majority of fie names [...] encoded correctly" (as you assert above)?

-- --Guido van Rossum (home page: http://www.python.org/~guido/)



More information about the Python-Dev mailing list