[Python-Dev] Unicode strings as filenames (original) (raw)

Martin v. Loewis martin@v.loewis.de
Sun, 6 Jan 2002 20:44:45 +0100


That's the global, sure but the code using it is scattered across fileobject.c and the posix module. I think it would be a good idea to put all this file naming code into some Python/fileapi.c file which then also provides C APIs for extensions to use. These APIs should then take the file name as PyObject* rather than char* to enable them to handle Unicode directly.

What do you gain by that? Most of the posixmodule functions that take filenames are direct wrappers around the system call. Using another level of indirection is only useful if the fileapi.c functions are used in different places. Notice that each function (open, access, stat, etc) is used exactly once currently, so putting this all into a single place just makes the code more complex.

The extensions module argument is a red herring: I don't think there are many extension modules out there which want to call access(2) but would like to do so using a PyObject* as the first argument, but numbers as the other arguments.

> Of course, if the system has an open function that expects wchart*, > we might want to use that instead of going through a codec. Off hand, > Win32 seems to be the only system where this might work, and even > there, it won't work on Win95.

I expect this to become a standard in the next few years.

I doubt that. Posix people (including developers of various posixish systems) have frequently rejected that idea in recent years. Even for the most recent system in this respect (OS X), we hear that they still open files with a char*, where char is byte - the only advancement is that there is a guarantee that those bytes are UTF-8.

It turns out that this is all you need: with that guarantee, there is no need for an additional set of APIs. UTF-8 was originally invented precisely to represent file names (and was called UTF-1 at that time); it is more likely that more systems will follow this convention. If so, a global per-system file system encoding is all that's needed.

The only problem is that on Windows, MS has already decided that the APIs are in CP_ANSI, so they cannot change it to UTF-8 now; that's why Windows will need special casing if people are unhappy with the "mbcs" approach (which some apparantly are).

> Also, it is more difficult than threads: for threads, there is a fixed > set of API features that need to be represented. Doing PyUNICODE* > opening alone is easy, but look at the number of posixmodule functions > that all expect file names of some sort.

Doesn't that support the idea of having a small subsystem in Python which exposes the Unicode aware APIs to Python and its extensions ?

No. It is a lot of work, and an additional layer of indirection, with no apparent advantage. Feel free to write a PEP, though.

Regards, Martin