Issue 767645: incorrect os.path.supports_unicode_filenames - Python tracker (original) (raw)

Created on 2003-07-08 09:42 by jvr, last changed 2022-04-10 16:09 by admin. This issue is now closed.

Messages (30)

msg16955 - (view)

Author: Just van Rossum (jvr) * (Python triager)

Date: 2003-07-08 09:42

At least on OSX, unicode file names are pretty much fully supported, yet os.path.supports_unicode_filenames is False (it comes from posixpath.py, which hard codes it). What would be a proper way to detect unicode filename support for posix platforms?

msg16956 - (view)

Author: Brett Cannon (brett.cannon) * (Python committer)

Date: 2003-07-09 18:07

Logged In: YES user_id=357491

What happens if you try to create a file using Unicode names?
Could a test get the temp directory for the platform, write a file with Unicode in it, and then check for an error? Or if it always succeeds, write it, and then see if the results match?

In other words, does writing Unicode to an ASCII file system ever lead to a mangling of the name?

msg16957 - (view)

Author: Martin v. Löwis (loewis) * (Python committer)

Date: 2003-07-10 21:01

Logged In: YES user_id=21627

On POSIX platforms in general, detecting Unicode file name support is not possible. Posix uses open(2), and only open(2) (alon with creat(2), stat(2) etc) to access files. There is no open_w, or open_utf8, or the like. So file names are byte strings on Posix, and it will stay that way forever. (There is actually also fopen, but that doesn't change the situation at all).

On OSX, the situation is somewhat different from POSIX, as you have additional functions to open files (which Python apparently does not use, though), and because OSX specifies that the byte strings have to be NFD UTF-8 (which Python violates AFAICT).

The documentation for supports_unicode_filenames says

True if arbitrary Unicode strings can be used as file names (within limitations imposed by the file system), and if \function{os.listdir()} returns Unicode strings for a Unicode argument.

While the first part is true for OSX, I don't think the second part is. If that ever gets corrected (or verified), no further detection is necessary - just set macpath.supports_unicode_filenames for darwin (assuming you use macpath.py on that system).

msg16958 - (view)

Author: Martin v. Löwis (loewis) * (Python committer)

Date: 2003-07-10 21:05

Logged In: YES user_id=21627

Brett: As for "writing Unicode to an ASCII file system": there is no such thing. POSIX file systems accept arbitrary bytes, and don't interpret them except by looking at the path separator (in ASCII).

So you can put Latin-1, KOI8-r, EUC-JP, UTF-8, gb2312, etc all on a single file system, and people actually do that. The convention is that bytes in file names are interpreted according to the locale's encoding. This is just a convention, and it has some significant flaws. Python follows that convention, meaning that you can use arbitrary Unicode strings in open(), as long as they are supported in the locale's encoding.

msg16959 - (view)

Author: Just van Rossum (jvr) * (Python triager)

Date: 2003-07-10 21:13

Logged In: YES user_id=92689

On OSX, the situation is somewhat different from POSIX, as you have additional functions to open files (which Python apparently does not use, though), and because OSX specifies that the byte strings have to be NFD UTF-8 (which Python violates AFAICT).

(I'm not 100% sure, but I think the OS corrects that)

True if arbitrary Unicode strings can be used as file names (within limitations imposed by the file system), and if \function{os.listdir()} returns Unicode strings for a Unicode argument.

While the first part is true for OSX, I don't think the second part is.

It is, we had a long discussion about that back when I implemented that ;-)

If that ever gets corrected (or verified), no further detection is necessary - just set macpath.supports_unicode_filenames for darwin (assuming you use macpath.py on that system).

Darwin is a posix platform, so I'll have to add a switch to posixpath.py. Unless you object to that, I will do that.

msg16960 - (view)

Author: Martin v. Löwis (loewis) * (Python committer)

Date: 2003-07-10 22:34

Logged In: YES user_id=21627

I'm not 100% sure, but I think the OS corrects that

I'm relatively sure that the OS doesn't. The OS won't complain if you pass a file name that isn't UTF-8 at all - Finder will then fail to display the file correctly. There are CoreFoundationsBasicServicesSomething functions that you are supposed to call to correct that; Python does not use them.

If you think setting the flag for darwin is fine in posixpath, just go ahead.

msg16961 - (view)

Author: Just van Rossum (jvr) * (Python triager)

Date: 2003-07-11 07:48

Logged In: YES user_id=92689

Done in rev. 1.61 of posixpath.py.

(Actually, OSX does complain when you feed open() a non-valid utf-8 string (albeit with a misleading error message). The OS also makes sure the name is converted to its preferred form, eg. if I create a file named u'\xc7', I can also open it as u'C\u0327', and os.listdir() will always show the latter, no matter how you created the file.)

msg16962 - (view)

Author: Just van Rossum (jvr) * (Python triager)

Date: 2003-07-17 16:20

Logged In: YES user_id=92689

Reopeing as the fix I checked in caused problems in test_pep277.py. Postpone work on this until after 2.3 is released.

msg16963 - (view)

Author: Just van Rossum (jvr) * (Python triager)

Date: 2003-07-17 16:21

Logged In: YES user_id=92689

(forgot to mention: my checkin was backed out)

msg16964 - (view)

Author: Just van Rossum (jvr) * (Python triager)

Date: 2005-06-28 09:46

Logged In: YES user_id=92689

Hmm, two years later and this still hasn't been resolved. Is anyone interested to take a stab at it? It would be nice if it could be fixed for 2.5.

(Btw. the only code using os.path.supports_unicode_filenames that I'm aware of is Jason Orendorff's path module.)

msg16965 - (view)

Author: Martin v. Löwis (loewis) * (Python committer)

Date: 2005-06-28 21:04

Logged In: YES user_id=21627

I don't care about this issue, as I think supports_unicode_filenames is a pretty useless property these days. If somebody changes the current value from False to True, just make sure that the testsuite still passes.

msg97652 - (view)

Author: Ezio Melotti (ezio.melotti) * (Python committer)

Date: 2010-01-12 19:44

Maybe os.path.supports_unicode_filenames should be deprecated. The doc currently says: "True if arbitrary Unicode strings can be used as file names (within limitations imposed by the file system), and if os.listdir() returns Unicode strings for a Unicode argument."

On Linux both the things work, even if the value of os.path.supports_unicode_filenames is still False:

os.path.supports_unicode_filenames False open(u'fòòbàr', 'w') <open file u'f\xf2\xf2b\xe0r', mode 'w' at 0x9470778> os.listdir(u'.') [u'f\xf2\xf2b\xe0r', ...] open(u'fòòbàr') <open file u'f\xf2\xf2b\xe0r', mode 'r' at 0x9470778>

msg97655 - (view)

Author: R. David Murray (r.david.murray) * (Python committer)

Date: 2010-01-12 20:16

In addition, whether or not true unicode filenames are supported really depends, at least on Linux, on the filesystem, not on the OS (for some definition of support). In other words, I think os.path.supports_unicode_filenames is an API design that is broken and should probably be dropped.

msg97658 - (view)

Author: Florent Xicluna (flox) * (Python committer)

Date: 2010-01-12 21:14

Additionally it filters out test_pep277 on some platforms.

But seemingly, it is not needed anymore with this patch.

msg97660 - (view)

Author: Joe Amenta (joe.amenta)

Date: 2010-01-12 21:35

If it is decided to keep supports_unicode_filenames, here is a patch for test_os.py that verifies the value of supports_unicode_filenames against the following line from the documentation: "True if arbitrary Unicode strings can be used as file names (within limitations imposed by the file system), and if os.listdir() returns Unicode strings for a Unicode argument."

msg101132 - (view)

Author: Florent Xicluna (flox) * (Python committer)

Date: 2010-03-15 18:35

With r78594, test_pep277 is active on all platforms having Unicode-friendly filesystem encoding.

msg114252 - (view)

Author: Mark Lawrence (BreamoreBoy) *

Date: 2010-08-18 17:22

There are at least three messages stating that os.path.supports_unicode_filenames should go so can someone please provide a definitive statement regarding its future.

msg116064 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2010-09-11 00:10

test_pep277.patch removes the usage of os.path.supports_unicode_filenames from test_pep277: the test still pass on Debian Sid (Linux). Can someone test the patch on Mac OS X, FreeBSD and Solaris (and maybe other POSIX/UNIX OSes)?

About Windows: supports_unicode_filenames is False if sys.getwindowsversion().platform < 2: win32s (0) or Windows 9x/ME (1). I don't know win32s, but I know that Windows 9x/ME is not more supported.

msg116065 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2010-09-11 00:15

Oops, forget test_pep277.patch: I misunderstood r81149 (new way to detect if the filesystem supports unicode or not). test_pep277 fails with my patch on Linux with LC_CTYPE=C.

msg116068 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2010-09-11 00:24

r84701 fixes supports_unicode_filenames's definition in Python 3.2 (and r84702 in Python 3.1): os.listdir(str) now always return unicode filenames (including non-ascii characters).

msg116069 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2010-09-11 00:37

Maybe os.path.supports_unicode_filenames should be deprecated. The doc currently says: "True if arbitrary Unicode strings can be used as file names (within limitations imposed by the file system), and if os.listdir() returns Unicode strings for a Unicode argument."

On Linux both the things work, even if the value of os.path.supports_unicode_filenames is still False: (...)

It depends on the locale encoding:

$ LC_CTYPE=C ./python Python 3.2a2+ (py3k, Sep 11 2010, 01:48:43)

import sys; sys.getfilesystemencoding() 'ascii' open('\xe9', 'w').close() ... UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 0: ordinal not in range(128)

With utf-8, surrogates are forbidden. Eg.

$ ./python Python 3.2a2+ (py3k, Sep 11 2010, 01:48:43)

import sys; sys.getfilesystemencoding() 'utf-8' open('\uDC00', 'w').close() ... UnicodeEncodeError: 'utf-8' codec can't encode character '\udc00' in position 0: surrogates not allowed

On Windows, Python uses the unicode API and so the unicode support doesn't depend on the locale encoding (on the ansi code page). Surrogates are accepted on Windows: '\uDC00' is a valid filename.

I think that supports_unicode_filenames is still useful to check if the filesystem API uses bytes (Linux, FreeBSD, Solaris, ...) or characters (Mac OS X, Windows). Mac OS X is a special case because the C API uses char* (byte string), but the filesystem encoding is fixed to utf-8 and it doesn't accept invalid utf-8 filenames. So I would like to say that supports_unicode_filenames should be True on Mac OS X (which was the initial request).

msg116214 - (view)

Author: Martin v. Löwis (loewis) * (Python committer)

Date: 2010-09-12 16:31

About Windows: supports_unicode_filenames is False if sys.getwindowsversion().platform < 2: win32s (0) or Windows 9x/ME (1). I don't know win32s, but I know that Windows 9x/ME is not more supported.

Win32s is long gone. It was an emulation layer to support Win32 on Windows 3.1.

msg116215 - (view)

Author: Martin v. Löwis (loewis) * (Python committer)

Date: 2010-09-12 16:36

I think that supports_unicode_filenames is still useful to check if the filesystem API uses bytes (Linux, FreeBSD, Solaris, ...) or characters (Mac OS X, Windows). Mac OS X is a special case because the C API uses char* (byte string), but the filesystem encoding is fixed to utf-8 and it doesn't accept invalid utf-8 filenames. So I would like to say that supports_unicode_filenames should be True on Mac OS X (which was the initial request).

Sounds reasonable.

msg116347 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2010-09-13 20:26

r84784 sets os.path.supports_unicode_filenames to True on Mac OS X (macpath module).

About test_supports_unicode_filenames.patch. test_unicode_listdir() is wrong: os.listdir(str) always return str (see r84701). "verify that the new file's name is equal to the name we tried" check of test_unicode_filename() is also wrong: newfile.name is always equal to fname, it doesn't depend on support_unicode_filenames. Since the test is wrong, I don't want to commit it. test_pep277 is enough to test the creation of files with unicode names.

I don't see anything else to do now, so I close this issue. Reopen it if I forgot something, or open a new issue.

msg116348 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2010-09-13 20:32

I backported r84701 and r84784 to Python 2.7 (r84787).

msg116354 - (view)

Author: Ned Deily (ned.deily) * (Python committer)

Date: 2010-09-13 22:07

There seems to be some confusion about the macpath.py module. I'm not sure why it even exists in Python 3. Note it has to do with obsolete Classic MacOS-style paths (colon-separated paths) which are available on Mac OS X only through deprecated Carbon interfaces. I'm not even sure that those style paths do support unicode. More importantly, the underlying Carbon interfaces that macpath.py uses were removed for Python 3. AFAIK, virtually nothing on OS X uses these style paths anymore and, with the removal of all the old Mac Carbon support in Python 3, I don't think there is any Python module that can use these paths other than macpath. I think this module should be marked for deprecation and removed. There is no reason to modify it nor add a NEWS note, even for 2.7.

msg116366 - (view)

Author: Ned Deily (ned.deily) * (Python committer)

Date: 2010-09-14 05:18

(I've opened Issue9850 to document the brokenness of macpath and suggest its deprecation and removal.)

msg116386 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2010-09-14 11:47

There seems to be some confusion about the macpath.py module. (...)

Oops. I thought that Mac OS X uses macpath, but in fact it is posixpath. Can you try my new patch posixpath_darwin.patch? I reopen the issue because I patched the wrong module. I suppose that Python 2.7 has the same issue: posixpath should be patched, not macpath.

My patch leaves macpath with supports_unicode_filenames=True. If I understood correctly: macpath should be removed (#9850).

msg116429 - (view)

Author: Ned Deily (ned.deily) * (Python committer)

Date: 2010-09-15 00:54

No problems noted with a quick test of posixpath_darwin.patch on 10.6 so looks good. It will get regression tested on more configurations sometime later.

msg116740 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2010-09-17 23:37

No problems noted with a quick test of posixpath_darwin.patch on 10.6 so looks good.

Ok thanks. Fix commited to 3.2 (r84866) and 2.7 (r84868). I kept my patch on macpath (supports_unicode_filenames=True) because it is still valid (even if it is not used). Or is it wrong that Mac OS 9 speaks unicode?

History

Date

User

Action

Args

2022-04-10 16:09:51

admin

set

github: 38817

2010-09-17 23:37:39

vstinner

set

status: open -> closed
resolution: fixed
messages: +

2010-09-15 00:54:49

ned.deily

set

messages: +

2010-09-14 11:47:37

vstinner

set

status: closed -> open
files: + posixpath_darwin.patch
resolution: fixed -> (no value)
messages: +

2010-09-14 05🔞20

ned.deily

set

messages: +

2010-09-13 22:07:22

ned.deily

set

nosy: + ronaldoussoren, ned.deily
messages: +

2010-09-13 20:32:37

vstinner

set

messages: +

2010-09-13 20:26:59

vstinner

set

status: open -> closed
resolution: fixed
messages: +

2010-09-13 19:42:39

vstinner

set

files: - test_pep277.patch

2010-09-12 16:36:11

loewis

set

messages: +

2010-09-12 16:31:34

loewis

set

messages: +

2010-09-11 00:37:12

vstinner

set

messages: +

2010-09-11 00:24:47

vstinner

set

messages: +

2010-09-11 00:15:46

vstinner

set

messages: +

2010-09-11 00:10:36

vstinner

set

files: + test_pep277.patch

messages: +

2010-08-18 17:22:59

BreamoreBoy

set

nosy: + BreamoreBoy
messages: +

2010-07-31 23:21:27

eric.araujo

set

nosy: + vstinner

2010-03-15 18:35:38

flox

set

type: behavior
messages: +

2010-03-15 18:34:53

flox

set

files: - issue767645_test_pep277.py

2010-01-28 17:58:27

flox

set

nosy:loewis, jvr, ezio.melotti, r.david.murray, joe.amenta, flox
versions: + Python 3.1, Python 2.7, Python 3.2
components: + Tests
stage: patch review

2010-01-12 21:35:24

joe.amenta

set

files: + test_supports_unicode_filenames.patch

nosy: + joe.amenta
messages: +

keywords: + patch

2010-01-12 21:14:55

flox

set

files: + issue767645_test_pep277.py

nosy: + flox
messages: +

resolution: later -> (no value)

2010-01-12 20:16:44

r.david.murray

set

nosy: + r.david.murray
messages: +

2010-01-12 19:44:01

ezio.melotti

set

messages: +

2010-01-12 19:03:13

brett.cannon

set

nosy: - brett.cannon

2010-01-12 18:49:31

ezio.melotti

set

nosy: + ezio.melotti

2008-01-20 19:24:27

christian.heimes

set

priority: normal -> low
versions: + Python 2.6, - Python 2.3

2003-07-08 09:42:15

jvr

create