Issue 767645: incorrect os.path.supports_unicode_filenames - Python tracker (original) (raw)
Created on 2003-07-08 09:42 by jvr, last changed 2022-04-10 16:09 by admin. This issue is now closed.
Messages (30)
Author: Just van Rossum (jvr) *
Date: 2003-07-08 09:42
At least on OSX, unicode file names are pretty much fully supported, yet os.path.supports_unicode_filenames is False (it comes from posixpath.py, which hard codes it). What would be a proper way to detect unicode filename support for posix platforms?
Author: Brett Cannon (brett.cannon) *
Date: 2003-07-09 18:07
Logged In: YES user_id=357491
What happens if you try to create a file using Unicode names?
Could a test get the temp directory for the platform, write a file
with Unicode in it, and then check for an error? Or if it always
succeeds, write it, and then see if the results match?
In other words, does writing Unicode to an ASCII file system ever lead to a mangling of the name?
Author: Martin v. Löwis (loewis) *
Date: 2003-07-10 21:01
Logged In: YES user_id=21627
On POSIX platforms in general, detecting Unicode file name support is not possible. Posix uses open(2), and only open(2) (alon with creat(2), stat(2) etc) to access files. There is no open_w, or open_utf8, or the like. So file names are byte strings on Posix, and it will stay that way forever. (There is actually also fopen, but that doesn't change the situation at all).
On OSX, the situation is somewhat different from POSIX, as you have additional functions to open files (which Python apparently does not use, though), and because OSX specifies that the byte strings have to be NFD UTF-8 (which Python violates AFAICT).
The documentation for supports_unicode_filenames says
True if arbitrary Unicode strings can be used as file names (within limitations imposed by the file system), and if \function{os.listdir()} returns Unicode strings for a Unicode argument.
While the first part is true for OSX, I don't think the second part is. If that ever gets corrected (or verified), no further detection is necessary - just set macpath.supports_unicode_filenames for darwin (assuming you use macpath.py on that system).
Author: Martin v. Löwis (loewis) *
Date: 2003-07-10 21:05
Logged In: YES user_id=21627
Brett: As for "writing Unicode to an ASCII file system": there is no such thing. POSIX file systems accept arbitrary bytes, and don't interpret them except by looking at the path separator (in ASCII).
So you can put Latin-1, KOI8-r, EUC-JP, UTF-8, gb2312, etc all on a single file system, and people actually do that. The convention is that bytes in file names are interpreted according to the locale's encoding. This is just a convention, and it has some significant flaws. Python follows that convention, meaning that you can use arbitrary Unicode strings in open(), as long as they are supported in the locale's encoding.
Author: Just van Rossum (jvr) *
Date: 2003-07-10 21:13
Logged In: YES user_id=92689
On OSX, the situation is somewhat different from POSIX, as you have additional functions to open files (which Python apparently does not use, though), and because OSX specifies that the byte strings have to be NFD UTF-8 (which Python violates AFAICT).
(I'm not 100% sure, but I think the OS corrects that)
True if arbitrary Unicode strings can be used as file names (within limitations imposed by the file system), and if \function{os.listdir()} returns Unicode strings for a Unicode argument.
While the first part is true for OSX, I don't think the second part is.
It is, we had a long discussion about that back when I implemented that ;-)
If that ever gets corrected (or verified), no further detection is necessary - just set macpath.supports_unicode_filenames for darwin (assuming you use macpath.py on that system).
Darwin is a posix platform, so I'll have to add a switch to posixpath.py. Unless you object to that, I will do that.
Author: Martin v. Löwis (loewis) *
Date: 2003-07-10 22:34
Logged In: YES user_id=21627
I'm not 100% sure, but I think the OS corrects that
I'm relatively sure that the OS doesn't. The OS won't complain if you pass a file name that isn't UTF-8 at all - Finder will then fail to display the file correctly. There are CoreFoundationsBasicServicesSomething functions that you are supposed to call to correct that; Python does not use them.
If you think setting the flag for darwin is fine in posixpath, just go ahead.
Author: Just van Rossum (jvr) *
Date: 2003-07-11 07:48
Logged In: YES user_id=92689
Done in rev. 1.61 of posixpath.py.
(Actually, OSX does complain when you feed open() a non-valid utf-8 string (albeit with a misleading error message). The OS also makes sure the name is converted to its preferred form, eg. if I create a file named u'\xc7', I can also open it as u'C\u0327', and os.listdir() will always show the latter, no matter how you created the file.)
Author: Just van Rossum (jvr) *
Date: 2003-07-17 16:20
Logged In: YES user_id=92689
Reopeing as the fix I checked in caused problems in test_pep277.py. Postpone work on this until after 2.3 is released.
Author: Just van Rossum (jvr) *
Date: 2003-07-17 16:21
Logged In: YES user_id=92689
(forgot to mention: my checkin was backed out)
Author: Just van Rossum (jvr) *
Date: 2005-06-28 09:46
Logged In: YES user_id=92689
Hmm, two years later and this still hasn't been resolved. Is anyone interested to take a stab at it? It would be nice if it could be fixed for 2.5.
(Btw. the only code using os.path.supports_unicode_filenames that I'm aware of is Jason Orendorff's path module.)
Author: Martin v. Löwis (loewis) *
Date: 2005-06-28 21:04
Logged In: YES user_id=21627
I don't care about this issue, as I think supports_unicode_filenames is a pretty useless property these days. If somebody changes the current value from False to True, just make sure that the testsuite still passes.
Author: Ezio Melotti (ezio.melotti) *
Date: 2010-01-12 19:44
Maybe os.path.supports_unicode_filenames should be deprecated. The doc currently says: "True if arbitrary Unicode strings can be used as file names (within limitations imposed by the file system), and if os.listdir() returns Unicode strings for a Unicode argument."
On Linux both the things work, even if the value of os.path.supports_unicode_filenames is still False:
os.path.supports_unicode_filenames False open(u'fòòbàr', 'w') <open file u'f\xf2\xf2b\xe0r', mode 'w' at 0x9470778> os.listdir(u'.') [u'f\xf2\xf2b\xe0r', ...] open(u'fòòbàr') <open file u'f\xf2\xf2b\xe0r', mode 'r' at 0x9470778>
Author: R. David Murray (r.david.murray) *
Date: 2010-01-12 20:16
In addition, whether or not true unicode filenames are supported really depends, at least on Linux, on the filesystem, not on the OS (for some definition of support). In other words, I think os.path.supports_unicode_filenames is an API design that is broken and should probably be dropped.
Author: Florent Xicluna (flox) *
Date: 2010-01-12 21:14
Additionally it filters out test_pep277 on some platforms.
But seemingly, it is not needed anymore with this patch.
Author: Joe Amenta (joe.amenta)
Date: 2010-01-12 21:35
If it is decided to keep supports_unicode_filenames, here is a patch for test_os.py that verifies the value of supports_unicode_filenames against the following line from the documentation: "True if arbitrary Unicode strings can be used as file names (within limitations imposed by the file system), and if os.listdir() returns Unicode strings for a Unicode argument."
Author: Florent Xicluna (flox) *
Date: 2010-03-15 18:35
With r78594, test_pep277 is active on all platforms having Unicode-friendly filesystem encoding.
Author: Mark Lawrence (BreamoreBoy) *
Date: 2010-08-18 17:22
There are at least three messages stating that os.path.supports_unicode_filenames should go so can someone please provide a definitive statement regarding its future.
Author: STINNER Victor (vstinner) *
Date: 2010-09-11 00:10
test_pep277.patch removes the usage of os.path.supports_unicode_filenames from test_pep277: the test still pass on Debian Sid (Linux). Can someone test the patch on Mac OS X, FreeBSD and Solaris (and maybe other POSIX/UNIX OSes)?
About Windows: supports_unicode_filenames is False if sys.getwindowsversion().platform < 2: win32s (0) or Windows 9x/ME (1). I don't know win32s, but I know that Windows 9x/ME is not more supported.
Author: STINNER Victor (vstinner) *
Date: 2010-09-11 00:15
Oops, forget test_pep277.patch: I misunderstood r81149 (new way to detect if the filesystem supports unicode or not). test_pep277 fails with my patch on Linux with LC_CTYPE=C.
Author: STINNER Victor (vstinner) *
Date: 2010-09-11 00:24
r84701 fixes supports_unicode_filenames's definition in Python 3.2 (and r84702 in Python 3.1): os.listdir(str) now always return unicode filenames (including non-ascii characters).
Author: STINNER Victor (vstinner) *
Date: 2010-09-11 00:37
Maybe os.path.supports_unicode_filenames should be deprecated. The doc currently says: "True if arbitrary Unicode strings can be used as file names (within limitations imposed by the file system), and if os.listdir() returns Unicode strings for a Unicode argument."
On Linux both the things work, even if the value of os.path.supports_unicode_filenames is still False: (...)
It depends on the locale encoding:
$ LC_CTYPE=C ./python Python 3.2a2+ (py3k, Sep 11 2010, 01:48:43)
import sys; sys.getfilesystemencoding() 'ascii' open('\xe9', 'w').close() ... UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 0: ordinal not in range(128)
With utf-8, surrogates are forbidden. Eg.
$ ./python Python 3.2a2+ (py3k, Sep 11 2010, 01:48:43)
import sys; sys.getfilesystemencoding() 'utf-8' open('\uDC00', 'w').close() ... UnicodeEncodeError: 'utf-8' codec can't encode character '\udc00' in position 0: surrogates not allowed
On Windows, Python uses the unicode API and so the unicode support doesn't depend on the locale encoding (on the ansi code page). Surrogates are accepted on Windows: '\uDC00' is a valid filename.
I think that supports_unicode_filenames is still useful to check if the filesystem API uses bytes (Linux, FreeBSD, Solaris, ...) or characters (Mac OS X, Windows). Mac OS X is a special case because the C API uses char* (byte string), but the filesystem encoding is fixed to utf-8 and it doesn't accept invalid utf-8 filenames. So I would like to say that supports_unicode_filenames should be True on Mac OS X (which was the initial request).
Author: Martin v. Löwis (loewis) *
Date: 2010-09-12 16:31
About Windows: supports_unicode_filenames is False if sys.getwindowsversion().platform < 2: win32s (0) or Windows 9x/ME (1). I don't know win32s, but I know that Windows 9x/ME is not more supported.
Win32s is long gone. It was an emulation layer to support Win32 on Windows 3.1.
Author: Martin v. Löwis (loewis) *
Date: 2010-09-12 16:36
I think that supports_unicode_filenames is still useful to check if the filesystem API uses bytes (Linux, FreeBSD, Solaris, ...) or characters (Mac OS X, Windows). Mac OS X is a special case because the C API uses char* (byte string), but the filesystem encoding is fixed to utf-8 and it doesn't accept invalid utf-8 filenames. So I would like to say that supports_unicode_filenames should be True on Mac OS X (which was the initial request).
Sounds reasonable.
Author: STINNER Victor (vstinner) *
Date: 2010-09-13 20:26
r84784 sets os.path.supports_unicode_filenames to True on Mac OS X (macpath module).
About test_supports_unicode_filenames.patch. test_unicode_listdir() is wrong: os.listdir(str) always return str (see r84701). "verify that the new file's name is equal to the name we tried" check of test_unicode_filename() is also wrong: newfile.name is always equal to fname, it doesn't depend on support_unicode_filenames. Since the test is wrong, I don't want to commit it. test_pep277 is enough to test the creation of files with unicode names.
I don't see anything else to do now, so I close this issue. Reopen it if I forgot something, or open a new issue.
Author: STINNER Victor (vstinner) *
Date: 2010-09-13 20:32
I backported r84701 and r84784 to Python 2.7 (r84787).
Author: Ned Deily (ned.deily) *
Date: 2010-09-13 22:07
There seems to be some confusion about the macpath.py module. I'm not sure why it even exists in Python 3. Note it has to do with obsolete Classic MacOS-style paths (colon-separated paths) which are available on Mac OS X only through deprecated Carbon interfaces. I'm not even sure that those style paths do support unicode. More importantly, the underlying Carbon interfaces that macpath.py uses were removed for Python 3. AFAIK, virtually nothing on OS X uses these style paths anymore and, with the removal of all the old Mac Carbon support in Python 3, I don't think there is any Python module that can use these paths other than macpath. I think this module should be marked for deprecation and removed. There is no reason to modify it nor add a NEWS note, even for 2.7.
Author: Ned Deily (ned.deily) *
Date: 2010-09-14 05:18
(I've opened Issue9850 to document the brokenness of macpath and suggest its deprecation and removal.)
Author: STINNER Victor (vstinner) *
Date: 2010-09-14 11:47
There seems to be some confusion about the macpath.py module. (...)
Oops. I thought that Mac OS X uses macpath, but in fact it is posixpath. Can you try my new patch posixpath_darwin.patch? I reopen the issue because I patched the wrong module. I suppose that Python 2.7 has the same issue: posixpath should be patched, not macpath.
My patch leaves macpath with supports_unicode_filenames=True. If I understood correctly: macpath should be removed (#9850).
Author: Ned Deily (ned.deily) *
Date: 2010-09-15 00:54
No problems noted with a quick test of posixpath_darwin.patch on 10.6 so looks good. It will get regression tested on more configurations sometime later.
Author: STINNER Victor (vstinner) *
Date: 2010-09-17 23:37
No problems noted with a quick test of posixpath_darwin.patch on 10.6 so looks good.
Ok thanks. Fix commited to 3.2 (r84866) and 2.7 (r84868). I kept my patch on macpath (supports_unicode_filenames=True) because it is still valid (even if it is not used). Or is it wrong that Mac OS 9 speaks unicode?
History
Date
User
Action
Args
2022-04-10 16:09:51
admin
set
github: 38817
2010-09-17 23:37:39
vstinner
set
status: open -> closed
resolution: fixed
messages: +
2010-09-15 00:54:49
ned.deily
set
messages: +
2010-09-14 11:47:37
vstinner
set
status: closed -> open
files: + posixpath_darwin.patch
resolution: fixed -> (no value)
messages: +
2010-09-14 05🔞20
ned.deily
set
messages: +
2010-09-13 22:07:22
ned.deily
set
nosy: + ronaldoussoren, ned.deily
messages: +
2010-09-13 20:32:37
vstinner
set
messages: +
2010-09-13 20:26:59
vstinner
set
status: open -> closed
resolution: fixed
messages: +
2010-09-13 19:42:39
vstinner
set
files: - test_pep277.patch
2010-09-12 16:36:11
loewis
set
messages: +
2010-09-12 16:31:34
loewis
set
messages: +
2010-09-11 00:37:12
vstinner
set
messages: +
2010-09-11 00:24:47
vstinner
set
messages: +
2010-09-11 00:15:46
vstinner
set
messages: +
2010-09-11 00:10:36
vstinner
set
files: + test_pep277.patch
messages: +
2010-08-18 17:22:59
BreamoreBoy
set
nosy: + BreamoreBoy
messages: +
2010-07-31 23:21:27
eric.araujo
set
nosy: + vstinner
2010-03-15 18:35:38
flox
set
type: behavior
messages: +
2010-03-15 18:34:53
flox
set
files: - issue767645_test_pep277.py
2010-01-28 17:58:27
flox
set
nosy:loewis, jvr, ezio.melotti, r.david.murray, joe.amenta, flox
versions: + Python 3.1, Python 2.7, Python 3.2
components: + Tests
stage: patch review
2010-01-12 21:35:24
joe.amenta
set
files: + test_supports_unicode_filenames.patch
nosy: + joe.amenta
messages: +
keywords: + patch
2010-01-12 21:14:55
flox
set
files: + issue767645_test_pep277.py
nosy: + flox
messages: +
resolution: later -> (no value)
2010-01-12 20:16:44
r.david.murray
set
nosy: + r.david.murray
messages: +
2010-01-12 19:44:01
ezio.melotti
set
messages: +
2010-01-12 19:03:13
brett.cannon
set
nosy: - brett.cannon
2010-01-12 18:49:31
ezio.melotti
set
nosy: + ezio.melotti
2008-01-20 19:24:27
christian.heimes
set
priority: normal -> low
versions: + Python 2.6, - Python 2.3
2003-07-08 09:42:15
jvr
create