Issue 3187: os.listdir can return byte strings (original) (raw)

Created on 2008-06-24 10:28 by HWJ, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (79)

msg68674 - (view)

Author: Helmut Jarausch (HWJ)

Date: 2008-06-24 10:28

The script below produces 1664 lines of output before it bails out with Traceback (most recent call last): File "WalkBug.py", line 5, in for Dir, SubDirs, Files in os.walk('/home/jarausch') : File "/usr/local/lib/python3.0/os.py", line 278, in walk for x in walk(path, topdown, onerror, followlinks): File "/usr/local/lib/python3.0/os.py", line 268, in walk if isdir(join(top, name)): File "/usr/local/lib/python3.0/posixpath.py", line 64, in join if b.startswith('/'): TypeError: expected an object with the buffer interface

========================= file WalkBug.py:

#!/usr/local/bin/python3.0

import os

for Dir, SubDirs, Files in os.walk('/home/jarausch') : print("processing {0:d} files in {1}".format(len(Files),Dir))

msg68679 - (view)

Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer)

Date: 2008-06-24 11:45

Could you tell us what this 1665th line should be? Maybe the 1665th directory has something special (a filename with spaces or non-ascii chars...)

Can you try with an older version of python?

msg68684 - (view)

Author: Benjamin Peterson (benjamin.peterson) * (Python committer)

Date: 2008-06-24 13:02

It's failing because he's giving a string to bytes.startswith when it requires a byte string or such.

msg68685 - (view)

Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer)

Date: 2008-06-24 13:12

"he's giving a string"... the user simply called os.walk, which accepts strings AFAIK.

We should discover what produced this bytestring. Does listdir() returns a mixed list of strings and bytes?

msg68686 - (view)

Author: Benjamin Peterson (benjamin.peterson) * (Python committer)

Date: 2008-06-24 13:21

It seems the conversion to unicode strings (PyUnicode vs PyBytes) was not complete in os.listdir. See the attached patch.

msg68688 - (view)

Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer)

Date: 2008-06-24 13:37

The original problem seems to come from some Unix platform, but this patch only handles two cases:

on win32, when the argument is a bytestring.
on OS/2. And in both cases, the default (utf-8) conversion seems wrong. Something like cp1252 (the ANSI code page for Western Windows) would be more sensible.

In the posix part of the function, there is the comment (2003-03-04): /* fall back to the original byte string, as discussed in patch #683592 */ btw, I find the penultimate message of this other thread very pleasant, in the py3k context... I suppose the conclusions would not be the same today.

msg68689 - (view)

Author: Helmut Jarausch (HWJ)

Date: 2008-06-24 14:09

Could you tell us what this 1665th line should be? Maybe the 1665th directory has something special (a filename with >> spaces or non-ascii chars...)

Yes, the next directory contains a filename with an iso-latin1 but non- ascii character

Can you try with an older version of python? No problems - runs every night here

The patch (applied to SVN GMT 13:30) does NOT help.

msg70943 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2008-08-09 17:35

Hmm, I suppose that while the filename is latin1-encoded, Py_FileSystemDefaultEncoding is "utf-8" and therefore os.listdir fails decoding the filename and falls back on returning a byte string. It was acceptable in Python 2.x but is a very annoying problem in py3k now that unicode and bytes objects can't be mixed together anymore. I'm bumping this to critical, although there is probably no clean solution.

msg70953 - (view)

Author: Benjamin Peterson (benjamin.peterson) * (Python committer)

Date: 2008-08-10 00:17

Let's make this a release blocker for RCs.

msg71525 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2008-08-20 09:37

See #3616 for a consequence of this.

msg71612 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2008-08-21 08:20

If the filename can not be encoded correctly in the system charset, it's not really a problem. The goal is to be able to use open(), shutil.copyfile(), os.unlink(), etc. with the given filename.

orig = filename from the kernel (bytes) filename = filename from listdir() (str) dest = filename to the kernel (bytes)

The goal is to get orig == dest. In my program Hachoir, to workaround this problem I store the original filename (bytes) and convert it to unicode with characters replacements (eg. replace invalid byte sequence by "?"). So the bytes string is used for open(), unlink(), ... and the unicode string is displayed to stdout for the user.

IMHO, the best solution is to create such class:

class Filename: def init(self, orig): self.as_bytes = orig self.as_str = myformat(orig) def str(self): return self.as_str def bytes(self): return self.as_bytes

New problems: I guess that functions operating on filenames (os.path.*) will have to support this new type (Filename class).

msg71615 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2008-08-21 08:51

Selon STINNER Victor <report@bugs.python.org>:

IMHO, the best solution is to create such class:

class Filename: def init(self, orig): self.as_bytes = orig self.as_str = myformat(orig) def str(self): return self.as_str def bytes(self): return self.as_bytes

I agree that logically it's the right solution. It's also the most invasive. If that class is made a subclass of str, however, existing code shouldn't break more than it currently does.

msg71624 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2008-08-21 10:59

I wrote a Filename class. I tries different methods:

no parent class "class Filename: ..." -> I don't know how to make bytes(filename) works!? But it's the best option to avoid strange bugs (mix bytes/str, remember Python 2.x...)
str parent class "class Filename(str): ..." -> doesn't work because os functions uses the fake unicode filename before testing the bytes (real) filename
bytes parent class "class Filename(bytes): ..." -> that's the current implementation

The idea is to encode str -> bytes (and not bytes -> str because we want to avoid problems with such conversions). So I reimplemented most bytes methods: addr, raddr, contains, startswith, endswith and index. index method has no start/end arguments since the behaviour would be different than a real unicode string :-/

I added an example of fixed os.listdir(): create Filename() object if we get bytes. Should we always create Filename objects? I don't think so.

msg71629 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2008-08-21 12:55

bytes parent class "class Filename(bytes): ..." -> that's the current implementation

I don't think that makes sense (especially under Windows which has Unicode file APIs). os.listdir() and friends should really return str or str-like objects, not bytes-like objects with an additional str method.

str parent class "class Filename(str): ..." -> doesn't work because os functions uses the fake unicode filename before testing the bytes (real) filename

Well, of course, if we create a filename type, then all os functions must be adapted to accept it rather than assume str.

All this is highly speculative of course, and if we really follow this course (i.e. create a filename type) it should probably be postponed to 3.1: too many changes with far-reaching consequences.

msg71647 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2008-08-21 14:59

Le Thursday 21 August 2008 14:55:43 Antoine Pitrou, vous avez écrit :

bytes parent class "class Filename(bytes): ..." -> that's the current implementation

I don't think that makes sense (especially under Windows which has Unicode file APIs). os.listdir() and friends should really return str or str-like objects, not bytes-like objects with an additional str method.

In we use "class Filename(str): ...", we have to ensure that all operations takes care of the charset because the unicode version is invalid and not be used to access to the file system. Dummy example: Filename()+"/" should not return str but raise an error or create a new filename.

Well, of course, if we create a filename type, then all os functions must be adapted to accept it rather than assume str.

If Filename has no parent class but is convertible to bytes(), os functions requires no change and so we can fix it before final 3.0 ;-)

msg71648 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2008-08-21 15:09

If Filename has no parent class but is convertible to bytes(), os functions requires no change and so we can fix it before final 3.0 ;-)

This sounds highly optimistic.

Also, I think it's wrong to introduce a string-like class with implicit conversion both to bytes and to str, while we have taken all measures to make sure that bytes/str exchangeability doesn't exist any more in py3k.

msg71655 - (view)

Author: Guido van Rossum (gvanrossum) * (Python committer)

Date: 2008-08-21 16:17

The proper work-around is for the app to pass bytes into os.listdir(); then it will return bytes. It would be nice if open() etc. accepted bytes (as well as strings of course), at least on Unix, but not absolutely necessary -- the app could also just know the right encoding.

I see two reasonable alternatives for what os.listdir() should return when the input is a string and one of the filenames can't be decoded: either omit it from the output list; or use errors='replace' in the encoding. Failing the entire os.listdir() call is not acceptable, and neither is returning a mixture of str and bytes instances.

msg71680 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2008-08-21 20:55

Le Thursday 21 August 2008 18:17:47 Guido van Rossum, vous avez écrit :

The proper work-around is for the app to pass bytes into os.listdir(); then it will return bytes.

In my case, I just would like to remove a directory with shutil.rmtree(). I don't know if it contains bytes or characters filenames :-)

It would be nice if open() etc. accepted bytes (as well as strings of course), at least on Unix, but not absolutely necessary -- the app could also just know the right encoding.

An invalid filename has no charset. It's just a "raw" byte string. So open(), unlink(), etc. have to accept byte string. Maybe not in the Python version with in low level (C version)?

I see two reasonable alternatives for what os.listdir() should return when the input is a string and one of the filenames can't be decoded: either omit it from the output list;

It's not a good option: rmtree() will fails because the directory in not empty :-/

or use errors='replace' in the encoding.

It will also fails because filenames will be invalid (valid unicode string but non existent file names :-/).

Failing the entire os.listdir() call is not acceptable, and neither is returning a mixture of str and bytes instances.

Ok, I have another suggestion:

by default, listdir() only returns str and raise an error (TypeError?) on invalid filename
add an optional argument (a callback), eg. "fallback_encoder", to catch such errors (similar to "onerror" from shutils.rmtree())

Example of new listdir implementation (pseudo-code):

charset = sys.getfilesystemcharset() dirobj = opendir(path) try: for bytesname in readdir(dirobj): try: name = str(bytesname, charset) exept UnicodeDecodeError: name = fallback_encoder(bytesname) yield name finally: closedir(dirobj)

The default fallback_encoder:

def fallback_encoder(name): raise

Keep raw bytes string:

def fallback_encoder(name): return name

Create my custom filename object:

class Filename: ...

def fallback_encoder(name): return Filename(name)

If a callback is overkill, we can just add an option, eg. "keep_invalid_filename=True", to ask listdir() to keep bytes string if the conversion to unicode fails.

In any case, open(), unlink(), etc. have to accept byte string to be accept to read, copy, remove invalid filenames. In a perfect world, all filenames would be valid UTF-8 strings, but in the real world (think to Matrix :-)), we have to support such strange cases...

msg71699 - (view)

Author: Guido van Rossum (gvanrossum) * (Python committer)

Date: 2008-08-21 23:31

So shutil should be fixed to pass a bytes value to os.listdir(). But then os.remove() should be fixed to accept bytes as well. This is the crux I believe: on Unix at least, syscall wrappers should accept bytes for filenames. And this would then have to be extended to things like the functions in os.path, and we'd need bytes versions of os.sep and os.altsep... This sounds like a good project for 3.1.

I do not accept an os.listdir() that raises an error because one filename cannot be decoded. It sounds like using errors='replace' is also wrong -- so the only solution is for os.listdir() to skip files it cannot decode. While this doesn't help for rmtree(), it is better than errors='replace' for code that descends into the tree looking for files matching a pattern or other property. So I propose this as a patch for 3.0.

The callback variant is too complex; you could write it yourself by using os.listdir() with a bytes argument. This also applies to proposals like passing optional encoding and errors arguments to os.listdir().

msg71700 - (view)

Author: Benjamin Peterson (benjamin.peterson) * (Python committer)

Date: 2008-08-21 23:35

On Thu, Aug 21, 2008 at 6:31 PM, Guido van Rossum <report@bugs.python.org> wrote:

Guido van Rossum <guido@python.org> added the comment:

So shutil should be fixed to pass a bytes value to os.listdir(). But then os.remove() should be fixed to accept bytes as well. This is the crux I believe: on Unix at least, syscall wrappers should accept bytes for filenames. And this would then have to be extended to things like the functions in os.path, and we'd need bytes versions of os.sep and os.altsep... This sounds like a good project for 3.1.

I do not accept an os.listdir() that raises an error because one filename cannot be decoded. It sounds like using errors='replace' is also wrong -- so the only solution is for os.listdir() to skip files it cannot decode. While this doesn't help for rmtree(), it is better than errors='replace' for code that descends into the tree looking for files matching a pattern or other property. So I propose this as a patch for 3.0.

As much as this maybe the right idea, I don't like the idea of silently losing the contents of a directory. That's asking for difficult to discover bugs. Could Python emit a warning in this case?

The callback variant is too complex; you could write it yourself by using os.listdir() with a bytes argument. This also applies to proposals like passing optional encoding and errors arguments to os.listdir().

Python tracker <report@bugs.python.org> <http://bugs.python.org/issue3187>

msg71705 - (view)

Author: Guido van Rossum (gvanrossum) * (Python committer)

Date: 2008-08-21 23:59

I do not accept an os.listdir() that raises an error because one filename cannot be decoded. It sounds like using errors='replace' is also wrong -- so the only solution is for os.listdir() to skip files it cannot decode. While this doesn't help for rmtree(), it is better than errors='replace' for code that descends into the tree looking for files matching a pattern or other property. So I propose this as a patch for 3.0.

As much as this maybe the right idea, I don't like the idea of silently losing the contents of a directory. That's asking for difficult to discover bugs.

Well, the other approaches also cause difficult to discover bugs (the original bug report here was an example :-).

Could Python emit a warning in this case?

This may be the best compromise yet. It would have to use the warnings module so that you could disable it.

msg71748 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2008-08-22 14:21

I implemented the "invalid filename" class feature:

by default, os.listdir() raise an error (UnicodeDecodeError) on invalid filename. The previous behaviour was to return bytes object instead of str.
if invalid_filename=True: create an InvalidFilename class instance

InvalidFilename is not a bytes string, it's not a str string, it's a new class. It has three attributes:

bytes: the real filename
charset: charset (type str)
str: fake filename (type str) used by str() method

My patch also fixes os.path.join() to accept InvalidFilename: if at last one argument is an InvalidFilename, use InvalidFilename.join() (class method).

os.listdir() and os.unlink() are patched to accept InvalidFilename. unlink always accept InvalidFilename whereas listdir() only produces InvalidFilename is os.listdir(path, invalid_filename=True) is used.

I added an optional argument "invalid_filename" to shutil.rmtree(), default value is True.

To sum up, visible changes:

os.listdir() raise an error on invalid filename instead of return a mixed list of str and bytes
shutil.rmtree() manipulate str and InvalidFilename instead of str and bytes

msg71749 - (view)

Author: Guido van Rossum (gvanrossum) * (Python committer)

Date: 2008-08-22 14:53

I'm not interested in the InvalidFilename class; it's an API complification that might seem right for your situation but will hinder most other people. However I am interested in a patch that makes os.unlink() (and as many other functions as you can think of) accept bytes. You'll have to think what encoding to use on Windows though, since (AFAIK) the Windows filesystem APIs do use Unicode.

msg71751 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2008-08-22 15:19

@gvanrossum: os.unlink() and os.lstat() already accept byte filenames (but open() doesn't).

Ok, here is very small patch for posixpath.join() to accept bytes strings. This patch is enough to fix my initial problem (#3616).

msg71752 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2008-08-22 15:33

My last patch (posix_join_bytes.patch) is also enough to fix the initial reported problem: error in posixpath.join() called by os.walk(). I tried os.walk() on a directory with invalid filenames and invalid directory name and it works well.

So the last bug is open() which disallow opening a file with an invalid name. So here is another patch for that.

msg71756 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2008-08-22 16:53

To continue in the "accept bytes filenames" way, a new patch for fnmatch.filter(). Use sys.getfilesystemencoding() to convert the bytes. The patch contains a new unit test.

msg71757 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2008-08-22 16:54

Patch glob.glob() to accept directory with invalid filename (invalid in the filesystem charset): just ignore bytes => str conversion error.

msg71769 - (view)

Author: Guido van Rossum (gvanrossum) * (Python committer)

Date: 2008-08-22 19:05

See http://codereview.appspot.com/3055 for a code review of Victor's latest patches.

msg71991 - (view)

Author: Dwayne Litzenberger (dlitz)

Date: 2008-08-26 18:15

I think Guido already understands this, but I haven't seen it stated very clearly here:

** Different systems use different "things" to identify files. **

On Linux/ext3, all filenames are octet strings (i.e. bytes), and only the following caveats apply:

a filename/pathname cannot contain the zero-octet (b"\x00").
a filename/pathname cannot be empty.
a filename cannot contain the slash (b"/"); In a pathname, the slash is used to separate filenames.
the filenames b"." and b".." have special meanings; They cannot be created, deleted, or renamed.

All filenames that meet these criteria are valid, and calling them "invalid" amounts to plugging one's ears and shouting "LA LA LA" while imagining Unicode having pre-dated Unix.

It is sometimes convenient to imagine filenames on Linux/ext3 as sequences of Unicode code points (where the encoding is specified by LC_CTYPE---it's not necessarily UTF-8), but other times (e.g. in backup tools that need to be robust in the face of mischievous users) it is an unnecessary abstraction that introduces bugs.

On Windows/NTFS, the situation is entirely different: Filenames are actually sequences of Unicode code points, and if you pretend they are octet strings, Windows will happily invent phantom filenames for you that will show up in the output of os.listdir(), but that will return "File not found" if you try to open them for reading (if you open them for writing, you risk clobbering other files that happens to have the same names).

To avoid bugs, it should be possible to work exclusively with filenames in the platform's native representation. It was possible in Python 2 (though you had to be very careful). Ideally, Python 3 would recognize and enforce the difference instead of trying to guess the translations; "Explicit is better than implicit" and all that.

msg72495 - (view)

Author: Baptiste Carvello (zegreek)

Date: 2008-09-04 12:03

If, as I understand, it is the application's job to call listdir with bytes or unicode depending on the platform, it might be useful to have a function in the os module telling whether the filesystem is bytes of unicode-native.

That way, the idiom for doing something to all files in a given directory would read:

directory="my_direcory" ... if not os.is_filesystem_unicode(): ... directory=directory.encode(sys.stdin.encoding) ... for name in os.listdir(directory): ... f=open(name) ... # do something ... f.close()

and it would work on all platforms, even if one of the filenames is not in the locale's normal encoding.

If this idiom is correct, it could also be useful to include it in the module documentation so that application authors know what they are supposed to do.

msg73362 - (view)

Author: Helmut Jarausch (HWJ)

Date: 2008-09-18 07:39

Hi, is this assumed to be fixed in 3.0rc1 ?

with SVN 66506 (3.0rc1+) for dirname, subdirs, files in os.walk(bytes(Top,'iso-8859-1')) :

still gives an error here:

for dirname, subdirs, files in os.walk(bytes(Top,'iso-8859-1')) :

File "/usr/local/lib/python3.0/os.py", line 268, in walk if isdir(join(top, name)): File "/usr/local/lib/python3.0/posixpath.py", line 64, in join if b.startswith('/'): TypeError: expected an object with the buffer interface

msg73534 - (view)

Author: Benjamin Peterson (benjamin.peterson) * (Python committer)

Date: 2008-09-21 21:44

Here's a potential patch for listdir. It emits a UnicodeWarning (or should that be a BytesWarning?) and skips the file when decoding fails. What would be the best way to test this?

msg73535 - (view)

Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer)

Date: 2008-09-21 22:05

I did not test the patch, but I have some remarks about it:

%r does not seem to be handled by PyUnicode_FromFormat; %R maybe?
In this case, PyObject_Repr(v) is not necessary - and this will avoid a reference leak.
Does the warning warn multiple times? IIRC the default behaviour is to warn once.

msg73540 - (view)

Author: Benjamin Peterson (benjamin.peterson) * (Python committer)

Date: 2008-09-21 23:48

Here's two more patches. One is like the old one with Amaury's comments observed. The other simply notes if there were decoding problems and warns once at the end of the listdir call.

Making a warning happen more than once is tricky because it requires messing with the warnings filter. This of course takes away some of the user's control which is one of the main reasons for using the Python warning system in the first place.

(I almost wish we could write another listdir that returned the names it could decode and a list of those it couldn't.)

msg73678 - (view)

Author: Benjamin Peterson (benjamin.peterson) * (Python committer)

Date: 2008-09-23 23:05

Here's another patch. It simply propagates the UnicodeDecodeErrors. I like this because it avoids silent ignoring problem, and people can get bytes if they want by passing in a bytes path.

msg73680 - (view)

Author: Guido van Rossum (gvanrossum) * (Python committer)

Date: 2008-09-24 00:01

Hmm... much of the os.path machinery (and os.walk) probably doesn't work with bytes, and neither do fnmatch.py and glob.py, I expect. Plus io.open() refuses bytes for the filename, even though _fileio accepts them. The latter should be fixed regardless, and one of the attachments here has a fix IIRC.

Gotta run, sorry.

msg73688 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2008-09-24 00:41

Guido compiled my patches here: http://codereview.appspot.com/3055

My patches allows bytes for fnmatch.filter(), glob.glob1(), os.path.join() and open().

msg73909 - (view)

Author: Benjamin Peterson (benjamin.peterson) * (Python committer)

Date: 2008-09-27 00:47

Ok. Here's another possibility. It adds another optional parameter to listdir. If False, bytes strings can be returned. Otherwise, the UnicodeDecodeError is reraised.

msg73910 - (view)

Author: Guido van Rossum (gvanrossum) * (Python committer)

Date: 2008-09-27 01:15

On Fri, Sep 26, 2008 at 5:47 PM, Benjamin Peterson <report@bugs.python.org> wrote:

Ok. Here's another possibility. It adds another optional parameter to listdir. If False, bytes strings can be returned. Otherwise, the UnicodeDecodeError is reraised.

I don't see the advantage over the existing rule bytes in -> bytes out...

msg73911 - (view)

Author: Benjamin Peterson (benjamin.peterson) * (Python committer)

Date: 2008-09-27 01:23

Does that mean that the right thing to do is raise decoding errors when unicode is given and fix the path modules so they can use bytes?

msg73925 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2008-09-27 14:50

getcwd() fails with "NOT FOUNT" (not found?) if the current directory filename can't be converted to unicode (str type). Here is a patch to fallback to bytes if creation of the unicode failed.

msg73926 - (view)

Author: Dwayne Litzenberger (dlitz)

Date: 2008-09-27 15:12

On Sat, Sep 27, 2008 at 01:15:46AM +0000, Guido van Rossum wrote:

I don't see the advantage over the existing rule bytes in -> bytes out...

Guido,

I figure I should say something since I have some experience in this area.

I wrote some automatic backup software in Python 2 earlier this year. It had to work on ext3/Linux (where filenames are natively octet-strings) and on NTFS/Win32 (where filenames are natively unicode-strings). I had to be ridiculously careful to always use unicode paths on Win32, and to always use str paths on Linux, because otherwise Python would do the conversion automatically---poorly.

It was particularly bad on Win32, where if you used os.listdir() with a non-unicode path (Python 2.x str object) in a directory that contained non-ascii filenames, Windows would invent filenames that looked similar but couldn't actually be found when using open(). So, naive (Python 2) code like this would break:

for filename in os.listdir("."):
    f = open(filename, "rb")
    # ...

On Linux, it was bad too, since if you used unicode paths, the filenames actually opened would depend on your LANG or LC_CTYPE or LC_ALL environment variables, and those could vary from one system to another, or even from one invocation of the program to another.

The simple fact of the matter is that pathnames on Linux are not Unicode, and pathnames on Windows are not octet strings. They're fundamentally incompatible types that can only be reconciled when you make assumptions (e.g. specifying a character encoding) that allow you to convert from one to the other.

Ideally, io.open(), os.listdir(), os.path.*, etc. would accept only pathnames in their native format, and it would be the job of a wrapper to provide a portable-but-less-robust interface on top of that. Perhaps the built-in functions would use the wrapper (with reasonable defaults), but the native-only interface should be there for module-writers who want robust pathname handling.

msg73992 - (view)

Author: Martin v. Löwis (loewis) * (Python committer)

Date: 2008-09-28 21:31

I'd like to propose yet another approach: make sure that conversion according to the file system encoding always succeeds. If an unconvertable byte is detected, map it into some private-use character. To reduce the chance of conflict with other people's private-use characters, we can use some of the plane 15 private-use characters, e.g. map byte 0xPQ to U+F30PQ (in two-byte Unicode mode, this would result in a surrogate pair).

This would make all file names accessible to all text processing (including glob and friends); UI display would typically either report an encoding error, or arrange for some replacement glyph to be shown.

There are certain variations of the approach possible, in case there is objection to a specific detail.

msg73999 - (view)

Author: Dwayne Litzenberger (dlitz)

Date: 2008-09-29 00:54

Martin,

Consider this scenario. On ext3/Linux, assume that UTF-8 is specified in the system locale. What would happen if you have two files, named b"\xf3\xb3\x83\x80\x00" and b"\xc0\x00"? Under your proposal, the first file would decode successfully as "\U000f30c0\x00", and the second file would decode unsuccessfully, so it would be mapped to "\U000f30c0\x00"---the same thing!

Under your proposal, you could end up with multiple files having the same filename (from Python's perspective). Python shouldn't break if somebody deliberately created some weird filenames. Your proposal would make it impossible to write a robust remote backup tool in Python 3.

Pathnames on ext3/Linux are not Unicode. Blindly pretending they're Unicode is a leaky abstraction at best, and a security hole at worst.

msg74000 - (view)

Author: Guido van Rossum (gvanrossum) * (Python committer)

Date: 2008-09-29 02:51

You can call it a leaky abstraction all you want, but most people think of filenames as text strings most of the time, and we need to somehow support this, at least for users who agree . I agree we also need to support bytes strings (at least on Unix) in order to support backup routines, and support for bytes in -> bytes out in os.listdir() is meant for this. The open() function should also support a pure bytes filename (and almost does so -- _fileio does, but io.py doesn't yet). os.getcwd() is a weird case and will probably need to be given a flag to make it return bytes (I don't like that style of API much, but the alternative is perhaps worse -- os.getcwd_bytes()).

Conclusion: I support patches that make the I/O library work with either bytes or strings. (It's OK if the bytes don't actually work on Windows, where the native type is apparently strings -- though it has a bytes API too, doesn't it?)

msg74006 - (view)

Author: Martin v. Löwis (loewis) * (Python committer)

Date: 2008-09-29 04:45

Consider this scenario. On ext3/Linux, assume that UTF-8 is specified in the system locale. What would happen if you have two files, named b"\xf3\xb3\x83\x80\x00" and b"\xc0\x00"? Under your proposal, the first file would decode successfully as "\U000f30c0\x00", and the second file would decode unsuccessfully, so it would be mapped to "\U000f30c0\x00"---the same thing!

Correct.

Under your proposal, you could end up with multiple files having the same filename (from Python's perspective). Python shouldn't break if somebody deliberately created some weird filenames.

I'm not so sure about that. Practicality beats purity.

Your proposal would make it impossible to write a robust remote backup tool in Python 3.

There could be an option to set the file system encoding via an API to some known safe value, such as Latin-1, or ASCII. If you set the file system encoding to Latin-1, this escaping would never happen; if you set it to ASCII, it would happen uniformly for all non-ASCII bytes. The robust backup tool would have to know to set this option on POSIX systems.

Pathnames on ext3/Linux are not Unicode. Blindly pretending they're Unicode is a leaky abstraction at best, and a security hole at worst.

I think most Linux users would disagree, and claim that file names are indeed character strings (which is synonym to "being Unicode"). It is technically true that it's possible to create file names which are not text, but that's really a bug, not a feature - Unix and POSIX were never intended to work this way. Also, in the overwhelming majority of Python applications, consistent support for practically-existing systems matters more than robustness against malicious users.

msg74007 - (view)

Author: Martin v. Löwis (loewis) * (Python committer)

Date: 2008-09-29 04:47

I agree we also need to support bytes strings (at least on Unix) in order to support backup routines

How about letting such applications set the file system encoding to Latin-1?

msg74008 - (view)

Author: Martin v. Löwis (loewis) * (Python committer)

Date: 2008-09-29 05:11

James Knight points out that UTF-8b can be used to give unambiguous round-tripping of characters in a UTF-8 locale. So I would like to amend my previous proposal:

for a non-UTF-8 encoding, use private-use characters for roundtripping
if the locale's charset is UTF-8, use UTF-8b as the file system encoding.

msg74027 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2008-09-29 12:51

About os.getcwd(), another solution is merge_os_getcwd_getcwdu.patch: os.getcwd() always return unicode string and raise an error on unicode decode error. Wheras os.getcwd(bytes=True) always return bytes.

The old function os.getcwdu() is removed since os.getcwd() already return unicode string.

Note: current version of os.getcwd() uses the wrong encoding to conversion bytes to unicode: it uses PyUnicode_FromString() instead of PyUnicode_Decode(..., Py_FileSystemDefaultEncoding, "strict") (as does getcwdu()).

msg74032 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2008-09-29 15:58

As Steven Bethard proposed, here is a new version of my getcwd() patch: instead of adding a keyword argument "bytes", I created a function getcwdb():

os.getcwd() -> unicode
os.getcwdb() -> bytes

In Python2 it was:

os.getcwd() -> str (bytes)
os.getcwdu() -> unicode

msg74059 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2008-09-30 01:08

Patch python3_bytes_filename.patch:

open() support bytes
listdir(unicode) -> only unicode, skip invalid filenames (as asked by Guido)
remove os.getcwdu()
create os.getcwdb() -> bytes
glob.glob() support bytes
fnmatch.filter() support bytes
posixpath.join() and posixpath.split() support bytes

Mixing bytes and str is invalid. Examples raising a TypeError:

posixpath.join(b'x', 'y')
fnmatch.filter([b'x', 'y'], '*')
fnmatch.filter([b'x', b'y'], '*')
glob.glob1('.', b'*')
glob.glob1(b'.', '*')

TODO:

review this patch :-)
support non-ASCII bytes in fnmatch.filter()
fix other functions, eg. posixpath.isabs() and fnmatch.fnmatchcase()
fix functions written in C: grep FileSystemDefaultEncoding
make sure that mixing bytes and str is rejected

msg74080 - (view)

Author: Martin v. Löwis (loewis) * (Python committer)

Date: 2008-09-30 15:21

Here is a patch that solves the issue in a different way: it introduces sys.setfilesystemencoding. If applications invoke sys.setfilesystemencoding("iso-8859-1"), all file names can be successfully converted into a character string.

msg74083 - (view)

Author: Guido van Rossum (gvanrossum) * (Python committer)

Date: 2008-09-30 17:04

On Tue, Sep 30, 2008 at 8:21 AM, Martin v. Löwis <report@bugs.python.org> wrote:

Martin v. Löwis <martin@v.loewis.de> added the comment: Here is a patch that solves the issue in a different way: it introduces sys.setfilesystemencoding. If applications invoke sys.setfilesystemencoding("iso-8859-1"), all file names can be successfully converted into a character string.

I'm not opposed to this going in as well, but I don't think it's the right approach, as it can cause severe cases of mojibake (which you have strongly opposed in the past). It's quite orthogonal to Victor's patch IMO.

msg74101 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2008-09-30 22:02

As I wrote, python3_bytes_filename.patch was just an initial support for bytes filename. So as asked by Guido, here is a new version of my patch.

Changes:

for all functions, support bytes as well as bytearray
os.readlink(unicode) -> unicode and raise an error if unicode conversion fails. Note: os.readlink(bytes)->bytes was already working.
many changes in posixpath to fix all functions: add many "if isinstance(...):" and repeat sep / curdir / parent / ... in bytes
current version of test_posixpath contains a duplicate to test_splitdrive() and test_normcase() calls normcase() twice which is wrong (fixed in my patch)
i used copy/paste + conversion to bytes to test posixpath with bytes arguments
i added some checks in posixpath tests to reject mixing bytes + str
fix quoting style
factorize pattern compilation in fnmatch
fnmatch.fnmatchcase() supports bytes
fix test_unicode_file: replace getcwdu() by getcwd(), and sometimes getcwd() by getcwdb()

Open issues:

pwd.getpwnam() and grp.getgrpnam() should accept bytes, and then expanduser() should use pwd with bytes. Now expanduser() supposes that an username is an ASCII string and the user directory can be converted using getfilesystemencoding()
expandvars() doesn't support non-ASCII variable value: that's new problem. os.environ key should be str or bytes? And the value: str or bytes? It str is choosen, what is the charset to convert str to bytes?

msg74173 - (view)

Author: Guido van Rossum (gvanrossum) * (Python committer)

Date: 2008-10-02 16:43

Martin, can you check in your changes to add sys.setfilesystemencoding()?

I will check in Victor's changes (with some edits).

Together this means that the various suggested higher-level solutions (like returning path-like objects, or some kind of roudtripping almost-but-not-quite-utf-8 encoding) can be implemented in pure Python.

msg74186 - (view)

Author: Guido van Rossum (gvanrossum) * (Python committer)

Date: 2008-10-02 18:57

I have committed a somewhat modified version of Victor's latest patch as r66743.

Assigning to MvL now for the sys.setfilesystemencoding() commit. Then you can hand it over to the doc folks; something derived from http://wiki.python.org/moin/Python3UnicodeDecodeError should probably be added to the Doc tree.

msg74192 - (view)

Author: Martin v. Löwis (loewis) * (Python committer)

Date: 2008-10-02 19:35

Martin, can you check in your changes to add sys.setfilesystemencoding()?

Will do tomorrow.

msg74222 - (view)

Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer)

Date: 2008-10-03 01:45

Here is a patch for Windows:

The failing tests on buildbots now pass (test_fnmatch test_posixpath test_unicode_file)

test_ntpath also runs functions with bytes.

I suppose macpath.py is broken as well.

msg74236 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2008-10-03 10:38

Le Friday 03 October 2008 03:45:44 Amaury Forgeot d'Arc, vous avez écrit :

Here is a patch for Windows: (...) test_ntpath also runs functions with bytes.

Which charset is used when you use bytes filename? I read somewhere that it's the "current codepage". How can the user get this codepage in Python? I ask this to complete my document: http://wiki.python.org/moin/Python3UnicodeDecodeError

Don't hesitate to edit directly the document, which may also be moved to Python3 Doc/ directory.

You should also support bytearray() in ntpath: isinstance(path, (bytes, bytearray))

The unit tests might use pure unicode on Windows and bytes on Linux, especially getcwd() vs getcwdb().

I don't have Windows nor Mac to test bytes filenames on these systems.

msg74237 - (view)

Author: Martin v. Löwis (loewis) * (Python committer)

Date: 2008-10-03 10:43

Which charset is used when you use bytes filename?

It's the "ANSI" code page, which is a system-wide admin-modifiable indirection to some real code page (changing it requires a reboot). In the API, it's referred to as CP_ACP. It's also related to the "multi-byte" API, which has caused Mark Hammond to call the codec invoking it "mbcs" (IOW, "mbcs" is always the codec name for the file system encoding). The specific code page that CP_ACP denotes can be found with locale.getpreferredencoding(). Using that codec name (which might be e.g. "cp1252") is different from using "mbcs", as that goes through a regular (table-driven) Python codec. In particular, the Python codec will report errors, whereas the "mbcs" codec will find replacement characters.

msg74240 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2008-10-03 11:31

You should also support bytearray() in ntpath: isinstance(path, (bytes, bytearray))

The most generic way of allowing all bytes-alike objects is to write: path = bytes(path)

It raises a TypeError if path can't export a read-only buffer of contiguous bytes; also, it is a no-op if path is already a bytes object, so very cheap in the common case.

msg74241 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2008-10-03 11:43

The most generic way of allowing all bytes-alike objects is to write: path = bytes(path)

If you use that, any unicode may fails and the function will always return unicode. The goal is to get: func(bytes)->bytes func(bytearray)->bytes (or maybe bytearray, it doesn't matter) func(unicode)->unicode

msg74242 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2008-10-03 11:52

Le vendredi 03 octobre 2008 à 11:43 +0000, STINNER Victor a écrit :

STINNER Victor <victor.stinner@haypocalc.com> added the comment:

The most generic way of allowing all bytes-alike objects is to write: path = bytes(path)

If you use that, any unicode may fails and the function will always return unicode. The goal is to get: func(bytes)->bytes func(bytearray)->bytes (or maybe bytearray, it doesn't matter) func(unicode)->unicode

Then make it:

path = path if isinstance(path, str) else bytes(path)

msg74246 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2008-10-03 12:23

path=path is useless most of the code (unicode path), this code is faster if both cases (bytes or unicode)! if not isinstance(path, str): path = bytes(path)

a if b else c: unicode=0.756730079651; bytes=1.93071103096
if test: path=...: unicode=0.681571006775; bytes=1.88843798637

msg74255 - (view)

Author: Martin v. Löwis (loewis) * (Python committer)

Date: 2008-10-03 16:11

I've committed sys.setfilesystemencoding as r66769.

Declaring it as a documentation issue now. Not sure whether it should remain a release blocker; IMO, the documentation can still be produced after the release.

msg74256 - (view)

Author: Guido van Rossum (gvanrossum) * (Python committer)

Date: 2008-10-03 16:34

Reducing priority to critical, it's just docs and tweaks from here.

You should also support bytearray() in ntpath:

isinstance(path, (bytes, bytearray))

No, you shouldn't. I changed my mind on this several times and in the end figured it's good enough to just support bytes and str instances.

Amaury: I've reviewed your patch and ran test_ntpath.py on a Linux box. I get this traceback:

====================================================================== ERROR: test_relpath (main.TestNtpath)

Traceback (most recent call last): File "Lib/test/test_ntpath.py", line 188, in test_relpath tester('ntpath.relpath("a")', 'a') File "Lib/test/test_ntpath.py", line 22, in tester gotResult = eval(fn) File "", line 1, in File "/usr/local/google/home/guido/python/py3k/Lib/ntpath.py", line 530, in relpath start_list = abspath(start).split(sep) File "/usr/local/google/home/guido/python/py3k/Lib/ntpath.py", line 499, in abspath path = join(os.getcwd(), path) File "/usr/local/google/home/guido/python/py3k/Lib/ntpath.py", line 137, in join if b[:1] in seps: TypeError: 'in ' requires string as left operand, not bytes

The fix is to change the fallback abspath to this code:

def abspath(path):
    """Return the absolute version of a path."""
    if not isabs(path):
        if isinstance(path, bytes):
            cwd = os.getcwdb()
        else:
            cwd = os.getcwd()
        path = join(cwd, path)
    return normpath(path)

Once you fix that please check it in!

msg74257 - (view)

Author: Guido van Rossum (gvanrossum) * (Python committer)

Date: 2008-10-03 16:35

Assigning to Amaury for Windows fix first.

msg74266 - (view)

Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer)

Date: 2008-10-03 18:40

Thanks for testing the non-Windows part of ntpath. Committed patch in r66777.

Leaving the issue open: macpath.py should certainly be modified.

msg74267 - (view)

Author: Guido van Rossum (gvanrossum) * (Python committer)

Date: 2008-10-03 18:51

Sorry Amaury, but there's another issue.

test_ntpath now fails when run with -bb:

====================================================================== ERROR: test_expandvars (main.TestNtpath)

Traceback (most recent call last): File "Lib/test/test_ntpath.py", line 151, in test_expandvars tester('ntpath.expandvars("$foo bar")', "bar bar") File "Lib/test/test_ntpath.py", line 10, in tester gotResult = eval(fn) File "", line 1, in File "/usr/local/google/home/guido/python/py3k/Lib/ntpath.py", line 344, in expandvars if c in (''', b'''): # no expansion within single quotes BytesWarning: Comparison between bytes and string

====================================================================== ERROR: test_normpath (main.TestNtpath)

Traceback (most recent call last): File "Lib/test/test_ntpath.py", line 120, in test_normpath tester("ntpath.normpath('A//////././//.//B')", r'A\B') File "Lib/test/test_ntpath.py", line 10, in tester gotResult = eval(fn) File "", line 1, in File "/usr/local/google/home/guido/python/py3k/Lib/ntpath.py", line 465, in normpath if comps[i] in ('.', '', b'.', b''): BytesWarning: Comparison between bytes and string

====================================================================== ERROR: test_relpath (main.TestNtpath)

Traceback (most recent call last): File "Lib/test/test_ntpath.py", line 188, in test_relpath tester('ntpath.relpath("a")', 'a') File "Lib/test/test_ntpath.py", line 10, in tester gotResult = eval(fn) File "", line 1, in File "/usr/local/google/home/guido/python/py3k/Lib/ntpath.py", line 534, in relpath start_list = abspath(start).split(sep) File "/usr/local/google/home/guido/python/py3k/Lib/ntpath.py", line 504, in abspath return normpath(path) File "/usr/local/google/home/guido/python/py3k/Lib/ntpath.py", line 465, in normpath if comps[i] in ('.', '', b'.', b''): BytesWarning: Comparison between bytes and string

msg74268 - (view)

Author: Guido van Rossum (gvanrossum) * (Python committer)

Date: 2008-10-03 18:54

FWIW, I don't see a need to change macpath.py -- it's only used for MacOS 9 and the occasional legacy app. OSX uses posixpath.py.

msg74270 - (view)

Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer)

Date: 2008-10-03 20:35

Committed r66779: test_ntpath now passes with the -bb option.

It seems that the Windows buildbots do not set -bb.

msg74271 - (view)

Author: Guido van Rossum (gvanrossum) * (Python committer)

Date: 2008-10-03 20:55

Thanks Amaury!

On to Georg for doc tweaks. Summary:

all the os.path functions now work on bytes as well, on all platforms
only on Unix (but not OSX) do we recommend using bytes
os.getcwdu() no longer exists
os.getcwdb() returns bytes
os.listdir() skips undecodable entries (previously it returned a mixture of str and bytes instances)
open() accepts bytes as filename

Stuff that didn't change but that you might want to mention:

all the syscalls in os support bytes args; readlink() and listdir() return bytes if the arg is bytes
getcwd() may raise UnicodeDecodeError

Martin already documented sys.setfilesystemencoding().

msg74275 - (view)

Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer)

Date: 2008-10-03 21:21

I have a patch for macpath.py nonetheless. Tested on Windows (of course ;-) but all functions are pure text manipulation, except realpath(). It was much easier than ntpath.py.

I also added tests for three functions which were not exercised at all.

msg74276 - (view)

Author: Benjamin Peterson (benjamin.peterson) * (Python committer)

Date: 2008-10-03 21:50

Amaury, you're patch looks good.

msg74277 - (view)

Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer)

Date: 2008-10-03 21:57

Committed macpath.py in r66781.

msg74409 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2008-10-06 23:03

Would it possible to close this issue since os.listdir() is fixed and many other related functions (posix, posixpath, ntpath, macpath, etc.) are also fixed? I propose to open new issues for new bugs since this issue becomes a little big long :)

Eg. see new issues #4035 and #4036!

msg74412 - (view)

Author: Martin v. Löwis (loewis) * (Python committer)

Date: 2008-10-06 23:13

Would it possible to close this issue since os.listdir() is fixed and many other related functions (posix, posixpath, ntpath, macpath, etc.) are also fixed?

IIUC, these fixes are still not complete: they lack documentation changes. Of course, it would have been better if the original patches already contained the necessary documentation and test suite changes. See for what Guido considers the lacking documentation; you may find that other aspects also need documentation.

As for test cases: it seems that those got waived, in the hurry.

msg74414 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2008-10-06 23:34

Le Tuesday 07 October 2008 01:13:22 Martin v. Löwis, vous avez écrit :

IIUC, these fixes are still not complete: they lack documentation changes. (...) Of course, it would have been better if the original patches already contained the necessary documentation and test suite changes.

Most (or all) patches include new tests about bytes. Here is a patch for os.rst documentation about listdir(), getcwdb() and readlink().

See for what Guido considers the lacking documentation; you may find that other aspects also need documentation.

I wrote a long document about bytes for filenames but not only. I'm still waiting for some contributors or reviewers: http://wiki.python.org/moin/Python3UnicodeDecodeError

As for test cases: it seems that those got waived, in the hurry.

Can you be more precise? Which tests have to be improved/rewritten?

msg74426 - (view)

Author: Martin v. Löwis (loewis) * (Python committer)

Date: 2008-10-07 07:12

Most (or all) patches include new tests about bytes. Here is a patch for os.rst documentation about listdir(), getcwdb() and readlink().

Thanks! Committed as r66829.

I've added additional documentation in r66830, which should complete Guido's list of things to be documented. So the issue can be closed now.

See for what Guido considers the lacking documentation; you may find that other aspects also need documentation.

I wrote a long document about bytes for filenames but not only. I'm still waiting for some contributors or reviewers: http://wiki.python.org/moin/Python3UnicodeDecodeError

We should discuss that on python-dev, of course - the question is whether additional documentation patches are needed in response to this specific change.

As for test cases: it seems that those got waived, in the hurry.

Can you be more precise? Which tests have to be improved/rewritten?

I was probably looking at the wrong patches (such as getcwd_bytes.patch, merge_os_getcwd_getcwdu.patch, etc); I now see that the final patch did have tests. I recommend that patches that get superseded by other patches are removed from the issue. The won't be deleted; it's still possible to navigate to them through the History at the bottom of the issue.