Issue 22862: os.walk fails on undecodable filenames (original) (raw)

Created on 2014-11-13 13:16 by fhoech, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (7)
msg231110 - (view)	Author: Florian Höch (fhoech) *	Date: 2014-11-13 13:16
If 'top' is an unicode directory name, os.listdir can still return non-unicode filenames if they can't be decoded. This case is not handled in the Python 2.x standard library version of os.walk and will cause join(top, name) to fail on such filenames with an UnicodeDecodeError.
msg231111 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-11-13 13:23
What is your OS?
msg231112 - (view)	Author: Florian Höch (fhoech) *	Date: 2014-11-13 13:30
This problem only affects Linux as far as I know (in my case I'm using Fedora 21 Beta).
msg231115 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-11-13 14:40
Your problem has two solutions. 1) Upgrade to Python 3 which handles correctly your use case (thanks to the PEP 383, surrogateescape error handler) 2) Only process filenames as bytes, and encode/decode manually (so you can decide how to handle undecodable filenames)
msg231117 - (view)	Author: Florian Höch (fhoech) *	Date: 2014-11-13 14:50
1) Is not yet possible for me unfortunately, some libraries I require are not yet available for Python 3 (but in the long run, this would be my preferred solution) 2) Would necessitate too many changes in a carefully crafted, unicode-only application. I think I'll just override os.listdir and filter out filenames that are not decodable, or override os.walk and do something equivalent.
msg231118 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-11-13 14:57
> 1) Is not yet possible for me unfortunately, some libraries I require are not yet available for Python 3 (but in the long run, this would be my preferred solution) I'm curious, which libraries? Oh, I forgot to say that it's not possible to fix this issue in Python 2. Backporting the PEP 383 in Python 2 requires deep changes in the Unicode machinery, starting by the UTF-8 codec. Currently, the UTF-8 encoder encodes surrogates which violates Unicode standard and makes impossible to use this codec with the surrogateescape error handler.
msg231120 - (view)	Author: Florian Höch (fhoech) *	Date: 2014-11-13 15:16
> I'm curious, which libraries? wxPython and wexpect (wexpect I could probably port myself, so the problem is mainly with wx) > Oh, I forgot to say that it's not possible to fix this issue in Python 2. Backporting the PEP 383 in Python 2 requires deep changes in the Unicode machinery, starting by the UTF-8 codec. Ok, that's understandable of course.

History
Date	User	Action	Args
2022-04-11 14:58:10	admin	set	github: 67051
2014-11-13 15:16:27	fhoech	set	messages: +
2014-11-13 15:15:54	r.david.murray	set	status: open -> closedresolution: wont fixstage: resolved
2014-11-13 14:57:11	vstinner	set	messages: +
2014-11-13 14:50:07	fhoech	set	messages: +
2014-11-13 14:40:44	vstinner	set	messages: +
2014-11-13 13:30:54	fhoech	set	messages: +
2014-11-13 13:23:11	vstinner	set	nosy: + ezio.melotti, vstinnermessages: + components: + Unicode
2014-11-13 13:16:20	fhoech	create