[Python-Dev] zipfile and unicode filenames (original) (raw)
"Martin v. Löwis" martin at v.loewis.de
Sun Jun 10 10:38:15 CEST 2007
- Previous message: [Python-Dev] Fwd: Instance variable access and descriptors
- Next message: [Python-Dev] zipfile and unicode filenames
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
sys.setdefaultencoding() exists for a reason, wouldn't it be better if stdlib could cope with that at least with zipfile?
sys.setdefaultencoding just does not work. Many more things break when you call it. It only exists because people like you insisted that it exists.
Also note that I'm trying to ask if zipfile should be improved, how it should be improved, and this possible improvement is not even for me (because now I know how zipfile behaves and I will work correctly with it, but someone else might stumble upon this very unexpectedly).
If you want to come up with a patch: sure. The zipfile module should handle Unicode strings, encoding them in the encoding that the ZIP specification defines (both the formal one, and the informal-defined-by-pkwares-implementation).
The tricky question is what to do when reading in zipfiles with non-ASCII characters (and yes, I understand that in your case there were only ASCII characters in the file names).
The problem was that sourcedir was unicode, and on my machine everything went ok multiple times. zipfile.ZipInfo.FileHeader would return unicode, but then when it writes it to a file it gets back to str (because mappings back and forth were identical). The problem happened when on a different machine header suddenly got byte 0x98 in position 10 (seems to be compresssize), which cp1251 codec couldn't decode. You see, arcname didn't even have unicode characters, but the mere fact that it was unicode made header upgrade to unicode in "return header + self.filename + self.extra".
Ok, now I understand. If filename is a Unicode string, header is converted using the system encoding; depending on the exact value of header and depending on the system encoding, this may cause a decoding error.
This bug has been reported as http://bugs.python.org/1170311
Because that's not supposed to work sanely when self.filename is unicode I'm asking if the right behavior would be to a) disallow unicode filenames in zipfile.ZipInfo, b) automatically convert filename to str in zipfile.ZipInfo, c) leave everything as it is.
The correct behavior would be b); the difficult details are what encoding to use.
Regards, Martin
- Previous message: [Python-Dev] Fwd: Instance variable access and descriptors
- Next message: [Python-Dev] zipfile and unicode filenames
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]