Issue 27344: zipfile does support utf-8 filenames (original) (raw)

Created on 2016-06-17 15:08 by dholth, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (12)
msg268727 - (view) Author: Daniel Holth (dholth) * Date: 2016-06-17 15:08
The zipfile documentation says "There is no official file name encoding for ZIP files." However ZIP and zipfile supports utf-8 filenames; this has been true for a long time, at least since Python 2.7.
msg268750 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2016-06-18 00:22
There is a difference between 'official' and 'supported', and I don't quite know what you mean by the latter.
msg269035 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-06-22 00:24
See issue 10614 for the current state of play. This issue should probably be closed in favor of that one.
msg269041 - (view) Author: Daniel Holth (dholth) * Date: 2016-06-22 02:50
This is a simple documentation bug about the ZIP file format supporting utf-8 and 'no encoding' filenames depending on whether two bits are set in a flag inside the archive member. Bug 10614 appears to be a different issue about out-of-band encoding information that you could pass to Python's zipfile implementation.
msg269117 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-06-23 15:24
OK, what do you propose as a documentation change? The current doc is accurate, but incomplete. New phrasing could include something about the two de-facto standards but that one can not be sure that filenames will be in one of those two encodings. Issue 10614 addresses the fact that the zipfile module doesn't make it easy to specify the encoding of filenames when creating an archive, IIUC, which also still needs to be addressed in any documentation change.
msg269120 - (view) Author: Daniel Holth (dholth) * Date: 2016-06-23 15:46
The current documentation says "Note There is no official file name encoding for ZIP files. If you have unicode file names, you must convert them to byte strings in your desired encoding before passing them to write(). WinZip interprets all file names as encoded in CP437, also known as DOS Latin." This is bad advice because if you convert the filenames to bytes before passing them to zipfile, it won't remember that they should be unicode. Instead it should say "The ZIP file format supports Unicode filenames. If you have unicode filenames, zipfile will encode them to and from utf-8 internally. If you pass bytes filenames to write() then they will be stored without a specified encoding." I am not sure what current versions of WinZip or Windows file manager do.
msg269121 - (view) Author: Daniel Holth (dholth) * Date: 2016-06-23 15:47
" ... zipfile will encode them to and from utf-8 internally, and the encoding is marked in a standard flag inside the archive member."
msg269123 - (view) Author: Daniel Holth (dholth) * Date: 2016-06-23 16:08
The documentation should read The ZIP file format supports Unicode filenames. If you have unicode filenames, zipfile will encode them to and from utf-8 internally, but if you pass bytes filenames to write() then they will be stored without a specified encoding. Even though the format itself supports Unicode, historically Windows' built-in ZIP utility has interpreted all ZIP filenames as CP437 also known as DOS Latin. There is a fix from Microsoft for Windows 7 available here: https://support.microsoft.com/en-us/kb/2704299
msg269180 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-06-24 14:28
I bet the existing wording is just left over from the python2 docs. I think cp437 should still be mentioned explicitly. And mentioning "setting the utf-8 flag" would probably make the explanation clearer, though I'm not sure. Tecnically speaking, I think zipfile supports utf8, not unicode. Or it supports unicode via utf-8.
msg269190 - (view) Author: Daniel Holth (dholth) * Date: 2016-06-24 16:24
https://hg.python.org/cpython/file/2.6/Lib/zipfile.py#l331 Python 2.6 zipfile supports utf8 properly. It has only improved since then.
msg269201 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-06-24 17:29
This note looks outdated. In 2.x 8-bit file names are written as is, implying cp437 or what your consumers expect. Unicode file names are encoded to ascii or utf-8 (with setting utf-8 flag). In 3.x only Unicode file names are accepted, and they always are encoded to ascii or utf-8. There is no way to write non-ascii non-utf-8 file names. cp437 is not used at all. Maybe just remove this misleading note?
msg334969 - (view) Author: Cheryl Sabella (cheryl.sabella) * (Python committer) Date: 2019-02-06 18:25
This wording was removed as part of issue 32035.
History
Date User Action Args
2022-04-11 14:58:32 admin set github: 71531
2019-02-06 18:25:03 cheryl.sabella set status: open -> closedsuperseder: Documentation of zipfile.ZipFile().writestr() fails to mention that 'data' may also be bytesnosy: + cheryl.sabellamessages: + resolution: duplicatestage: needs patch -> resolved
2016-06-24 17:29:11 serhiy.storchaka set messages: +
2016-06-24 16:24:40 dholth set messages: +
2016-06-24 14:28:02 r.david.murray set messages: +
2016-06-23 16:08:32 dholth set messages: +
2016-06-23 15:47:29 dholth set messages: +
2016-06-23 15:46:24 dholth set messages: +
2016-06-23 15:24:39 r.david.murray set messages: +
2016-06-22 02:50:33 dholth set messages: +
2016-06-22 00:24:20 r.david.murray set nosy: + r.david.murraymessages: +
2016-06-18 00:22:16 terry.reedy set nosy: + terry.reedymessages: +
2016-06-17 19:11:38 serhiy.storchaka set nosy: + serhiy.storchakastage: needs patchversions: + Python 3.5
2016-06-17 15:08:46 dholth create