msg268727 - (view) |
Author: Daniel Holth (dholth) * |
Date: 2016-06-17 15:08 |
The zipfile documentation says "There is no official file name encoding for ZIP files." However ZIP and zipfile supports utf-8 filenames; this has been true for a long time, at least since Python 2.7. |
|
|
msg268750 - (view) |
Author: Terry J. Reedy (terry.reedy) *  |
Date: 2016-06-18 00:22 |
There is a difference between 'official' and 'supported', and I don't quite know what you mean by the latter. |
|
|
msg269035 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2016-06-22 00:24 |
See issue 10614 for the current state of play. This issue should probably be closed in favor of that one. |
|
|
msg269041 - (view) |
Author: Daniel Holth (dholth) * |
Date: 2016-06-22 02:50 |
This is a simple documentation bug about the ZIP file format supporting utf-8 and 'no encoding' filenames depending on whether two bits are set in a flag inside the archive member. Bug 10614 appears to be a different issue about out-of-band encoding information that you could pass to Python's zipfile implementation. |
|
|
msg269117 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2016-06-23 15:24 |
OK, what do you propose as a documentation change? The current doc is accurate, but incomplete. New phrasing could include something about the two de-facto standards but that one can not be sure that filenames will be in one of those two encodings. Issue 10614 addresses the fact that the zipfile module doesn't make it easy to specify the encoding of filenames when creating an archive, IIUC, which also still needs to be addressed in any documentation change. |
|
|
msg269120 - (view) |
Author: Daniel Holth (dholth) * |
Date: 2016-06-23 15:46 |
The current documentation says "Note There is no official file name encoding for ZIP files. If you have unicode file names, you must convert them to byte strings in your desired encoding before passing them to write(). WinZip interprets all file names as encoded in CP437, also known as DOS Latin." This is bad advice because if you convert the filenames to bytes before passing them to zipfile, it won't remember that they should be unicode. Instead it should say "The ZIP file format supports Unicode filenames. If you have unicode filenames, zipfile will encode them to and from utf-8 internally. If you pass bytes filenames to write() then they will be stored without a specified encoding." I am not sure what current versions of WinZip or Windows file manager do. |
|
|
msg269121 - (view) |
Author: Daniel Holth (dholth) * |
Date: 2016-06-23 15:47 |
" ... zipfile will encode them to and from utf-8 internally, and the encoding is marked in a standard flag inside the archive member." |
|
|
msg269123 - (view) |
Author: Daniel Holth (dholth) * |
Date: 2016-06-23 16:08 |
The documentation should read The ZIP file format supports Unicode filenames. If you have unicode filenames, zipfile will encode them to and from utf-8 internally, but if you pass bytes filenames to write() then they will be stored without a specified encoding. Even though the format itself supports Unicode, historically Windows' built-in ZIP utility has interpreted all ZIP filenames as CP437 also known as DOS Latin. There is a fix from Microsoft for Windows 7 available here: https://support.microsoft.com/en-us/kb/2704299 |
|
|
msg269180 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2016-06-24 14:28 |
I bet the existing wording is just left over from the python2 docs. I think cp437 should still be mentioned explicitly. And mentioning "setting the utf-8 flag" would probably make the explanation clearer, though I'm not sure. Tecnically speaking, I think zipfile supports utf8, not unicode. Or it supports unicode via utf-8. |
|
|
msg269190 - (view) |
Author: Daniel Holth (dholth) * |
Date: 2016-06-24 16:24 |
https://hg.python.org/cpython/file/2.6/Lib/zipfile.py#l331 Python 2.6 zipfile supports utf8 properly. It has only improved since then. |
|
|
msg269201 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2016-06-24 17:29 |
This note looks outdated. In 2.x 8-bit file names are written as is, implying cp437 or what your consumers expect. Unicode file names are encoded to ascii or utf-8 (with setting utf-8 flag). In 3.x only Unicode file names are accepted, and they always are encoded to ascii or utf-8. There is no way to write non-ascii non-utf-8 file names. cp437 is not used at all. Maybe just remove this misleading note? |
|
|
msg334969 - (view) |
Author: Cheryl Sabella (cheryl.sabella) *  |
Date: 2019-02-06 18:25 |
This wording was removed as part of issue 32035. |
|
|