msg126724 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2011-01-21 12:00 |
ZipInfo._encodeFilename() tries cp437 encoding or use UTF-8. It is not possible to decide the encoding. To workaround #10955 (bootstrap issue with python32.zip), it would be nice to be able to create a ZIP file using only UTF-8 filenames. Attached patch adds unicode parameter to ZipFile.write(), ZipFile.writestr() and ZipInfo constructor. |
|
|
msg126725 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2011-01-21 12:03 |
Oh, this patch fixes also a bug: ZipFile._RealGetContents() doesn't keep the unicode flag, so open a ZIP file and then write it somewhere else may change the unicode flag if unicode flag was set but the filename is also encodable to UTF-8 (eg. ASCII filename). |
|
|
msg126727 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2011-01-21 12:07 |
7zip and WinRAR uses the same algorithm than ZipFile._encodeFilename(): try cp437 or use UTF-8. Eg. if a filename contains ∞ (U+221E), it is encoded to UTF-8. WinZIP encodes all filenames to cp437: ∞ (U+221E) is replaced by 8 (U+0038), ☺ (U+263A) is replaced by... U+0001! 7zip, WinRAR and WinZIP are able to decode UTF-8 filenames (handle correctly the unicode flag). |
|
|
msg126731 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2011-01-21 12:18 |
What kind of problem are you trying to solve? |
|
|
msg126734 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2011-01-21 13:00 |
> What kind of problem are you trying to solve? Support non-ASCII filenames in python32.zip (#10955): at bootstrap, Python 3.2 can only use UTF-8 codec (not cp437). But I suppose also that forcing the encoding to UTF-8 gives a better Unicode support (when you decompress the archive). |
|
|
msg126735 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2011-01-21 13:03 |
> Support non-ASCII filenames in python32.zip (#10955): at bootstrap, > Python 3.2 can only use UTF-8 codec (not cp437). > > But I suppose also that forcing the encoding to UTF-8 gives a better > Unicode support (when you decompress the archive). The question is, rather, why you need an external flag for that. |
|
|
msg126745 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2011-01-21 15:02 |
> The question is, rather, why you need an external flag for that. Because I don't want to change the default encoding. I am not sure that all applications support UTF-8 encodings. But if you control your environment, force UTF-8 encoding should improve your Unicode support. |
|
|
msg126746 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2011-01-21 15:12 |
> > The question is, rather, why you need an external flag for that. > > Because I don't want to change the default encoding. I am not sure > that all applications support UTF-8 encodings. If this is a ZIP standard flag, why should we care about applications which don't support it? Should we add other flags to disable other features out of fear that other applications might not support them either? > But if you control your environment, force UTF-8 encoding should > improve your Unicode support. How is a random user supposed to know if their tools support UTF-8 encoding? It's not like everyone is an expert in ZIP files. This is the kind of situation where asking the user to make a choice is more confusing than helpful. When adding the flag, not only you complicate the API, but you have to support this flag for the rest of your life (well, almost :-)). We could instead use utf-8 by default for all non-ascii filenames (and *perhaps* have a separate force_cp437 flag, but default it to False). |
|
|
msg126759 - (view) |
Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *  |
Date: 2011-01-21 17:59 |
This looks similar to |
|
|
msg276182 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2016-09-13 06:18 |
Now UTF-8 is used for non-ASCII names. Can this issue be closed as outdated? |
|
|
msg297125 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2017-06-28 01:37 |
> This looks similar to Right. Let's focus on that one which has a better design. "unicode" means everything and nothing. It's more reliable to specify an encoding. |
|
|
msg297148 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2017-06-28 03:58 |
See also . |
|
|