[Python-Dev] zipfile and unicode filenames (original) (raw)

Alexey Borzenkov snaury at gmail.com
Sun Jun 10 20:17:16 CEST 2007


On 6/10/07, "Martin v. Löwis" <martin at v.loewis.de> wrote:

> I don't think always encoding them to utf-8 (and using bit 11 of > flagbits) is a good idea, since there's a chance to create archives > that won't be correctly readable by programs not supporting this bit > (it's no secret that currently some programs just assume that > filenames are encoded using one of system encodings). I think it is also fairly uniformly agreed that these programs are incorrect; the official encoding of file names in a zip file is Windows/DOS code page 437.

Before replying to you I actually did some quick tests. I packed a file with localized filename and then opened it using explorer and also viewed it using the hexeditor:

7-Zip: directory cp866, header cp866: explorer sees correct filename. zipfile: directory cp1251, header cp1251: explorer sees incorrect filename. pkzip25.exe: directory cp866, header cp1251: explorer sees correct filenames, zipfile complains that filenames differ. zip.exe: directory cp1251, header cp1251: explorer sees incorrect filenames.

Also note, that modifying filename in directory with a hex editor to cp866 made explorer see correct filenames. Another experiment with pkzip25 showed that modifying filename in directory makes it extract files with that filenam, i.e. it ignores header filename. The same behavior is showed by 7-Zip.

So the general idea is that at least directory filename has some sort of convention of using oem (dos, console) encoding on Windows, cp866 in my case. Header filenames have different encodings, and seem to be ignored.

I don't think that the situation on Windows is that the OEM code page should be used. Instead, CP 437 should be used, independent of the OEM code page.

And on the contrary, pkzip25 made by PKWARE Inc. themselves behaves otherwise.

> + filename = str(self.filename) That would be incorrect, as it relies on the system encoding, which shouldn't be relied upon.

Well, as I've seen in numerous examples above, system (or actually dos) encoding is actually what is used by at least by three major programs: 7-zip, pkzip25 and explorer, at least on windows.

Plus, it would allow arbitrary non-string things as filenames.

Hmm... why is that bad?

What it should do instead (IMO) is to encode in CP437. Bonus points if it falls back to the UTF-8 feature of zip files if encoding as CP437 fails.

And encoding to cp437 would be incorrect, as no currently existing program would correctly work on non-english Windows OSes. I think that letting the user deciding on the encoding is the right way to go here, as you can't know what user actually wants these days, it's all too hazy to me. And in case unicode is passed it just converts it using ascii (or default) codec. One can specify ascii codec there explicitly, if using system encoding is really an issue.



More information about the Python-Dev mailing list