[Python-Dev] zipfile and unicode filenames (original) (raw)

"Martin v. Löwis" martin at v.loewis.de
Sun Jun 10 18:45:51 CEST 2007


I don't think always encoding them to utf-8 (and using bit 11 of flagbits) is a good idea, since there's a chance to create archives that won't be correctly readable by programs not supporting this bit (it's no secret that currently some programs just assume that filenames are encoded using one of system encodings).

I think it is also fairly uniformly agreed that these programs are incorrect; the official encoding of file names in a zip file is Windows/DOS code page 437.

This is too complex and hazy to implement. Even if I know what is the situation on Windows (i.e. using OEM, also called DOS encoding, but I'm not sure how to determine its codec name from within python apart from calling GetConsoleCP), I'm totally unaware of the situation on other operating systems.

I don't think that the situation on Windows is that the OEM code page should be used. Instead, CP 437 should be used, independent of the OEM code page.

The tricky question is what to do when reading in zipfiles with non-ASCII characters (and yes, I understand that in your case there were only ASCII characters in the file names). I don't think it should be changed.

In Python 3, it will certainly change, since the string type will be unicode-based. It probably should not change for the rest of 2.x.

Current zipfile seems to officially support ascii filenames only anyway

That's not true. You can use any byte string as the file name that you want, including non-ASCII strings encoded in CP437.

+ filename = str(self.filename)

That would be incorrect, as it relies on the system encoding, which shouldn't be relied upon. Plus, it would allow arbitrary non-string things as filenames. What it should do instead (IMO) is to encode in CP437. Bonus points if it falls back to the UTF-8 feature of zip files if encoding as CP437 fails.

Regards, Martin



More information about the Python-Dev mailing list