gh-72680: Fix false positives when using zipfile.is_zipfile() by jjolly · Pull Request #5053 · python/cpython (original) (raw)
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
Fix zipfile validation issue by ... providing more validation!
Originally, zipfile.is_zipfile() only checked the End Central Directory
signature. If the signature could be found in the last 64k of the file,
success! This produced false positives on any file with 'PK\x05\x06' in the
last 64k of the file - including PDFs and PNGs.
This is now corrected by actually validating the Central Directory location
and size based on the information provided by the End Central Directory
along with verifying the Central Directory signature of the first entry.
This should be sufficient for the vast number of zipfiles, but more could be
done to absolutely validate the zipfile content - such as validating all
local file headers and Central Directory entries.
https://bugs.python.org/issue28494
Can you expand the unit testing for is_zipfile
? is_zipfile
should effectively suggest that zipfile.ZipFile
will be able to read the archive and in particular a False
return value should always mean that zipfile.ZipFile
would fail to read it (ie: we don't want is_zipfile to fail something but be a file we could've read). We don't have much more than very basic coverage of this in the test_zipfile.py
and test_zipfile64.py
today.
It would not be unreasonable to assertTrue and assertFalse as appropriate the result of is_zipfile()
on any test .zip
file that a test is using (writing, reading, or both), valid or invalid, and any zip file contents being opened for read.
The zipfile.is_zipfile function would only search for the EndOfZipfile section header. This failed to correctly identify non-zipfiles that contained this header. Now the zipfile.is_zipfile function verifies the first central directory entry.
Changes:
- Extended zipfile.is_zipfile to verify zipfile catalog
- Added tests to validate failure of binary non-zipfiles
Can you expand the unit testing for
is_zipfile
?is_zipfile
should effectively suggest thatzipfile.ZipFile
will be able to read the archive and in particular aFalse
return value should always mean thatzipfile.ZipFile
would fail to read it (ie: we don't want is_zipfile to fail something but be a file we could've read). We don't have much more than very basic coverage of this in thetest_zipfile.py
andtest_zipfile64.py
today.
I've added a PNG file that would succeed under the old algorithm, but fails correctly with the new algorithm. Perhaps the PNG is overkill, but it's proof that a valid format for another file should fail as a ZIP archive.
It would not be unreasonable to assertTrue and assertFalse as appropriate the result of
is_zipfile()
on any test.zip
file that a test is using (writing, reading, or both), valid or invalid, and any zip file contents being opened for read.
I've added a few checks throughout the test_zipfile.py script. If you see others you would like to check, let me know and I can add them.
facebook-github-bot pushed a commit to pytorch/pytorch that referenced this pull request
Summary:
The default implementation is lenient in that it recognizes a zipfile if the magic number appears anywhere in the archive. So if someone has the bytes PK\x03\x04
in a tensor, it gets recognized as a zipfile. See https://bugs.python.org/issue28494
This implementation only checks the first 4 bytes of the file for the zip magic number. We could also copy python/cpython#5053 fix, but that seems like overkill.
Fixes #25214 ](https://our.intern.facebook.com/intern/diff/17102516/) Pull Request resolved: #25279
Pulled By: driazati
Differential Revision: D17102516
fbshipit-source-id: 4d09645bd97e9ff7136a2229fba1d9a1bce5665a
FYI - This PR as is causes zipfile.is_zipfile
to start rejecting some valid zipfiles that zipfile.ZipFile()
happily opens and properly processes. I'm looking into why and how to best update the test suite to reveal this...
This PR is stale because it has been open for 30 days with no activity.
I'm looking into why and how to best update the test suite to reveal this...
@gpshead ping
arhadthedev changed the title
bpo-28494: Fix false positives when using zipfile.is_zipfile() gh-72680: Fix false positives when using zipfile.is_zipfile()
This PR is stale because it has been open for 30 days with no activity.