gh-72680: Fix false positives when using zipfile.is_zipfile() by jjolly · Pull Request #5053 · python/cpython (original) (raw)

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

jjolly

Fix zipfile validation issue by ... providing more validation!

Originally, zipfile.is_zipfile() only checked the End Central Directory
signature. If the signature could be found in the last 64k of the file,
success! This produced false positives on any file with 'PK\x05\x06' in the
last 64k of the file - including PDFs and PNGs.

This is now corrected by actually validating the Central Directory location
and size based on the information provided by the End Central Directory
along with verifying the Central Directory signature of the first entry.

This should be sufficient for the vast number of zipfiles, but more could be
done to absolutely validate the zipfile content - such as validating all
local file headers and Central Directory entries.

https://bugs.python.org/issue28494

@gpshead

Can you expand the unit testing for is_zipfile? is_zipfile should effectively suggest that zipfile.ZipFile will be able to read the archive and in particular a False return value should always mean that zipfile.ZipFile would fail to read it (ie: we don't want is_zipfile to fail something but be a file we could've read). We don't have much more than very basic coverage of this in the test_zipfile.py and test_zipfile64.py today.

It would not be unreasonable to assertTrue and assertFalse as appropriate the result of is_zipfile() on any test .zip file that a test is using (writing, reading, or both), valid or invalid, and any zip file contents being opened for read.

@jjolly

The zipfile.is_zipfile function would only search for the EndOfZipfile section header. This failed to correctly identify non-zipfiles that contained this header. Now the zipfile.is_zipfile function verifies the first central directory entry.

Changes:

@jjolly

Can you expand the unit testing for is_zipfile? is_zipfile should effectively suggest that zipfile.ZipFile will be able to read the archive and in particular a False return value should always mean that zipfile.ZipFile would fail to read it (ie: we don't want is_zipfile to fail something but be a file we could've read). We don't have much more than very basic coverage of this in the test_zipfile.py and test_zipfile64.py today.

I've added a PNG file that would succeed under the old algorithm, but fails correctly with the new algorithm. Perhaps the PNG is overkill, but it's proof that a valid format for another file should fail as a ZIP archive.

It would not be unreasonable to assertTrue and assertFalse as appropriate the result of is_zipfile() on any test .zip file that a test is using (writing, reading, or both), valid or invalid, and any zip file contents being opened for read.

I've added a few checks throughout the test_zipfile.py script. If you see others you would like to check, let me know and I can add them.

facebook-github-bot pushed a commit to pytorch/pytorch that referenced this pull request

Aug 30, 2019

@facebook-github-bot

Summary: The default implementation is lenient in that it recognizes a zipfile if the magic number appears anywhere in the archive. So if someone has the bytes PK\x03\x04 in a tensor, it gets recognized as a zipfile. See https://bugs.python.org/issue28494

This implementation only checks the first 4 bytes of the file for the zip magic number. We could also copy python/cpython#5053 fix, but that seems like overkill.

Fixes #25214 ](https://our.intern.facebook.com/intern/diff/17102516/) Pull Request resolved: #25279

Pulled By: driazati

Differential Revision: D17102516

fbshipit-source-id: 4d09645bd97e9ff7136a2229fba1d9a1bce5665a

@gpshead

FYI - This PR as is causes zipfile.is_zipfile to start rejecting some valid zipfiles that zipfile.ZipFile() happily opens and properly processes. I'm looking into why and how to best update the test suite to reveal this...

@github-actions

This PR is stale because it has been open for 30 days with no activity.

@arhadthedev

I'm looking into why and how to best update the test suite to reveal this...

@gpshead ping

@arhadthedev

@arhadthedev

@arhadthedev arhadthedev changed the titlebpo-28494: Fix false positives when using zipfile.is_zipfile() gh-72680: Fix false positives when using zipfile.is_zipfile()

Feb 25, 2023

@github-actions

This PR is stale because it has been open for 30 days with no activity.