Issue 17666: Extra gzip headers breaks _read_gzip_header (original) (raw)
Regression in Python 3.3.0 to 3.3.1, tested under Mac OS X 10.8 and CentOS Linux 64bit.
The same regression also present in going from Python 2.7.3 from 2.7.4, does that need a separate issue filed?
Consider this VALID GZIP file, human link: https://github.com/biopython/biopython/blob/master/Tests/GenBank/cor6_6.gb.bgz
Binary link, only a small file: https://raw.github.com/biopython/biopython/master/Tests/GenBank/cor6_6.gb.bgz
This is compressed using a GZIP variant called BGZF which uses multiple blocks and records additional tags in the header, for background see: http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html
$ curl -O https://raw.github.com/biopython/biopython/master/Tests/GenBank/cor6_6.gb.bgz $ cat cor6_6.gb.bgz | gunzip | wc 320 1183 14967
Now for the bug, expected behaviour:
$ python3.2 Python 3.2 (r32:88445, Feb 28 2011, 17:04:33) [GCC 4.2.1 (Apple Inc. build 5664)] on darwin Type "help", "copyright", "credits" or "license" for more information.
import gzip handle = gzip.open("cor6_6.gb.bgz", "rb") data = handle.read() handle.close() len(data) 14967 quit()
Broken behaviour:
$ python3.3 Python 3.3.1 (default, Apr 8 2013, 17:54:08) [GCC 4.2.1 Compatible Apple Clang 4.0 ((tags/Apple/clang-421.0.57))] on darwin Type "help", "copyright", "credits" or "license" for more information.
import gzip handle = gzip.open("cor6_6.gb.bgz", "rb") data = handle.read() Traceback (most recent call last): File "", line 1, in File "/Users/pjcock/lib/python3.3/gzip.py", line 359, in read while self._read(readsize): File "/Users/pjcock/lib/python3.3/gzip.py", line 432, in _read if not self._read_gzip_header(): File "/Users/pjcock/lib/python3.3/gzip.py", line 305, in _read_gzip_header self._read_exact(struct.unpack("<H", self._read_exact(2))) File "/Users/pjcock/lib/python3.3/gzip.py", line 282, in _read_exact data = self.fileobj.read(n) File "/Users/pjcock/lib/python3.3/gzip.py", line 81, in read return self.file.read(size) TypeError: integer argument expected, got 'tuple'
The bug is very simple, an error in line 205 of gzip.py:
203 if flag & FEXTRA: 204 # Read & discard the extra field, if present 205 self._read_exact(struct.unpack("<H", self._read_exact(2)))
The struct.unpack method returns a single element tuple, thus a fix is:
203 if flag & FEXTRA: 204 # Read & discard the extra field, if present 205 extra_len, = struct.unpack("<H", self._read_exact(2)) 206 self._read_exact(extra_len)
This bug was identified via failing Biopython unit tests under Python 2.7.4 and 3.3.1, which all pass with this minor fix applied.