gzip.compress(..., mtime=0) in cpython 3.11+ unexpectedly sets OS byte in gzip header · Issue #112346 · python/cpython (original) (raw)

Bug report

description

Using gzip.compress() with mtime=0 in 3.8<=cpython<=3.10, the OS byte, i.e. the 10th byte in the GZIP header, is set to 255 "unknown" (also see e.g. #83302):

return struct.pack("<BBBBLBB", 0x1f, 0x8b, 8, 0, int(mtime), xfl, 255)

However, in cpython 3.11 and 3.12, the OS byte is suddenly set to a "known" value, e.g. 3 ("Unix") on Ubuntu.

This is not mentioned in the changelog for Python 3.11.

This may lead to problems in the context of reproducible builds. In our case, hash checking fails after decompressing and re-compressing a gzipped archive.

how to reproduce

Here's an example, where byte 10 is \xff in python 3.10 and \x03 in python 3.11:

~ $ python Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] on linux

import gzip gzip.compress(b'', mtime=0) b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x02\xff\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00'

~ $ pyenv shell 3.11 ~ $ python Python 3.11.6 (main, Nov 23 2023, 17:30:16) [GCC 11.4.0] on linux

import gzip gzip.compress(b'', mtime=0) b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x02\x03\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00'

cause

I guess this is caused by python 3.11 delegating the gzip.compress() call to zlib if mtime=0, as mentioned in the docs:

Changed in version 3.11: Speed is improved by compressing all data at once instead of in a streamed fashion. Calls with mtime set to 0 are delegated to zlib.compress() for better speed.

and source:

if mtime == 0:
# Use zlib as it creates the header with 0 mtime by default.
# This is faster and with less overhead.
return zlib.compress(data, level=compresslevel, wbits=31)

Apparently zlib does set the OS byte.

CPython versions tested on:

3.8, 3.9, 3.10, 3.11, 3.12

Operating systems tested on:

Linux, macOS, Windows

Linked PRs