bpo-29708: support SOURCE_DATE_EPOCH env var in py_compile (allow for reproducible builds of python packages) by bmwiedemann · Pull Request #296 · python/cpython (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation70 Commits2 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
See https://reproducible-builds.org/ for why this is good
and https://reproducible-builds.org/specs/source-date-epoch/
for the definition of this variable.
Background:
In some distributions like openSUSE, binary rpms contain precompiled .pyc files.
And packages like amqp or twisted dynamically generate .py files at build time
so those have the current time and that timestamp gets embedded
into the .pyc file header.
When we then adapt file timestamps in rpms to be constant,
the timestamp in the .pyc header will no more match
the .py timestamp in the filesystem.
The software will still work, but it will not use the .pyc file as it should.
https://bugs.python.org/issue29708
Hello, and thanks for your contribution!
I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA).
Unfortunately we couldn't find an account corresponding to your GitHub username on bugs.python.org (b.p.o) to verify you have signed the CLA. This is necessary for legal reasons before we can look at your contribution. Please follow these steps to help rectify the issue:
- If you don't have an account on b.p.o, please create one
- Make sure your GitHub username is listed in "Your Details" at b.p.o
- If you have not already done so, please sign the PSF contributor agreement. The "bugs.python.org username " requested by the form is the "Login name" field under "Your Details".
- If you just signed the CLA, please wait at least one US business day and then check "Your Details" on bugs.python.org to see if your account has been marked as having signed the CLA (the delay is due to a person having to manually check your signed CLA)
- Reply here saying you have completed the above steps
Thanks again to your contribution and we look forward to looking at it!
@@ -137,6 +137,10 @@ def compile(file, cfile=None, dfile=None, doraise=False, optimize=-1): |
---|
except FileExistsError: |
pass |
source_stats = loader.path_stats(file) |
sde = os.environ.get('SOURCE_DATE_EPOCH') |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Debian, we don't ship binary packages with pyc files, instead we generate them when the package is installed. That makes more sense for us because we can support multiple versions of Python simultaneously, so we generate pycs based on the versions installed on the target system. That way we generally don't have to build all of our binary packages when we add a new Python interpreter, and we don't bloat our binary packages with pyc files.
But in any case, $SOURCE_DATE_EPOCH
is the de facto standard for transmitting this to programs for reproducible builds, so this change is reasonable. However I would like to see documentation and a test.
Will do the changes in a week (after vacation).
We had that discussion recently about dropping .pyc files in http://lists.opensuse.org/opensuse-packaging/2017-02/msg00086.html and it seems, many people prefer to keep them, because they want the package manager to track all files belonging to a package and because it makes it easier to predict disk-space consumption. Then there are some advantages in processing speed. Some even went as far as only wanting to ship .pyc files (similar to how we only ship binaries in C)
did testing, found of the originally affected 476 packages in openSUSE, most were fixed by this patch, except 12 (eric5 nml pcp python3 python3-base python3-Markdown python3-pycparser python-base python-Markdown python-pycparser qpid-cpp salt) which seem to use some other method of producing .pyc files that will need to be patched separately.
On Feb 25, 2017, at 09:37 PM, Bernhard M. Wiedemann wrote: We had that discussion recently about dropping .pyc files inhttp://lists.opensuse.org/opensuse-packaging/2017-02/msg00086.html and it seems, many people prefer to keep them, because they want the package manager to track all files belonging to a package and because it makes it easier to predict disk-space consumption. Then there are some advantages in processing speed. Some even went as far as only wanting to ship .pyc files (similar to how we only ship binaries in C)
Just out of curiosity, does OpenSUSE support multiple parallel installed versions of Python at the same time? E.g. can I install from system packages both Python 3.5 and 3.6? If so, do you have to rebuild all your library packages when you add a new version of Python? (In the Debian ecosystem, we can support multiple versions, although we tend to only do that during transitions. Still, it makes transitions easier for the archive and users. Since we don't ship pyc files in binary packages, we only have to rebuild packages with extension modules.)
Please open a general issue on bugs.python.org to track multiple changes to support reproducible Python builds and link PR to this issue.
vstinner changed the title
allow for reproducible builds of python packages support SOURCE_DATE_EPOCH env var in pycompile (allow for reproducible builds of python packages)
@@ -137,6 +137,10 @@ def compile(file, cfile=None, dfile=None, doraise=False, optimize=-1): |
---|
except FileExistsError: |
pass |
source_stats = loader.path_stats(file) |
sde = os.environ.get('SOURCE_DATE_EPOCH') |
if sde and source_stats['mtime'] > int(sde): |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I concur for the unit test. Document the variable in Doc/library/py_compile.rst
What happens if SOURCE_DATE_EPOCH is not a valid integer? Should it be ignored or raise a nice error message?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The traceback from int()
should be enough, right?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The traceback if int
does not tell the end user to set the `SOURCE_DATE_EPOCH' environment variable with a valid integer value.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might convert sde to int to raise a nice error message if the conversion fails.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I addressed 4 of 5 comments, but for this one I don't know how you would prefer it to be...
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, if SOURCE_DATE_EPOCH
is set, it is very uncommon (aka exceptional) to be a non-int, because it is set by programs rather than users most of the time.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I say convert the string value to an int in a separate step in a try
, catch the ValueError
, and raise a chained ValueError
that SOURCE_DATE_EPOCH
is malformed.
vstinner changed the title
support SOURCE_DATE_EPOCH env var in pycompile (allow for reproducible builds of python packages) support SOURCE_DATE_EPOCH env var in py_compile (allow for reproducible builds of python packages)
bmwiedemann changed the title
support SOURCE_DATE_EPOCH env var in py_compile (allow for reproducible builds of python packages) bpo-29708: support SOURCE_DATE_EPOCH env var in py_compile (allow for reproducible builds of python packages)
added test and documentation
@warsaw Yes. Fedora, OpenSUSE, and others do support multiple Python interpreters. In OpenSUSE's case, they have a global flavors thing that controls this, and in Fedora's case, we can target them specifically. However, as a matter of course, we ship one and only one per major version in the core distribution.
Because all paths are already versioned by Python (we didn't do the thing that Debian did and mutate the package install paths so that they are unversioned), parallel installability isn't an issue. We do rebuild things, but it also allows us to be sure things still work as we upgrade Python versions.
f.read(4) |
---|
timebytes = f.read(4) # read timestamp from pyc header |
f.close() |
self.assertEqual(timebytes, b'\x15\xcd\x5b\x07') |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make this test a little more obvious, how about changing the bytes literal to (123456789).to_bytes(4, 'little')
?
I have just one suggestion left, but everything else about the branch LGTM.
updated again.
Are these failures in 'cpython/Lib/test/test_descr.py", line 2888, in test_keywords'
really related to my change?
Then I wondered if the added test would fail on big-endian platforms (not many left though)
And there might be a more elegant way to save+restore the environment in the test.
And directly accessing bytes from .pyc files might be a bit fragile - e.g. the header format might need to be changed to avoid the year-2038 bug... I guess the test can be updated then as needed.
@bmwiedemann those failures have been fixed, so just rebase with master again and they should go away.
@warsaw anything left for me to do here?
In Nixpkgs we have also a patch to reproducibly build the Python interpreters.
- We set the timestamp of the bytecode to 1 (this is the value we use for
SOURCE_DATE_EPOCH
whenDETERMINISTIC_BUILD
is set - We also set
PYTHONHASHSEED=0
so that the order of e.g. sets are not randomized - we force a rebuild of all the bytecode in the final stage of the build
Furthermore, whenever we build a Python package we also set PYTHONHASHSEED=0
and DETERMINISTIC_BUILD=1
.
did testing, found of the originally affected 476 packages in openSUSE, most were fixed by this patch, except 12 (eric5 nml pcp python3 python3-base python3-Markdown python3-pycparser python-base python-Markdown python-pycparser qpid-cpp salt) which seem to use some other method of producing .pyc files that will need to be patched separately.
@bmwiedemann you will also need to modify Lib/importlib/_bootstrap_external.py
Furthermore, for packages I found the following along with PYTHONHASHSEED=0
is needed:
https://bitbucket.org/pypa/wheel/pull-requests/77/bdist_wheel-deterministic-order-record/diff
FRidh mentioned this pull request
I still think, this patch is good to merge as it is.
I have made the requested changes; please review again.
I dropped the mtime modification as it is not strictly needed (rpm can already adjust file timestamps (but it is enabled by a different macro than the one setting SOURCE_DATE_EPOCH)) and even if in some cases it leads to .pyc timestamps being inconsistent with .py mtime it only slightly decreases performance, but does not break functionality.
As for other compilers, I only know about Free Pascal 'fpc' and one other (forgot which) but AFAIK they dont have good solutions yet either, so pyc files are sort of a prototype in that area.
except ValueError: |
---|
raise ValueError("SOURCE_DATE_EPOCH is not a valid integer") |
if source_stats['mtime'] > source_date_epoch: |
source_stats['mtime'] = source_date_epoch |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we making judgment calls about which one is greater? SOURCE_DATE_EPOCH means I decided I know better than you what the time is, that is the entire point. For example, Arch Linux stores all files in the package archive using the modification time of SOURCE_DATE_EPOCH.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OTOH rpm and tar's --clamp-mtime option implement mtime clamping, so that old files keep their timestamps and just timestamps of files created during a build are modified to be SOURCE_DATE_EPOCH
can be tricky to get this right for both cases...
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in that case, the mtime clamp will set the pyc modification time to SOURCE_DATE_EPOCH anyway.
It would result in general files being clamped to a "latest" mtime of SOURCE_DATE_EPOCH and the pyc header timestamps being exactly equal to SOURCE_DATE_EPOCH, so everyone wins.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But python does an exact match when checking if it can use pyc files, so old .py files having .pyc files with 'too new' SOURCE_DATE_EPOCH in header would also result in .pyc files not to be used.
I think we might need something like
if source_stats['mtime'] > source_date_epoch or os.environ.get('ALWAYS_SET_MTIME'): source_stats['mtime'] = source_date_epoch
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think worrying about whether .pyc
files will be regenerated based on the static date set in them is getting beyond the scope of reproducible builds. If this is going to have a chance to land in Python 3.7 then let's keep this PR just to the production of .pyc
files and not their consumption.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think once the clamping is dropped so that it's a straight setting of the timestamp based on the envvar then this PR is ready to go (assuming my docs changes build appropriately 😁 ).
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.
Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again
. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.
IMHO that does not make sense. The original bug was about .pyc header timestamps not matching what .py timestamps end up in packages and some of it will still be there, depending on whether all timestamps or just new ones are modified.
One simple, but ugly solution would be to have python always set mtime of .py files, so that those match .pyc headers.
The other less ugly solution would be to give hints to python about what the packaging code will do to mtimes later and take that into account.
Or distributions add a small custom patch to the python they ship to match things up... might actually be the cleanest variant, but would probably need a bit more documentation.
The original report was about python not respecting SOURCE_DATE_EPOCH. Respecting SOURCE_DATE_EPOCH imho means using exactly that. Note that Arch Linux has recently gained support for reproducing packages, and we determined that the best policy was to make the package manager itself respect SOURCE_DATE_EPOCH, so makepkg
will use it internally for the timestamps of packaged files, among other things.
(Reproducible builds should not have implementation-defined dependencies like setting SOURCE_DATE_EPOCH to the past rather than the future. The reproducible-builds documentation firts references using --mtime, and then says "in some cases" you might prefer to use --clamp-mtime, which does not sound like a ringing endorsement.)
PR #5200 was merged, so we do not need this approach anymore => closing
akruis pushed a commit to akruis/cpython that referenced this pull request
It is now possible to cross-compile Stackless for Windows on ARM and ARM64.
gicmo mentioned this pull request
jaraco pushed a commit that referenced this pull request
mcepl added a commit to openSUSE-Python/cpython that referenced this pull request
Created by Bernhard M. Wiedemann bmwiedemann@suse.de -- gh#python#296
mcepl added a commit to openSUSE-Python/cpython that referenced this pull request
Created by Bernhard M. Wiedemann bmwiedemann@suse.de -- gh#python#296
Patch: 0001-allow-for-reproducible-builds-of-python-packages.patch
mcepl added a commit to openSUSE-Python/cpython that referenced this pull request
Created by Bernhard M. Wiedemann bmwiedemann@suse.de -- gh#python#296
Patch: 0001-allow-for-reproducible-builds-of-python-packages.patch
mcepl pushed a commit to openSUSE-Python/cpython that referenced this pull request
Code is originally from gh#python#296 (never released in this form).
Patch: 0001-allow-for-reproducible-builds-of-python-packages.patch
mcepl pushed a commit to openSUSE-Python/cpython that referenced this pull request
See https://reproducible-builds.org/ for why this is good idea and https://reproducible-builds.org/specs/source-date-epoch/ for the definition of this variable.
Background: In some distributions like openSUSE, binary rpms contain precompiled .pyc files.
And packages like amqp or twisted dynamically generate .py files at build time so those have the current time and that timestamp gets embedded into the .pyc file header. When we then adapt file timestamps in rpms to be constant, the timestamp in the .pyc header will no more match the .py timestamp in the filesystem. The software will still work, but it will not use the .pyc file as it should.
Original version which doesn't force hashed *.pyc files
Code is originally from gh#python#296 (never released in this form).
Patch: 0001-allow-for-reproducible-builds-of-python-packages.patch
daregit pushed a commit to daregit/yocto-combined that referenced this pull request
The compiled .pyc files contain time stamp corresponding to the compile time. This prevents binary reproducibility. This patch allows to achieve binary reproducibility by overriding the build time stamp by the value exported via SOURCE_DATE_EPOCH.
Patch by Bernhard M. Wiedemann, backported from python/cpython#296
[YOCTO#11241]
(From OE-Core rev: 2a044f1e4f5c63e11e631b31f741c7aabfa6f601)
Signed-off-by: Juro Bystricky juro.bystricky@intel.com Signed-off-by: Richard Purdie richard.purdie@linuxfoundation.org
daregit pushed a commit to daregit/yocto-combined that referenced this pull request
The compiled .pyc files contain time stamp corresponding to the compile time. This prevents binary reproducibility. This patch allows to achieve binary reproducibility by overriding the build time stamp by the value exported via SOURCE_DATE_EPOCH.
Patch by Bernhard M. Wiedemann, backported from python/cpython#296
[YOCTO#11241]
(From OE-Core rev: 2a044f1e4f5c63e11e631b31f741c7aabfa6f601)
Signed-off-by: Juro Bystricky juro.bystricky@intel.com Signed-off-by: Richard Purdie richard.purdie@linuxfoundation.org