Currently, the order of set or frozenset elements when saved to bytecode is dependent on the random seed. This breaks reproducibility. Example fail from an Arch Linux package: https://reproducible.archlinux.org/api/v0/builds/88454/diffoscope Let's take an example file, `test_compile.py` ```python s = { 'aaa', 'bbb', 'ccc', 'ddd', 'eee', } ``` $ PYTHONHASHSEED=0 python -m compileall --invalidation-mode checked-hash test_compile.py $ mv __pycache__ __pycache__1 $ PYTHONHASHSEED=1 python -m compileall --invalidation-mode checked-hash test_compile.py $ diff __pycache__/test_compile.cpython-39.pyc __pycache__1/test_compile.cpython-39.pyc Binary files __pycache__/test_compile.cpython-39.pyc and __pycache__1/test_compile.cpython-39.pyc differ $ diff <(xxd __pycache__/test_compile.cpython-39.pyc) <(xxd __pycache__1/test_compile.cpython-39.pyc) 5,6c5,6 < 00000040: 005a 0362 6262 5a03 6464 645a 0361 6161 .Z.bbbZ.dddZ.aaa < 00000050: 5a03 6363 635a 0365 6565 4e29 01da 0173 Z.cccZ.eeeN)...s --- > 00000040: 005a 0361 6161 5a03 6363 635a 0364 6464 .Z.aaaZ.cccZ.ddd > 00000050: 5a03 6565 655a 0362 6262 4e29 01da 0173 Z.eeeZ.bbbN)...s I believe the issue is in the marshall module. Particularly, this line[1]. My simple fix was to create a list from the set, sort it, and iterate over it instead. [1] https://github.com/python/cpython/blob/00d7abd7ef588fc4ff0571c8579ab4aba8ada1c0/Python/marshal.c#L505
I just realized my fix is wrong because list.sort does not handle different types. Similarly to other reproducibility fixes, how does skipping the item randomization when SOURCE_DATE_EPOCH is set sound?
Nevermind, AFAIK that depends on the hash seed, correct? So, the most viable option to me would be a sorting algorithm that could take type into account. Would that be an acceptable solution?
Sorry for the spam, I am trying to figure out the best option here, which is hard to do by myself. IMO it would be reasonable to create set objects with elements in the order they appear in the code, instead of based on the hash. I am not really sure where is the code responsible for this, and if there are any limitations preventing this from being implemented. So, my question are: Would you consider this reasonable? Is there anything I am missing? If there are no issues, could someone point me to the target code?