Issue 24085: large memory overhead when pyc is recompiled (original) (raw)

process

Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: asottile, benjamin.peterson, brett.cannon, bukzor, geoffreyspear, georg.brandl, jonathan.underwood, methane, ncoghlan, pitrou, r.david.murray, serhiy.storchaka
Priority: normal Keywords:

Created on 2015-04-30 18:59 by bukzor, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
repro.py bukzor,2015-04-30 18:59 repro.py, from the demo in description
repro2.py asottile,2015-05-01 15:59
anon_city_hoods.tar.gz asottile,2015-05-01 16:03
Messages (24)
msg242281 - (view) Author: Buck Evan (bukzor) * Date: 2015-04-30 18:59
In the attached example I show that there's a significant memory overhead present whenever a pre-compiled pyc is not present. This only occurs with more than 5225 objects (dictionaries in this case) allocated. At 13756 objects, the mysterious pyc overhead is 50% of memory usage. I've reproduced this issue in python 2.6, 2.7, 3.4. I imagine it's present in all cpythons. $ python -c 'import repro' 16736 $ python -c 'import repro' 8964 $ python -c 'import repro' 8964 $ rm *.pyc; python -c 'import repro' 16740 $ rm *.pyc; python -c 'import repro' 16736 $ rm *.pyc; python -c 'import repro' 16740
msg242282 - (view) Author: Buck Evan (bukzor) * Date: 2015-04-30 19:01
Also, we've reproduced this in both linux and osx.
msg242284 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2015-04-30 19:34
This is transitory memory consumption. Once the source is compiled to bytecode, memory consumption falls down to its previous level. Do you care that much about it?
msg242296 - (view) Author: Anthony Sottile (asottile) * Date: 2015-05-01 00:47
Adding `import gc; gc.collect()` doesn't change the outcome afaict
msg242301 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2015-05-01 10:40
> Adding `import gc; gc.collect()` doesn't change the outcome afaict Of course it doesn't. The memory has already been released. "ru_maxrss" is the maximum memory consumption during the whole process lifetime. Add the following at the end of your script (Linux): import os, re, resource print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss) with open("/proc/%d/status" % os.getpid(), "r") as f: for line in f: if line.split(':')[0] in ('VmHWM', 'VmRSS'): print(line.strip()) And you'll see that VmRSS has already fallen back to the same level as when the pyc is not recompiled (it's a little bit more, perhaps due to fragmentation): $ rm -r __pycache__/; ./python -c "import repro" 19244 VmHWM: 19244 kB VmRSS: 12444 kB $ ./python -c "import repro" 12152 VmHWM: 12152 kB VmRSS: 12152 kB ("VmHWM" - the HighWater Mark - is the same as ru_maxrss)
msg242324 - (view) Author: Anthony Sottile (asottile) * Date: 2015-05-01 14:37
I'm still seeing a very large difference: asottile@work:/tmp$ python repro.py ready <module 'city_hoods' from '/tmp/city_hoods.pyc'> 72604 VmHWM: 72604 kB VmRSS: 60900 kB asottile@work:/tmp$ rm *.pyc; python repro.py ready <module 'city_hoods' from '/tmp/city_hoods.py'> 1077232 VmHWM: 1077232 kB VmRSS: 218040 kB This file is significantly larger than the one attached, not sure if it makes much of a difference.
msg242327 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2015-05-01 15:32
Which Python version is that? Can you try with 3.4 or 3.5? (is it under GNU/Linux?) > This file is significantly larger than the one attached, not sure > if it makes much of a difference. Python doesn't make a difference internally, but perhaps it has some impact on your OS' memory management.
msg242328 - (view) Author: Anthony Sottile (asottile) * Date: 2015-05-01 15:39
3.4 seems happier: asottile@work:/tmp$ rm *.pyc; python3.4 repro.py ready <module 'city_hoods' from '/tmp/city_hoods.py'> 77472 VmHWM: 77472 kB VmRSS: 65228 kB asottile@work:/tmp$ python3.4 repro.py ready <module 'city_hoods' from '/tmp/city_hoods.py'> 77472 VmHWM: 77472 kB VmRSS: 65232 kB The nasty result above is from 2.7: $ python Python 2.7.6 (default, Mar 22 2014, 22:59:56) [GCC 4.8.2] on linux2 3.3 also seems to have the same exaggerated problem: $ rm *.pyc -f; python3.3 repro.py ready <module 'city_hoods' from '/tmp/city_hoods.py'> 1112996 VmHWM: 1112996 kB VmRSS: 133468 kB asottile@work:/tmp$ python3.3 repro.py ready <module 'city_hoods' from '/tmp/city_hoods.py'> 81392 VmHWM: 81392 kB VmRSS: 69304 kB $ python3.3 Python 3.3.6 (default, Jan 28 2015, 17:27:09) [GCC 4.8.2] on linux So seems the leaky behaviour was fixed at some point, any ideas of what change fixed it and is there a possibility of backporting it to 2.7?
msg242329 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2015-05-01 15:40
Note under 3.x, you need to "rm -r __pycache__", not "rm *.pyc", since the pyc files are now stored in the __pycache__ subdirectory.
msg242330 - (view) Author: Anthony Sottile (asottile) * Date: 2015-05-01 15:42
Ah, then 3.4 still has the problem: $ rm -rf __pycache__/ *.pyc; python3.4 repro.py ready <module 'city_hoods' from '/tmp/city_hoods.py'> 1112892 VmHWM: 1112892 kB VmRSS: 127196 kB asottile@work:/tmp$ python3.4 repro.py ready <module 'city_hoods' from '/tmp/city_hoods.py'> 77468 VmHWM: 77468 kB VmRSS: 65228 kB
msg242331 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2015-05-01 15:47
Is there any chance you can upload a script that's large enough to exhibit the problem? (perhaps with anonymized data if there's something sensitive in there)
msg242332 - (view) Author: Anthony Sottile (asottile) * Date: 2015-05-01 15:59
Attached is repro2.py (slightly different so my editor doesn't hate itself when editing the file) I'll attach the other file in another comment since it seems I can only do one at a time
msg242339 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2015-05-01 17:31
Ok, I can reproduce: $ rm -r __pycache__/; ./python repro2.py ready <module 'anon_city_hoods' from '/home/antoine/cpython/opt/anon_city_hoods.py'> 1047656 VmHWM: 1047656 kB VmRSS: 50660 kB $ ./python repro2.py ready <module 'anon_city_hoods' from '/home/antoine/cpython/opt/anon_city_hoods.py'> 77480 VmHWM: 77480 kB VmRSS: 15664 kB My guess is that memory fragmentation prevents the RSS mark to drop any further, though one cannot rule out the possibility of an actual memory leak.
msg242340 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2015-05-01 17:32
(by the way, my numbers are with Python 3.5 - the in-development version - on 64-bit Linux)
msg242351 - (view) Author: Buck Evan (bukzor) * Date: 2015-05-01 20:32
New data: The memory consumption seems to be in the compiler rather than the marshaller: ``` $ PYTHONDONTWRITEBYTECODE=1 python -c 'import repro' 16032 $ PYTHONDONTWRITEBYTECODE=1 python -c 'import repro' 16032 $ PYTHONDONTWRITEBYTECODE=1 python -c 'import repro' 16032 $ python -c 'import repro' 16032 $ PYTHONDONTWRITEBYTECODE=1 python -c 'import repro' 8984 $ PYTHONDONTWRITEBYTECODE=1 python -c 'import repro' 8984 $ PYTHONDONTWRITEBYTECODE=1 python -c 'import repro' 8984 ``` We were trying to use PYTHONDONTWRITEBYTECODE as a workaround to this issue, but it didn't help us because of this.
msg242379 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-05-02 05:29
The use of PYTHONDONTWRITEBYTECODE is not a workaround because it makes your to have memory overhead unconditionally. The compiler needs more momory than require compiled data itself. If this is an issue, I suggest to use different representation for the data: JSON, pickle, or just marshal. Also it may be faster. Try also CSV or custom simple format if it is appropriate.
msg242583 - (view) Author: Buck Evan (bukzor) * Date: 2015-05-04 21:31
@serhiy.storchaka This is a very stable piece of a legacy code base, so we're not keen to refactor it so dramatically, although we could. We've worked around this issue by compiling pyc files ahead of time and taking extra care that they're preserved through deployment. This isn't blocking our 2.7 transition anymore.
msg318505 - (view) Author: Jonathan G. Underwood (jonathan.underwood) Date: 2018-06-02 16:37
Seeing a very similar problem - very high memory useage during byte compilation. Consider the very simple code in a file: ``` def test_huge(): try: huge = b'\0' * 0x100000000 # this allocates 4GB of memory! except MemoryError: print('OOM') ``` Running this sequence of commands shows that during byte compilation, 4 GB memory is used. Presumably this is because of the `huge` object - note of course the function isn't actually executed. ``` valgrind --tool=massif python memdemo.py ms_print massif.out.7591 | less ``` You'll need to replace 7591 with whatever process number valgrind reports. Is there any hope of fixing this? It's currently a problem for me when running tests on Travis, where the memory limit is 3GB. I had hoped to use a conditional like the above to skip tests that would require more memory than is available. However, the testing is killed before that simply because the byte compilation is causing an OOM.
msg318507 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2018-06-02 18:29
That's presumably due to the compile-time constant-expression optimization. Have you tried bytes(0x1000000)? I don't think that gets treated as a constant by the optimizer (but I could be wrong since a bunch of things ahve been added to it lately).
msg318508 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-06-02 18:31
Jonathan, this is a different problem, and it is fixed in 3.6+ (see ).
msg318509 - (view) Author: Jonathan G. Underwood (jonathan.underwood) Date: 2018-06-02 18:45
Thanks to both Serhiy Storchaka and David Murray - indeed you're both correct, and that is the issue in 21074, and the workaround from there of declaring a variable for that size fixes the problem.
msg320980 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2018-07-03 13:12
In case repro2, unreturned memory is in glibc malloc. jemalloc mitigates this issue. There are some fragmentation in pymalloc, but I think it's acceptable level. $ python3 -B repro2.py ready <module 'anon_city_hoods' from '/home/inada-n/anon_city_hoods.py'> 1079124 VmHWM: 1079124 kB VmRSS: 83588 kB $ LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1 python3 -B repro2.py ready <module 'anon_city_hoods' from '/home/inada-n/anon_city_hoods.py'> 1108424 VmHWM: 1108424 kB VmRSS: 28140 kB
msg320981 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2018-07-03 13:26
since anon_city_hoods has massive constants, compiler_add_const makes dict larger and larger. It creates many large tuples too. I suspect it makes glibc malloc unhappy. Maybe, we can improve pymalloc for medium and large objects, by porting strategy from jemalloc. It can be good GSoC project. But I suggest close this issue as "won't fix" for now.
msg320984 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-07-03 14:41
VmRSS for different versions: malloc jmalloc 2.7: 237316 kB 90524 kB 3.4: 53888 kB 14768 kB 3.5: 51396 kB 14908 kB 3.6: 90692 kB 31776 kB 3.7: 130952 kB 28296 kB 3.8: 130284 kB 27644 kB
History
Date User Action Args
2022-04-11 14:58:16 admin set github: 68273
2018-07-27 09:47:19 methane set status: open -> closedresolution: wont fixstage: resolved
2018-07-03 14:41:22 serhiy.storchaka set messages: +
2018-07-03 13:26:01 methane set messages: +
2018-07-03 13:12:33 methane set nosy: + methanemessages: +
2018-06-02 18:45:38 jonathan.underwood set messages: +
2018-06-02 18:31:30 serhiy.storchaka set messages: +
2018-06-02 18:29:02 r.david.murray set nosy: + r.david.murraymessages: +
2018-06-02 16:37:12 jonathan.underwood set nosy: + jonathan.underwoodmessages: +
2015-05-04 21:31:13 bukzor set messages: +
2015-05-02 05:29:37 serhiy.storchaka set nosy: + serhiy.storchakamessages: +
2015-05-01 23:49:42 pitrou set nosy: + brett.cannon, georg.brandl, ncoghlan, benjamin.peterson
2015-05-01 20:32:11 bukzor set messages: +
2015-05-01 17:51:30 geoffreyspear set nosy: + geoffreyspeartype: resource usagecomponents: + Interpreter Coreversions: + Python 3.5
2015-05-01 17:32:41 pitrou set messages: +
2015-05-01 17:31:44 pitrou set messages: +
2015-05-01 16:03:09 asottile set files: + anon_city_hoods.tar.gz
2015-05-01 15:59:20 asottile set files: + repro2.pymessages: +
2015-05-01 15:47:39 pitrou set messages: +
2015-05-01 15:42:21 asottile set messages: +
2015-05-01 15:40:46 pitrou set messages: +
2015-05-01 15:39:09 asottile set messages: +
2015-05-01 15:32:39 pitrou set messages: +
2015-05-01 14:37:44 asottile set messages: +
2015-05-01 10:40:39 pitrou set messages: +
2015-05-01 00:47:14 asottile set nosy: + asottilemessages: +
2015-04-30 19:34:20 pitrou set nosy: + pitroumessages: +
2015-04-30 19:01:31 bukzor set messages: +
2015-04-30 18:59:04 bukzor create