Issue 33597: Compact PyGC_Head - Python tracker (original) (raw)

Created on 2018-05-22 07:06 by methane, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Messages (34)

msg317262 - (view)

Author: Inada Naoki (methane) * (Python committer)

Date: 2018-05-22 07:06

Currently, PyGC_Head takes three words; gc_prev, gc_next, and gc_refcnt.

gc_refcnt is used when collecting, for trial deletion. gc_prev is used for tracking and untracking.

So if we can avoid tracking/untracking while trial deletion, gc_prev and gc_refcnt can share same memory space.

This idea reduces PyGC_Head size to two words.

msg317295 - (view)

Author: Inada Naoki (methane) * (Python committer)

Date: 2018-05-22 14:56

$ ./python -m perf compare_to master.json twogc.json -G --min-speed=2 Slower (3):

scimark_monte_carlo: 268 ms +- 9 ms -> 278 ms +- 8 ms: 1.04x slower (+4%)
fannkuch: 1.03 sec +- 0.02 sec -> 1.06 sec +- 0.02 sec: 1.03x slower (+3%)
spectral_norm: 285 ms +- 9 ms -> 291 ms +- 6 ms: 1.02x slower (+2%)

Faster (13):

unpickle_pure_python: 980 us +- 13 us -> 886 us +- 11 us: 1.11x faster (-10%)
pickle_dict: 71.9 us +- 6.8 us -> 67.2 us +- 0.4 us: 1.07x faster (-7%)
mako: 40.5 ms +- 1.1 ms -> 38.7 ms +- 0.4 ms: 1.05x faster (-5%)
scimark_lu: 396 ms +- 18 ms -> 381 ms +- 16 ms: 1.04x faster (-4%)
genshi_text: 89.3 ms +- 1.2 ms -> 86.3 ms +- 1.2 ms: 1.03x faster (-3%)
xml_etree_generate: 270 ms +- 5 ms -> 262 ms +- 5 ms: 1.03x faster (-3%)
regex_dna: 286 ms +- 1 ms -> 279 ms +- 1 ms: 1.03x faster (-3%)
scimark_sor: 511 ms +- 6 ms -> 497 ms +- 10 ms: 1.03x faster (-3%)
xml_etree_iterparse: 220 ms +- 6 ms -> 214 ms +- 5 ms: 1.03x faster (-3%)
xml_etree_process: 231 ms +- 4 ms -> 225 ms +- 4 ms: 1.02x faster (-2%)
genshi_xml: 191 ms +- 2 ms -> 187 ms +- 2 ms: 1.02x faster (-2%)
unpack_sequence: 131 ns +- 2 ns -> 129 ns +- 2 ns: 1.02x faster (-2%)
richards: 180 ms +- 4 ms -> 176 ms +- 2 ms: 1.02x faster (-2%)

Benchmark hidden because not significant (44)

msg317297 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2018-05-22 15:01

Interesting. Do you have any comparisons on memory footprint too?

msg317298 - (view)

Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)

Date: 2018-05-22 15:31

This is an interesting idea.

The other problem with the garbage collecting is that it modifies the memory of all collectable objects. This leads to deduplicating virtually all memory blocks after the fork, even if these objects are not used in the child.

If gc_refcnt is used only when collecting, what if allocate a linear array for them for that time? This will save less memory (only one word per object in the peak), but avoid modifying the memory of not collected objects (except pointers at the ends of generation lists and neighbors of collected objects).

msg317299 - (view)

Author: Inada Naoki (methane) * (Python committer)

Date: 2018-05-22 15:31

$ ./python-gc -c 'import asyncio,sys; sys._debugmallocstats()'

master:

bytes in allocated blocks = 4,011,368

bytes in available blocks = 136,640

50 unused pools * 4096 bytes = 204,800

bytes lost to pool headers = 49,824

bytes lost to quantization = 53,816

bytes lost to arena alignment = 0

Total = 4,456,448

patched:

bytes in allocated blocks = 3,852,432

bytes in available blocks = 132,664

27 unused pools * 4096 bytes = 110,592

bytes lost to pool headers = 47,856

bytes lost to quantization = 50,760

bytes lost to arena alignment = 0

Total = 4,194,304

msg317301 - (view)

Author: Inada Naoki (methane) * (Python committer)

Date: 2018-05-22 15:49

@Serhiy

php implemented similar idea recently. https://react-etc.net/entry/improvements-to-garbage-collection-gc-php-7-3-boosts-performance-in-benchmark

In short, each tracked object have only "index" of GC struct, not "pointer". GC struct is in array and it can be resized.

I tried to copy it, but there are some challenges:

_PyObject_GC_TRACK() will resize GC array and cause MemoryError. It's not API compatible.
php's GC is not generational. This design may slow down moving objects between generation.
We need one word (index) for object header and two words (refcnt, pointer to the object) for GC struct. It means we can reduce memory footprint only for untracked dicts and tuples.

And this is my first time GC hack. So I gave up PHP way and choose easier way.

Anyway, we have gc.freeze() now which can be used for avoid CoW after fork.

msg317302 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2018-05-22 15:49

Le 22/05/2018 à 17:31, INADA Naoki a écrit :

INADA Naoki <songofacandy@gmail.com> added the comment:

$ ./python-gc -c 'import asyncio,sys; sys._debugmallocstats()'

Thanks. You can also collect peak memory stats during the benchmark suite, though the numbers may be approximate.

msg317306 - (view)

Author: Inada Naoki (methane) * (Python committer)

Date: 2018-05-22 16:20

In Doc folder:

make clean make PYTHON=../python venv /usr/bin/time make html

master:

113.15user 0.41system 1:55.46elapsed 98%CPU (0avgtext+0avgdata 205472maxresident)k 18800inputs+223544outputs (1major+66066minor)pagefaults 0swaps

111.07user 0.44system 1:51.72elapsed 99%CPU (0avgtext+0avgdata 205052maxresident)k 0inputs+223376outputs (0major+65855minor)pagefaults 0swaps

patched:

109.92user 0.44system 1:50.43elapsed 99%CPU (0avgtext+0avgdata 195832maxresident)k 0inputs+223376outputs (0major+63206minor)pagefaults 0swaps

110.70user 0.40system 1:51.50elapsed 99%CPU (0avgtext+0avgdata 195516maxresident)k 0inputs+223376outputs (0major+62723minor)pagefaults 0swaps

It seems reduced 5% memory footprint, and performance difference is very small.

msg317313 - (view)

Author: Yury Selivanov (yselivanov) * (Python committer)

Date: 2018-05-22 17:41

This is such a great idea. +1 from me.

msg318027 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2018-05-29 13:58

Are you sure that all memory allocators align at least on 8 bytes (to give up 3 unused bits)? I don't know the answer, it's an open question.

msg318028 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2018-05-29 14:00

Are you sure that all memory allocators align at least on 8 bytes (to give up 3 unused bits)?

If they don't then a simple double array will end up unaligned. It's not impossible but extremely unlikely.

msg318051 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2018-05-29 15:56

Ok, I ran a subset of the benchmarks to record their memory footprint and got these results:

master-mem.perf

Performance version: 0.6.2 Python version: 3.8.0a0 (64-bit) revision 73cbe7a Report on Linux-4.15.0-22-generic-x86_64-with-glibc2.9 Number of logical CPUs: 16 Start date: 2018-05-29 16:54:24.964539 End date: 2018-05-29 17:04:05.608437

methane-mem.perf

Performance version: 0.6.2 Python version: 3.8.0a0 (64-bit) revision 16c66ca Report on Linux-4.15.0-22-generic-x86_64-with-glibc2.9 Number of logical CPUs: 16 Start date: 2018-05-29 17:15:00.936857 End date: 2018-05-29 17:24:33.755119

+------------------------+-----------------+------------------+---------------+------------------------+ | Benchmark | master-mem.perf | methane-mem.perf | Change | Significance | +========================+=================+==================+===============+========================+ | 2to3 | 7121.8 kB | 6944.4 kB | 1.03x smaller | Significant (t=31.59) | +------------------------+-----------------+------------------+---------------+------------------------+ | chameleon | 15.7 MB | 15.3 MB | 1.02x smaller | Significant (t=24.25) | +------------------------+-----------------+------------------+---------------+------------------------+ | html5lib | 20.1 MB | 19.4 MB | 1.04x smaller | Significant (t=67.15) | +------------------------+-----------------+------------------+---------------+------------------------+ | json_dumps | 8282.2 kB | 8021.4 kB | 1.03x smaller | Significant (t=234.01) | +------------------------+-----------------+------------------+---------------+------------------------+ | json_loads | 7671.2 kB | 7494.0 kB | 1.02x smaller | Significant (t=37.43) | +------------------------+-----------------+------------------+---------------+------------------------+ | pickle | 7883.4 kB | 7698.6 kB | 1.02x smaller | Significant (t=28.64) | +------------------------+-----------------+------------------+---------------+------------------------+ | pickle_pure_python | 7867.4 kB | 7680.0 kB | 1.02x smaller | Significant (t=48.64) | +------------------------+-----------------+------------------+---------------+------------------------+ | sqlalchemy_declarative | 18.4 MB | 17.9 MB | 1.03x smaller | Significant (t=132.47) | +------------------------+-----------------+------------------+---------------+------------------------+ | sqlalchemy_imperative | 17.8 MB | 17.3 MB | 1.03x smaller | Significant (t=128.94) | +------------------------+-----------------+------------------+---------------+------------------------+ | tornado_http | 24.5 MB | 24.0 MB | 1.02x smaller | Significant (t=7.99) | +------------------------+-----------------+------------------+---------------+------------------------+ | unpickle | 7883.2 kB | 7694.2 kB | 1.02x smaller | Significant (t=31.19) | +------------------------+-----------------+------------------+---------------+------------------------+ | unpickle_pure_python | 7891.6 kB | 7699.4 kB | 1.02x smaller | Significant (t=63.68) | +------------------------+-----------------+------------------+---------------+------------------------+ | xml_etree_generate | 12.0 MB | 11.5 MB | 1.04x smaller | Significant (t=26.07) | +------------------------+-----------------+------------------+---------------+------------------------+ | xml_etree_iterparse | 11.8 MB | 11.3 MB | 1.04x smaller | Significant (t=146.82) | +------------------------+-----------------+------------------+---------------+------------------------+ | xml_etree_parse | 11.6 MB | 11.2 MB | 1.03x smaller | Significant (t=116.08) | +------------------------+-----------------+------------------+---------------+------------------------+ | xml_etree_process | 12.3 MB | 12.0 MB | 1.02x smaller | Significant (t=49.04) | +------------------------+-----------------+------------------+---------------+------------------------+

msg318078 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2018-05-29 19:58

Ok, I ran a subset of the benchmarks to record their memory footprint and got these results:

I'm not sure that the code tracking the memory usage in performance works :-) It may be worth to double check the code. By the way, perf has a --tracemalloc option, but performance doesn't have it.

perf has two options: --track-memory and --tracemalloc, see the doc:

http://perf.readthedocs.io/en/latest/runner.html#misc

perf has different implementations to track the memory usage:

resource.getrusage(resource.RUSAGE_SELF) * 1024
Windows: GetProcessMemoryInfo()
tracemalloc: tracemalloc.get_traced_memory()[1]

In the 3 cases, perf saves the peak of the memory usage.

msg318081 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2018-05-29 20:04

I'm not sure that the code tracking the memory usage in performance works

Why wouldn't it? It certainly gives believable numbers (see above).

And it seems to defer to your own "perf" tool anyway.

perf has two options: --track-memory and --tracemalloc, see the doc:

I don't think tracemalloc is a good tool to compare memory usage, as it comes with its own overhead. Also it won't account for issues such as memory fragmentation.

In the 3 cases, perf saves the peak of the memory usage.

Well, yes, that's the point, thank you.

msg318083 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2018-05-29 20:17

Why wouldn't it? It certainly gives believable numbers (see above). And it seems to defer to your own "perf" tool anyway.

I wrote the code and I never seriously tested it, that's why I have doubts :-) Maybe it works, I just suggest to double check the code ;-)

msg318084 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2018-05-29 20:19

As I said, the code just defers to "perf". And you should have tested that :-)

I'm not really interested in checking it. All I know is that the very old code (inherited from Unladen Swallow) did memory tracking correctly. And I have no reason to disbelieve the numbers I posted.

msg318086 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2018-05-29 20:34

As I said, the code just defers to "perf". And you should have tested that :-)

--track-memory and --tracemalloc have no unit tests, it's in the perf TODO list ;-) Well, it was just a remark. I'm looking for new contributors for perf!

msg318152 - (view)

Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)

Date: 2018-05-30 04:17

Even with smaller benefit the idea looks worth to me. Is the PR already ready for review?

msg318154 - (view)

Author: Inada Naoki (methane) * (Python committer)

Date: 2018-05-30 05:31

Is the PR already ready for review?

I think so.

msg318162 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2018-05-30 10:14

I asked if this change breaks the stable ABI. Steve Dower replied:

"Looks like it breaks the 3.7 ABI, which is certainly not allowed at this time. But it’s not a limited API structure, so no problem for 3.8."

https://mail.python.org/pipermail/python-dev/2018-May/153745.html

I didn't understand the answer. It breaks the ABI but it doesn't break the API?

It seems like PyObject.ob_refcnt is part of the "Py_LIMITED_API" and so an extension module using the stable API/ABI can access directly the field with no function call. For example, Py_INCREF() modifies directly the field at the ABI level.

But PyGC_Head is a strange thing since it's stored "before" the usual PyObject* pointer, so fields starting at PyObject* address are not affected by this change, no?

Hopefully, PyGC_Head seems to be excluded from PyGC_Head, and so it seems like the PR 7043 doesn't break the stable ABI.

Can someone please confirm my analysis?

msg318181 - (view)

Author: Inada Naoki (methane) * (Python committer)

Date: 2018-05-30 12:41

On Wed, May 30, 2018 at 7:14 PM STINNER Victor <report@bugs.python.org> wrote:

STINNER Victor <vstinner@redhat.com> added the comment:

I asked if this change breaks the stable ABI. Steve Dower replied:

"Looks like it breaks the 3.7 ABI, which is certainly not allowed at this time. But it’s not a limited API structure, so no problem for 3.8."

https://mail.python.org/pipermail/python-dev/2018-May/153745.html

I didn't understand the answer. It breaks the ABI but it doesn't break the API?

It breaks ABI, but it is not part of the "stable" ABI. So we can't commit it on 3.7, but we can commit it on 3.8.

It seems like PyObject.ob_refcnt is part of the "Py_LIMITED_API" and so an extension module using the stable API/ABI can access directly the field with no function call. For example, Py_INCREF() modifies directly the field at the ABI level.

But PyGC_Head is a strange thing since it's stored "before" the usual PyObject* pointer, so fields starting at PyObject* address are not affected by this change, no?

I think so.

Hopefully, PyGC_Head seems to be excluded from PyGC_Head, and so it seems like the PR 7043 doesn't break the stable ABI.

s/from PyGC_Head/from PyObject/ I think so.

msg318186 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2018-05-30 13:47

"Hopefully, PyGC_Head seems to be excluded from PyGC_Head, and so it seems like the PR 7043 doesn't break the stable ABI."

Oops, I mean: PyGC_Head seems to be excluded from the Py_LIMITED_API.

msg318321 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2018-05-31 14:50

Here is a micro-benchmark of GC overhead:

before:

$ ./python -m timeit -s "import gc, doctest, ftplib, asyncio, email, http.client, pydoc, pdb, fractions, decimal, difflib, textwrap, statistics, shutil, shelve, lzma, concurrent.futures, telnetlib, smtpd, tkinter.tix, trace, distutils, pkgutil, tabnanny, pickletools, dis, argparse" "gc.collect()" 100 loops, best of 5: 2.41 msec per loop

after:

So it's a 4% slowdown, but GC runs themselves are a minor fraction of usual programs' runtime, so I'm not sure that matters. Though it would be better to test on an actual GC-heavy application.

msg318354 - (view)

Author: Inada Naoki (methane) * (Python committer)

Date: 2018-06-01 00:58

https://github.com/python/cpython/pull/7043/commits/053111f321d792fb26f8be10abba24a980f3590f

I added one micro optimization. Although it is not relating to two-word-gc directly, it makes gc_collect faster than master on the microbench (without importing tkinter.tix, because no GUI on my Linux environment).

$ ./python -m perf timeit --compare-to ./python-master -s "import gc, doctest, ftplib, asyncio, email, http.client, pydoc, pdb, fractions, decimal, difflib, textwrap, statistics, shutil, shelve, lzma, concurrent.futures, telnetlib, smtpd, trace, distutils, pkgutil, tabnanny, pickletools, dis, argparse" "gc.collect()" python-master: ..................... 1.66 ms +- 0.08 ms python: ..................... 1.58 ms +- 0.00 ms

Mean +- std dev: [python-master] 1.66 ms +- 0.08 ms -> [python] 1.58 ms +- 0.00 ms: 1.05x faster (-5%)

msg318362 - (view)

Author: Inada Naoki (methane) * (Python committer)

Date: 2018-06-01 06:21

Oops, this optimization broke trace module. I reverted a part of the optimization. Current benchmark is:

$ ./python-patched -m perf timeit --compare-to ./python-master -s "import gc, doctest, ftplib, asyncio, email, http.client, pydoc, pdb, fractions, decimal, difflib, textwrap, statistics, shutil, shelve, lzma, concurrent.futures, telnetlib, smtpd, trace, distutils, pkgutil, tabnanny, pickletools, dis, argparse" "gc.collect()" python-master: ..................... 1.63 ms +- 0.04 ms python-patched: ..................... 1.64 ms +- 0.01 ms

Mean +- std dev: [python-master] 1.63 ms +- 0.04 ms -> [python-patched] 1.64 ms +- 0.01 ms: 1.01x slower (+1%)

msg318366 - (view)

Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)

Date: 2018-06-01 07:03

There are also problems with func_name and func_qualname. func_name can be an instance of the str subtype and has dict. func_qualname seems can be of any type.

Even if they would be exact strings, excluding them from tp_traverse will break functions that calculate the total size of the memory consumed by the specified set of objects by calling sys.getsizeof() recursively.

msg318367 - (view)

Author: Inada Naoki (methane) * (Python committer)

Date: 2018-06-01 07:08

You're right. I'll revert the optimization completely...

msg318387 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2018-06-01 10:16

It seems like the PR 7043 has no big impact on performances. Sometimes, it's a little bit faster, sometimes it's a little bit slower. The trend is a little bit more in the "faster" side, but it's not very obvious.

I approved PR 7043. It shouldn't break the world, it might only break very specialized Python extensions touching very low level details of the Python implementation. The performance seems ok, but it reduces the memory footprint which is a very good thing!

msg318399 - (view)

Author: Inada Naoki (methane) * (Python committer)

Date: 2018-06-01 11:22

@Victor

Thanks for review.

Do you think buildbots for master branch are sound enough to commit this change? Or should I wait one more week? http://buildbot.python.org/all/#/grid?branch=master

msg318400 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2018-06-01 11:34

Do you think buildbots for master branch are sound enough to commit this change? Or should I wait one more week?

CIs on master are stable again. Since PyGC_Head is a key feature of Python, I would suggest you to wait at least a second approval of another core dev.

msg321366 - (view)

Author: Inada Naoki (methane) * (Python committer)

Date: 2018-07-10 08:19

New changeset 5ac9e6eee5ed18172d70d28cf438df0be4e3b83d by INADA Naoki in branch 'master': bpo-33597: Reduce PyGC_Head size (GH-7043) https://github.com/python/cpython/commit/5ac9e6eee5ed18172d70d28cf438df0be4e3b83d

msg321402 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2018-07-10 23:55

Would you mind to mention your optimization in What's New in Python 3.8? IHMO any enhancement of the memory footprint should be documented :-)

msg321415 - (view)

Author: Inada Naoki (methane) * (Python committer)

Date: 2018-07-11 08:43

New changeset d5c875bbf1ef9b24349d3243b1ffaab6a9d5a78e by INADA Naoki in branch 'master': bpo-33597: Add What's New for PyGC_Head (GH-8236) https://github.com/python/cpython/commit/d5c875bbf1ef9b24349d3243b1ffaab6a9d5a78e

msg321430 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2018-07-11 10:17

bpo-33597: Add What's New for PyGC_Head (GH-8236)

Thank you!

History

Date

User

Action

Args

2022-04-11 14:59:00

admin

set

github: 77778

2018-07-11 10:17:58

vstinner

set

messages: +

2018-07-11 08:43:00

methane

set

messages: +

2018-07-11 01:31:05

methane

set

pull_requests: + <pull%5Frequest7772>

2018-07-10 23:55:18

vstinner

set

messages: +

2018-07-10 08:20:16

methane

set

status: open -> closed
resolution: fixed
stage: patch review -> resolved

2018-07-10 08:19:58

methane

set

messages: +

2018-06-01 11:34:01

vstinner

set

messages: +

2018-06-01 11:22:25

methane

set

messages: +

2018-06-01 10:16:27

vstinner

set

messages: +

2018-06-01 07:08:55

methane

set

messages: +

2018-06-01 07:03:26

serhiy.storchaka

set

messages: +

2018-06-01 06:21:44

methane

set

messages: +

2018-06-01 00:58:19

methane

set

messages: +

2018-05-31 14:50:39

pitrou

set

messages: +

2018-05-30 13:47:00

vstinner

set

messages: +

2018-05-30 12:41:24

methane

set

messages: +

2018-05-30 10:14:28

vstinner

set

messages: +

2018-05-30 05:31:10

methane

set

messages: +

2018-05-30 04:17:38

serhiy.storchaka

set

messages: +

2018-05-29 20:34:08

vstinner

set

messages: +

2018-05-29 20:19:46

pitrou

set

messages: +

2018-05-29 20:17:19

vstinner

set

messages: +

2018-05-29 20:04:07

pitrou

set

messages: +

2018-05-29 19:58:11

vstinner

set

messages: +

2018-05-29 15:56:52

pitrou

set

messages: +

2018-05-29 14:00:17

pitrou

set

messages: +

2018-05-29 13:58:29

vstinner

set

nosy: + vstinner
messages: +

2018-05-29 13:36:47

eitan.adler

set

nosy: + eitan.adler

2018-05-22 17:41:55

yselivanov

set

nosy: + yselivanov
messages: +

2018-05-22 16:20:05

methane

set

messages: +

2018-05-22 15:49:19

pitrou

set

messages: +

2018-05-22 15:49:07

methane

set

messages: +

2018-05-22 15:31:37

methane

set

messages: +

2018-05-22 15:31:12

serhiy.storchaka

set

messages: +

2018-05-22 15:01:22

pitrou

set

messages: +

2018-05-22 14:56:55

methane

set

messages: +

2018-05-22 10:32:31

serhiy.storchaka

set

nosy: + serhiy.storchaka

2018-05-22 07:07:23

methane

set

keywords: + patch
stage: patch review
pull_requests: + <pull%5Frequest6682>

2018-05-22 07:06:22

methane

create