[Python-Dev] Memory bitmaps for the Python cyclic garbage collector (original) (raw)

Tim Peters tim.peters at gmail.com
Fri Sep 8 16:00:35 EDT 2017

Previous message (by thread): [Python-Dev] Memory bitmaps for the Python cyclic garbage collector
Next message (by thread): [Python-Dev] Memory bitmaps for the Python cyclic garbage collector
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Python objects that participate in cyclic GC (things like lists, dicts, sets but not strings, ints and floats) have extra memory overhead. I think it is possible to mostly eliminate this overhead. Also, while the GC is running, this GC state is mutated, which destroys copy-on-write optimizations. This change would mostly fix that issue.

All objects that participate in cyclic GC have the PyTPFLAGSHAVEGC bit set in their type. That causes an extra chunk of memory to be allocated before the obrefcnt struct member. This is the PyGCHead struct. The whole object looks like this in memory (PyObject pointer is at arrow): _union gchead *gcnext; _union gchead *gcprev; Pyssizet gcrefs; --> Pyssizet obrefcnt struct typeobject *obtype; [rest of PyObject members]

So, 24 bytes of overhead on a 64-bit machine. The smallest Python object that can have a pointer to another object (e.g. a single PyObject * member) is 48 bytes. Removing PyGCHead would cut the size of these objects in half. Carl Shaprio questioned me today on why we use a double linked-list and not the memory bitmap. I think the answer is that there is no good reason. We use a double linked list only due to historical constraints that are no longer present.

Since you wrote this code to begin with, it will come back to you ;-) that the real purpose of the doubly-linked lists is to partition (not just find) the tracked objects. Between collections, they're partitioned by generation, and within a collection equivalence classes are first merged (up through the oldest generation to be scanned in this run), and then temporarily partitioned internally in various ways (based on things like whether objects turn out to be reachable from outside, and whether they have finalizers). The linked list representation makes all the required operations cheap: iteration, merging classes, moving an object from one class to another, removing an object entirely while iterating over its equivalence class. Don't know whether all that can be done efficiently with a bitmap representation instead.

Long ago, Python objects could be allocated using the system malloc or other memory allocators. Since we could not control the memory location, bitmaps would be inefficient. Today, we allocate all Python objects via our own function. Python objects under a certain size are allocated using our own malloc, obmalloc, and are stored in memory blocks known "arenas".

The PyGCHead struct performs three functions. First, it allows the GC to find all Python objects that will be checked for cycles (i.e. follow the linked list).

As above, the set of tracked objects is partitioned into more than one linked list.

Second, it stores a single bit of information to let the GC know if it is safe to traverse the object, set with PyObjectGCTrack().

? An object is "tracked" if and only if it appears in some doubly-linked list. There is no bit set (or cleared) for this. Untracking an object removes it entirely from whichever linked list it's in (leaving it in no linked lists), and tracking an object consists of adding it to the "generation 0" linked list. Unless the code has changed a whole lot recently.

For clarity, the top N-1 bits of gc_refs (which you cover next) are also set to a special _PyGC_REFS_UNTRACKED constant when an object is untracked:

""" /* True if the object is currently tracked by the GC. */ #define _PyObject_GC_IS_TRACKED(o)
(_PyGC_REFS(o) != _PyGC_REFS_UNTRACKED) """

But I believe it could just as well check to see whether the object's gc_next is NULL.

Finally, it has a scratch area to compute the effective reference count while tracing refs (gcrefs).

As above, the top N-1 bits of that are also used between collections to record whether an object is tracked.

The least significant bit of gc_refs now (not back when you or I were mucking with this code) records whether the object has a finalizer that has already been run, and that state needs to be preserved across gc runs. So that's another bit that would need to be stored somewhere else.

Here is a sketch of how we can remove the PyGCHead struct for small objects (say less than 512 bytes). Large objects or objects created by a different memory allocator will still have the PyGCHead overhead.

* Have memory arenas that contain only objects with the PyTPFLAGSHAVEGC flag. Objects like ints, strings, etc will be in different arenas, not have bitmaps, not be looked at by the cyclic GC. * For those arenas, add a memory bitmap. The bitmap is a bit array that has a bit for each fixed size object in the arena. The memory used by the bitmap is a fraction of what is needed by PyGCHead. E.g. an arena that holds up to 1024 objects of 48 bytes in size would have a bitmap of 1024 bits.

If it's based on obmalloc code, an arena (256KB) can hold objects of all kinds of different sizes simultaneously - the sizes are the same only within a relatively small (4KB) "pool". I suspect this gets complicated.

* The bits will be set and cleared by PyObjectGCTrack/Untrack()

* We also need an array of Pyssizet to take over the job of gcrefs. That could be allocated only when GC is working and it only needs to be the size of the number of true bits in the bitmap. Or, it could be allocated when the arena is allocated and be sized for the full arena. * Objects that are too large would still get the PyGCHead struct allocated "in front" of the PyObject. Because they are big, the overhead is not so bad. * The GC process would work nearly the same as it does now. Rather than only traversing the linked list, we would also have to crawl over the GC object arenas, check blocks of memory that have the tracked bit set.

Well, as checks are performed now, objects are also moved around among various lists. Without the lists, some other machinery would be needed to represent the fast-changing partitioning.

There are a lot of smaller details to work out but I see no reason why the idea should not work. It should significantly reduce memory usage. Also, because the bitmap and gcrefs are contiguous in memory, locality will be improved. Łukasz Langa has mentioned that the current GC causes issues with copy-on-write memory in big applications. This change should solve that issue.

Better yet, convince Unixy systems to get rid of the abominable fork() ;-)

There's a related problem when fork() is used intentionally to share read-only data: memory use balloons as child processes access (but don't mutate!) the shared data. In that case, it's because Python does mutate the objects' refcount members under the covers, and so the OS ends up making fresh copies of the memory anyway. Heh.

To implement, I think the easiest path is to create new malloc to be used by small GC objects, e.g. gcmalloc.c. It would be similar to obmalloc but have the features needed to keep track of the bitmap. obmalloc has some quirks that makes it hard to use for this purpose. Once the idea is proven, gcmalloc could be merged or made to be a variation of obmalloc. Or, maybe just optimized and remain separate. obmalloc is complicated and highly optimized. So, adding additional functionality to it will be challenging.

The hardest thing obmalloc needs to do isn't allocating memory - while very detailed, that's straightforward. The hard part is for a free() or realloc() function to figure out efficiently where the heck its argument came from. All it's given is a raw memory address. Especially since:

I believe this change would be ABI compatible.

that can't change. free()/realloc() functions need to figure out which lowest-level allocation system "owns" the address passed to it. That's excruciating now with only two possible answers ("it's either the system malloc or Python's obmalloc"). Not much code is required - "excruciating" means the logic is exceedingly delicate, and enrages memory debuggers because it may branch based on reading uninitialized memory. It's probably not going to get less painful if a third possible answer appears ;-)

All that said, ya, it would be wonderful to cut the gc space overhead for small objects!

Previous message (by thread): [Python-Dev] Memory bitmaps for the Python cyclic garbage collector
Next message (by thread): [Python-Dev] Memory bitmaps for the Python cyclic garbage collector
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list