[Python-Dev] Changing pymalloc behaviour for long running processes (original) (raw)

Tim Peters tim.peters at gmail.com
Tue Oct 19 22:53:07 CEST 2004

Previous message: [Python-Dev] Changing pymalloc behaviour for long running processes
Next message: [Python-Dev] Changing pymalloc behaviour for long running processes
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

[Evan Jones] ...

There is absolutely nothing I can do about that, however. On platforms that matter to me (Mac OS X, Linux) some number of large malloc() allocations are done via mmap(), and can be immediately released when free() is called. Hence, large blocks are reclaimable. I have no knowledge about the implementation of malloc() on Windows. Anyone care to enlighten me?

Not me, I'm too short on time. Memory pragmatics on Windows varies both across Windows flavors and MS C runtime releases, so it's not a simple topic. In practice, at least the NT+ flavors of Windows, under MS VC 6.0 and 7.1 + service packs, appear to do a reasonable job of releasing VM reservations when free() gives a large block back. I wouldn't worry about older Windows flavors anymore. The native Win32 API has many functions that could be used for fine control.

... I am not moving around Python objects, I'm just dealing with free pools and arenas in obmalloc.c at the moment.

Good.

There two separate things I am doing:

1. Scan through the free pool list, and count the number of free pools in each arena. If an arena is completely unused, I free it. If there is even one pool in use, the arena cannot be freed.

Yup.

2. Sorting the free pool list so that "nearly full" arenas are used before "nearly empty" arenas. Right now, when a pool is free, it is pushed on the list. When one is needed, it is popped off. This leads to an LRU allocation of memory.

It's stack-like: it reuses the pool most recently emptied, because the expectation is that the most recently emptied pool is the most likely of all empty pools to be highest in the memory hierarchy. I really don't know what LRU (or MRU) might mean in this context (it's not like we've evicting something from a cache).

What I am doing is removing all the free pools from the list, and putting them back on so that areas that have more free pools are used later, while arenas with less free pools are used first.

That sounds reasonable.

In my crude tests, the second detail increases the number of completely free arenas. However, I suspect that differentiating between free arenas and used arenas, like is already done for pools, would be a good idea.

Right.

...

Absolutely: I am not touching that. I'm working from the assumption that pymalloc has been well tested and well tuned and is appropriate for Python workloads. I'm just trying to make it free memory occasionally.

Harder than it looked, eh ?

If the real point of this (whatever it is ) is to identify free arenas, I expect that could be done a lot easier by keeping a count of allocated pools in each arena ...

You are correct, and this is something I would like to play with. This is, of course, a tradeoff between overhead on each allocation and deallocation,

It shouldn't be. Pool transitions among the "used", "full" and "empty" states don't occur on each alloc and dealloc. Note that PyObject_Free and PyObject_Malloc are both coded with the most frequent paths earliest in the function, and pool transitions don't occur until after a few return statements have passed. It's unusual not to get out via one of the "early returns"; the bulk of the code in each function (including pool transitions) isn't executed on most calls; in most calls, the affected pool both enters and leaves in the "used" state.

and one big occasionally overhead caused by the "cleanup" process.

Or it may be small overhead, if all it's trying to do is free() empty arenas. Indeed, if arenas "grow states" too, arena transitions should be so rare that perhaps they could afford to do extra processing right then to decide whether to free() an arena that just transitioned to its notion of an empty state.

...

Let me just make sure I am clear on this: Some extensions use native threads,

By extension module I mean a module coded in C; and yes, any extension module that uses threads is probably using native threads.

is that why this is a problem?

No, threads aren't the problem, in the sense that an alcoholic's problem isn't really alcohol, it's drinking <0.7 wink>. The problem is incorrect usage of the Python C API, and the most dangerous problem there is that old code may be calling PyMem_{Free, FREE, Del, DEL} while not holding the GIL. "Everyone always knew" that PyMem_{Free, FREE, Del, DEL} was just an irritating way to spell "free()", so some old code didn't worry about the GIL when calling it. Such code is fatally broken, but we're still trying to support it (or rather we were, when obmalloc was new; now it's still "supported" just in the sense that the excruciating support code still exists).

The other twist is that we couldn''t map PyMem_{Free, FREE, Del, DEL} to the system free() directly (which would have solved the problem just above), because other broken old code called PyMem_{Free, FREE, Del, DEL} to release an object obtained via PyObject_New(). We're still supporting that too, but again just in the sense that the convolutions to support it still exist.

If we changed PyMem_{Free, FREE, Del, DEL} to map to the system free(), all would be golden (except for broken old code mixing PyObject_ with PyMem_ calls). If any such broken code still exists, that remapping would lead to dramatic failures, easy to reproduce; and old code broken in the other, infinitely more subtle way (calling PyMem_{Free, FREE, Del, DEL} when not holding the GIL) would continue to work fine.

Because as far as I am aware, the Python interpreter itself is not threaded.

Unsure what that means to you. Any number of threads can be running Python code in a single process, although the GIL serializes their execution while they're executing Python code. When a thread ends up in C code, it's up to the C code to decide whether to release the GIL and so allow other threads to run at the same time. If it does, that thread must reacquire the GIL before making another Python C API call (with very few exceptions, related to Python C API thread initialization and teardown functions).

So how does the cyclical garbage collector work?

The same as every other part of Python's C implementation, except for this crazy exception in obmalloc: it assumes the GIL is held, and that no other thread can make a Python C API call until the GIL is released. Note that this doesn't necessarily mean that cyclic gc can assume that no other thread can run Python code until cyclic gc is done. Because gc may trigger destructors that in turn execute Python code (del methods or weakref callbacks), it's all but certain other threads can run at such times (invoking Python code ends up in the interpreter main loop, which releases the GIL periodically to allow other threads to run).

obmalloc doesn't have that problem, though -- nothing obmalloc does can cause Python code to get executed, so obmalloc can assume that the thread calling into it holds the GIL for as long as obmalloc wants. Except, again, for the crazy PyMem_{Free, FREE, Del, DEL} exception.

Doesn't it require that there is no execution going on?

As above.

Now all such insane uses have been officially deprecated, so you could be bold and just assume obmalloc is always entered by a thread holding the GIL now.

I would rather not break this property of obmalloc.

I would -- it's backward compatibility hacks for insane code, which may not even exist anymore, and you'll find that it puts severe contraints on what you can do.

However, this leads to a big problem: I'm not sure it is possible to have an occasional cleanup task be lockless and co-operate nicely with other threads, since by definition it needs to go and mess with all the arenas. One of the reason that obmalloc doesn't have this problem is because it never releases memory.

Yes, but that's backwards: obmalloc never releases memory in part because of this thread problem. Indeed, when new_arena() has to grow the vector of arena base addresses, it doesn't realloc(), it makes a copy into a new memory area, and deliberately lets the old vector leak. That's solely because some broken PyMem_{Free, FREE, Del, DEL} call may be simultaneously trying to access the vector, and without locking it's plain impossible to know whether or when that occurs. You'll have an equally impossible time trying to change the content of the arena base vector in virtually any way -- heck, we've got 40 lines of comments now just trying to explain what it took to support appending new values safely (and that's the only kind of mutation done on that vector now).

Change PyMem_{Free, FREE, Del, DEL} to stop resolving to PyObject_ functions, and all that pain can go away -- obmalloc could then do anything it wanted to do without any thread worries.

It's only a waste if it ultimately fails .

It is also a waste if the core Python developers decide it is a bad idea, and don't want to accept patches! :)

Sad to say, it's more likely that making time to review patches will be the bottleneck, and in this area careful review is essential. It's great that you can make some time for this now -- be optimistic!

Previous message: [Python-Dev] Changing pymalloc behaviour for long running processes
Next message: [Python-Dev] Changing pymalloc behaviour for long running processes
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list