[Python-Dev] pymalloc killer (original) (raw)

Tim Peters tim.one@comcast.net
Fri, 29 Mar 2002 05:32:32 -0500


[Tim]

Recall that pymalloc delegates requests for "not small" blocks to the system malloc. This means that when pymalloc's free() is called, it has to figure out whether the address passed to it was one it allocated itself, or came from the system malloc. ... However, the question remained whether a hostile user could study the source code to mount a successful attack. Alas, it turns out that's easy.

There may be hope for this, and another strong reason for pursuing it. Guido pointed out that if pymalloc's free() could know for sure whether an address passed to it did or didn't come from the system malloc, then we could also (probably -- needs some more thought) continue to use the PyObject interface, without breaking most "illegal by fiat" old code, via also mapping PyMem_{Free, FREE, Del, DEL} to the pymalloc free (when pymalloc is enabled): then the various PyMem_NukeIt spellings would know whether to pass a piece of memory on to the system free().

So I have a variant of pymalloc that doesn't use magic cookies -- it knows "for sure", by maintaining a sorted (by address) contiguous vector of the base addresses of the (256KB each) "arenas" allocated by pymalloc. A given address A either does or doesn't satisfy

B <= A < B+256KB

for some base address B in this list. At worst, a binary search is needed to see whether any such B exists. For a typical 231 byte VM address space, we can get at most 231/218 == 213 arenas, so at worst this vector can grow to a measly 8K entries (typically far fewer), and 13 is a reasonable upper bound on the number of compares needed.

So it can't be a disaster, but neither is it dirt cheap. A promising candidate for saving grace: in tests so far, frees tend to hit the same arena many times in a row. By saving a search finger into the vector that remembers the last hit index, and checking that location first, I'm seeing hit rates over 85%. The two-sided range test at the finger index can be inlined, and the two compares it requires are exactly as many compares as the current magic-cookie tests do. So, most of the time, it may run as fast as now.

For an address pymalloc doesn't handle on its own, though, a full binary search is required to discover that it's not a pymalloc address. Leaving at least one magic-cookie gimmick in could slash that (if the magic cookie isn't there, we know for sure it's not a pymalloc address). So the test becomes something like

def PyMalloc_Free(thisaddr): # ... weed out NULL ...

# not NULL
if (addrs[finger] <= thisaddr < addrs[finger] + 256KB
    or (thisaddr_has_magic_cookie and binary_search_for(thisaddr)):
    it's a pymalloc addr
else
    it goes to the system free()

That gives a likely fast-path for each class of address, and should never err (provided we're passed legitimate addresses, and user code doesn't write out of bounds, etc).

A nasty subtlety: if PyMem_NukeIt is called when the GIL isn't held, it's going to take a lot of thought to see whether this can be made safe when called simultaneously by multiple threads; e.g., the finger can change "mid-stream" then, and, worse, some thread that does hold the GIL may legitimately cause a new arena to be allocated midstream, mutating the address vector during the search. This may be intractable enough to kill the idea of mapping PyMem_XXX to this.

A nasty trick: due to the wonders of unsigned arithmetic, the two-sided range test can be reduced to one compare:

(Py_uintptr)thisaddr - (Py_uintptr)addrs[finger] < 256KB

In other news, "it's obvious" that the small object threshhold is too small for 2.3. I'm still inclined to max it out.