[Python-Dev] pymalloc killer (original) (raw)

Tim Peters tim.one@comcast.net
Thu, 04 Apr 2002 22:14:19 -0500

Previous message: [Python-Dev] pymalloc killer
Next message: [Python-Dev] Pymalloc and backward compatibility
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

i.e., all references to pool->capacity could be replaced by capacity[pool->szidx], where "capacity" is a new small (at most 32 uints) file-static const vector computed at compile-time.

[Marangozov, Vladimir (Vladimir) -- I hope you're noticing that your name as reported by your email headers is getting stranger all the time ]

The comparison in the "try to extend the free list" code is executed very frequently, so probing a static vector this way will slow down the thing noticeably. I've tried that. We are very much running after speed here.

Gotcha. I was left partly unhappy with the "bulletproof address check" for the same reason (needing to index into a static vector of arena base addresses).

... [on "the division" and "the multiplication" optimizations] Do both, despite that the multiplication optimization is more important. (for one division we have #blocks multiplications for a full pool)

Two things have shifted since you wrote this code:

The architecture trend has been to make division ever-slower in comparison to multiplication (e.g., Itanium doesn't even have an integer division instruction); on some boxes integer division is over 100 cycles now, at the same time integer multiplication is down to a few.
We can fit fewer objects into a pool now. For example, on Windows, Python container types are 16 bytes bigger than they used to be, due to gc-header overhead (and an empty dict is actually up to 148 bytes on Windows now!).

We could shift that back by making pools larger, but if they exceed the size of a system page then cutting a legit pointer back to a pool-aligned address may yield an invalid address. So it gets dicey short of teaching Python how to ask the OS directly for pages.

1) I think that adding back one uint slot (lastoffset) is not a problem (and we end up with the perfect balance of having a pool header layout with 4 pointers and 4 integers ).

Yes, that is perfect .

2) Now, with the advent of lastoffset, we can easily get rid off the division as well. For this, we switch to offset calculus for figuring out whether the free list can be extended. This means that 'capacity' becomes 'highoffset'. The 'highoffset' is the highwater not to be exceeded by more than 'size' bytes by 'lastoffest'.

a) the test becomes then: if (pool->lastoffset <= pool->highoffset) ... pool->lastoffset += size; _pool->freeblock = (block *)pool + _ pool->lastoffset; b) in the poolinit code: pool->lastoffset = POOLOVERHEAD + size; pool->highoffset = POOLSIZE - size - size; and you'll need to adjust the code accordingly at the places where you're using 'capacity' per se (and adapt the comments, of course).

I'll do this, but I expect I'll use a nextoffset instead of a lastoffset, in order to let the two additions in 2a proceed in parallel, and to skip one of the subtracts in 2b. Hey, if we're gonna optimize, I'm not gonna miss a cycle.

This should pretty much settle these murky optimizations for now.

Cool! Thanks for your help. You should come back to Python, you know -- we don't know what you're doing instead, but we all have to suspect it's comparatively worthless .

Previous message: [Python-Dev] pymalloc killer
Next message: [Python-Dev] Pymalloc and backward compatibility
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]