[Python-Dev] Optimization targets - refcount (original) (raw)

Mike Pall mikepy-0404 at mike.de
Fri Apr 16 14:26:05 EDT 2004


Hi,

pje wrote:

I think the idea is more that you could skip the 'PyINCREF(PyNone)', which is a fairly common prelude to 'return PyNone'. You'd set their refcounts to the maximum possible value, and the deallocation function would simply reset the refcount to maximum again.

I'm not sure, however, that this would be common enough to be helpful. It seems to me PyINCREF should effectively translate to only a single machine instruction or two. I mean, it's just incrementing an integer whose address is known at compile time.

I don't think it would matter much performance-wise. A quick scan does not reveal a single Py_INCREF(Py_None) in any hot code section. But with more than 800 occurences it would reduce the code size a bit (5-10K). So that alone is not a good reason to throw it out.

Checking for Py_None in Py_DECREF is counter-productive because that would need another branch. The existing branch is always taken with Py_None since the refcount never goes zero. I.e. this is not an issue.

But let's look at the general case. Here's a common fragment of inlined x86 code for Py_DECREF:

mov -something(%ebp),%esi ; odep %esi mov (%esi),%eax ; idep %esi, odep %eax dec %eax ; idep %eax, odep %eax test %eax,%eax ; [idep %eax], odep flags mov %eax,(%esi) ; [idep %eax %esi] jne .L1 ; idep flags, predict

sub $0xc,%esp mov 0x4(%esi),%eax push %esi call *0x18(%eax) add $0x10,%esp .L1:

There are at least 2 to 3 data dependencies that are hard to untangle. Tough luck.

The other problem is branch prediction. I've done a little analysis with gcov on ceval.c and found that the situation is not that bad. The Py_DECREF occurences that account for 90% of the time are almost exclusively 0/100 or 100/0 cases. The remaining 10% have a few 30/70 or 50/50 cases. The folded and weighted average is around 4/96. The branch misprediction penalty accounts for less than 0.1% of the runtime.

But the deallocation itself is costly. On average the deallocator is called 15% of the time (I checked ceval.c only). This is a lot and accounts for up to 5% of the runtime (this is based on an overall profile). My guess is that allocation is a bit cheaper, but adds another 2-3%. I think tuples are the worst offender (again only a good guess from my profiling experiments).

To summarize: I think we can optimize the GC a little more (have not done an in-depth analysis yet), but we really should focus on making Python less malloc-happy. Suggestions welcome!

Bye, Mike



More information about the Python-Dev mailing list