[Python-Dev] Optimization targets (was: String hash function multiplier) (original) (raw)

Mike Pall mikepy-0404 at mike.de
Wed Apr 14 17:49:45 EDT 2004


Hi,

Raymond wrote:

It looks like the best bet is to try to speedup the code without changing the multiplier.

Indeed. And while you are at it, there are other optimizations, that seem more promising:

I compiled a recent CVS Python with profiling and here is a list of the top CPU hogs (on a Pentium III, your mileage may vary):

pystone:

CPU% Function Name

55.44 eval_frame 7.30 lookdict_string 4.34 PyFrame_New 3.73 frame_dealloc 1.73 vgetargs1 1.65 PyDict_SetItem 1.42 string_richcompare 1.15 PyObject_GC_UnTrack 1.11 PyObject_RichCompare 1.08 PyInt_FromLong 1.08 tupledealloc 1.04 insertdict

parrotbench:

CPU% Function Name

23.65 eval_frame 8.68 l_divmod 4.43 lookdict_string 2.95 k_mul 2.27 PyType_IsSubtype 2.23 PyObject_Malloc 2.09 x_add 2.05 PyObject_Free 2.05 tupledealloc

Arguably parrotbench is a bit unrepresentative here. And beware: due to automatic inlining of static functions the real offender may be hidden (x_divmod is the hog, not l_divmod).

Anyway, this just confirms that the most important optimization targets are:

  1. eval_frame
  2. string keyed dictionaries
  3. frame handling

I think 3. needs an algorithmic approach (there was some discussion about it a few weeks ago), while 1. and 2. look like a code generation issue.

So I took a look at the machine code that GCC generates on x86 with -O3 for these: uh oh, not a pretty sight. The main problems are bad branch predictions and lack of inlining.

About branch predictions:

The worst offender is the main code path for eval_frame: it gets split across half a dozen segments. The code path for the non-argument bytecodes is in fact the slowest.

Similarly the code for lookdict_string branches around like crazy. The most likely code path is not the fastest path.

One solution is to use likely()/unlikely() macros (using __builtin_expect). This is a good solution for small, isolated and performance critical code. It does not work well for eval_frame, though (I tried).

But GCC has more to offer: read the man page entries for -fprofile-arcs and -fbranch-probabilities. Here is a short recipe:

Go to your Python source directory and do this:

$ mkdir build_profile $ cd build_profile $ ../configure # add any options you may need $ make $ mv python python_orig

Edit the Makefile and add -fprofile-arcs to OPT.

$ rm Python/ceval.o $ make $ mv python python_profile

Run your favourite benchmark(s), but only invoke python once:

$ ./python_profile -c 'import test.pystone; test.pystone.main(loops=100000)'

Forget about the performance numbers the benchmark reports. Never use an executable compiled with profiling for comparison. But ... you should now have a new (binary) file called Python/ceval.da that contains the profiled branch probabilities.

Edit the Makefile and replace -fprofile-arcs with -fbranch-probabilities

$ rm Python/ceval.o $ make $ mv python python_opt

Then compare the benchmarks:

$ ./python_orig -c 'import test.pystone; test.pystone.main(loops=100000)' $ ./python_opt -c 'import test.pystone; test.pystone.main(loops=100000)'

On my machine I get a 10-15% speedup. But we only optimized ceval.c ...

So repeat the above steps, but now delete Python/ceval.o and Objects/*.o each time. Now I get about 20-25% speedup for pystone and 10% speedup for parrotbench! Not bad, huh?

Now the bad news: I don't know how to integrate this into the regular build process. So this is not an option for everyone, but ambitious packagers might want to take the trouble to do this by hand.

Oh and of course neither pystone nor parrotbench are representative of the branch probabilities of any particular application. But for predicting eval_frame and lookdict_string, they are probably good enough.

On a related note: GCC uses random branch probabilities with -O, when no probability information is present (no __builtin_expect or *.da files). Use -fno-guess-branch-probability if you want predictable timings on recompiles.

About inlining:

I bet there are more, but I'm running out of time right now -- sorry.

Bye, Mike



More information about the Python-Dev mailing list