[Python-Dev] Python Benchmarks (original) (raw)

Steve Holden steve at holdenweb.com
Mon Jun 5 17:20:44 CEST 2006

Previous message: [Python-Dev] Python Benchmarks
Next message: [Python-Dev] SF patch #1473257: "Add a gi_code attr to generators"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

M.-A. Lemburg wrote:

Fredrik Lundh wrote:

M.-A. Lemburg wrote:

Seriously, I've been using and running pybench for years and even though tweaks to the interpreter do sometimes result in speedups or slow-downs where you wouldn't expect them (due to the interpreter using the Python objects), they are reproducable and often enough have uncovered that optimizations in one area may well result in slow-downs in other areas. > Often enough the results are related to low-level features > of the architecture you're using to run the code such as > cache size, cache lines, number of registers in the CPU or > on the FPU stack, etc. etc. and that observation has never made you stop and think about whether there might be some problem with the benchmarking approach you're using? The approach pybench is using is as follows: * Run a calibration step which does the same as the actual test without the operation being tested (ie. call the function running the test, setup the for-loop, constant variables, etc.) The calibration step is run multiple times and is used to calculate an average test overhead time. I believe my recent changes now take the minimum time rather than computing an average, since the minimum seems to be the best reflection of achievable speed. I assumed that we wanted to measure achievable speed rather than average speed as our benchmark of performance.

* Run the actual test which runs the operation multiple times.

The test is then adjusted to make sure that the test overhead / test run ratio remains within reasonable bounds. If needed, the operation code is repeated verbatim in the for-loop, to decrease the ratio. * Repeat the above for each test in the suite * Repeat the suite N number of rounds * Calculate the average run time of all test runs in all rounds. Again, we are now using the minimum value. The reasons are similar: if extraneous processes interfere with timings then we don't want that to be reflected in the given timings. That's why we now report "notional minimum round time", since it's highly unlikely that any specific test round will give the minimum time for all tests.

Even with these changes we still see some disturbing variations in timing both on Windows and on Unix-like platforms.

after all, if a change to e.g. the try/except code slows things down or speed things up, is it really reasonable to expect that the time it takes to convert Unicode strings to uppercase should suddenly change due to cache effects or a changing number of registers in the CPU? real hardware doesn't work that way... Of course, but then changes to try-except logic can interfere with the performance of setting up method calls. This is what pybench then uncovers. The only problem I see in the above approach is the way calibration is done. The run-time of the calibration code may be to small w/r to the resolution of the used timers. Again, please provide the parameters you've used to run the test case and the output. Things like warp factor, overhead, etc. could hint to the problem you're seeing.

is PyBench perhaps using the following approach: T = set of tests for N in range(number of test runs): for t in T: t0 = getprocesstime() t() t1 = getprocesstime() assign t1 - t0 to test t print assigned time where t1 - t0 is very short? See above (or the code in pybench.py). t1-t0 is usually around 20-50 seconds: """ The tests must set .rounds to a value high enough to let the test run between 20-50 seconds. This is needed because clock()-timing only gives rather inaccurate values (on Linux, for example, it is accurate to a few hundreths of a second). If you don't want to wait that long, use a warp factor larger than 1. """ First, I'm not sure that this is the case for the default test parameters on modern machines. On my current laptop, for example, I see a round time of roughly four seconds and a notional minimum round time of 3.663 seconds.

Secondly, while this recommendation may be very sensible, with 50 individual tests a decrease in the warp factor to 1 (the default is currently 20) isn't sufficient to increase individual test times to your recommended value, and decreasing the warp factor tends also to decrease reliability and repeatability.

Thirdly, since each round of the suite at warp factor 1 takes between 80 and 90 seconds, pybench run this way isn't something one can usefully use to quickly evaluate the impact of a single change - particularly since even continuing development work on the benchmark machine potentially affects the benchmark results in unknown ways.

that's not a very good idea, given how getprocesstime tends to be implemented on current-era systems (google for "jiffies")... but it definitely explains the bogus subtest results I'm seeing, and the "magic hardware" behaviour you're seeing. That's exactly the reason why tests run for a relatively long time - to minimize these effects. Of course, using wall time make this approach vulnerable to other effects such as current load of the system, other processes having a higher priority interfering with the timed process, etc. For this reason, I'm currently looking for ways to measure the process time on Windows. I wish you luck with this search, as we clearly do need to improve repeatability of pybench results across all platforms, and particularly on Windows.

regards Steve

Steve Holden +44 150 684 7255 +1 800 494 3119 Holden Web LLC/Ltd http://www.holdenweb.com Love me, love my blog http://holdenweb.blogspot.com Recent Ramblings http://del.icio.us/steve.holden

Previous message: [Python-Dev] Python Benchmarks
Next message: [Python-Dev] SF patch #1473257: "Add a gi_code attr to generators"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list