(original) (raw)



On Fri, Apr 3, 2009 at 11:27, Antoine Pitrou <solipsis@pitrou.net> wrote:

Thomas Wouters <thomas <at> python.org> writes:
>
>
> Pystone is pretty much a useless benchmark. If it measures anything, it's the
speed of the bytecode dispatcher (and it doesn't measure it particularly well.)
PyBench isn't any better, in my experience.

I don't think pybench is useless. It gives a lot of performance data about
crucial internal operations of the interpreter. It is of course very little
real-world, but conversely makes you know immediately where a performance
regression has happened. (by contrast, if you witness a regression in a
high-level benchmark, you still have a lot of investigation to do to find out
where exactly something bad happened)

Really? Have you tried it? I get at least 5% noise between runs without any changes. I have gotten results that include \*negative\* run times. And yes, I tried all the different settings for calibration runs and timing mechanisms. The tests in PyBench are not micro-benchmarks (they do way too much for that), they don't try to minimize overhead or noise, but they are also not representative of real-world code. That doesn't just mean "you can't infer the affected operation from the test name", but "you can't infer anything." You can just be looking at differently borrowed runtime. I have in the past written patches to Python that improved \*every\* micro-benchmark and \*every\* real-world measurement I made, except PyBench. Trying to pinpoint the slowdown invariably lead to tests that did too much in the measurement loop, introduced too much noise in the "calibration" run or just spent their time \*in the measurement loop\* on doing setup and teardown of the test. Collin and Jeffrey have seen the exact same thing since starting work on Unladen Swallow.

So, sure, it might be "useful" if you have 10% or more difference across the board, and if you don't have access to anything but pybench and pystone.
Perhaps someone should start maintaining a suite of benchmarks, high-level and
low-level; we currently have them all scattered around (pybench, pystone,
stringbench, richard, iobench, and the various Unladen Swallow benchmarks; not
to mention other third-party stuff that can be found in e.g. the Computer
Language Shootout).

That's exactly what Collin proposed at the summits last week. Have you seen http://code.google.com/p/unladen-swallow/wiki/Benchmarks
? Please feel free to suggest more benchmarks to add :)

--
Thomas Wouters <thomas@python.org>

Hi! I'm a .signature virus! copy me into your .signature file to help me spread!