[Python-Dev] Micro-optimizations by adding special-case bytecodes? (original) (raw)

Ben Hoyt benhoyt at gmail.com
Wed Jun 28 09:50:10 EDT 2017

Previous message (by thread): [Python-Dev] Proposal for C++ metaclasses
Next message (by thread): [Python-Dev] Micro-optimizations by adding special-case bytecodes?
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

I've now got a patch to add new COMPARE_IS_NONE and COMPARE_IS_NOT_NONE bytecodes to CPython to attempt to speed up the very common "is None" and "is not None" expressions. It's a very simple change overall. Here is my patch: https://gist.github.com/benhoyt/e5ba19afe7b869fd743c1c39fc2afdf8

Note: The intent here is a proof of concept, not a CPython-quality implementation. I've put the code to generate these in compile.c, though it may be better off in peephole.c. Also, I'm not sure two new bytecodes are needed -- it'd be simpler and probably almost as fast to have one bytecode instruction COMPARE_NONE with an argument that determines the "is" vs "is not" part.

Anyway, it looks like a nice improvement on a micro-benchmark, with a 20% speed increase for "is None" and "is not None" expressions using timeit. See results at [1] below.

However, I'm not very familiar with the python/performance benchmarks and with my first try at running those I'm getting wildly varying results (most faster, but some tests say 1.7x slower, so I'm pretty sure this is an artifact of how I'm running them). See [2] below for my results. Would someone with more experience running those be able to run the performance tests and post results?

Any other thoughts welcome as well.

-Ben

[1] Here are the micro-benchmarks using timeit against the latest master (normal build and also PGO) as well as with my patch:

latest master (e613e6add5f07ff6aad5802924596b631b707d2a, June 27)

./configure make -j ../test_none.sh x = 1234; x is None -- 20000000 loops, best of 5: 19.8 nsec per loop x = 1234; x is not None -- 10000000 loops, best of 5: 20 nsec per loop x = None; x is None -- 10000000 loops, best of 5: 20.7 nsec per loop x = None; x is not None -- 10000000 loops, best of 5: 20.8 nsec per loop avg 20.3 nsec per loop

./configure --enable-optimizations make -j ../test_none.sh x = 1234; x is None -- 20000000 loops, best of 5: 18.6 nsec per loop x = 1234; x is not None -- 20000000 loops, best of 5: 19.2 nsec per loop x = None; x is None -- 10000000 loops, best of 5: 19.5 nsec per loop x = None; x is not None -- 20000000 loops, best of 5: 19.4 nsec per loop avg 19.2 nsec per loop

is_none_bytecode branch (with my patch)

./configure make -j ../test_none.sh x = 1234; x is None -- 20000000 loops, best of 5: 16 nsec per loop x = 1234; x is not None -- 20000000 loops, best of 5: 16.4 nsec per loop x = None; x is None -- 20000000 loops, best of 5: 16.6 nsec per loop x = None; x is not None -- 20000000 loops, best of 5: 15.5 nsec per loop avg 16.1 nsec per loop (21% faster than master)

./configure --enable-optimizations make -j ../test_none.sh x = 1234; x is None -- 20000000 loops, best of 5: 15.6 nsec per loop x = 1234; x is not None -- 20000000 loops, best of 5: 16.4 nsec per loop x = None; x is None -- 20000000 loops, best of 5: 15.7 nsec per loop x = None; x is not None -- 20000000 loops, best of 5: 15.4 nsec per loop avg 15.8 nsec per loop (18% faster than master)

[2] Benchmarks comparing master and is_none_bytecode patch (each compiled with --enable-optimizations) using python/performance:

+-------------------------+------------+------------------------------+ | Benchmark | masteropt | isnonebytecodeopt | +=========================+============+==============================+ | 2to3 | 617 ms | 541 ms: 1.14x faster (-12%) | +-------------------------+------------+------------------------------+ | chameleon | 19.9 ms | 18.6 ms: 1.07x faster (-7%) | +-------------------------+------------+------------------------------+ | cryptopyaes | 208 ms | 201 ms: 1.04x faster (-3%) | +-------------------------+------------+------------------------------+ | deltablue | 13.8 ms | 12.9 ms: 1.07x faster (-7%) | +-------------------------+------------+------------------------------+ | djangotemplate | 245 ms | 233 ms: 1.05x faster (-5%) | +-------------------------+------------+------------------------------+ | dulwichlog | 132 ms | 126 ms: 1.05x faster (-5%) | +-------------------------+------------+------------------------------+ | fannkuch | 898 ms | 849 ms: 1.06x faster (-5%) | +-------------------------+------------+------------------------------+ | float | 204 ms | 191 ms: 1.07x faster (-7%) | +-------------------------+------------+------------------------------+ | genshitext | 57.5 ms | 54.8 ms: 1.05x faster (-5%) | +-------------------------+------------+------------------------------+ | genshixml | 122 ms | 114 ms: 1.07x faster (-6%) | +-------------------------+------------+------------------------------+ | hexiom | 18.4 ms | 26.9 ms: 1.46x slower (+46%) | +-------------------------+------------+------------------------------+ | html5lib | 155 ms | 265 ms: 1.72x slower (+72%) | +-------------------------+------------+------------------------------+ | jsondumps | 23.0 ms | 30.6 ms: 1.33x slower (+33%) | +-------------------------+------------+------------------------------+ | loggingformat | 23.1 us | 21.6 us: 1.07x faster (-6%) | +-------------------------+------------+------------------------------+ | loggingsimple | 19.6 us | 18.1 us: 1.09x faster (-8%) | +-------------------------+------------+------------------------------+ | mako | 35.4 ms | 33.5 ms: 1.06x faster (-5%) | +-------------------------+------------+------------------------------+ | meteorcontest | 177 ms | 169 ms: 1.04x faster (-4%) | +-------------------------+------------+------------------------------+ | nqueens | 184 ms | 174 ms: 1.06x faster (-5%) | +-------------------------+------------+------------------------------+ | pathlib | 46.7 ms | 45.1 ms: 1.03x faster (-3%) | +-------------------------+------------+------------------------------+ | pickle | 19.5 us | 17.4 us: 1.12x faster (-11%) | +-------------------------+------------+------------------------------+ | pickledict | 58.6 us | 49.9 us: 1.17x faster (-15%) | +-------------------------+------------+------------------------------+ | picklelist | 7.45 us | 6.86 us: 1.09x faster (-8%) | +-------------------------+------------+------------------------------+ | pythonstartup | 28.1 ms | 31.0 ms: 1.10x slower (+10%) | +-------------------------+------------+------------------------------+ | regexcompile | 335 ms | 353 ms: 1.05x slower (+5%) | +-------------------------+------------+------------------------------+ | regexdna | 230 ms | 247 ms: 1.07x slower (+7%) | +-------------------------+------------+------------------------------+ | regexeffbot | 4.16 ms | 4.49 ms: 1.08x slower (+8%) | +-------------------------+------------+------------------------------+ | regexv8 | 33.0 ms | 36.2 ms: 1.10x slower (+10%) | +-------------------------+------------+------------------------------+ | richards | 120 ms | 124 ms: 1.03x slower (+3%) | +-------------------------+------------+------------------------------+ | scimarkfft | 539 ms | 593 ms: 1.10x slower (+10%) | +-------------------------+------------+------------------------------+ | scimarkmontecarlo | 183 ms | 172 ms: 1.07x faster (-6%) | +-------------------------+------------+------------------------------+ | scimarksparsematmult | 6.42 ms | 6.54 ms: 1.02x slower (+2%) | +-------------------------+------------+------------------------------+ | spectralnorm | 251 ms | 233 ms: 1.08x faster (-7%) | +-------------------------+------------+------------------------------+ | sqlalchemydeclarative | 246 ms | 229 ms: 1.07x faster (-7%) | +-------------------------+------------+------------------------------+ | sqlalchemyimperative | 48.8 ms | 45.9 ms: 1.06x faster (-6%) | +-------------------------+------------+------------------------------+ | sqlitesynth | 4.56 us | 4.69 us: 1.03x slower (+3%) | +-------------------------+------------+------------------------------+ | sympyintegrate | 31.4 ms | 33.5 ms: 1.07x slower (+7%) | +-------------------------+------------+------------------------------+ | telco | 9.76 ms | 9.99 ms: 1.02x slower (+2%) | +-------------------------+------------+------------------------------+ | unpacksequence | 73.8 ns | 76.9 ns: 1.04x slower (+4%) | +-------------------------+------------+------------------------------+ | unpickle | 24.1 us | 24.8 us: 1.03x slower (+3%) | +-------------------------+------------+------------------------------+ | unpicklelist | 6.04 us | 6.46 us: 1.07x slower (+7%) | +-------------------------+------------+------------------------------+ | xmletreeparse | 223 ms | 245 ms: 1.10x slower (+10%) | +-------------------------+------------+------------------------------+ | xmletreeiterparse | 175 ms | 203 ms: 1.16x slower (+16%) | +-------------------------+------------+------------------------------+ | xmletreeprocess | 144 ms | 149 ms: 1.03x slower (+3%) | +-------------------------+------------+------------------------------+

Not significant (15): chaos; go; json_loads; nbody; pickle_pure_python; pidigits; python_startup_no_site; raytrace; scimark_lu; scimark_sor; sympy_expand; sympy_sum; sympy_str; unpickle_pure_python; xml_etree_generate

On Thu, May 25, 2017 at 10:28 AM, Ben Hoyt <benhoyt at gmail.com> wrote:

Thanks, Victor. That's very helpful. So RETURNNONE (and probably RETURNSMALLCONST) are not worth it, based on your empirical tests. Your patch shows how (relatively) straight-forward it is to test out new opcodes.

I'm still optimistic about the value of COMPAREISNONE and COMPAREISNOTNONE, though. Mainly because I've done a quick expansion of LOADCONST(None) + COMPAREOP and it's quite a bit more code and many more instructions than COMPAREISNONE would be: LOADCONST(None) COMPAREOP PyObject *value = ((PyTupleObject *)(consts))->obitem[oparg]; value->obrefcnt++; *stackpointer++ = value; FASTDISPATCH(); PyObject *right = *--stackpointer; PyObject *left = stackpointer[-1] // cmpoutcome(), presumably inlined int r = 0; switch (compareoparg) { case PyCmpIS: r = (left == right); break; case PyCmpISNOT: r = (left != right); break; case ... } PyObject *res = r ? PyTrue : PyFalse; res->obrefcnt++; if (--(left)->obrefcnt == 0) PyDealloc(left); if (--(right)->obrefcnt == 0) PyDealloc(right); stackpointer[-1] = res; if (res == NULL) goto error; PREDICT(POPJUMPIFFALSE); PREDICT(POPJUMPIFTRUE); DISPATCH();

COMPAREISNONE PyObject* left = stackpointer[-1]; // TOP() PyObject* res = (left == PyNone) ? PyTrue : PyFalse; res->obrefcnt++; if (--(left)->obrefcnt == 0) PyDealloc(left); stackpointer[-1] = res; // SETTOP(res) PREDICT(POPJUMPIFFALSE); PREDICT(POPJUMPIFTRUE); DISPATCH(); You don't have to get the const arg, there are fewer increfs/decrefs, you skip a pop, you don't have to test res==NULL (because it's PyTrue or PyFalse, which are never NULL), and if there are separate COMPAREISNONE and COMPAREISNOTNONE you don't have to switch on the compare arg (though I'm not sure if that part will be worth it). For reference, based on a grep, " is None" occurs 2737 times in the CPython source tree, and " is not None" 2010 times. And I know personally I often use them in loops as well is at the start of functions (for mutable default arg handling). Still, the performance proof will be in the pudding! I might hack these two opcodes together and test it at some point. -Ben On Thu, May 25, 2017 at 6:47 AM, Victor Stinner <victor.stinner at gmail.com> wrote:

Hi Ben, I am not convinced that combining operations will have a significant impact in term of performance. Mark Shanon implemented that in his HotPy project. I proposed a RETURNNONE opcode to combine LOADCONST with RETURNVALUE. The issue was rejected because I failed to show any speedup. https://bugs.python.org/issue28800 I would be interested to restart/finish my registervm project to use register-based bytecode. It allows to implement more optmisations and reduce the number of instructions. In my experience, less instructions = faster code. http://faster-cpython.readthedocs.io/registervm.html Mark's bytecode uses registers but also a stack. Victor Le 24 mai 2017 8:09 PM, "Ben Hoyt" <benhoyt at gmail.com> a écrit :

Hi folks, I was looking at some dis output today, and I was wondering if anyone has investigated optimizing Python (slightly) by adding special-case bytecodes for common expressions or statements involving constants? For example, I (and, based on a quick grep of the stdlib, many others) write "x is None" and "x is not None" very often. Or "return True" or "return None" or "return 1" and things like that. These all expand into two bytecodes, which seems pretty non-optimal (LOADCONST + COMPAREOP or LOADCONST + RETURNVALUE). It seems we could get an easy speedup for these common cases by adding a peephole optimization and some new opcodes (maybe COMPAREISSMALLCONST and RETURNSMALLCONST for these cases). I'm not proposing to do this yet, as I'd need to benchmark to see how much of a gain (if any) it would amount to, but I'm just wondering if there's any previous work on this kind of thing. Or, if not, any other thoughts before I try it? Thanks, Ben

Python-Dev mailing list Python-Dev at python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/victor.stinner%40gmail.com

Previous message (by thread): [Python-Dev] Proposal for C++ metaclasses
Next message (by thread): [Python-Dev] Micro-optimizations by adding special-case bytecodes?
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list