[Python-Dev] Micro-optimizations by adding special-case bytecodes? (original) (raw)

Erik python at lucidity.plus.com
Wed May 24 16:14:18 EDT 2017


Hi Ben,

On 24/05/17 19:07, Ben Hoyt wrote:

I'm not proposing to do this yet, as I'd need to benchmark to see how much of a gain (if any) it would amount to, but I'm just wondering if there's any previous work on this kind of thing. Or, if not, any other thoughts before I try it?

This is exactly what I looked into just over a year ago. As Stephane suggests, I did this by adding new opcodes that the peephole optimizer generated and the interpreter loop understood (but the compiler itself did not need to know anything about these new opcodes, so that makes things much easier).

Adding new opcodes like this at the time wasn't straightforward because of issues with the build process (see this thread: https://mail.python.org/pipermail/python-dev/2015-December/142600.html - it starts out as a question about the bytecode format but ended up with some very useful information on the build process).

Note that since that thread, a couple of things have changed - the bytecode is now wordcode so some of my original questions aren't relevant, and some of the things I had a problem with in the build system are now auto-generated with a new 'make' target. So it should be easier now than it was then.

In terms of the results I got once I had things building and running, I didn't manage to find any particular magic bullets that gave me a significant enough speedup. Perhaps I just didn't pick the right opcode sequences or the right test cases (though what I was trying to do was quite successful in terms of doing things like replacing branches-to-RETURN into a single RETURN - so LOAD_CONST/RETURN_VALUE became RETURN_CONST and therefore if the target of an unconditional branch was to a RETURN_CONST op, the branch op could be replaced by the RETURN_CONST).

I figured that one thing every function or method needs to do is return, so I tried to make that more efficient. I only had two weeks to spend on it though ...

I was trying to do that by avoiding trips-around-the-interpreter-loop as that was historically something that would give speedups. However, with the new computed-goto version of the interpreter I came to the conclusion that it's not at important as it used to be. I was building with gcc though and what I didn't do was disable the computed-goto code (it's controlled by a #define) to see if my changes improved performance on platforms that can't use it.

I have other opcode sequences that I identified that might be useful to look at further.

I didn't (and still don't) have enough bandwidth to drive something like this through though, but if you want to do that I'd be more than happy to be kept in the loop on what you're doing and can possibly find time to write some code too.

Regards, E.



More information about the Python-Dev mailing list