[Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes (original) (raw)

Nathaniel Smith njs at pobox.com
Thu Jun 5 22:51:41 CEST 2014

Previous message: [Python-Dev] Request: new "Asyncio" component on the bug tracker
Next message: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi all,

There's a very valuable optimization -- temporary elision -- which numpy can almost do. It gives something like a 10-30% speedup for lots of common real-world expressions. It would probably would be useful for non-numpy code too. (In fact it generalizes the str += str special case that's currently hardcoded in ceval.c.) But it can't be done safely without help from the interpreter, and possibly not even then. So I thought I'd raise it here and see if we can get any consensus on whether and how CPython could support this.

=== The dream ===

Here's the idea. Take an innocuous expression like:

result = (a + b + c) / c

This gets evaluated as:

tmp1 = a + b tmp2 = tmp1 + c result = tmp2 / c

All these temporaries are very expensive. Suppose that a, b, c are arrays with N bytes each, and N is large. For simple arithmetic like this, then costs are dominated by memory access. Allocating an N byte array requires the kernel to clear the memory, which incurs N bytes of memory traffic. If all the operands are already allocated, then performing a three-operand operation like tmp1 = a + b involves 3N bytes of memory traffic (reading the two inputs plus writing the output). In total our example does 3 allocations and has 9 operands, so it does 12N bytes of memory access.

If our arrays are small, then the kernel doesn't get involved and some of these accesses will hit the cache, but OTOH the overhead of things like malloc won't be amortized out; the best case starting from a cold cache is 3 mallocs and 6N bytes worth of cache misses (or maybe 5N if we get lucky and malloc'ing 'result' returns the same memory that tmp1 used, and it's still in cache).

There's an obvious missed optimization in this code, though, which is that it keeps allocating new temporaries and throwing away old ones. It would be better to just allocate a temporary once and re-use it:

tmp1 = a + b tmp1 += c tmp1 /= c result = tmp1

Now we have only 1 allocation and 7 operands, so we touch only 8N bytes of memory. For large arrays -- that don't fit into cache, and for which per-op overhead is amortized out -- this gives a theoretical 33% speedup, and we can realistically get pretty close to this. For smaller arrays, the re-use of tmp1 means that in the best case we have only 1 malloc and 4N bytes worth of cache misses, and we also have a smaller cache footprint, which means this best case will be achieved more often in practice. For small arrays it's harder to estimate the total speedup here, but 66% fewer mallocs and 33% fewer cache misses is certainly enough to make a practical difference.

Such optimizations are important enough that numpy operations always give the option of explicitly specifying the output array (like in-place operators but more general and with clumsier syntax). Here's an example small-array benchmark that IIUC uses Jacobi iteration to solve Laplace's equation. It's been written in both natural and hand-optimized formats (compare "num_update" to "num_inplace"):

https://yarikoptic.github.io/numpy-vbench/vb_vb_app.html#laplace-inplace

num_inplace is totally unreadable, but because we've manually elided temporaries, it's 10-15% faster than num_update. With our prototype automatic temporary elision turned on, this difference disappears -- the natural code gets 10-15% faster, and we remove the temptation to write horrible things like num_inplace.

What do I mean by "automatic temporary elision"? It's almost possible for numpy to automatically convert the first example into the second. The idea is: we want to replace

tmp2 = tmp1 + c

with

tmp1 += c tmp2 = tmp1

And we can do this by defining

def add(self, other): if is_about_to_be_thrown_away(self): return self.iadd(other) else: ...

now tmp1.add(c) does an in-place add and returns tmp1, no allocation occurs, woohoo.

The only little problem is implementing is_about_to_be_thrown_away().

=== The sneaky-but-flawed approach ===

The following implementation may make you cringe, but it comes tantalizingly close to working:

bool is_about_to_be_thrown_away(PyObject * obj) { return (Py_REFCNT(obj) == 1); }

In fact, AFAICT it's 100% correct for libraries being called by regular python code (which is why I'm able to quote benchmarks at you :-)). The bytecode eval loop always holds a reference to all operands, and then immediately DECREFs them after the operation completes. If one of our arguments has no other references besides this one, then we can be sure that it is a dead obj walking, and steal its corpse.

But this has a fatal flaw: people are unreasonable creatures, and sometimes they call Python libraries without going through ceval.c :-(. It's legal for random C code to hold an array object with a single reference count, and then call PyNumber_Add on it, and then expect the original array object to still be valid. But who writes code like that in practice? Well, Cython does. So, this is no-go.

=== A better (?) approach ===

This is a pretty arcane bit of functionality that we need, and it interacts with ceval.c, so I'm not at all confident about the best way to do it. (We even have an implementation using libunwind to walk the C stack and make sure that we're being called from ceval.c, which... works, actually, but is unsatisfactory in other ways.) I do have an idea that I think might work and be acceptable, but you tell me:

Proposal: We add an API call

PyEval_LastOpDefinitelyMatches(frame, optype, *args)

which checks whether the last instruction executed in 'frame' was in fact an 'optype' instruction and did in fact have arguments 'args'. If it was, then it returns True. If it wasn't, or if we aren't sure, it returns False. The intention is that 'optype' is a semantic encoding of the instruction (like "+" or "function call") and thus can be preserved even if the bytecode details change.

Then, in numpy's add, we do:

fetch the current stack frame from TLS
check PyEval_LastOpDefinitelyMatches(frame, "+", arg1, arg2)
check for arguments with refcnt == 1
check that all arguments are base-class numpy array objects (i.e., PyArray_CheckExact)

The logic here is that step (2) tells us that someone did 'arg1 + arg2', so ceval.c is holding a temporary reference to the arguments, and step (3) tells us that at the time of the opcode evaluation there were no other references to these arguments, and step (4) tells us that 'arg1 + arg2' dispatched directly to ndarray.add so there's no chance that anyone else has borrowed a reference in the mean time.

AFAICT PyEval_LastOpDefinitelyMatches can almost be implemented now; the only problem is that stack_pointer is a local variable in PyEval_EvalFrameEx, and we would need it to be accessible via the frame object. The easy way would be to just move it in there. I don't know if this would have any weird effects on speed due to cache effects, but I guess we could arrange to put it into the same cache line as f_lasti, which is also updated on every opcode? OTOH someone has gone to some trouble to make sure that f_stacktop usually doesn't point to the top of the stack, and I guess there must have been some reason for this. Alternatively we could stash a pointer to stack_pointer in the frame object, and that would only need to be updated once per entry/exit to PyEval_EvalFrameEx.

Obviously there are a lot of details to work out here, like what the calling convention for PyEval_LastOpDefinitelyMatches should really be, but:

Does this approach seem like it would successfully solve the problem?
Does this approach seem like it would be acceptable in CPython?
Is there a better idea I'm missing?

-n

-- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

Previous message: [Python-Dev] Request: new "Asyncio" component on the bug tracker
Next message: [Python-Dev] [numpy wishlist] Interpreter support for temporary elision in third-party classes
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list