[Python-Dev] C-level duck typing (original) (raw)

Dag Sverre Seljebotn d.s.seljebotn at astro.uio.no
Wed May 16 16:59:16 CEST 2012


On 05/16/2012 02:47 PM, Mark Shannon wrote:

Stefan Behnel wrote:

Dag Sverre Seljebotn, 16.05.2012 12:48:

On 05/16/2012 11:50 AM, "Martin v. Löwis" wrote:

Agreed in general, but in this case, it's really not that easy. A C function call involves a certain overhead all by itself, so calling into the C-API multiple times may be substantially more costly than, say, calling through a function pointer once and then running over a returned C array comparing numbers. And definitely way more costly than running over an array that the type struct points to directly. We are not talking about hundreds of entries here, just a few. A linear scan in 64 bit steps over something like a hundred bytes in the L1 cache should hardly be measurable. I give up, then. I fail to understand the problem. Apparently, you want to do something with the value you get from this lookup operation, but that something won't involve function calls (or else the function call overhead for the lookup wouldn't be relevant). In our specific case the value would be an offset added to the PyObject*, and there we would find a pointer to a C function (together with a 64-bit signature), and calling that C function (after checking the 64 bit signature) is our final objective. I think the use case hasn't been communicated all that clearly yet. Let's give it another try. Imagine we have two sides, one that provides a callable and the other side that wants to call it. Both sides are implemented in C, so the callee has a C signature and the caller has the arguments available as C data types. The signature may or may not match the argument types exactly (float vs. double, int vs. long, ...), because the caller and the callee know nothing about each other initially, they just happen to appear in the same program at runtime. All they know is that they could call each other through Python space, but that would require data conversion, tuple packing, calling, tuple unpacking, data unpacking, and then potentially the same thing on the way back. They want to avoid that overhead. Now, the caller needs to figure out if the callee has a compatible signature. The callee may provide more than one signature (i.e. more than one C call entry point), perhaps because it is implemented to deal with different input data types efficiently, or perhaps because it can efficiently convert them to its expected input. So, there is a signature on the caller side given by the argument types it holds, and a couple of signature on the callee side that can accept different C data input. Then the caller needs to find out which signatures there are and match them against what it can efficiently call. It may even be a JIT compiler that can generate an efficient call signature on the fly, given a suitable signature on callee side. An example for this is an algorithm that evaluates a user provided function on a large NumPy array. The caller knows what array type it is operating on, and the user provided function may be designed to efficiently operate on arrays of int, float and double entries. Given that use case, can I suggest the following: Separate the discovery of the function from its use. By this I mean first lookup the function (outside of the loop) then use the function (inside the loop).

We would obviously do that when we can. But Cython is a compiler/code translator, and we don't control usecases. You can easily make up usecases (= Cython code people write) where you can't easily separate the two.

For instance, the Sage projects has hundreds of thousands of lines of object-oriented Cython code (NOT just array-oriented, but also graphs and trees and stuff), which is all based on Cython's own fast vtable dispatches a la C++. They might want to clean up their code and more generic callback objects some places.

Other users currently pass around C pointers for callback functions, and we'd like to tell them "pass around these nicer Python callables instead, honestly, the penalty is only 2 ns per call". (Regardless of how you use them, like making sure you use them in a loop where we can statically pull out the function pointer acquisition. Saying "this is only non-sluggish if you do x, y, z puts users off.)

I'm not asking you to consider the details of all that. Just to allow some kind of high-performance extensibility of PyTypeObject, so that we can stop bothering python-dev with specific requirements from our parallel universe of nearly-all-Cython-and-Fortran-and-C++ codebases :-)

Dag



More information about the Python-Dev mailing list