Issue 26814: [WIP] Add a new _PyObject_FastCall() function which avoids the creation of a tuple or dict for arguments (original) (raw)

Created on 2016-04-21 08:57 by vstinner, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (34)

msg263899 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-04-21 08:57

Attached patch adds the following new function:

PyObject* _PyObject_CallStack(PyObject *func, PyObject **stack, int na, int nk);

where na is the number of positional arguments and nk is the number of (key, pair) arguments stored in the stack.

Example of C code to call a function with one positional argument:

PyObject *stack[1];
stack[0] = arg;
return _PyObject_CallStack(func, stack, 1, 0);

Simple, isn't it?

The difference with PyObject_Call() is that its API avoids the creation of a tuple and a dictionary to pass parameters to functions when possible. Currently, the temporary tuple and dict can be avoided to call Python functions (nice, isn't it?) and C function declared with METH_O (not the most common API, but many functions are declared like that).

The patch only modifies property_descr_set() to test the feature, but I'm sure that a lot of C code can be modified to use this new function to beneift from its optimization.

Should we make this new _PyObject_CallStack() function official: call it PyObject_CallStack() (without the understand prefix) or experiment it in CPython 3.6 and decide later to make it public? If it's made private, it will require a large replacement patch later to replace all calls to _PyObject_CallStack() with PyObject_CallStack() (strip the underscore prefix).

The next step is to add a new METH_STACK flag to pass parameters to C functions using a similar API (PyObject **stack, int na, int nk) and modify the argument clinic to use this new API.

Thanks to Larry Hasting who gave me the idea in a previous edition of Pycon US ;-)

This issue was created after the discussion on issue #26811 which is an issue in a micro-optimization in property_descr_set() to avoid the creation of a tuple: it caches a private tuple inside property_descr_set().

msg263907 - (view)

Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)

Date: 2016-04-21 09:53

"Stack" in the function name looks a little confusing. I understand that this is related to the stack of bytecode interpreter, but this looks as raising pretty deep implementation detail. The way of packing positional and keyword arguments in the continuous array is not clear. Wouldn't be better to provide separate arguments for positional and keyword arguments?

What is the performance effect of using this function? For example compare the performance of namedtuple's attribute access of current code, the code with with this patch, and unoptimized code in 3.4:

./python -m timeit -r 11 -s "from collections import namedtuple as n; a = n('n', 'a b c')(1, 2, 3)" -- "a.a"

Is there any use of this function with keyword arguments?

msg263908 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-04-21 10:20

Microbenchmark on Python 3.6, best of 3 runs:

./python -m timeit -r 11 -s "from collections import namedtuple as n; a = n('n', 'a b c')(1, 2, 3)" -- "a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a"

"Python 3.6 with property_descr_get() of Python 3.4": replace the current optimization with "return PyObject_CallFunctionObjArgs(gs->prop_get, obj, NULL);".

Oh, in fact the tested code calls a property where the final function is operator.itemgetter(0). _PyObject_CallStack() creates a temporary tuple to call PyObject_Call() which calls func->ob_type->tp_call, itemgetter_call().

Problem: tp_call API uses (PyObject *args, PyObject kwargs). It doesn't accept directly a stack (a C array of PyObject). And it may be more difficult to modify tp_call.

In short, my patch disables the optimization on property with my current incomplete implementation.

msg263909 - (view)

Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)

Date: 2016-04-21 10:28

See also . May be your function help to optimize filter(), map(), sorted()?

msg263910 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-04-21 10:42

call_stack-2.patch: A little bit more complete patch, it adds a tp_call_stack field to PyTypeObject an use it in _PyObject_CallStack().

Updated microbenchmark on Python 3.6, best of 3 runs:

./python -m timeit -r 11 -s "from collections import namedtuple as n; a = n('n', 'a b c')(1, 2, 3)" -- "a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a"

call_stack-2.patch makes this micro-benchmark 31% faster, not bad! It also makes calls to C functions almost 2x as fast if you replace current unoptimized calls with _PyObject_CallStack()!!

IHMO we should continue to experiment, making function calls 2x faster is worth it ;-)

Serhiy: "See also . May be your function help to optimize filter(), map(), sorted()?"

IMHO the API is generic enough to be usable in a lot of cases.

Serhiy: "Is there any use of this function with keyword arguments?"

Calling functions with keywords is probably the least common case for function calls in C code. But I would like to provide a fast function to call with keywords. Maybe we need two functions just to make the API cleaner? The difference would just be that "int k" would be omitted?

I proposed an API (PyObject **stack, int na, int nk) based on the current code in Python/ceval.c. I'm not sure that it's the best API ever :-)

In fact, there is already PyObject_CallFunctionObjArgs() which can be modified to reuse internally _PyObject_CallStack(), and its API is maybe more convenient than my proposed API.

msg263918 - (view)

Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)

Date: 2016-04-21 13:45

With call_stack-2.patch attribute access in namedtuple is only 25% slower than attribute access in ordinary Python object! Definitely this this worth to continue to experiment!

But adding new slot to PyTypeObject sets the bar too high. Try to use your function to speed up all cases mentioned in : sorted()/list.sort(), min() and max() with the key argument, filter(), map(), some iterators from itertools (groupby(), dropwhile(), takewhile(), accumulate(), filterfalse()), thin wrappers around special method (round(), math.floor(), etc). Use it in wrappers around PyObject_Call() like PyObject_CallFunctionObjArgs(). May be this will cause an effect even on some macrobenchmarks.

msg263920 - (view)

Author: Larry Hastings (larry) * (Python committer)

Date: 2016-04-21 14:24

Yes, I've been working on a patch to do this as well. I called the calling convention METH_RAW, to go alongside METH_ZERO METH_O etc. My calling convention was exactly the same as yours: PyObject *(PyObject *o, PyObject **stack, int na, int nk). I only had to modify two functions in ceval.c to support it: ext_do_call() and call_function().

And yes, the overarching goal was to have Argument Clinic generate custom argument parsing code for every function. Supporting the calling convention was the easy part; generating code was quite complicated. I believe I got a very simple version of it working at one point, supporting positional parameters only, with some optional arguments. Parsing arguments by hand gets very complicated indeed when you introduce keyword arguments.

I haven't touched this patch in most of a year. I hope to return to it someday. In the meantime it's fine by me if you add support for this and rewrite some functions by hand to use it.

p.s. My last name has two S's. If you continue to leave off one of them, I shall remove one from yours, Mr. TINNER.

msg263923 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-04-21 15:03

Since early microbenchmarks are promising, I wrote a more complete implementations which tries to use the fast-path (avoid temporary tuple/dict) in all PyObject_Call*() functions.

The next step would be to add a METH_FASTCALL flag. IMHO adding such new flag requires to enhance Argument Clinic to be able to use it, at least when a function doesn't accept keyword parameters.

PyObject_CallFunction() & friends have a weird API: if call with the format string "O", the behaviour depends if the object parameter is a tuple or not. If it's a tuple, the tuple is unpacked. It's a little bit weird. I recall that it led to a bug in the implementation in generators in Python: issue #21209! Moreover, if the format string is "(...)", parenthesis are ignored. If you want to call a function with one argument which is a tuple, you have to write "((...))". It's a little bit weird, but we cannot change that without breaking the (Python) world :-)

call_stack-3.patch:

Nice change in the WITH_CLEANUP_START opcode (ceval.c):

I don't know if it's a common byetcode, nor if the change is really faster.

msg263924 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-04-21 15:05

I believe I got a very simple version of it working at one point, supporting positional parameters only, with some optional arguments.

Yeah, that would be a nice first step.

p.s. My last name has two S's. If you continue to leave off one of them, I shall remove one from yours, Mr. TINNER.

Ooops, I'm sorry Guido Hastings :-(

msg263926 - (view)

Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)

Date: 2016-04-21 17:04

PyObject_Call*() implementations with _PyObject_FastCall() look much more complex than with PyObject_Call() (even not counting additional complex functions in modsupport.c). And I'm not sure there is a benefit. May be for first stage we can do without this.

msg263946 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-04-22 00:44

I created a repository. I will work there and make some experiment. It would help to have a better idea of the concrete performance. When I will have a better view of all requires changes to get best performances everywhere, I will start a discussion to see which parts are worth it or not. In my latest microbenchmarks, functions calls (C/Python, mixed) are between 8% and 40% faster. I'm now running the CPython benchmark suite.

msg263995 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-04-22 11:10

Changes of my current implementation, ad4a53ed1fbf.diff.

The good thing is that all changes are internals (really?). Even if you don't modify your C extensions (nor your Python code), you should benefit of the new fast call is a lot of cases.

IMHO the best tricky part are changes on the PyTypeObject. Is it ok to add a new tp_fastcall slot? Should we add even more slots using the fast call convention like tp_fastnew and tp_fastinit? How should we handle the inheritance of types with that?

(*) Add 2 new public functions:

PyObject* PyObject_CallNoArg(PyObject func); PyObject PyObject_CallArg1(PyObject *func, PyObject *arg);

(*) Add 1 new private function:

PyObject* _PyObject_FastCall(PyObject *func, PyObject **stack, int na, int nk);

_PyObject_FastCall() is the root of the new feature.

(*) type: add a new "tp_fastcall" field to the PyTypeObject structure.

It's unclear to me how inheritance is handled here. Maybe it's simply broken, but it's strange because it looks like it works :-) Maybe it's very rare that tp_call is overidden in a child class?

TODO: maybe reuse the "tp_call" field? (risk of major backward incompatibility...)

(*) slots: add a new "fastwrapper" field to the wrappercase structure. Add a fast wrapper to all slots (really all? i should check).

I don't think that consumers of the C API are of this change, or maybe only a few projects.

TODO: maybe remove "fastwrapper" and reuse the "wrapper" field? (low risk of backward compatibility?)

(*) Implement fast call for Python function (_PyFunction_FastCall) and C functions (PyCFunction_FastCall)

(*) Add a new METH_FASTCALL calling convention for C functions. Right now, it is used for 4 builtin functions: sorted(), getattr(), iter(), next().

Argument Clinic should be modified to emit C code using this new fast calling convention.

(*) Implement fast call in the following functions (types):

() Modify PyObject_Call() functins to reuse internally the fast call. "tp_fastcall" is preferred over "tp_call" (FIXME: is it really useful to do that?).

The following functions are able to avoid temporary tuple/dict without having to modify the code calling them:

It's not required to modify code using these functions to use the 3 new shiny functions (PyObject_CallNoArg, PyObject_CallArg1, _PyObject_FastCall). For example, replacing PyObject_CallFunctionObjArgs(func, NULL) with PyObject_CallNoArg(func) is just a micro-optimization, the tuple is already avoided. But PyObject_CallNoArg() should use less memory of the C stack and be a "little bit" faster.

(*) Add new helpers: new Include/pystack.h file, Py_VaBuildStack(), etc.

Please ignore unrelated changes.

msg263996 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-04-22 11:12

Related issue: issue #23507, "Tuple creation is too slow".

msg263999 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-04-22 11:40

Some microbenchmarks: bench_fast.py.

== Python 3.6 / Python 3.6 FASTCALL ==

----------------------------------+--------------+--------------- Tests | /tmp/default | /tmp/fastcall ----------------------------------+--------------+--------------- filter | 241 us () | 166 us (-31%) map | 205 us () | 168 us (-18%) sorted(list, key=lambda x: x) | 242 us () | 162 us (-33%) sorted(list) | 27.7 us () | 27.8 us b=MyBytes(); bytes(b) | 549 ns () | 533 ns namedtuple.attr | 2.03 us () | 1.56 us (-23%) object.setattr(obj, "x", 1) | 347 ns () | 218 ns (-37%) object.getattribute(obj, "x") | 331 ns () | 200 ns (-40%) getattr(1, "real") | 267 ns () | 150 ns (-44%) bounded_pymethod(1, 2) | 193 ns () | 190 ns unbound_pymethod(obj, 1, 2 | 195 ns () | 192 ns ----------------------------------+--------------+--------------- Total | 719 us () | 526 us (-27%) ----------------------------------+--------------+---------------

== Compare Python 3.4 / Python 3.6 / Python 3.6 FASTCALL ==

Common platform: Timer: time.perf_counter Python unicode implementation: PEP 393 Timer info: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09) CPU model: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz Platform: Linux-4.4.4-301.fc23.x86_64-x86_64-with-fedora-23-Twenty_Three SCM: hg revision=abort: repository . not found! tag=abort: repository . not found! branch=abort: repository . not found! date=abort: no repository found in '/home/haypo/prog/python' (.hg not found)! Bits: int=32, long=64, long long=64, size_t=64, void*=64

Platform of campaign /tmp/py34: Python version: 3.4.3 (default, Jun 29 2015, 12:16:01) [GCC 5.1.1 20150618 (Red Hat 5.1.1-4)] CFLAGS: -Wno-unused-result -DDYNAMIC_ANNOTATIONS_ENABLED=1 -DNDEBUG -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv Timer precision: 78 ns Date: 2016-04-22 13:37:52

Platform of campaign /tmp/default: Python version: 3.6.0a0 (default:496e094f4734, Apr 22 2016, 02🔞13) [GCC 5.3.1 20151207 (Red Hat 5.3.1-2)] CFLAGS: -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes Timer precision: 103 ns Date: 2016-04-22 13:38:07

Platform of campaign /tmp/fastcall: Python version: 3.6.0a0 (default:ad4a53ed1fbf, Apr 22 2016, 12:42:15) [GCC 5.3.1 20151207 (Red Hat 5.3.1-2)] Timer precision: 99 ns CFLAGS: -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes Date: 2016-04-22 13:38:21

----------------------------------+-------------+----------------+--------------- Tests | /tmp/py34 | /tmp/default | /tmp/fastcall ----------------------------------+-------------+----------------+--------------- filter | 325 us () | 241 us (-26%) | 166 us (-49%) map | 260 us () | 205 us (-21%) | 168 us (-35%) sorted(list, key=lambda x: x) | 354 us () | 242 us (-32%) | 162 us (-54%) sorted(list) | 46.9 us () | 27.7 us (-41%) | 27.8 us (-41%) b=MyBytes(); bytes(b) | 839 ns () | 549 ns (-35%) | 533 ns (-36%) namedtuple.attr | 4.51 us () | 2.03 us (-55%) | 1.56 us (-65%) object.setattr(obj, "x", 1) | 447 ns () | 347 ns (-22%) | 218 ns (-51%) object.getattribute(obj, "x") | 401 ns () | 331 ns (-17%) | 200 ns (-50%) getattr(1, "real") | 236 ns () | 267 ns (+13%) | 150 ns (-36%) bounded_pymethod(1, 2) | 249 ns () | 193 ns (-22%) | 190 ns (-24%) unbound_pymethod(obj, 1, 2 | 251 ns () | 195 ns (-22%) | 192 ns (-23%) ----------------------------------+-------------+----------------+--------------- Total | 993 us () | 719 us (-28%) | 526 us (-47%) ----------------------------------+-------------+----------------+---------------

msg264003 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-04-22 11:52

For more fun, comparison between Python 2.7 / 3.4 / 3.6 / 3.6 FASTCALL.

----------------------------------+-------------+----------------+----------------+--------------- Tests | py27 | py34 | py36 | fast ----------------------------------+-------------+----------------+----------------+--------------- filter | 165 us () | 318 us (+93%) | 237 us (+43%) | 165 us map | 209 us () | 258 us (+24%) | 202 us | 171 us (-18%) sorted(list, key=lambda x: x) | 272 us () | 348 us (+28%) | 237 us (-13%) | 163 us (-40%) sorted(list) | 33.7 us () | 47.8 us (+42%) | 27.3 us (-19%) | 27.7 us (-18%) b=MyBytes(); bytes(b) | 3.31 us () | 835 ns (-75%) | 510 ns (-85%) | 561 ns (-83%) namedtuple.attr | 4.63 us () | 4.51 us | 1.98 us (-57%) | 1.57 us (-66%) object.setattr(obj, "x", 1) | 463 ns () | 440 ns | 343 ns (-26%) | 222 ns (-52%) object.getattribute(obj, "x") | 323 ns () | 396 ns (+23%) | 316 ns | 196 ns (-39%) getattr(1, "real") | 218 ns () | 237 ns (+8%) | 264 ns (+21%) | 147 ns (-33%) bounded_pymethod(1, 2) | 213 ns () | 244 ns (+14%) | 194 ns (-9%) | 188 ns (-12%) unbound_pymethod(obj, 1, 2) | 345 ns () | 247 ns (-29%) | 196 ns (-43%) | 191 ns (-45%) func() | 161 ns () | 211 ns (+31%) | 161 ns | 157 ns func(1, 2, 3) | 219 ns () | 247 ns (+13%) | 196 ns (-10%) | 190 ns (-13%) ----------------------------------+-------------+----------------+----------------+--------------- Total | 689 us () | 980 us (+42%) | 707 us | 531 us (-23%) ----------------------------------+-------------+----------------+----------------+---------------

I didn't know that Python 3.4 was so much slower than Python 2.7 on function calls!?

Note: Python 2.7 and Python 3.4 are system binaries (Fedora 22), wheras Python 3.6 and Python 3.6 FASTCALL are compiled manually.

Ignore "b=MyBytes(); bytes(b)", this benchmark is written for Python 3.

--

details:

Common platform: Bits: int=32, long=64, long long=64, size_t=64, void*=64 Platform: Linux-4.4.4-301.fc23.x86_64-x86_64-with-fedora-23-Twenty_Three CPU model: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz

Platform of campaign py27: CFLAGS: -fno-strict-aliasing -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv Python unicode implementation: UCS-4 Timer precision: 954 ns Python version: 2.7.10 (default, Sep 8 2015, 17:20:17) [GCC 5.1.1 20150618 (Red Hat 5.1.1-4)] Timer: time.time

Platform of campaign py34: Timer info: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09) CFLAGS: -Wno-unused-result -DDYNAMIC_ANNOTATIONS_ENABLED=1 -DNDEBUG -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv Timer precision: 84 ns Python unicode implementation: PEP 393 Python version: 3.4.3 (default, Jun 29 2015, 12:16:01) [GCC 5.1.1 20150618 (Red Hat 5.1.1-4)] Timer: time.perf_counter

Platform of campaign py36: Timer info: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09) Python version: 3.6.0a0 (default:496e094f4734, Apr 22 2016, 02🔞13) [GCC 5.3.1 20151207 (Red Hat 5.3.1-2)] CFLAGS: -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes Python unicode implementation: PEP 393 Timer: time.perf_counter

Platform of campaign fast: Timer info: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09) CFLAGS: -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes Python unicode implementation: PEP 393 Python version: 3.6.0a0 (default:ad4a53ed1fbf, Apr 22 2016, 12:42:15) [GCC 5.3.1 20151207 (Red Hat 5.3.1-2)]

msg264009 - (view)

Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)

Date: 2016-04-22 12:52

Could you compare filter(), map() and sorted() performance with your patch and with patch?

msg264021 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-04-22 14:56

Results of the CPython benchmark suite on the revision 6c376e866330 of https://hg.python.org/sandbox/fastcall compared to CPython 3.6 at the revision 496e094f4734.

It's surprising than call_simple is 1.08x slower in fastcall. This slowdown is not acceptable and should be fixed. It probable explains why many other benchmarks are slower.

Hopefully, some benchmarks are faster, between 1.02x and 1.09x faster.

IMHO there are still performance issues in my current implementation that can and must be fixed. At least, we have a starting point to compare performances.

$ python3 -u perf.py ../default/python ../fastcall/python -b all (...) Report on Linux smithers 4.4.4-301.fc23.x86_64 #1 SMP Fri Mar 4 17:42:42 UTC 2016 x86_64 x86_64 Total CPU cores: 8

[ slower ]

2to3

6.859604 -> 6.985351: 1.02x slower

call_method_slots

Min: 0.308846 -> 0.317780: 1.03x slower Avg: 0.308902 -> 0.318667: 1.03x slower Significant (t=-464.83) Stddev: 0.00003 -> 0.00026: 9.8974x larger

call_simple

Min: 0.232594 -> 0.251789: 1.08x slower Avg: 0.232816 -> 0.252443: 1.08x slower Significant (t=-911.97) Stddev: 0.00024 -> 0.00011: 2.2373x smaller

chaos

Min: 0.273084 -> 0.284790: 1.04x slower Avg: 0.273951 -> 0.293177: 1.07x slower Significant (t=-7.57) Stddev: 0.00036 -> 0.01796: 49.9421x larger

django_v3

Min: 0.549604 -> 0.569982: 1.04x slower Avg: 0.550557 -> 0.571038: 1.04x slower Significant (t=-204.09) Stddev: 0.00046 -> 0.00054: 1.1747x larger

float

Min: 0.261939 -> 0.269224: 1.03x slower Avg: 0.268475 -> 0.276515: 1.03x slower Significant (t=-12.22) Stddev: 0.00301 -> 0.00354: 1.1757x larger

formatted_logging

Min: 0.325786 -> 0.334440: 1.03x slower Avg: 0.326827 -> 0.335968: 1.03x slower Significant (t=-34.44) Stddev: 0.00129 -> 0.00136: 1.0503x larger

mako_v2

Min: 0.039642 -> 0.044765: 1.13x slower Avg: 0.040251 -> 0.045562: 1.13x slower Significant (t=-323.73) Stddev: 0.00028 -> 0.00024: 1.1558x smaller

meteor_contest

Min: 0.196589 -> 0.203667: 1.04x slower Avg: 0.197497 -> 0.204782: 1.04x slower Significant (t=-76.06) Stddev: 0.00050 -> 0.00045: 1.1111x smaller

nqueens

Min: 0.274664 -> 0.285866: 1.04x slower Avg: 0.275285 -> 0.286774: 1.04x slower Significant (t=-68.34) Stddev: 0.00091 -> 0.00076: 1.2036x smaller

pickle_list

Min: 0.262687 -> 0.269629: 1.03x slower Avg: 0.263804 -> 0.270789: 1.03x slower Significant (t=-50.14) Stddev: 0.00070 -> 0.00070: 1.0004x larger

raytrace

Min: 1.272960 -> 1.284516: 1.01x slower Avg: 1.276398 -> 1.368574: 1.07x slower Significant (t=-3.41) Stddev: 0.00157 -> 0.19115: 122.0022x larger

regex_compile

Min: 0.335753 -> 0.343820: 1.02x slower Avg: 0.336273 -> 0.344894: 1.03x slower Significant (t=-127.84) Stddev: 0.00026 -> 0.00040: 1.5701x larger

regex_effbot

Min: 0.048656 -> 0.050810: 1.04x slower Avg: 0.048692 -> 0.051619: 1.06x slower Significant (t=-69.92) Stddev: 0.00002 -> 0.00030: 16.7793x larger

silent_logging

Min: 0.069539 -> 0.071172: 1.02x slower Avg: 0.069679 -> 0.071230: 1.02x slower Significant (t=-124.08) Stddev: 0.00009 -> 0.00002: 3.7073x smaller

simple_logging

Min: 0.278439 -> 0.287736: 1.03x slower Avg: 0.279504 -> 0.288811: 1.03x slower Significant (t=-52.46) Stddev: 0.00084 -> 0.00093: 1.1074x larger

telco

Min: 0.012480 -> 0.013104: 1.05x slower Avg: 0.012561 -> 0.013157: 1.05x slower Significant (t=-100.42) Stddev: 0.00004 -> 0.00002: 1.5881x smaller

unpack_sequence

Min: 0.000047 -> 0.000048: 1.03x slower Avg: 0.000047 -> 0.000048: 1.03x slower Significant (t=-1170.16) Stddev: 0.00000 -> 0.00000: 1.0749x larger

unpickle_list

Min: 0.325310 -> 0.330080: 1.01x slower Avg: 0.326484 -> 0.333974: 1.02x slower Significant (t=-24.19) Stddev: 0.00100 -> 0.00195: 1.9392x larger

[ faster ]

chameleon_v2

Min: 5.525575 -> 5.263668: 1.05x faster Avg: 5.541444 -> 5.281893: 1.05x faster Significant (t=85.79) Stddev: 0.01107 -> 0.01831: 1.6539x larger

etree_iterparse

Min: 0.212073 -> 0.197146: 1.08x faster Avg: 0.215504 -> 0.200254: 1.08x faster Significant (t=61.07) Stddev: 0.00119 -> 0.00130: 1.0893x larger

etree_parse

Min: 0.282983 -> 0.260390: 1.09x faster Avg: 0.284333 -> 0.262758: 1.08x faster Significant (t=77.34) Stddev: 0.00102 -> 0.00169: 1.6628x larger

etree_process

Min: 0.218953 -> 0.213683: 1.02x faster Avg: 0.221036 -> 0.215280: 1.03x faster Significant (t=25.98) Stddev: 0.00114 -> 0.00108: 1.0580x smaller

hexiom2

Min: 122.001408 -> 118.967112: 1.03x faster Avg: 122.108010 -> 119.110115: 1.03x faster Significant (t=16.81) Stddev: 0.15076 -> 0.20224: 1.3415x larger

pathlib

Min: 0.088533 -> 0.084888: 1.04x faster Avg: 0.088916 -> 0.085280: 1.04x faster Significant (t=257.68) Stddev: 0.00014 -> 0.00017: 1.1725x larger

The following not significant results are hidden, use -v to show them: call_method, call_method_unknown, etree_generate, fannkuch, fastpickle, fastunpickle, go, json_dump_v2, json_load, nbody, normal_startup, pickle_dict, pidigits, regex_v8, richards, spectral_norm, startup_nosite, tornado_http.

msg264098 - (view)

Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)

Date: 2016-04-24 06:37

I have collected statistics about using CALL_FUNCTION* opcodes in compliled code during running CPython testsuite. According to it, 99.4% emitted opcodes is the CALL_FUNCTION opcode, and 89% of emitted CALL_FUNCTION opcodes have only positional arguments, and 98% of them have not more than 3 arguments.

That was about calls from Python code. All convenient C API functions (like PyObject_CallFunction and PyObject_CallFunctionObjArgs) used for direct calling in C code use only positional arguments.

Thus I think we need to optimize only cases of calling with small number (0-3) of positional arguments.

msg264101 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-04-24 07:15

Thus I think we need to optimize only cases of calling with small number (0-3) of positional arguments.

My code is optimized to up to 10 positional arguments: with 0..10 arguments, the C stack is used to hold the array of PyObject*. For more arguments, an array is allocated in the heap memory.

+# define _PyStack_SIZE 10

For keyword parameters, I don't know yet what is the best API (fatest API). Right now, I'm also using the same PyObject** array for positional and keyword arguments using "int nk", but maybe a dictionary is faster to combinary keyword arguments and to parse keyword arguments.

msg264102 - (view)

Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)

Date: 2016-04-24 07:37

I think you can simplify the patch by dropping keyword arguments support from fastcall. Then you can decrease _PyStack_SIZE to 4 (larger size will serve only 1.7% of calls), and may be refactor a code since an array of 4 pointers consumes less C stack than an array of 10 pointers.

msg264518 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-04-29 20:35

Results of the CPython benchmark suite. Reference = default branch at rev 496e094f4734, patched: fastcall fork at rev 2b4b7def2949.

I got many issues to get a reliable benchmark output:

The benchmark was run with CPU isolation. Both binaries were compiled with PGO+LTO.

Report on Linux smithers 4.4.4-301.fc23.x86_64 #1 SMP Fri Mar 4 17:42:42 UTC 2016 x86_64 x86_64 Total CPU cores: 8

call_method_slots

Min: 0.289704 -> 0.269634: 1.07x faster Avg: 0.290149 -> 0.275953: 1.05x faster Significant (t=162.17) Stddev: 0.00019 -> 0.00150: 8.1176x larger

call_method_unknown

Min: 0.275295 -> 0.302810: 1.10x slower Avg: 0.280201 -> 0.309166: 1.10x slower Significant (t=-200.65) Stddev: 0.00161 -> 0.00191: 1.1909x larger

call_simple

Min: 0.202163 -> 0.207939: 1.03x slower Avg: 0.202332 -> 0.208662: 1.03x slower Significant (t=-636.09) Stddev: 0.00008 -> 0.00015: 2.0130x larger

chameleon_v2

Min: 4.349474 -> 3.901936: 1.11x faster Avg: 4.377664 -> 3.942932: 1.11x faster Significant (t=62.39) Stddev: 0.01403 -> 0.06826: 4.8635x larger

django_v3

Min: 0.484456 -> 0.462013: 1.05x faster Avg: 0.489186 -> 0.465189: 1.05x faster Significant (t=53.10) Stddev: 0.00415 -> 0.00180: 2.3096x smaller

etree_generate

Min: 0.193538 -> 0.182069: 1.06x faster Avg: 0.196306 -> 0.184403: 1.06x faster Significant (t=65.94) Stddev: 0.00140 -> 0.00115: 1.2181x smaller

etree_iterparse

Min: 0.189955 -> 0.177583: 1.07x faster Avg: 0.195268 -> 0.183411: 1.06x faster Significant (t=27.04) Stddev: 0.00316 -> 0.00304: 1.0386x smaller

etree_process

Min: 0.166556 -> 0.158617: 1.05x faster Avg: 0.168822 -> 0.160672: 1.05x faster Significant (t=43.33) Stddev: 0.00125 -> 0.00140: 1.1205x larger

fannkuch

Min: 0.859842 -> 0.878412: 1.02x slower Avg: 0.865138 -> 0.889188: 1.03x slower Significant (t=-14.97) Stddev: 0.00718 -> 0.01436: 2.0000x larger

float

Min: 0.222095 -> 0.214706: 1.03x faster Avg: 0.226273 -> 0.218210: 1.04x faster Significant (t=21.61) Stddev: 0.00307 -> 0.00212: 1.4469x smaller

hexiom2

Min: 100.489630 -> 94.765364: 1.06x faster Avg: 101.204871 -> 94.885605: 1.07x faster Significant (t=77.45) Stddev: 0.25310 -> 0.05016: 5.0454x smaller

meteor_contest

Min: 0.181076 -> 0.176904: 1.02x faster Avg: 0.181759 -> 0.177783: 1.02x faster Significant (t=43.68) Stddev: 0.00061 -> 0.00067: 1.1041x larger

nbody

Min: 0.208752 -> 0.217011: 1.04x slower Avg: 0.211552 -> 0.219621: 1.04x slower Significant (t=-69.45) Stddev: 0.00080 -> 0.00084: 1.0526x larger

pathlib

Min: 0.077121 -> 0.070698: 1.09x faster Avg: 0.078310 -> 0.071958: 1.09x faster Significant (t=133.39) Stddev: 0.00069 -> 0.00081: 1.1735x larger

pickle_dict

Min: 0.530379 -> 0.514363: 1.03x faster Avg: 0.531325 -> 0.515902: 1.03x faster Significant (t=154.33) Stddev: 0.00086 -> 0.00050: 1.7213x smaller

pickle_list

Min: 0.253445 -> 0.263959: 1.04x slower Avg: 0.255362 -> 0.267402: 1.05x slower Significant (t=-95.47) Stddev: 0.00075 -> 0.00101: 1.3447x larger

raytrace

Min: 1.071042 -> 1.030849: 1.04x faster Avg: 1.076629 -> 1.109029: 1.03x slower Significant (t=-3.93) Stddev: 0.00199 -> 0.08246: 41.4609x larger

regex_compile

Min: 0.286053 -> 0.273454: 1.05x faster Avg: 0.287171 -> 0.274422: 1.05x faster Significant (t=153.16) Stddev: 0.00067 -> 0.00050: 1.3452x smaller

regex_effbot

Min: 0.044186 -> 0.048192: 1.09x slower Avg: 0.044336 -> 0.048513: 1.09x slower Significant (t=-172.41) Stddev: 0.00020 -> 0.00014: 1.4671x smaller

richards

Min: 0.137456 -> 0.135029: 1.02x faster Avg: 0.138993 -> 0.136028: 1.02x faster Significant (t=20.35) Stddev: 0.00116 -> 0.00088: 1.3247x smaller

silent_logging

Min: 0.060288 -> 0.056344: 1.07x faster Avg: 0.060380 -> 0.056518: 1.07x faster Significant (t=310.27) Stddev: 0.00011 -> 0.00005: 2.1029x smaller

telco

Min: 0.010735 -> 0.010441: 1.03x faster Avg: 0.010849 -> 0.010557: 1.03x faster Significant (t=34.04) Stddev: 0.00007 -> 0.00005: 1.3325x smaller

unpickle_list

Min: 0.290750 -> 0.297958: 1.02x slower Avg: 0.292741 -> 0.299419: 1.02x slower Significant (t=-41.62) Stddev: 0.00133 -> 0.00090: 1.4852x smaller

The following not significant results are hidden, use -v to show them: 2to3, call_method, chaos, etree_parse, fastpickle, fastunpickle, formatted_logging, go, json_dump_v2, json_load, mako_v2, normal_startup, nqueens, pidigits, regex_v8, simple_logging, spectral_norm, startup_nosite, tornado_http, unpack_sequence.

msg264519 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-04-29 20:37

Results of the CPython benchmark suite. Reference = default branch at rev 496e094f4734, patched: fastcall fork at rev 2b4b7def2949.

Oh, I forgot to mention that I modified perf.py to run each benchmark using 10 fresh processes to test multiple random seeds for the randomized hash function, instead of testing a fixed seed (PYTHONHASHSEED=1). This change should reduce the noise in the benchmark results.

I ran the benchmark suite using --rigorous.

I will open a new issue later for my perf.py change.

msg264525 - (view)

Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)

Date: 2016-04-29 21:43

Could you repeat benchmarks on different computer? Better with different CPU or compiler.

msg264526 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-04-29 21:55

Could you repeat benchmarks on different computer? Better with different CPU or compiler.

Sorry, I don't really have the bandwith to repeat the benchmarks. PGO+LTO compilation is slow and running the benchmark suite in rigorous mode is very slow.

What do you expect from running the benchmark on a different computer?

msg264529 - (view)

Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)

Date: 2016-04-29 22:16

Results look as a noise. Some tests become slower, others become faster. If results on different machine will show the same sets of slowing down and speeding up tests, this likely is not a noise.

msg264530 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-04-29 22:23

Results look as a noise.

As I wrote, it's really hard to get a reliable benchmark result. I did my best.

See also discussions about the CPython benchmark suite on the speed list: https://mail.python.org/pipermail/speed/

I'm not sure that you will get less noise on other computers. IMHO many benchmarks are simply "broken" (not reliable).

msg265856 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-05-19 13:30

Hi,

I made progress on my FASTCALL branch. I removed tp_fastnew, tp_fastinit and tp_fastnew fields from PyTypeObject to replace them with new type flags (ex: Py_TPFLAGS_FASTNEW) to avoid code duplication and reduce the memory footprint. Before, each function was simply duplicated.

This change introduces a backward incompatibility change: it's not more possible to call directly tp_new, tp_init and tp_call. I don't know yet if such change would be acceptable in Python 3.6, nor if it is worth it.

I spent a lot of ot time on the CPython benchmark suite to check for performance regression. In fact, I spent most of my time to try to understand why most benchmarks looked completly unstable. I now tuned correctly my system and patched perf.py to get reliable benchmarks.

On the latest run of the benchmark suite, most benchmarks are faster! I have to investigate why 3 benchmarks are still slower. In the run, normal_startup was not significant, etree_parse was faster (instead of slower), but raytrace was already slower (but only 1.13x slower). It may be the "noise" of the PGO compilation. I already noticed that once: see the issue #27056 "pickle: constant propagation in _Unpickler_Read()".

Result of the benchmark suite:

slower (3):

faster (18):

not significant (21):

I know that my patch is simply giant and cannot be merged like that.

Since the performance is still promising, I plan to split my giant patch into smaller patches, easier to review. I will try to check that individual patches don't make Python slower. This work will take time.

msg265857 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-05-19 13:37

New patch: 34456cce64bb.patch

$ diffstat 34456cce64bb.patch .hgignore | 3 Makefile.pre.in | 37 b/Doc/includes/shoddy.c | 2 b/Include/Python.h | 1 b/Include/abstract.h | 17 b/Include/descrobject.h | 14 b/Include/funcobject.h | 6 b/Include/methodobject.h | 6 b/Include/modsupport.h | 20 b/Include/object.h | 28 b/Lib/json/encoder.py | 1 b/Lib/test/test_extcall.py | 19 b/Lib/test/test_sys.py | 6 b/Modules/_collectionsmodule.c | 14 b/Modules/_csv.c | 15 b/Modules/_ctypes/_ctypes.c | 12 b/Modules/_ctypes/stgdict.c | 2 b/Modules/_datetimemodule.c | 47 b/Modules/_elementtree.c | 11 b/Modules/_functoolsmodule.c | 113 +- b/Modules/_io/clinic/_iomodule.c.h | 8 b/Modules/_io/clinic/bufferedio.c.h | 42 b/Modules/_io/clinic/bytesio.c.h | 42 b/Modules/_io/clinic/fileio.c.h | 26 b/Modules/_io/clinic/iobase.c.h | 26 b/Modules/_io/clinic/stringio.c.h | 34 b/Modules/_io/clinic/textio.c.h | 40 b/Modules/_io/iobase.c | 4 b/Modules/_json.c | 24 b/Modules/_lsprof.c | 4 b/Modules/_operator.c | 11 b/Modules/_pickle.c | 106 - b/Modules/_posixsubprocess.c | 15 b/Modules/_sre.c | 11 b/Modules/_ssl.c | 9 b/Modules/_testbuffer.c | 4 b/Modules/_testcapimodule.c | 4 b/Modules/_threadmodule.c | 32 b/Modules/_tkinter.c | 11 b/Modules/arraymodule.c | 29 b/Modules/cjkcodecs/clinic/multibytecodec.c.h | 50 b/Modules/clinic/_bz2module.c.h | 8 b/Modules/clinic/_codecsmodule.c.h | 318 +++-- b/Modules/clinic/_cryptmodule.c.h | 10 b/Modules/clinic/_datetimemodule.c.h | 8 b/Modules/clinic/_dbmmodule.c.h | 26 b/Modules/clinic/_elementtree.c.h | 86 - b/Modules/clinic/_gdbmmodule.c.h | 26 b/Modules/clinic/_lzmamodule.c.h | 16 b/Modules/clinic/_opcode.c.h | 10 b/Modules/clinic/_pickle.c.h | 34 b/Modules/clinic/_sre.c.h | 124 +- b/Modules/clinic/_ssl.c.h | 74 - b/Modules/clinic/_tkinter.c.h | 50 b/Modules/clinic/_winapi.c.h | 124 +- b/Modules/clinic/arraymodule.c.h | 34 b/Modules/clinic/audioop.c.h | 210 ++- b/Modules/clinic/binascii.c.h | 36 b/Modules/clinic/cmathmodule.c.h | 24 b/Modules/clinic/fcntlmodule.c.h | 34 b/Modules/clinic/grpmodule.c.h | 14 b/Modules/clinic/md5module.c.h | 8 b/Modules/clinic/posixmodule.c.h | 642 ++++++----- b/Modules/clinic/pyexpat.c.h | 32 b/Modules/clinic/sha1module.c.h | 8 b/Modules/clinic/sha256module.c.h | 14 b/Modules/clinic/sha512module.c.h | 14 b/Modules/clinic/signalmodule.c.h | 50 b/Modules/clinic/unicodedata.c.h | 42 b/Modules/clinic/zlibmodule.c.h | 68 - b/Modules/itertoolsmodule.c | 20 b/Modules/main.c | 2 b/Modules/pyexpat.c | 3 b/Modules/signalmodule.c | 9 b/Modules/xxsubtype.c | 4 b/Objects/abstract.c | 403 ++++--- b/Objects/bytesobject.c | 2 b/Objects/classobject.c | 36 b/Objects/clinic/bytearrayobject.c.h | 90 - b/Objects/clinic/bytesobject.c.h | 66 - b/Objects/clinic/dictobject.c.h | 10 b/Objects/clinic/unicodeobject.c.h | 10 b/Objects/descrobject.c | 162 +- b/Objects/dictobject.c | 26 b/Objects/enumobject.c | 8 b/Objects/exceptions.c | 91 + b/Objects/fileobject.c | 29 b/Objects/floatobject.c | 25 b/Objects/funcobject.c | 77 - b/Objects/genobject.c | 2 b/Objects/iterobject.c | 6 b/Objects/listobject.c | 20 b/Objects/longobject.c | 40 b/Objects/methodobject.c | 139 ++ b/Objects/object.c | 4 b/Objects/odictobject.c | 2 b/Objects/rangeobject.c | 12 b/Objects/tupleobject.c | 21 b/Objects/typeobject.c | 1463 +++++++++++++++++++------- b/Objects/unicodeobject.c | 58 - b/Objects/weakrefobject.c | 22 b/PC/clinic/msvcrtmodule.c.h | 42 b/PC/clinic/winreg.c.h | 128 +- b/PC/clinic/winsound.c.h | 26 b/PCbuild/pythoncore.vcxproj | 4 b/Parser/tokenizer.c | 7 b/Python/ast.c | 31 b/Python/bltinmodule.c | 173 +-- b/Python/ceval.c | 591 +++++++++- b/Python/clinic/bltinmodule.c.h | 104 + b/Python/clinic/import.c.h | 18 b/Python/codecs.c | 17 b/Python/errors.c | 105 - b/Python/getargs.c | 284 ++++- b/Python/import.c | 27 b/Python/modsupport.c | 244 +++- b/Python/pythonrun.c | 10 b/Python/sysmodule.c | 32 b/Tools/clinic/clinic.py | 115 +- pystack.c | 288 +++++ pystack.h | 64 + 121 files changed, 5420 insertions(+), 2802 deletions(-)

msg265859 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-05-19 13:38

Status of the my FASTCALL implementation (34456cce64bb.patch):

A large part of the patch changes existing code to use the new calling convention in many functions of many modules. Some changes were generated by the Argument Clinic. IMHO the best would be to use Argument Clinic in more places, rather than patching manually the code.

msg265887 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-05-19 19:51

Result of the benchmark suite:

slower (3):

Hum, I recompiled the patched Python, again with PGO+LTO, and ran the same benchmark with the same command. In short, I replayed exaclty the same scenario. And... Only raytrace remains slower, etree_parse and normal_startup moved to the "not significant" list.

The difference in the benchmark result doesn't come from the benchmark. For example, I ran gain the normal_startup benchmark 3 times: I got the same result 3 times.

normal_startup

Avg: 0.295168 +/- 0.000991 -> 0.294926 +/- 0.00048: 1.00x faster Not significant

normal_startup

Avg: 0.294871 +/- 0.000606 -> 0.294883 +/- 0.00072: 1.00x slower Not significant

normal_startup

Avg: 0.295096 +/- 0.000706 -> 0.294967 +/- 0.00068: 1.00x faster Not significant

IMHO the difference comes from the data collected by PGO.

msg265896 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-05-19 21:03

In short, I replayed exaclty the same scenario. And... Only raytrace remains slower, (...)

Oh, it looks like the reference binary calls the garbage collector less frequently than the patched python. In the patched Python, collections of the generation 2 are needed, whereas no collection of the generation 2 is needed on the reference binary.

msg265938 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-05-20 12:05

unpickle_list: 1.11x faster

This result was unfair: my fastcall branch contained the optimization of the issue #27056. I just pushed this optimization into the default branch.

I ran again the benchmark: the result is now "not significant", as expected, since the benchmark is a microbenchmark testing C functions of Modules/_pickle.c, it doesn't really rely on the performance of (C or Python) functions calls.

msg266359 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-05-25 14:05

I fixed even more issues with my setup to run benchmark. Results should be even more reliable. Moreover, I fixed multiple reference leaks in the code which introduced performance regressions. I started to write articles to explain how to run stable benchmarks:

Summary of benchmarks at the revision e6f3bf996c01:

Faster (25):

Slower (4):

Not significat (13):

I'm now investigating why 4 benchmarks are slower.

Note: I'm still using my patched CPython benchmark suite to get more stable benchmark. I will send patches upstream later.

msg274124 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2016-09-01 13:15

I splitted the giant patch into smaller patches easier to review. The first part (_PyObject_FastCall, _PyObject_FastCallDict) is already merged. Other issues were opened to implement the full feature. I now close this issue.

History

Date

User

Action

Args

2022-04-11 14:58:29

admin

set

github: 71001

2016-09-01 13:15:25

vstinner

set

status: open -> closed
resolution: fixed
messages: +

2016-05-25 14:05:19

vstinner

set

messages: +

2016-05-20 12:05:27

vstinner

set

messages: +

2016-05-19 21:03:51

vstinner

set

messages: +

2016-05-19 19:51:46

vstinner

set

messages: +

2016-05-19 13:38:54

vstinner

set

messages: +

2016-05-19 13:38:19

vstinner

set

files: + 34456cce64bb.patch

messages: +

2016-05-19 13:36:19

vstinner

set

files: - 34456cce64bb.diff

2016-05-19 13:35:17

vstinner

set

files: + 34456cce64bb.diff

2016-05-19 13:30:46

vstinner

set

messages: +

2016-05-09 22:55:14

jstasiak

set

nosy: + jstasiak

2016-04-29 22:23:44

vstinner

set

messages: +

2016-04-29 22:16:35

serhiy.storchaka

set

messages: +

2016-04-29 21:55:12

vstinner

set

messages: +

2016-04-29 21:43:53

serhiy.storchaka

set

messages: +

2016-04-29 20:37:52

vstinner

set

messages: +

2016-04-29 20:35:56

vstinner

set

messages: +

2016-04-24 07:37:58

serhiy.storchaka

set

messages: +

2016-04-24 07:15:35

vstinner

set

messages: +

2016-04-24 06:37:35

serhiy.storchaka

set

messages: +

2016-04-22 14:56:57

vstinner

set

messages: +

2016-04-22 12:52:39

serhiy.storchaka

set

messages: +

2016-04-22 11:52:19

vstinner

set

files: + bench_fast-2.py

messages: +

2016-04-22 11:40:11

vstinner

set

files: + bench_fast.py

messages: +

2016-04-22 11:12:30

vstinner

set

messages: +

2016-04-22 11:10:16

vstinner

set

messages: +

2016-04-22 10:41:52

vstinner

set

files: + ad4a53ed1fbf.diff

2016-04-22 00:44:11

vstinner

set

messages: +
title: Add a new _PyObject_FastCall() function which avoids the creation of a tuple or dict for arguments -> [WIP] Add a new _PyObject_FastCall() function which avoids the creation of a tuple or dict for arguments

2016-04-22 00:41:28

vstinner

set

hgrepos: + hgrepo342

2016-04-21 17:04:54

serhiy.storchaka

set

messages: +

2016-04-21 15:05:13

vstinner

set

messages: +

2016-04-21 15:03:09

vstinner

set

files: + call_stack-3.patch

messages: +
title: Add a new _PyObject_CallStack() function which avoids the creation of a tuple or dict for arguments -> Add a new _PyObject_FastCall() function which avoids the creation of a tuple or dict for arguments

2016-04-21 14:24:16

larry

set

messages: +

2016-04-21 13:45:49

serhiy.storchaka

set

messages: +

2016-04-21 10:42:27

vstinner

set

files: + call_stack-2.patch

messages: +

2016-04-21 10:28:27

serhiy.storchaka

set

messages: +

2016-04-21 10:20:50

vstinner

set

messages: +

2016-04-21 09:53:53

serhiy.storchaka

set

messages: +

2016-04-21 08:58:02

vstinner

set

nosy: + yselivanov

2016-04-21 08:57:21

vstinner

create