gh-87729: add LOAD_SUPER_ATTR instruction for faster super() by carljm · Pull Request #103497 · python/cpython (original) (raw)
This PR speeds up super()
(by around 85%, for a simple one-level super().meth()
microbenchmark) by avoiding allocation of a new single-use super()
object on each use.
Microbenchmark results
With this PR:
➜ ./python -m pyperf timeit -s 'from superbench import b' 'b.meth()'
.....................
Mean +- std dev: 70.4 ns +- 1.4 ns
Without this PR:
➜ ./python -m pyperf timeit -s 'from superbench import b' 'b.meth()'
.....................
Mean +- std dev: 130 ns +- 1 ns
Microbenchmark code
➜ cat superbench.py
class A:
def meth(self):
return 1
class B(A):
def meth(self):
return super().meth()
b = B()
Microbenchmark numbers are the same (both pre and post) if the microbenchmark is switched to use return super(B, self).meth()
instead.
super()
is already special-cased in the compiler to ensure the presence of the __class__
cell needed by zero-argument super()
. This extends that special-casing a bit in order to compile super().meth()
as
4 LOAD_GLOBAL 0 (super)
14 LOAD_DEREF 1 (__class__)
16 LOAD_FAST 0 (self)
18 LOAD_SUPER_ATTR 5 (NULL|self + meth)
20 CALL 0
instead of the current:
4 LOAD_GLOBAL 1 (NULL + super)
14 CALL 0
22 LOAD_ATTR 3 (NULL|self + meth)
42 CALL 0
Bytecode comparison for simple attribute
And compile super().attr
as
4 LOAD_GLOBAL 0 (super)
14 LOAD_DEREF 1 (__class__)
16 LOAD_FAST 0 (self)
18 LOAD_SUPER_ATTR 4 (attr)
instead of the current:
4 LOAD_GLOBAL 1 (NULL + super)
14 CALL 0
22 LOAD_ATTR 2 (attr)
The new bytecode has one more instruction, but still ends up executing much faster, because it eliminates the cost of allocating a new single-use super
object each time. For zero-arg super, it also eliminates dynamically figuring out each time via frame introspection where to find the self
argument and __class__
cell, even though the location of both is already known at compile time.
The LOAD_GLOBAL
of super
remains only in order to support existing semantics in case the name super
is re-bound to some other callable besides the built-in super
type.
Besides being faster, the new bytecode is preferable because it regularizes the loading of self
and __class__
to use the normal LOAD_FAST
and LOAD_DEREF
opcodes, instead of custom code in the super
object (not part of the interpreter) relying on private details of interpreter frames to load these in a bespoke way. This helps optimizers like the Cinder JIT that fully support LOAD_FAST
and LOAD_DEREF
but may not maintain frame locals in the same way. It also makes the bytecode more easily amenable to future optimization by a type-specializing tier 2 interpreter, because __class__
and self
will now be surfaced and visible to the optimizer in the usual way, rather than hidden inside the super
object.
I'll follow up with a specialization of LOAD_SUPER_ATTR
for the case where we are looking up a method and a method is found (because this is a common case, and a case where the output of LOAD_SUPER_ATTR
depends only on the type of self
and not on the actual instance). But to simplify review, I'll do this in a separate PR. I think the benefits of this PR stand alone, even without further benefits of specialization. (ETA: the specialization is now also ready at https://github.com/carljm/cpython/compare/superopt...carljm:cpython:superopt_spec?expand=1 and increases the microbenchmark win from 85% to 2.3x.)
The frame introspection code for runtime/dynamic zero-arg super()
still remains, but after this PR it would only ever be used in an odd edge case like super(*args)
(if args
turns out to be empty at runtime), where we can't detect at compile time whether we will have zero-arg or two-arg super()
.
"Odd" uses of super()
(like one-argument super
, use of a super object as a descriptor etc) are still supported and experience no change; the compiler will not emit the new LOAD_SUPER_ATTR
opcode.
I chose to make the new opcode more general by using it for both (statically detectable) zero- and two-arg super. Optimizing zero-arg super is more important because it is more common in modern Python code, and because it also eliminates the frame introspection. But supporting two-arg super costs only one extra bit smuggled via the oparg; this seems worth it.
Real-world results and macrobenchmarks
This approach provides a speed-up of about 0.5% globally on the Instagram server real-world workload (measured recently on Python 3.10.) I can work on a macrobenchmark for the pyperformance
suite that exercises super()
(currently it isn't significantly exercised by any benchmark.) (ETA: benchmark is now ready at python/pyperformance#271 -- this diff improves its performance by 10%, the specialization follow-up by another 10%.)
Prior art
This PR is essentially an updated version of #24936 -- thanks to @vladima for the original inspiration for this approach. Notable differences from that PR:
- I avoid turning the oparg for the new opcode into a const load, preferring to pass the needed bits of information by bit-shifting the oparg instead (following the precedent of
LOAD_ATTR
). - I prioritize code simplicity over performance in edge cases like when a
super()
attribute access raisesAttributeError
, which also reduces the footprint of the PR.
#30992 was an attempt to optimize super()
solely using the specializing interpreter, but it was never merged because there are too many problems caused by adaptive super-instructions in the tier 1 specializing interpreter.