gh-87729: add LOAD_SUPER_ATTR instruction for faster super() by carljm · Pull Request #103497 · python/cpython (original) (raw)

This PR speeds up super() (by around 85%, for a simple one-level super().meth() microbenchmark) by avoiding allocation of a new single-use super() object on each use.

Microbenchmark results

With this PR:

➜ ./python -m pyperf timeit -s 'from superbench import b' 'b.meth()'
.....................
Mean +- std dev: 70.4 ns +- 1.4 ns

Without this PR:

➜ ./python -m pyperf timeit -s 'from superbench import b' 'b.meth()'
.....................
Mean +- std dev: 130 ns +- 1 ns

Microbenchmark code

➜ cat superbench.py
class A:
    def meth(self):
        return 1

class B(A):
    def meth(self):
        return super().meth()

b = B()

Microbenchmark numbers are the same (both pre and post) if the microbenchmark is switched to use return super(B, self).meth() instead.

super() is already special-cased in the compiler to ensure the presence of the __class__ cell needed by zero-argument super(). This extends that special-casing a bit in order to compile super().meth() as

              4 LOAD_GLOBAL              0 (super)
             14 LOAD_DEREF               1 (__class__)
             16 LOAD_FAST                0 (self)
             18 LOAD_SUPER_ATTR          5 (NULL|self + meth)
             20 CALL                     0

instead of the current:

              4 LOAD_GLOBAL              1 (NULL + super)
             14 CALL                     0
             22 LOAD_ATTR                3 (NULL|self + meth)
             42 CALL                     0

Bytecode comparison for simple attribute

And compile super().attr as

              4 LOAD_GLOBAL              0 (super)
             14 LOAD_DEREF               1 (__class__)
             16 LOAD_FAST                0 (self)
             18 LOAD_SUPER_ATTR     4 (attr)

instead of the current:

              4 LOAD_GLOBAL              1 (NULL + super)
             14 CALL                     0
             22 LOAD_ATTR                2 (attr)

The new bytecode has one more instruction, but still ends up executing much faster, because it eliminates the cost of allocating a new single-use super object each time. For zero-arg super, it also eliminates dynamically figuring out each time via frame introspection where to find the self argument and __class__ cell, even though the location of both is already known at compile time.

The LOAD_GLOBAL of super remains only in order to support existing semantics in case the name super is re-bound to some other callable besides the built-in super type.

Besides being faster, the new bytecode is preferable because it regularizes the loading of self and __class__ to use the normal LOAD_FAST and LOAD_DEREF opcodes, instead of custom code in the super object (not part of the interpreter) relying on private details of interpreter frames to load these in a bespoke way. This helps optimizers like the Cinder JIT that fully support LOAD_FAST and LOAD_DEREF but may not maintain frame locals in the same way. It also makes the bytecode more easily amenable to future optimization by a type-specializing tier 2 interpreter, because __class__ and self will now be surfaced and visible to the optimizer in the usual way, rather than hidden inside the super object.

I'll follow up with a specialization of LOAD_SUPER_ATTR for the case where we are looking up a method and a method is found (because this is a common case, and a case where the output of LOAD_SUPER_ATTR depends only on the type of self and not on the actual instance). But to simplify review, I'll do this in a separate PR. I think the benefits of this PR stand alone, even without further benefits of specialization. (ETA: the specialization is now also ready at https://github.com/carljm/cpython/compare/superopt...carljm:cpython:superopt_spec?expand=1 and increases the microbenchmark win from 85% to 2.3x.)

The frame introspection code for runtime/dynamic zero-arg super() still remains, but after this PR it would only ever be used in an odd edge case like super(*args) (if args turns out to be empty at runtime), where we can't detect at compile time whether we will have zero-arg or two-arg super().

"Odd" uses of super() (like one-argument super, use of a super object as a descriptor etc) are still supported and experience no change; the compiler will not emit the new LOAD_SUPER_ATTR opcode.

I chose to make the new opcode more general by using it for both (statically detectable) zero- and two-arg super. Optimizing zero-arg super is more important because it is more common in modern Python code, and because it also eliminates the frame introspection. But supporting two-arg super costs only one extra bit smuggled via the oparg; this seems worth it.

Real-world results and macrobenchmarks

This approach provides a speed-up of about 0.5% globally on the Instagram server real-world workload (measured recently on Python 3.10.) I can work on a macrobenchmark for the pyperformance suite that exercises super() (currently it isn't significantly exercised by any benchmark.) (ETA: benchmark is now ready at python/pyperformance#271 -- this diff improves its performance by 10%, the specialization follow-up by another 10%.)

Prior art

This PR is essentially an updated version of #24936 -- thanks to @vladima for the original inspiration for this approach. Notable differences from that PR:

#30992 was an attempt to optimize super() solely using the specializing interpreter, but it was never merged because there are too many problems caused by adaptive super-instructions in the tier 1 specializing interpreter.