Analysis of CPython binary assembly — Unofficial Python Development (Victor's notes) documentation (original) (raw)
Usually, you should not care how the C compiler optimizes Python. Analyzing the assembly code helps to check if the C compiler is able to optimize Python as you might expect.
See also Python builds and Assembly Intel x86.
Inline libpython function calls and LTO¶
Link Time Optimization (LTO) helps a lot to inline function calls.
If Python is configured with --enable-shared
(Python executable is linked to libpythonX.Y.so
), the -fno-semantic-interposition
compiler flag is needed by GCC to inline libpython function calls. This flag is now enabled by--enable-optimizations
since Python 3.10. Clang disables semantic interposition by default and so doesn’t need this flag.
See Red Hat Enterprise Linux 8.2 brings faster Python 3.8 run speedsfor a concrete analysis of Python 3.8 performance on RHEL 8 with--enable-shared
and -fno-semantic-interposition
.
macOS doesn’t use LTO¶
The official Python macOS binaries are not built with LTO to keep support of old clang versions of macOS 10.6: see bpo-41181. See also bpo-42235: [macOS] Use –enable-optimizations in build-installer.py.
Concrete example of performance issue with the lack of LTO on macOS: bpo-39542. Converting PyTuple_Check()
macro to a function call introduced a performance slowndown on macOS beause clang was unable to inline the PyTuple_Check() function call. The change was reverted to restore performance on macOS.
In Python 3.10, LTO is used on macOS but on macOS 10.15 and newer (bpo-42235).
Security compiler flags¶
Position Independent Code (-fPIC)¶
On Fedora, Python is built with -fPIC
for security. See Wikipedia: Position-independent code.
Control flow Enforcement Technology (CET) hardening¶
GCC has a -fcf-protection=branch
flag which emits ENDBR64
(“End Branch 64 bit”) instructions at functions entry point. It is used on Fedora.
Compiler and linker flags¶
Get compiler (CFLAGS) and linker (LDFLAGS) flags:
$ python3 Python 3.9.1 (default, Jan 20 2021, 00:00:00)
import sysconfig cflags = sysconfig.get_config_var('PY_CFLAGS') + sysconfig.get_config_var('PY_CFLAGS_NODIST') ldflags = sysconfig.get_config_var('PY_LDFLAGS') + sysconfig.get_config_var('PY_LDFLAGS_NODIST') '-fPIC' in cflags True '-fno-semantic-interposition' in cflags True '-flto' in ldflags True
Python thread state (tstate)¶
Since Python 3.8, there is an on-going effect to pass explicitly the current Python thread state (“tstate”) to internal functions:
- It avoids having to read an atomic variable:
_PyThreadState_GET()
reads_PyRuntime.gilstate.tstate_current
atomic variable with_Py_atomic_load_relaxed()
. - It should help the C compiler to inline more code.
See Pass the Python thread state explicitly.
PyErr_Occurred()¶
Simplified C code of PyErr_Occurred()
:
PyObject* PyErr_Occurred(void) { _PyRuntimeState *runtime = &_PyRuntime; _Py_atomic_address *ptstate = &runtime->gilstate.tstate_current; PyThreadState tstate = (PyThreadState)_Py_atomic_load_relaxed(ptstate) return tstate->curexc_type; }
PyErr_Occurred()
of Fedora Python 3.9 (built with -fPIC
):
endbr64
rax = &_PyRuntime = *(void **)0x7ffff7f45d38
mov rax, QWORD PTR [rip+0x1fef9d] # 0x7ffff7f45d38
offsetof(_PyRuntimeState, gilstate.tstate_current) = 0x238
rdx = tstate = *(_PyRuntime.gilstate.tstate_current) = *(void **)($rax + 0x238)
mov rdx, QWORD PTR [rax+0x238]
offsetof(PyThreadState, curexc_type) = 0x58
rax = tstate->curexc_type = *(void **)($rdx + 0x58)
mov rax, QWORD PTR [rdx+0x58]
ret
Getting tstate requires two pointer deferences (two MOV
):
runtime = *($rip + 0x1fef9d)
(&_PyRuntime
)tstate = runtime->gilstate.tstate
PyErr_Occurred()
requires 3 pointer deferences.
Note: the $rip
indirection is needed by -fPIC
flag and endbr64
instruction is related to CET hardening flag.
_PyErr_Occurred()¶
C code:
static inline PyObject* _PyErr_Occurred(PyThreadState *tstate) { assert(tstate != NULL); return tstate->curexc_type; }
_PyErr_Occurred()
of Fedora Python 3.9 (built with -fPIC
), inlined in_Py_CheckFunctionResult+12()
::
$rdi = tstate argument
offsetof(PyThreadState, curexc_type) = 0x58
mov rax, QWORD PTR [rdi+0x58] │
The function calls becomes a single pointer deference (one MOV
):
result = (*tstate).curexc_type
On Fedora, calling PyErr_Occurred()
requires 6 instructions (CALL, ENDBR64, 3 MOV, RET), whereas inlined _PyErr_Occurred
is a single MOV instruction.