[Python-Dev] Support for Linux perf (original) (raw)

Francis Giraldeau francis.giraldeau at gmail.com
Mon Nov 17 23:09:57 CET 2014


Hi,

The PEP-418 is about performance counters, but there is no mention of performance management unit (PMU) counters, such as cache misses and instruction counts.

The Linux perf tool aims at recording these samples at the system level. I ran linux perf on CPython for profiling. The resulting callstack is inside libpython.so, mostly recursive calls to PyEval_EvalFrameEx(), because the tool works at the ELF level. Here is an example with a dummy program (linux-tools on Ubuntu 14.04):

$ perf record python crunch.py $ perf report --stdio

Overhead Command Shared Object Symbol

........ ....... .................. ................................

32.37%   python  python2.7           [.] PyEval_EvalFrameEx
13.70%   python  libm-2.19.so        [.] __sin_avx
 5.25%   python  python2.7           [.] binary_op1.5010
 4.82%   python  python2.7           [.] PyObject_GetAttr

While this may be insightful for the interpreter developers, it it not so for the average Python developer. The report should display Python code instead. It seems obvious, still I haven't found the feature for that.

When a performance counter reaches a given value, a sample is recorded. The most basic sample only records a timestamps, thread ID and the program counter (%rip). In addition, all executable memory maps of libraries are recorded. For the callstack, frame pointers are traversed, but most of the time, they are optimized on x86, so there is a fall back to unwind, which requires saving register values and a chunk of the stack. The memory space of the process is reconstructed offline.

CPython seems to allocates code and frames on mmap() pages. If the data is outside about 1k from the top of stack, it is not available offline in the trace. We need some way to reconstitute this memory space of the interpreter to resolve the symbols, probably by dumping the data on disk.

In Java, there is a small HotSpot agent that spits out the symbols of JIT code:

https://github.com/jrudolph/perf-map-agent

The problem is that CPython does not JIT code, and executed code is the ELF library itself. The executed frames are parameters of functions of the interpreter. I don't think the same approach can be used (maybe this can be applied to PyPy?).

I looked at how Python frames are handled in GDB (file cpython/Tools/gdb/libpython.py). A python frame is detected in Frame(gdbframe).is_evalframeex() by a C call to PyEval_EvalFrameEx(). However, the traceback accesses PyFrameObject on the heap (at least for f->f_back = 0xa57460), which is possible in GDB when the program is paused and the whole memory space is available, but is not recorded for offline use in perf. Here is an example of callstack from GDB:

#0 PyEval_EvalFrameEx (f=Frame 0x7ffff7f1b060, for file crunch.py, line 7, in bar (num=466829), throwflag=0) at ../Python/ceval.c:1039 #1 0x0000000000527877 in fast_function (func=<function at remote 0x7ffff6ec45a0>, pp_stack=0x7fffffffd280, n=1, na=1, nk=0) at ../Python/ceval.c:4106 #2 0x0000000000527582 in call_function (pp_stack=0x7fffffffd280, oparg=1) at ../Python/ceval.c:4041

We could add a kernel module that "knows" how to make samples of CPython, but it means python structures becomes sort of ABI, and kernel devs won't allow a python interpreter in kernel mode ;-).

What we really want is f_code data and related objects:

(gdb) print (void *)(f->f_code) $8 = (void *) 0x7ffff7e370f0

Maybe we could save these pages every time some code is loaded from the interpreter? (the memory range is about 1.7MB, but )

Anyway, I think we must change CPython to support tools such as perf. Any thoughts?

Cheers,

Francis -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.python.org/pipermail/python-dev/attachments/20141117/4a63989a/attachment.html>



More information about the Python-Dev mailing list