http://www.lag.net/paramiko/)     it will give meaningless results.
    
    I looked for an existing profiler and found statprof, mentioned in     this blog: http://wingolog.org/archives/2005/10/28/profiling
    
    However, this still fails profiling multiple threads. I don't think     one can do any better without touching the interpreter core. So     that's what I tried...
    
    My approach implements the profiler from the following parts:
    
          
  1. Sample Trigger: A timer that triggers taking samples of all         threads at fixed intervals.
  2.       
  3. Sampling function injection: Ensures that a sampling function         is called on each thread at the first opportunity after the         sample trigger fired.
  4.       
  5. Sampling function: A function that assigns the current time         slice to the executing code on the thread it is called on.
  6.     
    Only the second part has to be implemented in the Python interpreter     core, the sample trigger and sampling functions can be provided by     external modules. My implementation of (2) can be found at https://bitbucket.org/bluehorn/sampling_prof
    
    I also prepared a first implementation of (1) and (3) at https://bitbucket.org/bluehorn/profiler_util.     Using that library, I can profile pybench with a 3 % slowdown     (taking 100 profiling samples per second):
    
    $ export PYTHON_PATH=$HOME/python-work/profiler_util
    $ python3 -m sampling_profiler -- Tools/pybench/pybench.py -c     stock.pybench -f prof.pybench
    [...]
    Totals:                          3119ms  3072ms   +1.5%  3198ms      3108ms   +2.9%
    
    This also shows that the string benchmark takes the most time:
    
      Thread MainThread (3842 samples, 38.434401 seconds)
------------------------------------------------------------------------
      cpu_time    (cum)  ticks  (cum) samples (cum)       filename:lineno                          function
         0.548    0.548     55     55     55     55       Tools/pybench/Strings.py:455             test
         0.448    0.448     45     45     45     45       Tools/pybench/Strings.py:568             calibrate
         0.428    0.428     43     43     43     43       Tools/pybench/Constructs.py:484          test
         0.380    0.380     38     38     38     38       Tools/pybench/Calls.py:445               f
         0.324    0.324     34     34     34     34       Tools/pybench/Calls.py:122               f
    

    The output is explained as follows:
    
    cpu_time: cpu time in seconds accounted to that line.
    (cum): This is the cpu time including the cumulative time of the     functions called on that line. Gathering that data means that the     sampling function has to walk the stack for each sample, which will     cause a dramatic slowdown for deeply recursive calls. There should     be a switch to disable this.
    ticks: Number of sampling profiler ticks accounted to that line. For     my example (where sample ticks are equidistant in wall time) this is     a measure of the time spent in that line, including any blocking     calls (especially I/O).
    (cum): ... including called functions
    samples: Number of times the sampling profiler function was called     for that line. This should equal ticks unless execution of that line     took longer than a profiler tick.
    
    The remaining output should be obvious.
    
    I have to admit that this is far from finished, but I hope to get     some feedback and perhaps even help by posting it here.
    
    What do you think?
    
    Greetings, Torsten
    
   ">

(original) (raw)

Hello world,

since I moved from Java development to Python development, I really miss the sampling profiler of jvisualvm (http://visualvm.java.net/).

Perhaps it is only me but especially for long running GUI
applications, I would really like to know why it sometimes slows
down to a crawl.



cProfile is not really useful for this as it makes the software
unresponsive alone. Also if using multiple threads (like doing some
I/O in background, when using http://www.lag.net/paramiko/)
it will give meaningless results.



I looked for an existing profiler and found statprof, mentioned in
this blog: http://wingolog.org/archives/2005/10/28/profiling



However, this still fails profiling multiple threads. I don't think
one can do any better without touching the interpreter core. So
that's what I tried...



My approach implements the profiler from the following parts:


  1. Sample Trigger: A timer that triggers taking samples of all
    threads at fixed intervals.

  2. Sampling function injection: Ensures that a sampling function
    is called on each thread at the first opportunity after the
    sample trigger fired.

  3. Sampling function: A function that assigns the current time
    slice to the executing code on the thread it is called on.


Only the second part has to be implemented in the Python interpreter
core, the sample trigger and sampling functions can be provided by
external modules. My implementation of (2) can be found at https://bitbucket.org/bluehorn/sampling_prof



I also prepared a first implementation of (1) and (3) at https://bitbucket.org/bluehorn/profiler_util.
Using that library, I can profile pybench with a 3 % slowdown
(taking 100 profiling samples per second):



$ export PYTHON_PATH=$HOME/python-work/profiler_util

$ python3 -m sampling_profiler -- Tools/pybench/pybench.py -c
stock.pybench -f prof.pybench

[...]

Totals: 3119ms 3072ms +1.5% 3198ms
3108ms +2.9%



This also shows that the string benchmark takes the most time:



Thread MainThread (3842 samples, 38.434401 seconds)

------------------------------------------------------------------------

cpu_time (cum) ticks (cum) samples (cum)
filename:lineno function

0.548 0.548 55 55 55 55
Tools/pybench/Strings.py:455 test

0.448 0.448 45 45 45 45
Tools/pybench/Strings.py:568 calibrate

0.428 0.428 43 43 43 43
Tools/pybench/Constructs.py:484 test

0.380 0.380 38 38 38 38
Tools/pybench/Calls.py:445 f

0.324 0.324 34 34 34 34
Tools/pybench/Calls.py:122 f



The output is explained as follows:



cpu_time: cpu time in seconds accounted to that line.

(cum): This is the cpu time including the cumulative time of the
functions called on that line. Gathering that data means that the
sampling function has to walk the stack for each sample, which will
cause a dramatic slowdown for deeply recursive calls. There should
be a switch to disable this.

ticks: Number of sampling profiler ticks accounted to that line. For
my example (where sample ticks are equidistant in wall time) this is
a measure of the time spent in that line, including any blocking
calls (especially I/O).

(cum): ... including called functions

samples: Number of times the sampling profiler function was called
for that line. This should equal ticks unless execution of that line
took longer than a profiler tick.



The remaining output should be obvious.



I have to admit that this is far from finished, but I hope to get
some feedback and perhaps even help by posting it here.



What do you think?



Greetings, Torsten