[Python-Dev] Joys of Optimization (original) (raw)
Raymond Hettinger raymond.hettinger at verizon.net
Fri Mar 19 00:21:33 EST 2004
- Previous message: [Python-Dev] (class) module names clarification
- Next message: [Python-Dev] Joys of Optimization
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
[Hye-Shik]
Yay! I'm a big fan of your series of optimizations. Great thanks! :)
By the way, I think there're many junior python hacker wannabes like me to challenge objects/compiler/vm optimizations. Can you share your roadmap or work queues/plans, list of known bottle-necks for them?
Here are the optimizations on my todo list:
Objects can define equality tests through rich comparisons or cmp. A few timings suggest that the latter run 30% to 50% faster. This may mean that PyObject_RichCompareBool(v, w, Py_EQ) has a potential for a significant speedup. Python pervasively uses equality tests (internally to dictionaries and just about everywhere else). So, efforts to speedup rich comparisions (without changing semantics) would result in an across the board improvement. As a proof of concept, I replaced all instances of rich equality testing with a new function, _PyObject_RichCompareBoolEQ(v, w) which tries PyObject_Compare() and if there is an error, falls back to the regular rich comparision. This gave the expected speedup and passed most, but not all of the test_suite. The reasons for the test failures are that the simplistic redefinition slightly changed semantics when both cmp and rich comparisons were defined. The correct approach is to investigate rich comparisons and find the bottleneck. At the root of it all, most objects implement only one equality test and rich comparisons need to be able to find it and run it as fast as cmp.
Michael started an investigation into the costs of setting up new stackframes. Standing on his shoulders, Armin proposed the concept of a Zombie frame which goes beyond freelisting and takes advantage of the contents of older, discarded frames. His implementation eliminated freelists in favor a single captive frame per code object. The timings were impressive, but there were some negative consequences -- recursive calls and multi-threaded apps needed more than one frame per code object and they suffered miserably in the absence of a freelist -- also, there was a ton of code run only once and not only doesn't benefit from the zombie frame, it costs memory because the zombie is never gets freed. I and one other respondant independently suggested keeping the freelist (which would need to become doubly linked) and using a weak reference from the code object to see if the matching pre-made frame is still on the freelist. This would capture all the benefits and incur none of the detriments. AFAICT, the only potential downfall is that so little setup time is saved by zombie frames that any significant tracking logic would completely offset the gain. IOW, if you do this, the mechanism needs to be dirt simple and fast.
I had re-coded several modules in C with the goal of making pure python more viable for certain high performance apps (for instance, implementing heapq and bisect in C made them a more viable as components of superfast code written in pure python). If the threading module were re-coded in C, then it would benefit a broad class of multi-threaded applications making it more likely that Python becomes the language of choice for those apps.
The peephole optimizer in compile.c is currently limited in scope because it lacks a basic block checker and the ability to re-insert opcodes of a different length than the original (which entails moving all the jump targets and recomputing the line numbering). I've already written the basic block checker, have a jump retargeter, and have several peephole optimizations thoroughly tested; however, they can't go in until the line renumberer is done. For example, the code a,b=b,a currently runs slower than t=a;a=b;b=t but it could run several times faster if the BUILD_TUPLE 2 UNPACK_SEQUENCE 2 were replaced by ROT_TWO. This is an important optimization because it changes the way we write pure python (avoiding multiple assignment because it is slower).
Python's startup time got much worse in Py2.3. Several people took a look into it but there were no clear-cut results. This is an important optimization because there are plenty of short running apps where python is called to knock about a single basic task. Especially in web apps, frequent starts and exits can be the norm and Python's startup time can dominate total resource utilization and response time.
The concept behind ceval.c 's goto_fast_nextopcode bypassing the periodic checking for signals and occasional thread switching. Taking this concept to the limit, those operations could be pulled completely out of the eval loop and only called by a handful of opcodes such as CALL_FUNCTION (which can be of arbitrarily long duration) and JUMP_ABSOLUTE (which appears in any non-recursive repeating code block). The goal is not to make error checking or thread switching less frequent; rather, the idea is to increase the number of things that can be written in pure python without incurring tons of eval loop overhead (I think it is enough check the first opcode, every function call, and every loop iteration). This will increase the viability of writing algorithms in pure python. Of course, this needs to be handled carefully, many simple operations have the potential to trigger some long running operations; however, some of those situations exist today (i.e. map() can run a huge number of iterations with zero trips around the eval loop for tasking switching or signal checking).
There are three PEPs for reducing the overhead associated with attribute lookup. This, IMO, is the potentially largest remaining optimization. More importantly, it will improve the way we code, allowing more free use of method calls, smaller functions, and less need to put frequent calls in a local variable (I truly hate seeing code like: _int=int # speed hack).
Aside from optimizations, there are some modules to be developed:
collections.heap(), a high performance Fibonacci heap implemented as a type and taking sort style arguments (cmp=, key=, reverse=).
finish the statistics module (except for the docs, this is basically done and awaiting collections.heap() to underlie nsmallest() and nlargest()).
write a calculator module - done right, this will take a long time, a lot of research, and require much public discussion (everyone will have an opinion on this).
explore whether the StringIO module would perform better if based on array objects rather than strings.
Also, there are a couple of bugfixes in the queue:
fix operator.isMappingType() to exclude anything that passes operator.isSequenceType().
the logging module is in dire need of a quick start tutorial (patterned after that for unittests which gives you the 90% of what you need to know in one easy lesson) and a new method or two that encapsulates standard usage patterns boiling it down to start_logging_stuff_here() and okay_log_this(). Right now, the module suffers from a near vertical learning curve and even the simplest logging applications require more instructions than you would expect. Otherwise, the module is extremely well engineered. If the docs were fixed up and a couple of methods added, it could be a runaway success instead of a debacle.
the Standard Library Tour in the tutorial was wildly successful and demands a sequel. A part II should be written that covers more of what you need to know to get productive from the start.
eval() only takes real dictionaries as arguments. A huge number of possibilities open up if it could take dictionary look-alikes. Previous proposals were rejected because they cost a percentage point or two of execution time -- why have everyone pay the price for this. So, efforts need to be directed at how to do this without a performance penalty - several approaches were mentioned; none have been tried and proven.
Also, the generator expression patch is about ready for prime time. I have it as a priority to get it reviewed and to fixup any outstanding issues.
Okay, that's Raymond's list. Feel free to take one and run with it.
Raymond Hettinger
- Previous message: [Python-Dev] (class) module names clarification
- Next message: [Python-Dev] Joys of Optimization
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]