[Python-Dev] Python 3 optimizations continued... (original) (raw)
stefan brunthaler s.brunthaler at uci.edu
Fri Sep 2 17:55:03 CEST 2011
- Previous message: [Python-Dev] Python 3 optimizations continued...
- Next message: [Python-Dev] Python 3 optimizations continued...
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
1) The SFC optimisation is purely based on static code analysis, right? I assume it takes loops into account (and just multiplies scores for inner loops)? Is that what you mean with "nesting level"? Obviously, static analysis can sometimes be misleading, e.g. when there's a rare special case with lots of loops that needs to adapt input data in some way, but in general, I'd expect that this heuristic would tend to hit the important cases, especially for well structured code with short functions. Yes, currently I only use the heuristic to statically estimate utility of assigning an optimized slot to a local variable. And, another yes, nested blocks (like for-statements) is what I have in mind when using "nesting level". I was told that the algorithm itself is very similar to linear scan register allocation, modulo the ability to spill values, of course. From my benchmarks and in-depth analysis of several programs, I found this to work very well. In fact, the only situation I found is (unfortunately) one of the top-most executed functions in US' bm_django.py: There is one loop that gets almost never executed but this loop gives precedence to local variables used inside. Because of this, I have already an idea for a better approach: first, use the static heuristic to compute stack slot score, then count back-branches (I would need this anyways, as the _Py_CheckInterval has gone and OSR/hot-swapping is in general a good idea) and record their frequency. Next, just replace the current static weight of 100 by the dynamically recorded weight. Consequently, you should get better allocations. (Please note that I did some quantitative analysis of bython functions to determine that using 4 SFC-slots covers a substantial amount of functions [IIRC >95%] with the trivial scenario when there are at most 4 local variables.)
2) The RC elimination is tricky to get right and thus somewhat dangerous, but sounds worthwhile and should work particularly well on a stack based byte code interpreter like CPython. Well, it was very tricky to get right when I implemented it first around Christmas 2009. The current implementation is reasonably simple to understand, however, it depends on the function refcount_effect to give me correct information at all times. I got the biggest performance improvement on one benchmark on the PowerPC and think that RISC architectures in general benefit more from this optimization (eliminate the load, add and store instructions) than x86 CISCs do (an INCREF is just an add on the memory location without data dependencies, so fairly cheap). In any case, however, you get the replication effect of improving CPU branch predicion by having these additional instruction derivatives. It would be interesting (research-wise, too) to be able to measure whether the reduction in memory operations makes Python programs use less energy, and if so, how much the difference is.
3) Inline caching also sounds worthwhile, although I wonder how large the savings will be here. You'd save a couple of indirect jumps at the C-API level, sure, but apart from that, my guess is that it would highly depend on the type of instruction. Certain (repeated) calls to C implemented functions would likely benefit quite a bit, for example, which would be a nice optimisation by itself, e.g. for builtins. I would expect that the same applies to iterators, even a couple of percent faster iteration can make a great deal of a difference, and a substantial set of iterators are implemented in C, e.g. itertools, range, zip and friends.
I'm not so sure about arithmetic operations. In Cython, we (currently?) do not optimistically replace these with more specific code (unless we know the types at compile time), because it complicates the generated C code and indirect jumps aren't all that slow that the benefit would be important. Savings are much higher when data can be unboxed, so much that the slight improvement for optimistic type guesses is totally dwarfed in Cython. I would expect that the return of investment is better when the types are actually known at runtime, as in your case. Well, in my thesis I already hint at another improvement of the existing design that can work on unboxed data as well (while still being an interpreter.) I am eager to try this, but don't know how much time I can spend on this (because there are several other research projects I am actively involved in.) In my experience, this works very well and you cannot actually report good speedups without inline-caching arithmetic operations, simply because that's where all JITs shine and most benchmarks don't reflect real world scenarios but mathematics-inclined microbenchmarks. Also, if in the future compilers (gcc and clang) will be able to inline the invoked functions, higher speedups will be possible.
4) Regarding inlined object references, I would expect that it's much more worthwhile to speed up LOADGLOBAL and LOADNAME than LOADCONST. I guess that this would be best helped by watching the module dict and the builtin dict internally and invalidating the interpreter state after changes (e.g. by providing a change counter in those dicts and checking that in the instructions that access them), and otherwise keeping the objects cached. Simply watching the dedicated instructions that change that state isn't enough as Python allows code to change these dicts directly through their dict interface. Ok, I thought about something along these lines, too, but in the end, decided to go with the current design, as it is easy and language neutral (for my research I primarily chose Python as a demonstration vehicle and none of these techniques is specific to Python.) LOAD_GLOBAL pays off handsomly, and I think that I could easily make it correct for all cases, if I knew the places that need to call "invalidate_cache". Most of the LOAD_CONST instructions can be replaced with the inlined-version (INCA_LOAD_CONST), and while I did not do any benchmarks only on this, simply because they are very frequently executed, even small optimizations pay off nicely. Another point is that you can slim down the activation record of PyEval_EvalFrameEx, because you don't need to use the "consts" field anymore (similarly, you could probably eliminate the "names" and "fastlocals" fields, if you find that most of the frequent and fast cases are covered by the optimized instructions.)
All in all, your list does sound like an interesting set of changes that are both understandable and worthwhile. Thanks, I think so, too, which is why I wanted to integrate the optimizations with CPython in the first place.
Thanks for the pointers to the dict stuff, I will take a look (IIRC, Antoine pointed me in the same direction last year, but I think the design was slightly different then), --stefan
- Previous message: [Python-Dev] Python 3 optimizations continued...
- Next message: [Python-Dev] Python 3 optimizations continued...
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]