[Python-Dev] Speeding up CPython 5-10% (original) (raw)
Yury Selivanov yselivanov.ml at gmail.com
Tue Feb 2 09:15:58 EST 2016
- Previous message (by thread): [Python-Dev] Speeding up CPython 5-10%
- Next message (by thread): [Python-Dev] Speeding up CPython 5-10%
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 2016-02-02 4:28 AM, Victor Stinner wrote: [..]
I take a first look at your patch and sorry,
Thanks for the initial code review!
I'm skeptical about the design. I have to play with it a little bit more to check if there is no better design.
So far I see two things you are worried about:
- The cache is attached to the code object vs function/frame.
I think the code object is the perfect place for such a cache.
The cache must be there (and survive!) "across" the frames. If you attach it to the function object, you'll have to re-attach it to a frame object on each PyEval call. I can't see how that would be better.
- Two levels of indirection in my cache -- offsets table + cache table.
In my other email thread "Opcode cache in ceval loop" I explained that optimizing every code object in the standard library and unittests adds 5% memory overhead. Optimizing only those that are called frequently is less than 1%.
Besides, many functions that you import are never called, or only called once or twice. And code objects for modules and class bodies are called once.
If we don't use an offset table and just allocate a cache entry for every opcode, then the memory usage will raise significantly. Right now the overhead of the offset table is 8 bits per opcode, the overhead of the cache table is 32 bytes per an optimized opcode. The overhead of using 1 extra indirection is minimal.
[..]
2016-01-27 19:25 GMT+01:00 Yury Selivanov <yselivanov.ml at gmail.com>: tl;dr The summary is that I have a patch that improves CPython performance up to 5-10% on macro benchmarks. Benchmarks results on Macbook Pro/Mac OS X, desktop CPU/Linux, server CPU/Linux are available at [1]. There are no slowdowns that I could reproduce consistently. That's really impressive, great job Yury :-) Getting non-negligible speedup on large macrobenchmarks became really hard in CPython. CPython is already well optimized in all corners. It looks like the overall Python performance still depends heavily on the performance of dictionary and attribute lookups. Even if it was well known, I didn't expect up to 10% speedup on macro benchmarks.
Thanks!
LOADMETHOD & CALLMETHOD ------------------------- We had a lot of conversations with Victor about his PEP 509, and he sent me a link to his amazing compilation of notes about CPython performance [2]. One optimization that he pointed out to me was LOAD/CALLMETHOD opcodes, an idea first originated in PyPy. There is a patch that implements this optimization, it's tracked here: [3]. There are some low level details that I explained in the issue, but I'll go over the high level design in this email as well. Your cache is stored directly in code objects. Currently, code objects are immutable.
Code objects are immutable on the Python level. My cache doesn't make any previously immutable field mutable.
Adding a few mutable cache structures visible only at the C level is acceptable I think.
Antoine Pitrou's patch adding a LOADGLOBAL cache adds a cache to functions with an "alias" in each frame object: http://bugs.python.org/issue10401 Andrea Griffini's patch also adding a cache for LOADGLOBAL adds a cache for code objects too. https://bugs.python.org/issue1616125
Those patches are nice, but optimizing just LOAD_GLOBAL won't give you a big speed-up. For instance, 2to3 became 7-8% faster once I started to optimize LOAD_ATTR.
The idea of my patch is that it implements caching in such a way, that we can add it to several different opcodes.
The opcodes we want to optimize are LAODGLOBAL, 0 and 3. Let's look at the first one, that loads the 'print' function from builtins. The opcode knows the following bits of information: I tested your latest patch. It looks like LOADGLOBAL never invalidates the cache on cache miss ("deoptimize" the instruction).
Yes, that was a deliberate decision (but we can add the deoptimization easily). So far I haven't seen a use case or benchmark where we really need to deoptimize.
I suggest to always invalidate the cache at each cache miss. Not only, it's common to modify global variables, but there is also the issue of different namespace used with the same code object. Examples: * late global initialization. See for example a85chars cache of base64.a85encode. * code object created in a temporary namespace and then always run in a different global namespace. See for example collections.namedtuple(). I'm not sure that it's the best example because it looks like the Python code only loads builtins, not globals. But it looks like your code keeps a copy of the version of the global namespace dict. I tested with a threshold of 1: always optimize all code objects. Maybe with your default threshold of 1024 runs, the issue with different namespaces doesn't occur in practice.
Yep. I added a constant in ceval.c that enables collection of opcode cache stats.
99.9% of all global dicts in benchmarks are stable.
test suite was a bit different, only ~99% :) One percent of cache misses was probably because of unittest.mock.
A straightforward way to implement such a cache is simple, but consumes a lot of memory, that would be just wasted, since we only need such a cache for LOADGLOBAL and LOADMETHOD opcodes. So we have to be creative about the cache design. I'm not sure that it's worth to develop a complex dynamic logic to only enable optimizations after a threshold (design very close to a JIT compiler).
I think it's not even remotely close to what JITs do.
In my design I have a simple counter -- when it reaches 1000, we create the caches in the code objects. Some opcodes start to use it. That's basically it.
JIT compilers trace the code, collect information about types, think about memory, optimize, deoptimize, think about memory again, etc, etc :)
What is the overhead (% of RSS memory) on a concrete application when all code objects are optimized at startup?
I've mentioned that in my other thread.
When the whole test suite is run with every code object being optimized (threshold = 1), about 73000 code objects were optimized, requiring >20Mb of memory (the test suite process consumed ~400Mb of memory). So 5% looks to be the worst case.
When I ran the test suite with threshold set to 1024, only 2000 objects were optimized, requiring less than 1% of the total process memory.
Maybe we need a global boolean flag to disable the optimization? Or even a compilation option?
I'd hate to add such thing. Why would you want to disable the cache? To save 1% of memory? TBH I think this only adds maintenance overhead to us.
I mean that all these new counters have a cost, and the code may be even faster without these counters if everything is always optimized, no?
Yes, but only marginally. You'll save one "inc" in eval loop. And a couple of "if"s. Maybe on a micro benchmark you can see a difference.
But optimizing everything will require much more memory. And we shouldn't optimize code objects that are run only once -- that's code objects for modules and classes.
Threshold of 1024 is big enough to say that the code object is frequently used and will probably continue to be frequently used in the future.
I'm not sure that the storage for the cache is really efficient. It's a compact data structure, but it looks "expensive" to access it (there is one level of indirection). I understand that it's compact to reduce the memory footpring overhead. I'm not sure that the threshold of 1000x run is ok for short scripts. It would be nice to optimize also scripts which only call a function 900x times :-) Classical memory vs cpu compromise issue :-)
I'd be OK to change the threshold to 500 or something. But IMHO it won't change much. Short/small scripts won't hit it anyways. And even if they do, they typically don't run long enough to get a measurable speedup.
I'm just thinking aloud :-)
Thanks! I'm happy that you are looking at this thing with a critical eye.
BTW, here's a debug output of unit tests with every code object optimized:
-- Opcode cache number of objects = 72395 -- Opcode cache total extra mem = 20925595
-- Opcode cache LOAD_METHOD hits = 64569036 (63%) -- Opcode cache LOAD_METHOD misses = 23899 (0%) -- Opcode cache LOAD_METHOD opts = 104872 -- Opcode cache LOAD_METHOD deopts = 19191 -- Opcode cache LOAD_METHOD dct-chk= 12805608 -- Opcode cache LOAD_METHOD total = 101735114
-- Opcode cache LOAD_GLOBAL hits = 123808815 (99%) -- Opcode cache LOAD_GLOBAL misses = 310397 (0%) -- Opcode cache LOAD_GLOBAL opts = 125205
-- Opcode cache LOAD_ATTR hits = 59089435 (53%) -- Opcode cache LOAD_ATTR misses = 33372 (0%) -- Opcode cache LOAD_ATTR opts = 73643 -- Opcode cache LOAD_ATTR deopts = 20276 -- Opcode cache LOAD_ATTR total = 111049468
Yury
- Previous message (by thread): [Python-Dev] Speeding up CPython 5-10%
- Next message (by thread): [Python-Dev] Speeding up CPython 5-10%
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]