[Python-Dev] Speeding up CPython 5-10% (original) (raw)

Yury Selivanov yselivanov.ml at gmail.com
Wed Jan 27 17:58:11 EST 2016

Previous message (by thread): [Python-Dev] Speeding up CPython 5-10%
Next message (by thread): [Python-Dev] Speeding up CPython 5-10%
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 2016-01-27 3:46 PM, Glenn Linderman wrote:

On 1/27/2016 12:37 PM, Yury Selivanov wrote:

MicroPython also has dictionary lookup caching, but it's a bit different to your proposal. We do something much simpler: each opcode that has a cache ability (eg LOADGLOBAL, STOREGLOBAL, LOADATTR, etc) includes a single byte in the opcode which is an offset-guess into the dictionary to find the desired element. Eg for LOADGLOBAL we have (pseudo code): CASE(LOADGLOBAL): key = DECODEKEY; offsetguess = DECODEBYTE; if (globaldict[offsetguess].key == key) { // found the element straight away } else { // not found, do a full lookup and save the offset offsetguess = dictlookup(globaldict, key); UPDATEBYTECODE(offsetguess); } PUSH(globaldict[offsetguess].elem); We have found that such caching gives a massive performance increase, on the order of 20%. The issue (for us) is that it increases bytecode size by a considerable amount, requires writeable bytecode, and can be non-deterministic in terms of lookup time. Those things are important in the embedded world, but not so much on the desktop. That's a neat idea! You're right, it does require bytecode to become writeable. Would it? Remember "fixup lists"? Maybe they still exist for loading function addresses from one DLL into the code of another at load time? So the equivalent for bytecode requires a static table of offsetguess, and the offsets into that table are allocated by the byte-code loader at byte-code load time, and the byte-code is "fixed up" at load time to use the correct offsets into the offsetguess table. It takes one more indirection to find the guess, but if the result is a 20% improvement, maybe you'd still get 19%...

Right, in my current patch I have an offset table per code object. Essentially, this offset table adds 8bits per opcode. It also means that only first 255 LOAD_GLOBAL/LOAD_METHOD opcodes per-code-object are optimized (because the offset table only can store 8bit offsets), which is usually enough (I think you need to have more than a 500 lines of code function to reach that limit).

Yury

Previous message (by thread): [Python-Dev] Speeding up CPython 5-10%
Next message (by thread): [Python-Dev] Speeding up CPython 5-10%
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list