Issue 2013: Long object free list optimization (original) (raw)

Issue2013

Created on 2008-02-05 03:07 by christian.heimes, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
long_freelist.patch christian.heimes,2008-02-05 03:07
py3k_longfreelist.patch christian.heimes,2008-02-06 17:09
py3k_longfreelist2.patch pitrou,2008-02-12 22:19
py3k_longfreelist2-gps01.patch gregory.p.smith,2008-03-22 19:54 improvement over longfreelist2
py3k_longfreelist2-gps02.patch gregory.p.smith,2008-03-23 01:24
issue2013-benchmarks-gps02.txt gregory.p.smith,2008-03-23 02:04 benchmarks of the -gps02 patch
py3k_longfreelist2_gps03.patch gregory.p.smith,2008-08-18 04:56 updated, now works in debug mode.
py3k_longfreelist2_gps04.patch gregory.p.smith,2008-08-18 04:56 resimplified for only 1 digit longs
Messages (13)
msg62060 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2008-02-05 03:07
The patch adds a free list of long objects with size 1 or -1. It has quite a tremendous effect on code like list(range(1000)). $ ./python -m timeit "for i in range(100): list(range(1000))" Without patch: 10 loops, best of 3: 79 msec per loop With patch: 10 loops, best of 3: 20.8 msec per loop
msg62106 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2008-02-06 17:09
The updated patch limits the free list.
msg62338 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-02-12 22:19
I don't get the same impressive speedup as you do, although it's still significant. $ ./python -m timeit "for i in range(100): list(range(1000))" Without patch: 100 loops, best of 3: 5.05 msec per loop With patch: 100 loops, best of 3: 3.57 msec per loop Also, your patch is leaky. I'm attaching a fixed version (for py3k, didn't check the trunk version).
msg62339 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-02-12 22:32
Christian, maybe you did your measurements with the DEBUG flag turned on? That would explain the discrepancy. Also, I'm not sure the patch is useful for 2.x since long objects with size -1 or 1 should be represented as ints instead :-)
msg62342 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2008-02-12 22:48
Antoine Pitrou wrote: > Christian, maybe you did your measurements with the DEBUG flag turned > on? That would explain the discrepancy. > > Also, I'm not sure the patch is useful for 2.x since long objects with > size -1 or 1 should be represented as ints instead :-) Yes, I've used a Py_Debug build to measure the speed difference. You are right. The patch makes no sense for the 2.x series. Christian
msg64331 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-03-22 16:45
Here are some updated timings against the current py3k branch: $ ./python -m timeit "sum(range(10000))" Without patch: 1000 loops, best of 3: 675 usec per loop With patch: 1000 loops, best of 3: 462 usec per loop $ ./python -m timeit -s "n=1000000" "sum(range(n, n+10000))" Without patch: 1000 loops, best of 3: 1.36 msec per loop With patch: 1000 loops, best of 3: 1.38 msec per loop $ ./python -m timeit "min(range(255))" Without patch: 10000 loops, best of 3: 18.7 usec per loop With patch: 10000 loops, best of 3: 19.4 usec per loop As you can see the patch makes things quite a bit better for 1-digit long objects, and there is only a small slowdown for longer or tiny ints. Given that 1-digit long objects should be prevalent in most code I think it's probably a winner overall.
msg64337 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2008-03-22 19:54
I'm attaching a slightly improved version of the longfreelist2 patch. It moves the important test to be first and removes the unneeded ? : conditional. ........................ Here are some more benchmarks: ........................ --- -without-patch.txt 2008-03-22 12🔞37.000000000 -0700 +++ -longfreelist2-gps01-size-first.txt 2008-03-22 12:34:44.000000000 -0700 @@ -1,43 +1,43 @@ -3.0a3+ (py3k:61746M, Mar 22 2008, 12:12:29) +3.0a3+ (py3k:61746M, Mar 22 2008, 12:33:47) [GCC 4.0.1 (Apple Computer, Inc. build 5367)] ./python.exe -m timeit min(range(255)) -100000 loops, best of 3: 17.2 usec per loop +100000 loops, best of 3: 15.8 usec per loop ./python.exe -m timeit -s n=1000000 sum(range(n,n+10000)) -1000 loops, best of 3: 1.21 msec per loop +1000 loops, best of 3: 1.22 msec per loop ./python.exe -m timeit sum(range(-32768,32768)) -100 loops, best of 3: 3.98 msec per loop +100 loops, best of 3: 2.63 msec per loop ./python.exe -m timeit sum(range(-1000000,1000000)) -10 loops, best of 3: 296 msec per loop +10 loops, best of 3: 273 msec per loop ./python.exe -m timeit sum(range(4000)) -1000 loops, best of 3: 239 usec per loop +10000 loops, best of 3: 151 usec per loop ./python.exe -m timeit sum(list(range(4000))) -1000 loops, best of 3: 360 usec per loop +1000 loops, best of 3: 274 usec per loop ./python.exe -m timeit sum(range(10000)) -1000 loops, best of 3: 615 usec per loop +1000 loops, best of 3: 382 usec per loop ./python.exe -m timeit sum(list(range(10000))) -1000 loops, best of 3: 887 usec per loop +1000 loops, best of 3: 802 usec per loop ./python.exe -m timeit sum(range(40000)) -100 loops, best of 3: 2.55 msec per loop +100 loops, best of 3: 1.79 msec per loop ./python.exe -m timeit sum(range(80000)) -100 loops, best of 3: 6.4 msec per loop +100 loops, best of 3: 5.77 msec per loop ........................ And a benchmarks of the longfreelist2 patch before my improvement: ........................ --- -without-patch.txt 2008-03-22 12🔞37.000000000 -0700 +++ -longfreelist2.txt 2008-03-22 12:19:46.000000000 -0700 @@ -1,43 +1,43 @@ -3.0a3+ (py3k:61746M, Mar 22 2008, 12:12:29) +3.0a3+ (py3k:61746M, Mar 22 2008, 12🔞57) [GCC 4.0.1 (Apple Computer, Inc. build 5367)] ./python.exe -m timeit min(range(255)) -100000 loops, best of 3: 17.2 usec per loop +100000 loops, best of 3: 16.1 usec per loop ./python.exe -m timeit -s n=1000000 sum(range(n,n+10000)) -1000 loops, best of 3: 1.21 msec per loop +1000 loops, best of 3: 1.25 msec per loop ./python.exe -m timeit sum(range(-32768,32768)) -100 loops, best of 3: 3.98 msec per loop +100 loops, best of 3: 2.95 msec per loop ./python.exe -m timeit sum(range(-1000000,1000000)) -10 loops, best of 3: 296 msec per loop +10 loops, best of 3: 283 msec per loop ./python.exe -m timeit sum(range(4000)) -1000 loops, best of 3: 239 usec per loop +1000 loops, best of 3: 177 usec per loop ./python.exe -m timeit sum(list(range(4000))) -1000 loops, best of 3: 360 usec per loop +1000 loops, best of 3: 276 usec per loop ./python.exe -m timeit sum(range(10000)) -1000 loops, best of 3: 615 usec per loop +1000 loops, best of 3: 453 usec per loop ./python.exe -m timeit sum(list(range(10000))) -1000 loops, best of 3: 887 usec per loop +1000 loops, best of 3: 792 usec per loop ./python.exe -m timeit sum(range(40000)) -100 loops, best of 3: 2.55 msec per loop +100 loops, best of 3: 2.03 msec per loop ./python.exe -m timeit sum(range(80000)) -100 loops, best of 3: 6.4 msec per loop +100 loops, best of 3: 6.01 msec per loop ........................ And diffs of longfreelist2 vs longfreelist2-gps01: ........................ --- -longfreelist2.txt 2008-03-22 12:19:46.000000000 -0700 +++ -longfreelist2-gps01-size-first.txt 2008-03-22 12:34:44.000000000 -0700 @@ -1,43 +1,43 @@ -3.0a3+ (py3k:61746M, Mar 22 2008, 12🔞57) +3.0a3+ (py3k:61746M, Mar 22 2008, 12:33:47) [GCC 4.0.1 (Apple Computer, Inc. build 5367)] ./python.exe -m timeit min(range(255)) -100000 loops, best of 3: 16.1 usec per loop +100000 loops, best of 3: 15.8 usec per loop ./python.exe -m timeit -s n=1000000 sum(range(n,n+10000)) -1000 loops, best of 3: 1.25 msec per loop +1000 loops, best of 3: 1.22 msec per loop ./python.exe -m timeit sum(range(-32768,32768)) -100 loops, best of 3: 2.95 msec per loop +100 loops, best of 3: 2.63 msec per loop ./python.exe -m timeit sum(range(-1000000,1000000)) -10 loops, best of 3: 283 msec per loop +10 loops, best of 3: 273 msec per loop ./python.exe -m timeit sum(range(4000)) -1000 loops, best of 3: 177 usec per loop +10000 loops, best of 3: 151 usec per loop ./python.exe -m timeit sum(list(range(4000))) -1000 loops, best of 3: 276 usec per loop +1000 loops, best of 3: 274 usec per loop ./python.exe -m timeit sum(range(10000)) -1000 loops, best of 3: 453 usec per loop +1000 loops, best of 3: 382 usec per loop ./python.exe -m timeit sum(list(range(10000))) -1000 loops, best of 3: 792 usec per loop +1000 loops, best of 3: 802 usec per loop ./python.exe -m timeit sum(range(40000)) -100 loops, best of 3: 2.03 msec per loop +100 loops, best of 3: 1.79 msec per loop ./python.exe -m timeit sum(range(80000)) -100 loops, best of 3: 6.01 msec per loop +100 loops, best of 3: 5.77 msec per loop
msg64351 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2008-03-23 01:24
Looking at how much memory was actually used by PyLongObjects the way our code allocates them... we can use this freelist for 2 digit PyLongs on 32-bit systems and for 4 digit PyLongs on 64-bit systems. Doing this speeds things up even more as numbers > 32767 are still quite common. why? sizeof(PyVarObject) == 12, sizeof(digit) == 2. Objects/obmalloc.c allocates on 8, 16 and 32 byte boundaries. Theres no such thing as a 14 byte allocation, its 16 byte aligned. The same applies for 64-bit platforms where the sizeof(PyVarObject) == 24, our allocation size is 32 bytes there. patch attached as -gps02.
msg64352 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2008-03-23 02:04
And the benchmarks of gps02... see the attached file. At the end I had it run pybench rather than these microbenchmarks that obviously show the speedup, the -gps02 patch ran 1.3% faster best case a 0.5% faster on average (32-bit; my mac's toolchain can't build enough of py3k 64-bit to run pybench). Things left for someone to do? Determine what a good size for the PyLong free list is. 4096 was a magic number chosen by tiran.
msg64373 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-03-23 19:05
The problem with choosing a sensible freelist size is that we don't have any reference workloads. However, I just tested with 10000 and it doesn't seem to slow anything down anyway. It doesn't make our microbenchmarks I thought the patch to compact freelists at each full gc collection had been committed, but it doesn't seem there. Perhaps it will change matters quite a bit. On the one hand, it will allow for bigger freelists with less worries of degrading memory footprint (but still, potential cache pollution). On the other hand, the bigger the freelists, the more expensive it is to deallocate them.
msg69018 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-06-30 21:20
Gregory, your patch probably deserves checking in, it doesn't seem to me there is any concern preventing you to do that.
msg71306 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2008-08-18 01:15
preventing this right now is that when i apply this to py3k today, it fails miserably in a debug build. first, my patch has an invalid assert(size > 0) in it in _PyLong_New as 0 size is used when the value is 0. get rid of that line. then things at least run but you'll end up in an infinite loop when the interpreter exits at best if you've compiled in debug mode. things work great in a non-pydebug build. i believe the reason is this change is not properly looking at the structure allocation sizes. debug builds add extra structure fields. i'm investigating. the free_list code in floatobject.c does not have this problem so at least there's a good example to go from. and yet another reason why a more general free list library for various internals to use would be useful...
msg71314 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2008-08-18 04:56
attached is an updated patch that also works in debug mode. to do this i had to exclude all zero length (value = 0) PyLongObjects from the free list. i'm still not sure why, they all have the same underlying allocation size. simplifying it to only freelist one digit longs the only useful differences are in the microbenchmark of -32767..32768. otherwise things are slightly slower. pybench is identical and pystone drops a bit. i'm closing this as not worth it as things are written. if someone wants to pick up the idea and play with freelists go ahead but this doesn't need to be an open tracker issue. there is room for improvement but these patches aren't it.
History
Date User Action Args
2022-04-11 14:56:30 admin set github: 46297
2008-08-18 04:56:57 gregory.p.smith set files: + py3k_longfreelist2_gps04.patch
2008-08-18 04:56:17 gregory.p.smith set status: open -> closedresolution: rejectedmessages: + files: + py3k_longfreelist2_gps03.patch
2008-08-18 01:15:45 gregory.p.smith set assignee: gregory.p.smithmessages: +
2008-06-30 21:20:46 pitrou set messages: +
2008-03-23 19:05:33 pitrou set messages: +
2008-03-23 02:04:46 gregory.p.smith set files: + issue2013-benchmarks-gps02.txtmessages: +
2008-03-23 01:24:20 gregory.p.smith set files: + py3k_longfreelist2-gps02.patchmessages: +
2008-03-22 19:54:14 gregory.p.smith set files: + py3k_longfreelist2-gps01.patchtype: enhancement -> performancemessages: + nosy: + gregory.p.smithversions: - Python 2.6
2008-03-22 16:45:03 pitrou set messages: +
2008-02-12 22:48:19 christian.heimes set messages: +
2008-02-12 22:32:17 pitrou set messages: +
2008-02-12 22:19:05 pitrou set files: + py3k_longfreelist2.patchnosy: + pitroumessages: +
2008-02-06 17:09:57 christian.heimes set files: + py3k_longfreelist.patchtype: enhancementmessages: +
2008-02-05 03:07:46 christian.heimes create