Consider increasing the growth factor of list objects (original) (raw)

The growth factor for list objects is about 1.125x 1 and it is easy
to run into degenerate performace cases by way of repeated append:

from random import shuffle
n_els = 1_000
L1 = list(range(n_els))
shuffle(L1)
# Pre-allocated list
def fast():
    for _ in range(10_000):
        L2 = [0] * n_els
        for i in range(n_els):
            L2[i] = L1[i]
# Dynamically sized list
def slow():
    for _ in range(10_000):
        L2 = []
        for x in L1:
            L2.append(x)
fast()
slow()

The growth factor 1.125x was set over 20 years ago and is today
unusually low compared to other runtimes, which often use 1.5x or
2x. Consider increasing it. RAM is plentiful. Particularily when
building small lists, the repeated allocations cause churn.

tim.one (Tim Peters) March 20, 2026, 6:57pm 2

It’s good to revisit things, but how to proceed is rarely clear.

On the flip side, we’ve gone over 20 years without major complaints

Before 1.125 was picked, it did use fatter overallocation It started by doubling, later reduced to 1.5. Because too many “large lists” in real life prematurely ran out of RAM. 1.125 wasn’t picked out of a hat, but driven by user experiences at the time, reducing the growth factor until howls of pain subsided

For many apps, sure. Probably most. But not all important apps. As HW resources grow, so do programmer ambitions.

Do they? That requires measurement to back up, not "head arguments’. realloc() doesn’t copy anything if malloc reserved enough space last time to absorb the mild new “overallocation”. So what you measure will also vary depending on the platform malloc/realloc behavior.

Note that it’s not only giant lists that can cause problems. If a program has a great many small lists (some do!), leaving one half empty doesn’t matter, but wasting half in a billion adds up too.

Note that any growth factor over 1.0 gives O(1) amortized time, and that’s the high-order bit overallocation is aiming at. 2.0 or 1.125 have the same O() behavior, although the hidden constant factor gets larger the smaller the growth factor.

I think any change here would have to be motivated by substantial real-world data. That’s A Project™, alas.

Another kind of compromise could be to add a new method to lists, allowing a program to suggest a maximum capacity. But that’s a different can of worms …

tim.one (Tim Peters) March 20, 2026, 7:32pm 3

There’s a decent brief discussion of “the math” here:

tim.one (Tim Peters) March 20, 2026, 8:51pm 4

And I found only one relevant issue report on the tracker, which is related to that:

It had nothing to do with the general list overallocation strategy, but with that str.split() specifically over-allocated by more than that, for original reasons that remain unclear. That’s since been repaired.

The bottom line is that a user saving away millions of small .split() results saw unreasonably high RAM usage, due to .split() grossly over-allocating space for its result list for each. It “adds up”,

We’d reintroduce the same kind of bloat if, e.g., list’s own growth factor were increased to 2.

1.125, like everything else, is a compromise. Why the “.125”? Why not, e.g., “.129” or “.140” instead? Because “.125” is 1/8, and getting an eighth of an int is just a right shift by 3 bits. That’s a “go fast” compromise. and no deeper than that.

The history of growth factors went from 2 to 1+1/2 to 1+1/8. Why 1+1/4 was skipped is lost to history.

Perhaps that’s worth looking at again. Moving from 1+1/8 up to 1+1/4 wouldn’t change O() behavior, but would cut the hidden constant factor in half. And, of course, also potentially “waste” more memory. Tradeoffs!

storchaka (Serhiy Storchaka) March 20, 2026, 11:53pm 5

Moving from 1+1/8 up to 1+1/4 would increase memory usаge for lists storage by about 1/16 or 6%.

tim.one (Tim Peters) March 21, 2026, 1:04am 6

Best case? Worst case? Average (based on some assumptions about size distribution)? Please show your work.

For example, what are worst cases for percentage of unused capacity? When a list first resizes from C to C*k capacity, C*k-C = C*(k-1) of the slots are unused. So the fraction of unused slots is C*(k-1)/(C*k) = (k-1)/k = 1 - 1/k.

k      1-1/k
1.125   0.11
1.25    0.20
1.5     0.33
2.0     0.50

So moving from 1+1/8 to 1+1/4 could make worst-case “waste” almost twice as high (from 0.11 to 0.20). Best case is no waste for both. “Average” depends on details of expected size distributions.

I don’t know where 6% came from. To a hand-wavy first approximation, a list with L elements for which L*1.25 slots were allocated is L*1.25/(L*1.125) = 1.11 times larger than if L*1.125 slots were allocated, more like 11% larger. But that’s a worst case.. In return, resizing would happen about log_{1.125}(1.25) \approx 1.9 times less often.

Stefan2 (Stefan Pochmann) March 21, 2026, 8:05am 7

Benchmark:

from timeit import repeat

for f in [fast, slow] * 3:
    print(f.__name__, min(repeat(f, number=1)))

Results:

fast 0.35265472903847694
slow 0.1905038021504879
fast 0.35569924861192703
slow 0.18192972615361214
fast 0.3486487250775099
slow 0.18666411191225052

Why did you call the slow function “fast” and the fast function “slow”?

storchaka (Serhiy Storchaka) March 21, 2026, 8:59am 8

Average. More accurate estimation depends on the definition (is it a ratio to the number of items or to the allocated memory) on the model, and on the fine details of implementation (for example, on details of the memory pool and page size).

If the growth ratio is 1+1/8, in worst case you will use 1/8 more memory than necessary. Or, from other view, have 1/9 of the stroage unused. The difference betweeen 1/8 and 1/9 is insignifical in such estimations. This was the maximum, the minimum is zero, and average is something between. There is no much differences between different kinds of average here:

>>> (1 + (1+1/8))/2 - 1
0.0625
>>> sqrt((1**2 + (1+1/8)**2)/2) - 1
0.06433664787040017
>>> sqrt(1 * (1+1/8)) - 1
0.06066017177982119
>>> 2/(1/1 + 1/(1+1/8)) - 1
0.05882352941176472

so we can use arithmetic mean (1/16) for simplicity.

For the growth ratio is 1+1/4 the average overhead is 1/8, and the difference is 1/8 - 1/16 = 1/16 = 6%. You need to have very particular model to get very different average results. In very unlikely scenario you can get 5% or 7%, but it is virtually impossible to get less than 4% or larger than 8% without hard cheating. If you have many lists of different size with different initial size and history, you will likely get the result very close to that estimetion.

Now, my point is that 6% is pretty small. Even worst case scenario, 11% or 12% is small (note that this is only memory for the list storage, the memory is also needed for the items themselves, and they are usually much larger than the size of the pointer). On other hand, changing the growth ratio from 1+1/8 to 1+1/4 will reduce the number of reallocations twice (you can argue that it is actually log(1+1/4)/log(1+1/8) = 1.8945304820059299, but 2 is good estimation). Of course, reallocation is only a small part of interesting things that we do with the list (as the overallocated list capacity is only a small part of the memory used for the list and items), but it may be time to reconsider the balance. We are now less restricted in memory than in the 32-bit era.

storchaka (Serhiy Storchaka) March 21, 2026, 9:08am 9

I’ll add that other runtimes have an API to preallocate the dynamic array and to shrink the capacity to the used size, so they have less issues with large growth ratio. I do not think Python needs such API – it would be preliminary optimization, and in most cases would have negative effect on performance (not mentioning readability), because Python have different ratio between the time of different operations than C++ and Java. So, there were good reasons for smaller growth ratio.

But it is not necessary to be such small.

vstinner (Victor Stinner) March 21, 2026, 10:20am 10

By the way, the function:

# Dynamically sized list
def slow():
    for _ in range(10_000):
        L2 = []
        for x in L1:
            L2.append(x)

can be rewritten using the more efficient list.extend() method:

def slow():
    for _ in range(10_000):
        L2 = []
        L2.extend(L1)

When the length is known, list.extend() only resizes the list once.

But I suppose that in your use case, you cannot use list.extend().

methane (Inada Naoki) March 21, 2026, 12:52pm 11

How about using size classes of mimalloc for >8word (>64bytes on 64bit) sizes?
It uses highest 3bits to determine size class.

64, 80, 96, 112,
128, 160, 192, 224,
256, 320, 384, 448,
512, 640, 768, 896,
1024, 1200, 1536, 1792,
…

The advantage of this idea is that if you are using mimalloc, you can avoid wasting memory equal to the difference between the size of the list’s array and mimalloc’s size class.

tim.one (Tim Peters) March 21, 2026, 9:06pm 12

I’m not sure I’m following the intent, but best I can make out this ends up doubling the size after every 4 resizes, basically acting like a growth factor of \sqrt[4]{2} \approx 1.189 overall.

tim.one (Tim Peters) March 21, 2026, 9:35pm 13

Thank you for fleshing things out!

I agree, but numbers emphatically do not “speak for themselves”. Spelling out “the point” explicitly is always better than leaving a reader to guess at what you think the numbers “mean”.

Yes and no . Object memory is always larger under CPython, because its small object allocator aligns to a 16-byte boundary, twice as large as a pointer on 64-bit boxes. “Almost all objects” end up taking at least 32-bytes, so even almost all the smallest objects take 4x as much RAM as a pointer (this is true, e.g, even of tiny ints, which is as small as non-trivial objects get).

OTOH, lists don’t always hold pointers to distinct objects. Each pointer in a list takes up its own 8 bytes, but what they point at can be shared. For example, I have a bunch of apps that routinely build up multi-million-element lists of tiny ints. But tiny ints in CPython are singleton objects (implementation detail), so the storage for the ints is tiny and constant regardless of how large the lists get. It’s the space for the pointers that grows without bound.

Alas, there is no “typical app”, and Python wants to work well with all apps.

And I agree. It’s an old decision from a time when HW realities were much different. Best I can tell, there is no surviving contemporaneous discussion of the thinking that went into picking 1.125, so even if it doesn’t change, it would be good to create a public record now of why that remains the best choice.

There are also variations that could be considered. The OP is right that 1.5 and 2.0 are overwhelmingly most common now in other similar implantations. But, e.g., Go appears to use 2.0 for “small” vectors, but backs off to 1.25 for larger ones.

For a start, someone needs to change the shift count from 3 to 2 in listobject.c, and see whether it makes a measurable time difference on real-life programs they care about. To spell that out, that would change the growth factor from 1.125 to 1.25. If that pays off, go on to try 1.5 (shift count 1) and 2.0 (no shift).

There are “head arguments” for why 1.5 could do better than 2.0 (having to do with reducing system malloc heap fragmentation), but I don’t really trust such stuff. Modern boxes are so complex that nothing can substitute for careful timing of how they actually work. Counterintuitive results abound.

cmaloney (Cody Maloney) March 21, 2026, 10:52pm 14

To me modifying the growth factor is just changing what cases are faster and slower. On new laptop and server machines there is more RAM to spare and memcpy is fast; older or more embedded hardware those have very different costs. For maintained code which wants to measure and hand optimize there is already a good solution: over-allocate and keep a length. There are lots of bits of CPython which do that already (Ex. bytearray, PyBytesWriter, io.BufferedWriter, …).

There is a system in PEP 424, __length_hint__, for code to get closer to the right allocation. That should lead to one pre-allocation of an estimated size and then filling. Unfortunately for small lists the extra call is measurable; for large there is quite a bit of improvement possible (recent benchmarks and analysis focused on islice: Add `__length_hint__` methods to `itertools` helpers · Issue #145361 · python/cpython · GitHub). Some way of figuring out that tradeoff (only try length hint if size is already over a threshold? Or when resize would be expensive / require a memcpy?) would make length hint cost effective to add and use in CPython code.

tim.one (Tim Peters) March 22, 2026, 12:53am 15

There’s no reason to imagine that raising the growth factor would make any case slower. To the contrary, it would make it possible for the system realloc to copy fewer bytes over time. But at the cost of allocating more memory than an app is actually using, and more such “waste” the higher the growth factor.

Depends on the app. “Big data” apps are always pushing the HW limits, no matter how large they become. Double the RAM? Fine, they’ll twiddle a constant or two in the code to consume all of it. Add a dozen cores? Ditto.

The OP is coding at the Python level, not at the C API level. There’s nothing a user can say in Python that has any effect on the overallocation list.append(x) does under the covers. That’s hard-coded, with no tuning knobs of any kind. There’s nothing at the Python level akin to the C API’s PyList_New(size), which allows the caller to set the new list’s initial maximum capacity to that size. So even that has no effect on the overallocation strategy, it just allows to force the initial max capacity. Which is more than can be done at the Python level.

I don’t think that’s relevant here. .append() takes a single object as argument, and is a one-shot operation on its own.

The focus here isn’t on speeding “bulk” mutations, just on what a list is best off doing when it’s asked to add a single element that requires growing the list. It has no context whatsoever now to consult to get a clue about what the user’s code may do next. Maybe that’s the last append that will ever be done, maybe just the first a billion appends to follow. The element itself can’t know anything about that either. Which relevant object could supply a potentially useful length hint?

Whether the Python-level list API could/should be expanded to allow the user to explicitly suggest a maximum capacity is a different, although potentially related issue. And regardless of whether that’s done, the question remains what a bare list.append(element) should do when it needs to grow the list.

The OP would like it to over-allocate more aggressively than the current growth factor of ~1.125, which is indeed the smallest growth factor I’ve seen in any contemporary implementation of resizable lists/arrays/vectors. Whatever reasons we have had for that caution 20+ years ago (when it was picked) may no longer make best sense.

cmaloney (Cody Maloney) March 22, 2026, 1:57am 16

Would really like to see pyperformance numbers for that. If it’s always faster definitely a supporter. I suspect there will be more nuance though for performance with modifying across the range of systems and applications written in Python.

I also think small increases in allocation across many objects can significantly impact an application overall. I know PyPI/warehouse ran into less stable memory usage in 3.14 which caused them to roll back to 3.13 for the time being (chore: downgrade to Python 3.13.12 by miketheman · Pull Request #19557 · pypi/warehouse · GitHub). With adjusting the growth factor I’m worried about increased overheads for some software which was written and specialized for the value as they are currently set. It can make a barrier to Python version upgrade.

To expand on and connect to this point

In 3.14 pymalloc_alloc oversizes allocations to a particular size class (cpython/Objects/obmalloc.c at 3.14 · python/cpython · GitHub). When code reallocates upwards it either fits in that overallocation or gets a new size class allocation (may be exact or over); when it reallocates downsize it uses a heuristic to try and reduce reallocation but also not waste too much memory (if (4 * nbytes > 3 * size). Individual objects, such as bytearray in bytearray_resize_lock_held have specialized somewhat tuned implementations which add their own “how much to oversize” + “how much to undersize” logic on top of that.

If we change the underlying allocation growth factor that may improve that second layer of bytearray logic but it might also interact badly with it and make bytearray’s allocation slower. It might make bytearray doing custom not useful / effectively dead code. For me at least really hard to know without measuring.

Agree very strongly with the context matters a lot; and currently CPython doesn’t track that. If we’re in a loop over a container a and see repeated single appends to a distinct container b would it make sense to over-allocate b for the length of the object we’re iterating? Currently prototyping that locally to see if it would help and how complex the code changes are to support it :).

My goal here is that if we can get to exactly right (or dramatically closer) preallocation without explicit user work more of the time that should have a large positive effect across a range of codebases. As always are edge cases (ex. partial iteration over infinite lists, preallocation that would run out of memory, …).

Structurally to me “exactly right more of the time” would be a really big gain compared to the more incremental tuning. To be explicit 10% from doubling a constant is very welcome to me as well. For exact preallocation I know C++ and Go do it a lot; and in Python I’ve definitely preallocated lists for performance. It is not something CPython has particularly done from my understanding but PyPy did and found a significant performance uplift for some cases as a result. That performance resulted in the proposal and standardization of __length_hint__. With things like the JIT hopeful more of that analysis can work so things like simple loop containing append → one right pre-allocation can just happen and make more Python code fast :).

There’s quite a few of those constants around, looking forward to seeing the impact io.DEFAULT_BUFFER_SIZE has (performance: Update io.DEFAULT_BUFFER_SIZE to make python IO faster? · Issue #117151 · python/cpython · GitHub, TIL: Python 3.15's new buffer size).

tim.one (Tim Peters) March 22, 2026, 2:42am 17

Ir would be a start for me Modern systems are just too varied and complex to out-think reliably. There are always surprises.

In particular, 2nd-order effects can certainly affect this. A lot depends on the system realloc. For example, if the growth factor is greater than about the Golden Ratio (the growth factor of the Fibonacci sequence \approx 1.62), across a sequence of resizes the sum of all discarded-so-far pieces never reaches the size of the next piece requested. So realloc always needs to find fresh memory to pass out.

If the growth facter is less than that, a realloc can theoretically, at times, find enough released space to satsify a new request from previously discarded pieces.

That in turn can have direct effects on cache hierarchy locality, and more indirectly via system heap fragmentation (the higher the growth factor, the more pressure on fragmentation).

That seems to be “the reason” some systems now favor 1.5 over 2.0. But I rarely find “head theories” compelling on their own. “2nd- and 3rd-order” effects (like cache- or branch-prediction hostility) are often the dominant effects these days.

It certainly could . Far beyond what CPython has traditionally tried to do, though. More in PyPy 's ballpark, although they get enormous benefit from dynamically specializing to lists of small ints or floats, storing raw machine bits and avoiding “object overheads” entirely when they can. Far more bang for the gonzo-effort buck than any fiddling with container pre-allocation schemes.

Suggestive factoid: to reach a million elements via one append at a time, a list today undergoes 86 resizes. That “seems excessive”. “Just” raising the growth factor from 1.125 to 1.25 would cut that to 49 (and 29 at factor 1.5, and just 17 at factor 2.0).

tim.one (Tim Peters) March 22, 2026, 4:14am 18

Some exact figures to inform intuitions. This starts by modeling the actual code but in Python:

def gencap(shift=3):
    s = 0
    while True:
        s += 1
        s = (s + (s >> shift) + 6) & ~3
        yield s

That yields the exact sequence of list capacities CPython uses. Then, for each list size in 1 through a million inclusive:

Add that size to a running total of sizes.
Find the smallest capacity at least that large. That’s how many slots CPython will actually allocate to hold a list with that many elements, via appending one element at a time.
Call “waste” the capacity minus the size. Add that to a running total of wastes.

This tries all sizes in range, not favoring “best” or “worst” or any other kinds of sizes.

At the end, call “overhead” the percentage total waste is of total sizes. For example, it we had lists with a total of 400 elements, and across all those we had 100 total temporarily unused extra allocated slots, overhead would be 25%.

Results at various growth factors:

shift 4 growth factor 1.0625
    through 10**1 resizes   3 overhead  30.91%
    through 10**2 resizes  14 overhead   7.41%
    through 10**3 resizes  44 overhead   3.49%
    through 10**4 resizes  80 overhead   3.09%
    through 10**5 resizes 118 overhead   3.10%
    through 10**6 resizes 156 overhead   3.10%
shift 3 growth factor 1.125 *** CURENT ***
    through 10**1 resizes   3 overhead  45.45%
    through 10**2 resizes  11 overhead  10.57%
    through 10**3 resizes  28 overhead   6.60%
    through 10**4 resizes  47 overhead   6.29%
    through 10**5 resizes  66 overhead   5.90%
    through 10**6 resizes  86 overhead   6.25%
shift 2 growth factor 1.25
    through 10**1 resizes   2 overhead  60.00%
    through 10**2 resizes   8 overhead  15.33%
    through 10**3 resizes  18 overhead  12.88%
    through 10**4 resizes  28 overhead  12.39%
    through 10**5 resizes  38 overhead  11.34%
    through 10**6 resizes  49 overhead  12.26%
shift 1 growth factor 1.5
    through 10**1 resizes   2 overhead  60.00%
    through 10**2 resizes   6 overhead  22.30%
    through 10**3 resizes  12 overhead  25.33%
    through 10**4 resizes  18 overhead  22.23%
    through 10**5 resizes  23 overhead  23.01%
    through 10**6 resizes  29 overhead  25.00%
shift 0 growth factor 2.0
    through 10**1 resizes   2 overhead 103.64%
    through 10**2 resizes   4 overhead  48.91%
    through 10**3 resizes   7 overhead  34.79%
    through 10**4 resizes  11 overhead  48.81%
    through 10**5 resizes  14 overhead  47.62%
    through 10**6 resizes  17 overhead  36.41%

Notes:

For very short lists, the actual growth factor is much higher than for long lists. For example, under the current scheme we leap from empty to 4 allocated slots on the first resize. Then to 8, then 16, and then 24. Doesn’t really matter "in the large’, but prevents frequent thrashing for short lists.
At the highest load factors. we’d presumably tweak the code to not overallocate so very much for very small lists.
To my eyes, the “sweet spots” are at 1.125 and 1.25, for their modest overheads. Python dicts have higher overheads in this sense (from 50% to 200%), but that’s to keep the load factor between 1/3 and 2/3, and going over 2/3 can have catastrophic effects on the expected number of performance-destroying collisions. The “wasted space” in that context pays off a whole lot in speed. It remains to be demonstrated that increasing “wasted space” here has any measurable effect on speed.

methane (Inada Naoki) March 22, 2026, 5:04am 19

Yes. My opinion is that if there is no clear evidence that a specific growth factor is superior to another, it would be most efficient to fit the allocator’s size class.

If 1.189 is too small, we can use the upper 2 bits instead of the upper 3 bits. In that case, the growth factor would be sqrt(2) = 1.414.

tim.one (Tim Peters) March 22, 2026, 5:31am 20

I don’t know mimalloc internals, but it’s unattractive to write higher-level code to cater to lower-level details. Python’s native small object allocator is limited to objects of at most 512 bytes, in 32 size classes (16-bytes apart, to cater to 16-byte alignment).

That already works out perfectly “by magic”, because lists happen to overallocate to a multiple of 4 elements, which is a multiple of 32 bytes on 64-bit boxes. So long as a list stays small, it always asks obmalloc for a chunk that exactly fills one of its 32 pool sizes. And that would remain so regardless of growth factor.

LIsts didn’t have obmalloc in mind, though. “Multiple of 4” rounds down, to make the growth pattern more sensible for small lists. That it was perfect for obmalloc is just happy coicidence.