O(N²) compile-time regression for large constant tuples after const folding moved to CFG (original) (raw)

While migrating a project from 3.11 to 3.14, I hit a case where a data file
containing ~81,000 nested 2-tuples (think ((x, y), (x, y), ...)) caused the
compiler to hang.

Turns out this is a regression from the const folding migration
(gh-126835). The root cause
is add_const() in Python/flowgraph.c — it does a linear scan over the
consts list:

for (index = 0; index < PyList_GET_SIZE(consts); index++) {
    if (PyList_GET_ITEM(consts, index) == newconst) {
        break;
    }
}

Before gh-126835, large constant tuples were folded in one shot at the AST stage,
so add_const was rarely called with big consts lists. After
gh-130769 moved tuple folding
to CFG, fold_tuple_of_constants() calls add_const() once per inner tuple
— N calls × O(N) scan = O(N²).

perf profiling shows add_const taking 83.76% of CPU time at N=100K.

This isn’t limited to tuple folding — any CFG-stage constant folding hitting
add_const is affected. Unary/binary op folding was also moved to CFG in
gh-129550, so patterns like
these show the same O(N²) behavior:

(-1, -2, ..., -N) — each UNARY_NEGATIVE folded via add_const, 31.7x
slower than 3.11 at N=100K
(0+1, 0+2, ..., 0+N) — each BINARY_OP folded via add_const, 9.7x
slower at N=100K

The issue is present on 3.15.0a7 as well.

Reproducer:

import sys, time

N = int(sys.argv[1]) if len(sys.argv) > 1 else 100000

# Generate source: vertices = ((-0.001, -0.002), (-0.003, -0.004), ...)
elements = ", ".join(f"({i * 0.001:.6f}, {i * 0.002:.6f})" for i in range(N))
source = f"vertices = ({elements},)"

t0 = time.perf_counter()
code_obj = compile(source, "<test>", "exec")
t1 = time.perf_counter()
print(f"Compile time: {t1-t0:.4f}s", flush=True)

Proposed fix:

Add an auxiliary _Py_hashtable_t (pointer → index) alongside the consts list
in _PyCfg_OptimizeCodeUnit, replacing the linear scan with O(1) lookup. It uses
_Py_hashtable_hash_ptr / _Py_hashtable_compare_direct — pure pointer ops, no
Python object overhead. The hashtable is created before optimize_cfg() and
destroyed right after, before remove_unused_consts() reindexes things.

This works because _PyCompile_ConstCacheMergeOne() already guarantees identity
uniqueness (equal-valued constants share the same pointer), so pointer-based lookup
is semantically equivalent to the existing identity scan. All affected functions are
static — no public API changes.

Results (N=100K):

Scenario	3.11	3.14 (before)	3.14 (after fix)
Nested tuples ((f, f), ...)	1.24s	10.38s	1.48s
Unary neg (-1, -2, ..., -N)	0.22s	6.86s	1.12s
Binary add (0+1, 0+2, ..., 0+N)	0.27s	2.59s	1.40s

All existing tests pass (test_compile, test_peepholer, test_ast, test_dis).

There’s also a further optimization opportunity — folding all-constant tuples
directly at codegen stage to avoid the instruction bloat entirely (closing the
remaining gap vs 3.11) — but that’s a separate discussion.

Draft fix: Fix O(N²) in add_const() after constant folding moved to CFG · zSirius/cpython@120e518 · GitHub

Thoughts?