perf: pool the 64 KiB decodeScratch buffer in PaginatedSegmentDimCursor (original) (raw)

Follow-up to #4068. Tracks heap pressure in PaginatedSegmentDimCursor so it does not get lost.

Context

PaginatedSegmentDimCursor allocates a byte[pageContentSize] decode scratch per cursor instance at construction, plus a wrapping ByteBuffer. At the default 64 KiB page that is one ~64 KiB allocation per cursor, and a query opens one cursor per (query dim, segment) pair.

Concrete scale: a 30-query-dim top-K against 15 sealed segments allocates 30 x 15 = 450 cursors, each carrying ~64 KiB of scratch -> ~28 MiB of allocations per query, reclaimed when the query finishes. Under concurrent load this is real heap pressure.

The current code documents the cost in the field's Javadoc and defers the fix because the serial path's allocation cost is a small fraction of total query time at current numbers. The trigger for revisiting is parallel scoring (#4085): parallel dispatch multiplies the allocation rate by the parallelism factor, so the same workload running 4 cores wide hits ~112 MiB/query of scratch churn.

Scope

Replace the per-cursor field with a pool sourced at start() and returned at close(). Two natural shapes:
- Thread-local stack of byte arrays. Lock-free, scales perfectly with concurrent queries, but each thread holds onto its high-water-mark allocation until it dies. Fine for the engine pool's bounded-thread executor, less ideal for embedded users with bursty thread creation.
- Bounded ConcurrentLinkedDeque-backed pool. Hands out arrays on start(), returns on close(). Slightly more contention but a fixed cap on retained memory. Probably the better default since it composes with the dedicated SparseVectorScoringPool pattern.
Audit start() / close() invariants: every cursor that calls start() must reach close() regardless of exception path. The existing topK finally-blocks already guarantee this, but a leak would silently exhaust the pool and force the fallback allocation path.
Sizing: the buffer must be >= component.pageContentSize(). If two indexes have different page sizes (today they don't, but the design allows it), the pool should key by size or take the max.
Benchmark: re-run LSMSparseVectorIndexLargeBenchmark with -XX:+PrintGC to confirm allocation rate drops and to spot any tail-latency regression from pool contention.

Acceptance criteria

PaginatedSegmentDimCursor no longer holds a per-instance decodeScratch.
Buffers are sourced at start() and returned at close() from a shared pool.
Allocation rate during a sustained 10M corpus benchmark is measurably lower (target: GC-rate reduction proportional to query fan-out).
No regression in BmwScorerCorrectnessTest or the 47-test sparse-vector unit suite.
The "FUTURE" Javadoc block on decodeScratch is removed once the pool lands.

Out of scope

Pooling the per-cursor blockRids / blockWeights / blockTombstones arrays (each only params.blockSize() entries, default 128, ~3 KiB combined - dominated by decodeScratch).
The payloadScratch in SparseSegmentBuilder (one allocation per builder, not per cursor; not on the query hot path).