perf: pool the 64 KiB decodeScratch buffer in PaginatedSegmentDimCursor (original) (raw)
Follow-up to #4068. Tracks heap pressure in PaginatedSegmentDimCursor so it does not get lost.
Context
PaginatedSegmentDimCursor allocates a byte[pageContentSize] decode scratch per cursor instance at construction, plus a wrapping ByteBuffer. At the default 64 KiB page that is one ~64 KiB allocation per cursor, and a query opens one cursor per (query dim, segment) pair.
Concrete scale: a 30-query-dim top-K against 15 sealed segments allocates 30 x 15 = 450 cursors, each carrying ~64 KiB of scratch -> ~28 MiB of allocations per query, reclaimed when the query finishes. Under concurrent load this is real heap pressure.
The current code documents the cost in the field's Javadoc and defers the fix because the serial path's allocation cost is a small fraction of total query time at current numbers. The trigger for revisiting is parallel scoring (#4085): parallel dispatch multiplies the allocation rate by the parallelism factor, so the same workload running 4 cores wide hits ~112 MiB/query of scratch churn.
Scope
- Replace the per-cursor field with a pool sourced at
start()and returned atclose(). Two natural shapes:- Thread-local stack of byte arrays. Lock-free, scales perfectly with concurrent queries, but each thread holds onto its high-water-mark allocation until it dies. Fine for the engine pool's bounded-thread executor, less ideal for embedded users with bursty thread creation.
- Bounded
ConcurrentLinkedDeque-backed pool. Hands out arrays onstart(), returns onclose(). Slightly more contention but a fixed cap on retained memory. Probably the better default since it composes with the dedicatedSparseVectorScoringPoolpattern.
- Audit
start()/close()invariants: every cursor that callsstart()must reachclose()regardless of exception path. The existingtopKfinally-blocks already guarantee this, but a leak would silently exhaust the pool and force the fallback allocation path. - Sizing: the buffer must be
>= component.pageContentSize(). If two indexes have different page sizes (today they don't, but the design allows it), the pool should key by size or take the max. - Benchmark: re-run
LSMSparseVectorIndexLargeBenchmarkwith-XX:+PrintGCto confirm allocation rate drops and to spot any tail-latency regression from pool contention.
Acceptance criteria
PaginatedSegmentDimCursorno longer holds a per-instancedecodeScratch.- Buffers are sourced at
start()and returned atclose()from a shared pool. - Allocation rate during a sustained 10M corpus benchmark is measurably lower (target: GC-rate reduction proportional to query fan-out).
- No regression in
BmwScorerCorrectnessTestor the 47-test sparse-vector unit suite. - The "FUTURE" Javadoc block on
decodeScratchis removed once the pool lands.
Out of scope
- Pooling the per-cursor
blockRids/blockWeights/blockTombstonesarrays (each onlyparams.blockSize()entries, default 128, ~3 KiB combined - dominated bydecodeScratch). - The
payloadScratchinSparseSegmentBuilder(one allocation per builder, not per cursor; not on the query hot path).