feat: per-segment parallel top-K scoring for LSM_SPARSE_VECTOR (Step 5 follow-up to #4068) (original) (raw)

Follow-up to #4068. Tracks the deferred Step 5 (per-segment parallel scoring) so it does not get lost.

Context

The v2 sparse-vector backend ships serial BMW DAAT (BmwScorer.topK) over a merged per-dim cursor stack. At 10M docs / 30k-dim SPLADE-shape this lands at 1.20 ms/query with size-tiered compaction in place. That is well inside the noise of network round-trip + JSON serialization for any real-world workload, which is why parallel dispatch was deferred during the v2 land.

What is already in place in the tree (no code to write for plumbing, just the hot-path dispatch):

SparseVectorScoringPool - JVM-wide dedicated executor, daemon threads, bounded queue, caller-runs fallback with throttled saturation WARNING.
arcadedb.sparseVectorScoringPoolThreads / arcadedb.sparseVectorScoringQueueSize global config knobs (currently emit a startup warning that the topK path is still serial).
PoolMetrics MeterBinder with Micrometer gauges for size, active, queue depth, queue capacity, completed tasks, caller-run fallbacks - tagged pool=sparse_vector.
Studio "Executor Pools" card under the Server tab renders the live numbers.

Scope

Wire PaginatedSparseVectorEngine.topK to fan out to SparseVectorScoringPool when segments.length >= threshold and pool.getMaxParallelism() > 1.
RID-range partitioning that preserves the newest-source-wins merge across cross-segment dim contributions. Each partition produces its own top-K, then a final merge step.
Ensure correctness: BmwScorerCorrectnessTest-shape coverage at multi-segment scale (parallel result must match serial result bit-for-bit modulo quantization noise).
Benchmark against the existing LSMSparseVectorIndexLargeBenchmark 10M run; expected gain is ~3-4x on a 4-core machine, capped by the segment count.
Drop the "currently the topK path is serial" wording from the two SPARSE_VECTOR_SCORING_* config descriptions and from the SparseVectorScoringPool startup warning once the dispatch is live.

Acceptance criteria

BmwScorer.topK callable through a parallel dispatcher in PaginatedSparseVectorEngine.topK, gated by a configurable threshold so single-segment queries stay on the caller thread.
RID-range partition correctness proven by a multi-segment correctness test.
LSMSparseVectorIndexLargeBenchmark shows measurable speedup at 10M (target: under 0.5 ms/query at K=10 on a 4-core box).
Studio "Executor Pools" card surfaces non-zero active / completedTasks / occasional caller-run fallbacks under load.
Startup warning + "currently serial" config wording removed.

Out of scope

Cross-query parallelism (already linear via independent queries on the engine's thread-safe state).
Cross-dim parallelism within a single segment (BMW pruning is inherently serial across dims; the parallel axis is segments, not dims).

Also still tracked

The other deferred item from #4068 - one-shot backward-compat rebuild for #4065 MVP-era postings on first open - is not part of this issue. The v2 backend keeps the old LSM-Tree shell readable (so old schemas still open) but does not auto-rebuild MVP postings into segments; pre-v2 datasets need re-insertion. If a user reports it in the field, file a separate issue for that migration path.