feat: partition-aware planner pruning in SQL/Cypher+ partitioning integrity guardrails (original) (raw)
Context
PartitionedBucketSelectionStrategy is shipped and works on the write path: rows route to bucket = hash(properties) % bucketCount automatically when a type's strategy is partitioned. The strategy is also persisted in the schema and surfaced via getBucketSelectionStrategy().
But every read-path query ignores the partition strategy:
- SQL planner (
engine/.../query/sql/executor/): one occurrence ofgetBucketSelectionStrategy, inFetchFromSchemaTypesStepline 92, and that is purelySHOW TYPESmetadata. No planner step prunes buckets based on aWHEREpredicate matching the partition key. - OpenCypher planner (
engine/.../query/opencypher/): zero occurrences. vector.neighbors/vector.sparseNeighbors: accept anallowedBucketIdsfilter (line 158) but the only thing populating it today is an explicit type qualifier - the planner never derives it from aWHERE prop = Xpredicate that matches the partition key.
So the partition strategy gives users zero query-time benefit today. Every SELECT FROM Doc WHERE tenant_id = X, every MATCH (n:Doc {tenant_id: 'X'}), every vector top-K with a tenant filter, scans all buckets. This is a real, broad missing optimization.
Scope
Part A: Partition-aware bucket pruning
Add a planner rule, on each query engine, that:
- Detects whether the type has a
PartitionedBucketSelectionStrategy. - Checks whether the query's filter contains an equality (or
IN) predicate on a property that's part of the partition key. - If yes, computes
hash(value) % bucketCountfor each constraint value and restricts the scan / index access / vector function to those bucket ids only. - If no, falls back to the existing fan-out (no regression).
The rule applies broadly:
- SQL planner -
FetchFromTypeStep,FetchFromIndexStep, and thevector.neighbors/vector.sparseNeighborsplanner integration (so they receiveallowedBucketIdsautomatically). - OpenCypher planner - the analogous rule for
MATCH (n:Type {prop: X})andMATCH (n:Type) WHERE n.prop = X. Cypher's optimizer already has hooks for index-driven label scans; this slots in similarly.
Part B: Partitioning integrity via a needsRepartition flag
Today nothing prevents a user from breaking partitioning consistency after the fact:
- Adding a bucket to a partitioned type silently changes
bucketCount, which invalidates thehash(value) % bucketCountmapping for every existing row. New rows go to the right bucket; old rows are now in the wrong bucket; queries pruned by partition would return wrong results. - Removing a bucket has the same problem.
- Removing the partition property from the type leaves rows partitioned but the planner has nothing to derive a hash from.
- Changing the partition strategy between strategies that hash to different bucket sets has the same problem.
The cleanest design is a persistent needsRepartition flag on the type that captures "is this type's partitioning currently trustworthy?" The planner treats the flag as a hard gate: if true, the partition-pruning rule from Part A is skipped and queries fan out across all buckets - no wrong results possible, just no optimization until the user reconciles. This avoids both the "block the DDL" and "throw on mutate" footguns: the DDL is always fast, queries are always correct, and the only cost is "no pruning until rebuild."
B.1 Schema state
Add boolean needsRepartition to LocalDocumentType (and equivalents). Persisted in schema.json. Replicated via the standard schema-replication path so followers see the same value. Set by exactly three paths:
- Bucket added or dropped on a partitioned type →
true. BucketSelectionStrategychanges from non-partitioned to partitioned on a populated type, or between two partition strategies →true.REBUILD TYPE <Type> WITH repartition = truecompletes successfully across every record →false.
A type created fresh as partitioned (no records yet) starts at false because the partition mapping is trivially correct over zero rows.
B.2 Planner contract
The partition-pruning rule from Part A checks type.needsRepartition() before firing. If true, the rule does nothing and the query falls back to today's fan-out across all buckets. Pruning resumes automatically on the next query after the rebuild clears the flag.
B.3 DDL ergonomics: two modes
-- Default: DDL is fast and non-blocking. Flag is set to true. WARNING is logged
-- and surfaced in Studio. Queries stay correct, just lose partition pruning until
-- rebuild.
ALTER TYPE Doc ADD BUCKET;
-- WARNING: type 'Doc' uses PartitionedBucketSelectionStrategy on property 'tenant_id'.
-- Adding a bucket has invalidated the partition mapping; run
-- REBUILD TYPE Doc WITH repartition = true when convenient to restore pruning.
-- Until then queries fan out across all buckets.
-- Atomic: DDL + rebuild as one operation. Blocks until rebuild completes; flag -- never goes to true because the rebuild covers the whole type before the DDL -- returns. ALTER TYPE Doc ADD BUCKET WITH repartition = true;
B.4 Rebuild command
Extend the existing REBUILD TYPE rather than introduce a new top-level statement:
REBUILD TYPE <Type> [POLYMORPHIC] WITH repartition = true [, batchSize = N]
The existing rebuild already walks every record via db.scanType to apply schema-layout changes; the new setting extends the per-record handler to also recompute the target bucket via type.getBucketSelectionStrategy().getBucketIdByRecord(...) and move the record (delete from current bucket, insert into target) when they differ. batchSize and POLYMORPHIC apply unchanged. Linear in row count, never silently triggered. The existing RebuildTypeStatement parser/AST/executor takes minimal changes - one new setting key and one branch in the per-record path. On full success the flag is cleared; on failure the flag stays true so a retry remains correct.
B.5 Query-time WARNING with throttle
When a query is executed against a type whose needsRepartition is true, the engine emits a Level.WARNING log so operators see the cost of the pending rebuild. Throttle to at most one message per minute per type (60-second window matching the existing QueryEngineManager and SparseVectorScoringPool saturation throttles - same pattern, same operator-mental-model). Implementation: AtomicLong lastNeedsRepartitionWarnMs per LocalDocumentType, the planner's pre-firing check on the partition-pruning rule increments it via compareAndSet when the window has elapsed. Spam is bounded; one entry per type per minute is enough to make the pending rebuild noticeable in any reasonable monitoring setup without drowning the log on a hot type.
Sample log line:
WARNING - type 'Doc' has needsRepartition=true; partition-aware bucket pruning is
disabled until `REBUILD TYPE Doc WITH repartition = true` runs. Queries continue
to return correct results but fan out across all 16 buckets.
B.6 Visibility
SHOW TYPE DocdisplaysneedsRepartition: true/false.schema:typessystem view exposes it as a queryable column.- Studio's type details panel renders a warning banner when the flag is set, with a one-click "Run repartition" button that invokes
REBUILD TYPE <Type> WITH repartition = trueand shows progress. - Server boot emits an info-level reminder ("3 partitioned types are pending repartition: Doc, Order, Event") if any are stale, so operators see the state without having to query for it.
B.7 Edge cases the flag handles
- Rebuild fails or is killed midway: flag stays
true(cleared only on full success). Restart-safe. - Multiple successive DDLs (add bucket, add bucket, drop bucket): each flips the flag to
trueif not already there; oneREBUILD TYPEcovers all of them. - HA replication: the flag replicates with the schema; followers see the same state and skip pruning identically.
- Backup/restore: the flag rides along with the schema; a restored type with
needsRepartition = truecorrectly skips pruning until rebuilt.
The principle stays: "if you opted into partitioning, the engine treats it as a load-bearing invariant; you cannot accidentally produce wrong results by mutating the schema." The flag is the mechanism that enforces it without throwing.
Part C: Documentation and Studio
- New docs page: "Schema design 101 - choosing a bucket strategy" with a 3-question decision tree, concrete
CREATE TYPE Doc BUCKETS 16examples, anti-patterns (low-cardinality / skewed / mutable partition keys), and a "how to verify it's working" section showingEXPLAIN-style query plans with bucket pruning visible. - Add a paragraph to the type-creation reference and the vector-index reference linking to the design page.
- Promote the partition-aware-vector-filter pattern as the answer to the filterable-HNSW question, alongside the future ACORN integration follow-up.
- Studio: when creating a type, surface a hint near the bucket-strategy dropdown ("If your data is scoped by tenant, customer, region, or another high-cardinality identifier, partition by that property for query-time pruning") linking to the design page. Half-day of frontend work; high leverage because schema choices are made once.
Acceptance criteria
- SQL planner: a
WHEREequality /INon the partition key restricts the bucket set inFetchFromTypeStep/FetchFromIndexStep/vector.neighbors/vector.sparseNeighbors. - OpenCypher planner: equivalent rule for
MATCH (n:Type {prop: X})andWHERE n.prop = X. - Schema-mutation guardrails: bucket count change, partition-property removal, strategy change while data exists either throw a clear error or trigger an explicit
REBUILD PARTITIONINGcommand. REBUILD TYPE <Type> WITH repartition = truesetting implemented and documented; existingPOLYMORPHICandbatchSizekeep working alongside it. Successful completion clears the type'sneedsRepartitionflag.LocalDocumentType.needsRepartitionflag persisted in schema.json, replicated via the standard schema-replication path, set by bucket add/drop and strategy change, cleared by successful rebuild.- Planner partition-pruning rule from Part A is gated on
!type.needsRepartition(); tests prove pruning is suppressed when the flag is true and resumes automatically after rebuild clears it. - Throttled
Level.WARNINGper-type query-time log when queries hit a type withneedsRepartition=true(one message per type per 60-second window). Test pins the throttle interval. SHOW TYPEandschema:typesexposeneedsRepartition. Studio renders a warning banner with a "Run repartition" button.- Tests: a
PartitionPruningPlannerTestproving each engine prunes correctly + anAlterPartitionedTypeTestproving each schema-mutation path either throws or triggers a rebuild. - Docs: the "Schema design 101" page is published and linked from the type-creation and vector-index reference pages.
- Studio: the bucket-strategy hint is rendered on the type-creation form.
- Default
BucketSelectionStrategystaysRoundRobinBucketSelectionStrategy(do not flip). Promote partitioning via docs, not by changing the default.
Out of scope
- Auto-detecting a "good" partition key from data distribution. The user picks; the engine validates and prunes.
- Repartitioning across nodes in a distributed deployment - relevant only when horizontal sharding lands.
- Partitioning by computed expressions (e.g.,
hash(tenant_id, region)beyond the existing multi-property strategy). - Time-window partitioning with automatic bucket creation per period.
Why now
Comes out of the post-#4068 roadmap discussion: with sparse-vector scaling shipped, the next-most-visible production gap is filter-aware vector retrieval, and the cheapest credible answer for the multi-tenant SaaS case is partition-aware planner pruning - which the engine already has the data for, just never used. The same rule then benefits every non-vector query that filters on a partition key, so the work is broadly reusable rather than vector-specific.
Related
- feat: WAND/BlockMax-WAND dynamic pruning for LSM_SPARSE_VECTOR (scale to 100M+) #4068 (sparse-vector scaling) - shipped; this issue extracts the partition-aware angle as a standalone deliverable
- feat: per-segment parallel top-K scoring for LSM_SPARSE_VECTOR (Step 5 follow-up to #4068) #4085 (per-segment parallel sparse scoring) - parallel work, also follow-up to feat: WAND/BlockMax-WAND dynamic pruning for LSM_SPARSE_VECTOR (scale to 100M+) #4068
- perf: pool the 64 KiB decodeScratch buffer in PaginatedSegmentDimCursor #4086 (decode-buffer pooling) - parallel work, also follow-up to feat: WAND/BlockMax-WAND dynamic pruning for LSM_SPARSE_VECTOR (scale to 100M+) #4068