feat: partition-aware planner pruning in SQL/Cypher+ partitioning integrity guardrails (original) (raw)

Context

PartitionedBucketSelectionStrategy is shipped and works on the write path: rows route to bucket = hash(properties) % bucketCount automatically when a type's strategy is partitioned. The strategy is also persisted in the schema and surfaced via getBucketSelectionStrategy().

But every read-path query ignores the partition strategy:

SQL planner (engine/.../query/sql/executor/): one occurrence of getBucketSelectionStrategy, in FetchFromSchemaTypesStep line 92, and that is purely SHOW TYPES metadata. No planner step prunes buckets based on a WHERE predicate matching the partition key.
OpenCypher planner (engine/.../query/opencypher/): zero occurrences.
vector.neighbors / vector.sparseNeighbors: accept an allowedBucketIds filter (line 158) but the only thing populating it today is an explicit type qualifier - the planner never derives it from a WHERE prop = X predicate that matches the partition key.

So the partition strategy gives users zero query-time benefit today. Every SELECT FROM Doc WHERE tenant_id = X, every MATCH (n:Doc {tenant_id: 'X'}), every vector top-K with a tenant filter, scans all buckets. This is a real, broad missing optimization.

Scope

Part A: Partition-aware bucket pruning

Add a planner rule, on each query engine, that:

Detects whether the type has a PartitionedBucketSelectionStrategy.
Checks whether the query's filter contains an equality (or IN) predicate on a property that's part of the partition key.
If yes, computes hash(value) % bucketCount for each constraint value and restricts the scan / index access / vector function to those bucket ids only.
If no, falls back to the existing fan-out (no regression).

The rule applies broadly:

SQL planner - FetchFromTypeStep, FetchFromIndexStep, and the vector.neighbors / vector.sparseNeighbors planner integration (so they receive allowedBucketIds automatically).
OpenCypher planner - the analogous rule for MATCH (n:Type {prop: X}) and MATCH (n:Type) WHERE n.prop = X. Cypher's optimizer already has hooks for index-driven label scans; this slots in similarly.

Part B: Partitioning integrity via a `needsRepartition` flag

Today nothing prevents a user from breaking partitioning consistency after the fact:

Adding a bucket to a partitioned type silently changes bucketCount, which invalidates the hash(value) % bucketCount mapping for every existing row. New rows go to the right bucket; old rows are now in the wrong bucket; queries pruned by partition would return wrong results.
Removing a bucket has the same problem.
Removing the partition property from the type leaves rows partitioned but the planner has nothing to derive a hash from.
Changing the partition strategy between strategies that hash to different bucket sets has the same problem.

The cleanest design is a persistent needsRepartition flag on the type that captures "is this type's partitioning currently trustworthy?" The planner treats the flag as a hard gate: if true, the partition-pruning rule from Part A is skipped and queries fan out across all buckets - no wrong results possible, just no optimization until the user reconciles. This avoids both the "block the DDL" and "throw on mutate" footguns: the DDL is always fast, queries are always correct, and the only cost is "no pruning until rebuild."

B.1 Schema state

Add boolean needsRepartition to LocalDocumentType (and equivalents). Persisted in schema.json. Replicated via the standard schema-replication path so followers see the same value. Set by exactly three paths:

Bucket added or dropped on a partitioned type → true.
BucketSelectionStrategy changes from non-partitioned to partitioned on a populated type, or between two partition strategies → true.
REBUILD TYPE <Type> WITH repartition = true completes successfully across every record → false.

A type created fresh as partitioned (no records yet) starts at false because the partition mapping is trivially correct over zero rows.

B.2 Planner contract

The partition-pruning rule from Part A checks type.needsRepartition() before firing. If true, the rule does nothing and the query falls back to today's fan-out across all buckets. Pruning resumes automatically on the next query after the rebuild clears the flag.

B.3 DDL ergonomics: two modes

-- Default: DDL is fast and non-blocking. Flag is set to true. WARNING is logged -- and surfaced in Studio. Queries stay correct, just lose partition pruning until -- rebuild. ALTER TYPE Doc ADD BUCKET; -- WARNING: type 'Doc' uses PartitionedBucketSelectionStrategy on property 'tenant_id'. -- Adding a bucket has invalidated the partition mapping; run -- REBUILD TYPE Doc WITH repartition = true when convenient to restore pruning. -- Until then queries fan out across all buckets.

-- Atomic: DDL + rebuild as one operation. Blocks until rebuild completes; flag -- never goes to true because the rebuild covers the whole type before the DDL -- returns. ALTER TYPE Doc ADD BUCKET WITH repartition = true;

B.4 Rebuild command

Extend the existing REBUILD TYPE rather than introduce a new top-level statement:

REBUILD TYPE <Type> [POLYMORPHIC] WITH repartition = true [, batchSize = N]

The existing rebuild already walks every record via db.scanType to apply schema-layout changes; the new setting extends the per-record handler to also recompute the target bucket via type.getBucketSelectionStrategy().getBucketIdByRecord(...) and move the record (delete from current bucket, insert into target) when they differ. batchSize and POLYMORPHIC apply unchanged. Linear in row count, never silently triggered. The existing RebuildTypeStatement parser/AST/executor takes minimal changes - one new setting key and one branch in the per-record path. On full success the flag is cleared; on failure the flag stays true so a retry remains correct.

B.5 Query-time WARNING with throttle

When a query is executed against a type whose needsRepartition is true, the engine emits a Level.WARNING log so operators see the cost of the pending rebuild. Throttle to at most one message per minute per type (60-second window matching the existing QueryEngineManager and SparseVectorScoringPool saturation throttles - same pattern, same operator-mental-model). Implementation: AtomicLong lastNeedsRepartitionWarnMs per LocalDocumentType, the planner's pre-firing check on the partition-pruning rule increments it via compareAndSet when the window has elapsed. Spam is bounded; one entry per type per minute is enough to make the pending rebuild noticeable in any reasonable monitoring setup without drowning the log on a hot type.

Sample log line:

WARNING - type 'Doc' has needsRepartition=true; partition-aware bucket pruning is
disabled until `REBUILD TYPE Doc WITH repartition = true` runs. Queries continue
to return correct results but fan out across all 16 buckets.

B.6 Visibility

SHOW TYPE Doc displays needsRepartition: true/false.
schema:types system view exposes it as a queryable column.
Studio's type details panel renders a warning banner when the flag is set, with a one-click "Run repartition" button that invokes REBUILD TYPE <Type> WITH repartition = true and shows progress.
Server boot emits an info-level reminder ("3 partitioned types are pending repartition: Doc, Order, Event") if any are stale, so operators see the state without having to query for it.

B.7 Edge cases the flag handles

Rebuild fails or is killed midway: flag stays true (cleared only on full success). Restart-safe.
Multiple successive DDLs (add bucket, add bucket, drop bucket): each flips the flag to true if not already there; one REBUILD TYPE covers all of them.
HA replication: the flag replicates with the schema; followers see the same state and skip pruning identically.
Backup/restore: the flag rides along with the schema; a restored type with needsRepartition = true correctly skips pruning until rebuilt.

The principle stays: "if you opted into partitioning, the engine treats it as a load-bearing invariant; you cannot accidentally produce wrong results by mutating the schema." The flag is the mechanism that enforces it without throwing.

Part C: Documentation and Studio

New docs page: "Schema design 101 - choosing a bucket strategy" with a 3-question decision tree, concrete CREATE TYPE Doc BUCKETS 16 examples, anti-patterns (low-cardinality / skewed / mutable partition keys), and a "how to verify it's working" section showing EXPLAIN-style query plans with bucket pruning visible.
Add a paragraph to the type-creation reference and the vector-index reference linking to the design page.
Promote the partition-aware-vector-filter pattern as the answer to the filterable-HNSW question, alongside the future ACORN integration follow-up.
Studio: when creating a type, surface a hint near the bucket-strategy dropdown ("If your data is scoped by tenant, customer, region, or another high-cardinality identifier, partition by that property for query-time pruning") linking to the design page. Half-day of frontend work; high leverage because schema choices are made once.

Acceptance criteria

SQL planner: a WHERE equality / IN on the partition key restricts the bucket set in FetchFromTypeStep / FetchFromIndexStep / vector.neighbors / vector.sparseNeighbors.
OpenCypher planner: equivalent rule for MATCH (n:Type {prop: X}) and WHERE n.prop = X.
Schema-mutation guardrails: bucket count change, partition-property removal, strategy change while data exists either throw a clear error or trigger an explicit REBUILD PARTITIONING command.
REBUILD TYPE <Type> WITH repartition = true setting implemented and documented; existing POLYMORPHIC and batchSize keep working alongside it. Successful completion clears the type's needsRepartition flag.
LocalDocumentType.needsRepartition flag persisted in schema.json, replicated via the standard schema-replication path, set by bucket add/drop and strategy change, cleared by successful rebuild.
Planner partition-pruning rule from Part A is gated on !type.needsRepartition(); tests prove pruning is suppressed when the flag is true and resumes automatically after rebuild clears it.
Throttled Level.WARNING per-type query-time log when queries hit a type with needsRepartition=true (one message per type per 60-second window). Test pins the throttle interval.
SHOW TYPE and schema:types expose needsRepartition. Studio renders a warning banner with a "Run repartition" button.
Tests: a PartitionPruningPlannerTest proving each engine prunes correctly + an AlterPartitionedTypeTest proving each schema-mutation path either throws or triggers a rebuild.
Docs: the "Schema design 101" page is published and linked from the type-creation and vector-index reference pages.
Studio: the bucket-strategy hint is rendered on the type-creation form.
Default BucketSelectionStrategy stays RoundRobinBucketSelectionStrategy (do not flip). Promote partitioning via docs, not by changing the default.

Out of scope

Auto-detecting a "good" partition key from data distribution. The user picks; the engine validates and prunes.
Repartitioning across nodes in a distributed deployment - relevant only when horizontal sharding lands.
Partitioning by computed expressions (e.g., hash(tenant_id, region) beyond the existing multi-property strategy).
Time-window partitioning with automatic bucket creation per period.

Why now

Comes out of the post-#4068 roadmap discussion: with sparse-vector scaling shipped, the next-most-visible production gap is filter-aware vector retrieval, and the cheapest credible answer for the multi-tenant SaaS case is partition-aware planner pruning - which the engine already has the data for, just never used. The same rule then benefits every non-vector query that filters on a partition key, so the work is broadly reusable rather than vector-specific.

feat: WAND/BlockMax-WAND dynamic pruning for LSM_SPARSE_VECTOR (scale to 100M+) #4068 (sparse-vector scaling) - shipped; this issue extracts the partition-aware angle as a standalone deliverable
feat: per-segment parallel top-K scoring for LSM_SPARSE_VECTOR (Step 5 follow-up to #4068) #4085 (per-segment parallel sparse scoring) - parallel work, also follow-up to feat: WAND/BlockMax-WAND dynamic pruning for LSM_SPARSE_VECTOR (scale to 100M+) #4068
perf: pool the 64 KiB decodeScratch buffer in PaginatedSegmentDimCursor #4086 (decode-buffer pooling) - parallel work, also follow-up to feat: WAND/BlockMax-WAND dynamic pruning for LSM_SPARSE_VECTOR (scale to 100M+) #4068