feat: groupBy/groupSize options on vector.neighbors for diversified retrieval (original) (raw)
Feature Request: groupBy Option on Vector Neighbors
Originated from discussion #4044 (Qdrant to ArcadeDB migration).
Overview
Add groupBy and groupSize options to vector.neighbors (and the future vector.sparseNeighbors) so vector retrieval can be diversified at search time by a payload field, instead of forcing callers to over-fetch and post-partition with ROW_NUMBER() OVER (PARTITION BY ...).
This mirrors Qdrant's query_points_groups and is generally useful for any diversification scenario:
- Best chunk per document (where multiple chunks of the same document each have their own embedding).
- At most N items per author / category / source.
- Best variant per product family in e-commerce.
- The text-classification use case from Migrating from Qdrant to ArcadeDB — stuck on sparse vector indexing and server-side hybrid search #4044, where each XML source produces multiple sparse points (positive / description / negative) and the caller wants the best hit per source file.
Design
API
SELECT * FROM vector.neighbors(
'Doc[embedding]',
$queryVec,
10,
{ groupBy: 'source_file', groupSize: 1, filter: "(status = 'active')" }
)
Options:
groupBy(string): payload field to group by. Resolves to aSTRING,INTEGER, orLONGvalue. Supports nested fields via dot notation, e.g.metadata.author.groupSize(integer, default1): max points returned per group.- The positional
limitargument (third param ofvector.neighbors) becomes the max number of distinct groups returned, mirroring Qdrant'slimitsemantics in the grouping API.
When groupBy is absent, behavior is identical to today (flat top-K, no breaking change).
Implementation
Best-per-group is enforced during HNSW traversal, not as a post-filter:
- Maintain a min-heap of size
limit × groupSize. - Track per-group occupancy in a hashmap
groupKey -> count. - Admit a candidate only if
count[groupKey] < groupSize, or its distance beats the current worst member of that group (in which case we evict the worst). - Traversal stops by the standard HNSW
efSearchbudget; if some groups cannot be filled, return what was found (best-effort, matching Qdrant's stated semantics).
Composition
- Works with the
filteroption (post-filter applied before group accounting). - Works inside
vector.fuse(see feat: server-side hybrid retrieval fusion (vector.fuse with RRF/DBSF/LINEAR) #4066), both per-source and at the fusion level. - Works for
vector.sparseNeighbors(see feat: LSM_SPARSE_VECTOR index for sparse embedding retrieval #4065) with the same option shape.
Acceptance criteria
groupByandgroupSizeoptions onvector.neighbors.- Same options on
vector.sparseNeighbors(when that lands). - Implemented during HNSW traversal, not as a post-filter.
groupSizedefaults to1.groupByacceptsSTRING,INTEGER,LONG, supports dotted nested-field access.- Best-effort semantics documented (cannot guarantee
groupSize × limitin all cases). - Composes with
filter. - Tests: 100 docs across 10 source_files, top-10 with
groupSize=1returns 10 distinct sources; withgroupSize=2, returns up to 20 with each group capped at 2.
Out of scope (future work)
- Cross-collection group lookup (Qdrant's
with_lookup). Can be emulated today with a follow-up SQL join. - Multi-key grouping (group by
(field_a, field_b)). - Group-level scoring / aggregation (e.g. avg score per group as a tiebreaker).
Related
- Discussion: Migrating from Qdrant to ArcadeDB — stuck on sparse vector indexing and server-side hybrid search #4044
- Companion issues: feat: LSM_SPARSE_VECTOR index for sparse embedding retrieval #4065 (
LSM_SPARSE_VECTORindex), feat: server-side hybrid retrieval fusion (vector.fuse with RRF/DBSF/LINEAR) #4066 (hybrid fusionvector.fuse).
cc @astarso