feat: groupBy/groupSize options on vector.neighbors for diversified retrieval (original) (raw)

Feature Request: `groupBy` Option on Vector Neighbors

Originated from discussion #4044 (Qdrant to ArcadeDB migration).

Overview

Add groupBy and groupSize options to vector.neighbors (and the future vector.sparseNeighbors) so vector retrieval can be diversified at search time by a payload field, instead of forcing callers to over-fetch and post-partition with ROW_NUMBER() OVER (PARTITION BY ...).

This mirrors Qdrant's query_points_groups and is generally useful for any diversification scenario:

Best chunk per document (where multiple chunks of the same document each have their own embedding).
At most N items per author / category / source.
Best variant per product family in e-commerce.
The text-classification use case from Migrating from Qdrant to ArcadeDB — stuck on sparse vector indexing and server-side hybrid search #4044, where each XML source produces multiple sparse points (positive / description / negative) and the caller wants the best hit per source file.

Design

API

SELECT * FROM vector.neighbors( 'Doc[embedding]', $queryVec, 10, { groupBy: 'source_file', groupSize: 1, filter: "(status = 'active')" } )

Options:

groupBy (string): payload field to group by. Resolves to a STRING, INTEGER, or LONG value. Supports nested fields via dot notation, e.g. metadata.author.
groupSize (integer, default 1): max points returned per group.
The positional limit argument (third param of vector.neighbors) becomes the max number of distinct groups returned, mirroring Qdrant's limit semantics in the grouping API.

When groupBy is absent, behavior is identical to today (flat top-K, no breaking change).

Implementation

Best-per-group is enforced during HNSW traversal, not as a post-filter:

Maintain a min-heap of size limit × groupSize.
Track per-group occupancy in a hashmap groupKey -> count.
Admit a candidate only if count[groupKey] < groupSize, or its distance beats the current worst member of that group (in which case we evict the worst).
Traversal stops by the standard HNSW efSearch budget; if some groups cannot be filled, return what was found (best-effort, matching Qdrant's stated semantics).

Composition

Works with the filter option (post-filter applied before group accounting).
Works inside vector.fuse (see feat: server-side hybrid retrieval fusion (vector.fuse with RRF/DBSF/LINEAR) #4066), both per-source and at the fusion level.
Works for vector.sparseNeighbors (see feat: LSM_SPARSE_VECTOR index for sparse embedding retrieval #4065) with the same option shape.

Acceptance criteria

groupBy and groupSize options on vector.neighbors.
Same options on vector.sparseNeighbors (when that lands).
Implemented during HNSW traversal, not as a post-filter.
groupSize defaults to 1.
groupBy accepts STRING, INTEGER, LONG, supports dotted nested-field access.
Best-effort semantics documented (cannot guarantee groupSize × limit in all cases).
Composes with filter.
Tests: 100 docs across 10 source_files, top-10 with groupSize=1 returns 10 distinct sources; with groupSize=2, returns up to 20 with each group capped at 2.

Out of scope (future work)

Cross-collection group lookup (Qdrant's with_lookup). Can be emulated today with a follow-up SQL join.
Multi-key grouping (group by (field_a, field_b)).
Group-level scoring / aggregation (e.g. avg score per group as a tiebreaker).

Discussion: Migrating from Qdrant to ArcadeDB — stuck on sparse vector indexing and server-side hybrid search #4044
Companion issues: feat: LSM_SPARSE_VECTOR index for sparse embedding retrieval #4065 (LSM_SPARSE_VECTOR index), feat: server-side hybrid retrieval fusion (vector.fuse with RRF/DBSF/LINEAR) #4066 (hybrid fusion vector.fuse).

cc @astarso