GitHub - MinishLab/vicinity: Lightweight Nearest Neighbors with Flexible Backends (original) (raw)

Vicinity is a light-weight, low-dependency vector store. It provides a simple and intuitive interface for nearest neighbor search, with support for different backends and evaluation.

There are many nearest neighbors packages and methods out there. However, we found it difficult to compare them. Every package has its own interface, quirks, and limitations, and learning a new package can be time-consuming. In addition to that, how do you effectively evaluate different packages? How do you know which one is the best for your use case?

This is where Vicinity comes in. Instead of learning a new interface for each new package or backend, Vicinity provides a unified interface for all backends. This allows you to easily experiment with different indexing methods and distance metrics and choose the best one for your use case. Vicinity also provides a simple way to evaluate the performance of different backends, allowing you to measure the queries per second and recall.

Quickstart

Install the package with:

Optionally, install specific backends and integrations, or simply install all of them with:

pip install vicinity[all]

The following code snippet demonstrates how to use Vicinity for nearest neighbor search:

import numpy as np from vicinity import Vicinity, Backend, Metric

Create some dummy data as strings or other serializable objects

items = ["triforce", "master sword", "hylian shield", "boomerang", "hookshot"] vectors = np.random.rand(len(items), 128)

Initialize the Vicinity instance (using the basic backend and cosine metric)

vicinity = Vicinity.from_vectors_and_items( vectors=vectors, items=items, backend_type=Backend.BASIC, metric=Metric.COSINE )

Create a query vector

query_vector = np.random.rand(128)

Query for nearest neighbors with a top-k search

results = vicinity.query(query_vector, k=3)

Query for nearest neighbors with a threshold search

results = vicinity.query_threshold(query_vector, threshold=0.9)

Query with a list of query vectors

query_vectors = np.random.rand(5, 128) results = vicinity.query(query_vectors, k=3)

Saving and loading a vector store:

vicinity.save('my_vector_store') vicinity = Vicinity.load('my_vector_store')

Pushing and loading a vector store from the Hugging Face Hub (note that you can optionally add the model used for generating embeddings to the metadata, e.g. vicinity.metadata["model"] = "minishlab/potion-base-8M"):

vicinity.push_to_hub(repo_id='minishlab/my-vicinity-repo') vicinity = Vicinity.load_from_hub(repo_id='minishlab/my-vicinity-repo')

Evaluating a backend:

Use the first 1000 vectors as query vectors

query_vectors = vectors[:1000]

Evaluate the Vicinity instance by measuring the queries per second and recall

qps, recall = vicinity.evaluate( full_vectors=vectors, query_vectors=query_vectors, )

Main Features

Vicinity provides the following features:

Supported Backends

The following backends are supported:

NOTE: the ANN backends do not support dynamic deletion. To delete items, you need to recreate the index. Insertion is supported in the following backends: FAISS, HNSW, and Usearch. The BASIC backend supports both insertion and deletion.

Backend Parameters

Backend Parameter Description Default Value
BASIC metric Similarity metric to use (cosine, euclidean). "cosine"
ANNOY metric Similarity metric to use (dot, euclidean, cosine). "cosine"
trees Number of trees to use for indexing. 100
length Optional length of the dataset. None
FAISS metric Similarity metric to use (cosine, l2). "cosine"
index_type Type of FAISS index (flat, ivf, hnsw, lsh, scalar, pq, ivf_scalar, ivfpq, ivfpqr). "hnsw"
nlist Number of cells for IVF indexes. 100
m Number of subquantizers for PQ and HNSW indexes. 8
nbits Number of bits for LSH and PQ indexes. 8
refine_nbits Number of bits for the refinement stage in IVFPQR indexes. 8
HNSW metric Similarity space to use (cosine, l2). "cosine"
ef_construction Size of the dynamic list during index construction. 200
m Number of connections per layer. 16
PYNNDESCENT metric Similarity metric to use (cosine, euclidean, manhattan). "cosine"
n_neighbors Number of neighbors to use for search. 15
USEARCH metric Similarity metric to use (cos, ip, l2sq, hamming, tanimoto). "cos"
connectivity Number of connections per node in the graph. 16
expansion_add Number of candidates considered during graph construction. 128
expansion_search Number of candidates considered during search. 64
VOYAGER metric Similarity space to use (cosine, l2). "cosine"
ef_construction The number of vectors that this index searches through when inserting a new vector into the index. 200
m The number of connections between nodes in the tree’s internal data structure. 16

Installation

The following installation options are available:

Install the base package

pip install vicinity

Install all integrations and backends

pip install vicinity[all]

Install all integrations

pip install vicinity[integrations]

Install specific integrations

pip install vicinity[huggingface]

Install all backends

pip install vicinity[backends]

Install specific backends

pip install vicinity[annoy] pip install vicinity[faiss] pip install vicinity[hnsw] pip install vicinity[pynndescent] pip install vicinity[usearch] pip install vicinity[voyager]

License

MIT

Citing

If you use Vicinity in your research, please cite the following:

@software{minishlab2024vicinity, author = {Stephan Tulkens and {van Dongen}, Thomas}, title = {Vicinity: Lightweight Nearest Neighbors with Flexible Backends}, year = {2024}, publisher = {Zenodo}, doi = {10.5281/zenodo.17265874}, url = {https://github.com/MinishLab/vicinity}, license = {MIT} }