quantization — Sentence Transformers documentation (original) (raw)

sentence_transformers.util.quantization defines different helpful functions to perform embedding quantization.

Note

Embedding Quantization differs from model quantization. The former shrinks the size of embeddings such that semantic search/retrieval is faster and requires less memory and disk space. The latter refers to lowering the precision of the model weights to speed up inference. This page only shows documentation for the former.

sentence_transformers.util.quantization.quantize_embeddings(embeddings: Tensor | ndarray, precision: Literal['float32', 'int8', 'uint8', 'binary', 'ubinary'], ranges: ndarray | None = None, calibration_embeddings: ndarray | None = None) → ndarray[source]

Quantizes embeddings to a lower precision. This can be used to reduce the memory footprint and increase the speed of similarity search. The supported precisions are “float32”, “int8”, “uint8”, “binary”, and “ubinary”.

Parameters:

Returns:

Quantized embeddings with the specified precision

sentence_transformers.util.quantization.semantic_search_faiss(query_embeddings: np.ndarray, corpus_embeddings: np.ndarray | None = None, corpus_index: faiss.Index | None = None, corpus_precision: Literal['float32', 'uint8', 'ubinary'] = 'float32', top_k: int = 10, ranges: np.ndarray | None = None, calibration_embeddings: np.ndarray | None = None, rescore: bool = True, rescore_multiplier: int = 2, exact: bool = True, output_index: bool = False) → tuple[list[list[dict[str, int | float]]], float, faiss.Index][source]

Performs semantic search using the FAISS library.

Rescoring will be performed if: 1. rescore is True 2. The query embeddings are not quantized 3. The corpus is quantized, i.e. the corpus precision is not float32 Only if these conditions are true, will we search for top_k * rescore_multiplier samples and then rescore to only keep top_k.

Parameters:

Returns:

A tuple containing a list of search results and the time taken for the search. If output_index is True, the tuple will also contain the FAISS index used for the search.

Raises:

ValueError – If both corpus_embeddings and corpus_index are provided or if neither is provided.

The list of search results is in the format: [[{“corpus_id”: int, “score”: float}, …], …] The time taken for the search is a float value.

sentence_transformers.util.quantization.semantic_search_usearch(query_embeddings: np.ndarray, corpus_embeddings: np.ndarray | None = None, corpus_index: usearch.index.Index | None = None, corpus_precision: Literal['float32', 'int8', 'binary'] = 'float32', top_k: int = 10, ranges: np.ndarray | None = None, calibration_embeddings: np.ndarray | None = None, rescore: bool = True, rescore_multiplier: int = 2, exact: bool = True, output_index: bool = False) → tuple[list[list[dict[str, int | float]]], float, usearch.index.Index][source]

Performs semantic search using the usearch library.

Rescoring will be performed if: 1. rescore is True 2. The query embeddings are not quantized 3. The corpus is quantized, i.e. the corpus precision is not float32 Only if these conditions are true, will we search for top_k * rescore_multiplier samples and then rescore to only keep top_k.

Parameters:

Returns:

A tuple containing a list of search results and the time taken for the search. If output_index is True, the tuple will also contain the usearch index used for the search.

Raises:

ValueError – If both corpus_embeddings and corpus_index are provided or if neither is provided.

The list of search results is in the format: [[{“corpus_id”: int, “score”: float}, …], …] The time taken for the search is a float value.