Computing Sparse Embeddings — Sentence Transformers documentation (original) (raw)

Once you have installed Sentence Transformers, you can easily use Sparse Encoder models:

from sentence_transformers import SparseEncoder

1. Load a pretrained SparseEncoder model

model = SparseEncoder("naver/splade-cocondenser-ensembledistil")

The sentences to encode

sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium.", ]

2. Calculate sparse embeddings by calling model.encode()

embeddings = model.encode(sentences) print(embeddings.shape)

[3, 30522] - sparse representation with vocabulary size dimensions

3. Calculate the embedding similarities (using dot product by default)

similarities = model.similarity(embeddings, embeddings) print(similarities)

tensor([[ 35.629, 9.154, 0.098],

[ 9.154, 27.478, 0.019],

[ 0.098, 0.019, 29.553]])

4. Check sparsity statistics

stats = SparseEncoder.sparsity(embeddings) print(f"Sparsity: {stats['sparsity_ratio']:.2%}") # Typically >99% zeros print(f"Avg non-zero dimensions per embedding: {stats['active_dims']:.2f}")

Note

Even though we talk about sentence embeddings, you can use Sparse Encoder for shorter phrases as well as for longer texts with multiple sentences. See Input Sequence Length for notes on embeddings for longer texts.

Initializing a Sparse Encoder Model

The first step is to load a pretrained Sparse Encoder model. You can use any of the models from the Pretrained Models or a local model. See also SparseEncoder for information on parameters.

from sentence_transformers import SparseEncoder

model = SparseEncoder("naver/splade-cocondenser-ensembledistil")

Alternatively, you can pass a path to a local model directory:

model = SparseEncoder("output/models/sparse-distilbert-nq-finetuned")

The model will automatically be placed on the most performant available device, e.g. cuda or mps if available. You can also specify the device explicitly:

model = SparseEncoder("naver/splade-cocondenser-ensembledistil", device="cuda")

Calculating Embeddings

The method to calculate embeddings is SparseEncoder.encode.

Input Sequence Length

For transformer models like BERT, RoBERTa, DistilBERT etc., the runtime and memory requirement grows quadratic with the input length. This limits transformers to inputs of certain lengths. A common value for BERT-based models are 512 tokens, which corresponds to about 300-400 words (for English).

Each model has a maximum sequence length under model.max_seq_length, which is the maximal number of tokens that can be processed. Longer texts will be truncated to the first model.max_seq_length tokens:

from sentence_transformers import SparseEncoder

model = SparseEncoder("naver/splade-cocondenser-ensembledistil") print("Max Sequence Length:", model.max_seq_length)

=> Max Sequence Length: 256

Change the length to 200

model.max_seq_length = 200

print("Max Sequence Length:", model.max_seq_length)

=> Max Sequence Length: 200

Note

You cannot increase the length higher than what is maximally supported by the respective transformer model. Also note that if a model was trained on short texts, the representations for long texts might not be that good.

Controlling Sparsity

For sparse models, you can control the maximum number of active dimensions (non-zero values) in the output embeddings using the max_active_dims parameter. This is particularly useful for reducing memory usage and storage requirements and controlling the trade-off between accuracy and retrieval latency.

You can specify max_active_dims either when initializing the model or during encoding:

from sentence_transformers import SparseEncoder

Initialize the SPLADE model

model = SparseEncoder("naver/splade-cocondenser-ensembledistil")

Embed a list of sentences

sentences = [ "This framework generates embeddings for each input sentence", "Sentences are passed as a list of string.", "The quick brown fox jumps over the lazy dog.", ]

Generate embeddings

embeddings = model.encode(sentences)

Print embedding dimensionality and sparsity

print(f"Embedding dim: {model.get_embedding_dimension()}")

stats = model.sparsity(embeddings) print(f"Embedding sparsity: {stats}") print(f"Average non-zero dimensions: {stats['active_dims']:.2f}") print(f"Sparsity percentage: {stats['sparsity_ratio']:.2%}") """ Embedding dim: 30522 Embedding sparsity: {'active_dims': 56.333335876464844, 'sparsity_ratio': 0.9981543366792325} Average non-zero dimensions: 56.33 Sparsity percentage: 99.82% """

Example of using max_active_dims during encoding to limit the active dimensions

embeddings_limited = model.encode(sentences, max_active_dims=32) stats_limited = model.sparsity(embeddings_limited) print(f"Limited embedding sparsity: {stats_limited}") print(f"Average non-zero dimensions: {stats_limited['active_dims']:.2f}") print(f"Sparsity percentage: {stats_limited['sparsity_ratio']:.2%}") """ Limited embedding sparsity: {'active_dims': 32.0, 'sparsity_ratio': 0.9989515759124565} Average non-zero dimensions: 32.00 Sparsity percentage: 99.90% """

When you set max_active_dims, the model will keep only the top-K dimensions with the highest values and set all other values to zero. This ensures your embeddings maintain a controlled level of sparsity while preserving the most important semantic information.

Note

Setting a very low max_active_dims value may reduce the quality of search results. The optimal value depends on your specific use case and dataset.

One of the key benefits of controlling sparsity with max_active_dims is reduced memory usage. Here’s an example showing the memory savings:

def get_sparse_embedding_memory_size(tensor): # For sparse tensors, only count non-zero elements return (tensor._values().element_size() * tensor._values().nelement() + tensor._indices().element_size() * tensor._indices().nelement())

print(f"Original embeddings memory: {get_sparse_embedding_memory_size(embeddings) / 1024:.2f} KB") print(f"Embeddings with max_active_dims=32 memory: {get_sparse_embedding_memory_size(embeddings_limited) / 1024:.2f} KB") """ Original embeddings memory: 3.32 KB Embeddings with max_active_dims=32 memory: 1.88 KB """

As shown in the example, limiting active dimensions to 32 reduced memory usage by approximately 43%. This efficiency becomes even more significant when working with large document collections but need to be put in balance with the possible loss of quality of the embeddings representations. Note that each of the Evaluator classes has a max_active_dims parameter that can be set to control the number of active dimensions during evaluation, so you can easily compare the performance of different settings.

Interpretability with SPLADE Models

When using SPLADE models, a key advantage is interpretability. You can easily visualize which tokens contribute most to the embedding, providing insights into what the model considers important in the text:

from sentence_transformers import SparseEncoder

Initialize the SPLADE model

model = SparseEncoder("naver/splade-cocondenser-ensembledistil")

Embed a list of sentences

sentences = [ "This framework generates embeddings for each input sentence", "Sentences are passed as a list of string.", "The quick brown fox jumps over the lazy dog.", ]

Generate embeddings

embeddings = model.encode(sentences)

Visualize top tokens for each text

top_k = 10

token_weights = model.decode(embeddings, top_k=top_k)

print(f"\nTop tokens {top_k} for each text:")

The result is a list of sentence embeddings as numpy arrays

for i, sentence in enumerate(sentences): token_scores = ", ".join([f'("{token.strip()}", {value:.2f})' for token, value in token_weights[i]]) print(f"{i}: {sentence} -> Top tokens: {token_scores}")

""" Top tokens 10 for each text: 0: This framework generates embeddings for each input sentence -> Top tokens: ("framework", 2.19), ("##bed", 2.12), ("input", 1.99), ("each", 1.60), ("em", 1.58), ("sentence", 1.49), ("generate", 1.42), ("##ding", 1.33), ("sentences", 1.10), ("create", 0.93) 1: Sentences are passed as a list of string. -> Top tokens: ("string", 2.72), ("pass", 2.24), ("sentences", 2.15), ("passed", 2.07), ("sentence", 1.90), ("strings", 1.86), ("list", 1.84), ("lists", 1.49), ("as", 1.18), ("passing", 0.73) 2: The quick brown fox jumps over the lazy dog. -> Top tokens: ("lazy", 2.18), ("fox", 1.67), ("brown", 1.56), ("over", 1.52), ("dog", 1.50), ("quick", 1.49), ("jump", 1.39), ("dogs", 1.25), ("foxes", 0.99), ("jumping", 0.84) """

This interpretability helps in understanding why certain documents match or don’t match in search applications, and provides transparency into the model’s behavior.

Multi-Process / Multi-GPU Encoding

You can encode input texts with more than one GPU (or with multiple processes on a CPU machine). It tends to help significantly with large datasets, but the overhead of starting multiple processes can be significant for smaller datasets.

You can use SparseEncoder.encode() (or SparseEncoder.encode_query() or SparseEncoder.encode_document()) with either:

Additionally, you can use the chunk_size parameter to control the size of the chunks sent to each process. This differs from the batch_size parameter. For example, with a chunk_size=1000 and a batch_size=32, the input texts will be split into chunks of 1000 texts, and each chunk will be sent to a process and embedded in batches of 32 texts at a time. This can help with memory management and performance, especially for large datasets.