Quickstart — Sentence Transformers documentation (original) (raw)

Sentence Transformer

Characteristics of Sentence Transformer (a.k.a bi-encoder) models:

Calculates a fixed-size vector representation (embedding) given texts, images, audio, or video.
Embedding calculation is often efficient, embedding similarity calculation is very fast.
Applicable for a wide range of tasks, such as semantic textual similarity, semantic search, clustering, classification, paraphrase mining, and more.
Often used as a first step in a two-step retrieval process, where a Cross-Encoder (a.k.a. reranker) model is used to re-rank the top-k results from the bi-encoder.

Once you have installed Sentence Transformers, you can easily use Sentence Transformer models:

Text

from sentence_transformers import SentenceTransformer

1. Load a pretrained Sentence Transformer model

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

The sentences to encode

sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium.", ]

2. Calculate embeddings by calling model.encode()

embeddings = model.encode(sentences) print(embeddings.shape)

[3, 384]

3. Calculate the embedding similarities

similarities = model.similarity(embeddings, embeddings) print(similarities)

tensor([[1.0000, 0.6660, 0.1046],

[0.6660, 1.0000, 0.1411],

[0.1046, 0.1411, 1.0000]])

Multimodal

Tip

Multimodal models require additional dependencies. Install them with e.g. pip install -U "sentence-transformers[image]" for image support. See Installation for all options.

from sentence_transformers import SentenceTransformer

1. Load a model that supports both text and images

model = SentenceTransformer("Qwen/Qwen3-VL-Embedding-2B", revision="refs/pr/23")

2. Encode images from URLs

img_embeddings = model.encode([ "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg", "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg", ])

3. Encode text queries (one matching + one hard negative per image)

text_embeddings = model.encode([ "A green car parked in front of a yellow building", "A red car driving on a highway", "A bee on a pink flower", "A wasp on a wooden table", ])

similarities = model.similarity(text_embeddings, img_embeddings) print(similarities)

tensor([[0.5115, 0.1078],

[0.1999, 0.1108],

[0.1255, 0.6749],

[0.1283, 0.2704]])

With SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2") we pick which Sentence Transformer model we load. In this example, we load sentence-transformers/all-MiniLM-L6-v2, which is a MiniLM model finetuned on a large dataset of over 1 billion training pairs. Using SentenceTransformer.similarity(), we compute the similarity between all pairs of sentences. As expected, the similarity between semantically related inputs is higher than between unrelated ones. Multimodal models like Qwen/Qwen3-VL-Embedding-2B can also encode images, audio, or video into the same embedding space.

Finetuning Sentence Transformer models is easy and requires only a few lines of code. For more information, see the Training Overview section.

Cross Encoder

Characteristics of Cross Encoder (a.k.a reranker) models:

Calculates a similarity score given pairs of inputs (typically text, but also images or other modalities).
Generally provides superior performance compared to a Sentence Transformer (a.k.a. bi-encoder) model.
Often slower than a Sentence Transformer model, as it requires computation for each pair rather than each text.
Due to the previous 2 characteristics, Cross Encoders are often used to re-rank the top-k results from a Sentence Transformer model.

The usage for Cross Encoder (a.k.a. reranker) models is similar to Sentence Transformers:

Text

from sentence_transformers import CrossEncoder

1. Load a pretrained CrossEncoder model

model = CrossEncoder("cross-encoder/stsb-distilroberta-base")

We want to compute the similarity between the query sentence...

query = "A man is eating pasta."

... and all sentences in the corpus

corpus = [ "A man is eating food.", "A man is eating a piece of bread.", "The girl is carrying a baby.", "A man is riding a horse.", "A woman is playing violin.", "Two men pushed carts through the woods.", "A man is riding a white horse on an enclosed ground.", "A monkey is playing drums.", "A cheetah is running behind its prey.", ]

2. We rank all sentences in the corpus for the query

ranks = model.rank(query, corpus)

Print the scores

print("Query: ", query) for rank in ranks: print(f"{rank['score']:.2f}\t{corpus[rank['corpus_id']]}") """ Query: A man is eating pasta. 0.67 A man is eating food. 0.34 A man is eating a piece of bread. 0.08 A man is riding a horse. 0.07 A man is riding a white horse on an enclosed ground. 0.01 The girl is carrying a baby. 0.01 Two men pushed carts through the woods. 0.01 A monkey is playing drums. 0.01 A woman is playing violin. 0.01 A cheetah is running behind its prey. """

3. Alternatively, you can also manually compute the score between two sentences

import numpy as np

sentence_combinations = [[query, sentence] for sentence in corpus] scores = model.predict(sentence_combinations)

Sort the scores in decreasing order to get the corpus indices

ranked_indices = np.argsort(scores)[::-1] print("Scores:", scores) print("Indices:", ranked_indices) """ Scores: [0.6732372, 0.34102544, 0.00542465, 0.07569341, 0.00525378, 0.00536814, 0.06676237, 0.00534825, 0.00516717] Indices: [0 1 3 6 2 5 7 4 8] """

Multimodal

from sentence_transformers import CrossEncoder

model = CrossEncoder("Qwen/Qwen3-VL-Reranker-2B", revision="refs/pr/11")

query = "A green car parked in front of a yellow building" documents = [ # Image documents (URL or local file path) "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg", "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg", # Text document "A vintage Volkswagen Beetle painted in bright green sits in a driveway.", # Combined text + image document { "text": "A car in a European city", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg", }, ]

rankings = model.rank(query, documents) for rank in rankings: print(f"{rank['score']:.4f}\t(document {rank['corpus_id']})") """ 0.9375 (document 0) 0.5000 (document 3) -1.2500 (document 2) -2.4375 (document 1) """

With CrossEncoder("cross-encoder/stsb-distilroberta-base") we pick which CrossEncoder model we load. CrossEncoder models can also work with multimodal inputs: Qwen/Qwen3-VL-Reranker-2B can rank images and text by relevance to a query.

Finetuning CrossEncoder models is easy and requires only a few lines of code. For more information, see the Training Overview section.

Sparse Encoder

Characteristics of Sparse Encoder models:

Calculates sparse vector representations where most dimensions are zero.
Provides efficiency benefits for large-scale retrieval systems due to the sparse nature of embeddings.
Often more interpretable than dense embeddings, with non-zero dimensions corresponding to specific tokens.
Complementary to dense embeddings, enabling hybrid search systems that combine the strengths of both approaches.

The usage for Sparse Encoder models follows a similar pattern to Sentence Transformers:

from sentence_transformers import SparseEncoder

1. Load a pretrained SparseEncoder model

model = SparseEncoder("naver/splade-cocondenser-ensembledistil")

The sentences to encode

sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium.", ]

2. Calculate sparse embeddings by calling model.encode()

embeddings = model.encode(sentences) print(embeddings.shape)

[3, 30522] - sparse representation with vocabulary size dimensions

3. Calculate the embedding similarities (using dot product by default)

similarities = model.similarity(embeddings, embeddings) print(similarities)

tensor([[ 35.629, 9.154, 0.098],

[ 9.154, 27.478, 0.019],

[ 0.098, 0.019, 29.553]])

4. Check sparsity statistics

stats = SparseEncoder.sparsity(embeddings) print(f"Sparsity: {stats['sparsity_ratio']:.2%}") # Typically >99% zeros print(f"Avg non-zero dimensions per embedding: {stats['active_dims']:.2f}")

With SparseEncoder("naver/splade-cocondenser-ensembledistil") we load a pretrained SPLADE model that generates sparse embeddings. SPLADE (SParse Lexical AnD Expansion) models use MLM prediction mechanisms to create sparse representations that are particularly effective for information retrieval tasks.

Finetuning Sparse Encoder models is easy and requires only a few lines of code. For more information, see the Training Overview section.

Next Steps

Consider reading one of the following sections next:

Quickstart — Sentence Transformers documentation (original) (raw)

Sentence Transformer

1. Load a pretrained Sentence Transformer model

The sentences to encode

2. Calculate embeddings by calling model.encode()

[3, 384]

3. Calculate the embedding similarities

tensor([[1.0000, 0.6660, 0.1046],

[0.6660, 1.0000, 0.1411],

[0.1046, 0.1411, 1.0000]])

1. Load a model that supports both text and images

2. Encode images from URLs

3. Encode text queries (one matching + one hard negative per image)

4. Compute cross-modal similarities

tensor([[0.5115, 0.1078],

[0.1999, 0.1108],

[0.1255, 0.6749],

[0.1283, 0.2704]])

Cross Encoder

1. Load a pretrained CrossEncoder model

We want to compute the similarity between the query sentence...

... and all sentences in the corpus

2. We rank all sentences in the corpus for the query

Print the scores

3. Alternatively, you can also manually compute the score between two sentences

Sort the scores in decreasing order to get the corpus indices

Sparse Encoder

1. Load a pretrained SparseEncoder model

The sentences to encode

2. Calculate sparse embeddings by calling model.encode()

[3, 30522] - sparse representation with vocabulary size dimensions

3. Calculate the embedding similarities (using dot product by default)

tensor([[ 35.629, 9.154, 0.098],

[ 9.154, 27.478, 0.019],

[ 0.098, 0.019, 29.553]])

4. Check sparsity statistics

Next Steps