Usage — Sentence Transformers documentation (original) (raw)

Characteristics of Sentence Transformer (a.k.a bi-encoder) models:

Calculates a fixed-size vector representation (embedding) given texts, images, audio, video, or combinations thereof (depending on the model).
Embedding calculation is often efficient, embedding similarity calculation is very fast.
Applicable for a wide range of tasks, such as semantic textual similarity, semantic search, clustering, classification, paraphrase mining, and more.
Often used as a first step in a two-step retrieval process, where a Cross-Encoder (a.k.a. reranker) model is used to re-rank the top-k results from the bi-encoder.

Once you have installed Sentence Transformers, you can easily use Sentence Transformer models:

from sentence_transformers import SentenceTransformer

1. Load a pretrained Sentence Transformer model

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

The sentences to encode

sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium.", ]

2. Calculate embeddings by calling model.encode()

embeddings = model.encode(sentences) print(embeddings.shape)

[3, 384]

3. Calculate the embedding similarities

similarities = model.similarity(embeddings, embeddings) print(similarities)

tensor([[1.0000, 0.6660, 0.1046],

[0.6660, 1.0000, 0.1411],

[0.1046, 0.1411, 1.0000]])

Some Sentence Transformer models support inputs beyond text, such as images, audio, or video. You can check which modalities a model supports using the modalities property and the supports() method. The encode() method accepts different input formats depending on the modality:

Tip

Multimodal models require additional dependencies. Install them with e.g. pip install -U "sentence-transformers[image]" for image support. See Installation for all options.

Text: strings.
Image: PIL images, file paths, URLs, or numpy/torch arrays.
Audio: file paths, numpy/torch arrays, dicts with "array" and "sampling_rate" keys, or (if torchcodec installed) torchcodec.AudioDecoder instances.
Video: file paths, numpy/torch arrays, dicts with "array" and "video_metadata" keys, or (if torchcodec installed) torchcodec.VideoDecoder instances.
Multimodal dicts: a dict mapping modality names to values, e.g. {"text": ..., "audio": ...}. The keys must be "text", "image", "audio", or "video".
Chat messages: a list of dicts with "role" and "content" keys for multimodal models that use an uncommon chat template to combine text and non-text inputs.

The following example loads a multimodal model and computes similarities between text and image embeddings:

from sentence_transformers import SentenceTransformer

1. Load a model that supports both text and images

model = SentenceTransformer("Qwen/Qwen3-VL-Embedding-2B", revision="refs/pr/23")

2. Encode images from URLs

img_embeddings = model.encode([ "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg", "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg", ])

3. Encode text queries (one matching + one hard negative per image)

text_embeddings = model.encode([ "A green car parked in front of a yellow building", "A red car driving on a highway", "A bee on a pink flower", "A wasp on a wooden table", ])

similarities = model.similarity(text_embeddings, img_embeddings) print(similarities)

tensor([[0.5115, 0.1078],

[0.1999, 0.1108],

[0.1255, 0.6749],

[0.1283, 0.2704]])

For retrieval tasks, encode_query() and encode_document() are the recommended methods. Many embedding models use different prompts or instructions for queries vs. documents, and these methods handle that automatically:

encode_query() uses the model’s "query" prompt (if available) and sets task="query".
encode_document() uses the first available prompt from "document", "passage", or "corpus", and sets task="document".

These methods accept all the same input types as encode() (text, images, URLs, multimodal dicts, etc.) and pass through all the same parameters. For models without specialized query/document prompts, they behave identically to encode().

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Qwen/Qwen3-VL-Embedding-2B", revision="refs/pr/23")

Encode text queries with the query prompt

query_embeddings = model.encode_query([ "Find me a photo of a vehicle parked near a building", "Show me an image of a pollinating insect", ])

Encode document screenshots with the document prompt

doc_embeddings = model.encode_document([ "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg", "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg", ])

Compute similarities

similarities = model.similarity(query_embeddings, doc_embeddings) print(similarities)

tensor([[0.3907, 0.1490],

[0.1235, 0.4872]])

Tasks and Advanced Usage

Usage — Sentence Transformers documentation (original) (raw)

1. Load a pretrained Sentence Transformer model

The sentences to encode

2. Calculate embeddings by calling model.encode()

[3, 384]

3. Calculate the embedding similarities

tensor([[1.0000, 0.6660, 0.1046],

[0.6660, 1.0000, 0.1411],

[0.1046, 0.1411, 1.0000]])

1. Load a model that supports both text and images

2. Encode images from URLs

3. Encode text queries (one matching + one hard negative per image)

4. Compute cross-modal similarities

tensor([[0.5115, 0.1078],

[0.1999, 0.1108],

[0.1255, 0.6749],

[0.1283, 0.2704]])

Encode text queries with the query prompt

Encode document screenshots with the document prompt

Compute similarities

tensor([[0.3907, 0.1490],

[0.1235, 0.4872]])