Usage — Sentence Transformers documentation (original) (raw)
Characteristics of Sentence Transformer (a.k.a bi-encoder) models:
- Calculates a fixed-size vector representation (embedding) given texts, images, audio, video, or combinations thereof (depending on the model).
- Embedding calculation is often efficient, embedding similarity calculation is very fast.
- Applicable for a wide range of tasks, such as semantic textual similarity, semantic search, clustering, classification, paraphrase mining, and more.
- Often used as a first step in a two-step retrieval process, where a Cross-Encoder (a.k.a. reranker) model is used to re-rank the top-k results from the bi-encoder.
Once you have installed Sentence Transformers, you can easily use Sentence Transformer models:
from sentence_transformers import SentenceTransformer
1. Load a pretrained Sentence Transformer model
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
The sentences to encode
sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium.", ]
2. Calculate embeddings by calling model.encode()
embeddings = model.encode(sentences) print(embeddings.shape)
[3, 384]
3. Calculate the embedding similarities
similarities = model.similarity(embeddings, embeddings) print(similarities)
tensor([[1.0000, 0.6660, 0.1046],
[0.6660, 1.0000, 0.1411],
[0.1046, 0.1411, 1.0000]])
Some Sentence Transformer models support inputs beyond text, such as images, audio, or video. You can check which modalities a model supports using the modalities property and the supports() method. The encode() method accepts different input formats depending on the modality:
Tip
Multimodal models require additional dependencies. Install them with e.g. pip install -U "sentence-transformers[image]" for image support. See Installation for all options.
- Text: strings.
- Image: PIL images, file paths, URLs, or numpy/torch arrays.
- Audio: file paths, numpy/torch arrays, dicts with
"array"and"sampling_rate"keys, or (iftorchcodecinstalled) torchcodec.AudioDecoder instances. - Video: file paths, numpy/torch arrays, dicts with
"array"and"video_metadata"keys, or (iftorchcodecinstalled) torchcodec.VideoDecoder instances. - Multimodal dicts: a dict mapping modality names to values, e.g.
{"text": ..., "audio": ...}. The keys must be"text","image","audio", or"video". - Chat messages: a list of dicts with
"role"and"content"keys for multimodal models that use an uncommon chat template to combine text and non-text inputs.
The following example loads a multimodal model and computes similarities between text and image embeddings:
from sentence_transformers import SentenceTransformer
1. Load a model that supports both text and images
model = SentenceTransformer("Qwen/Qwen3-VL-Embedding-2B", revision="refs/pr/23")
2. Encode images from URLs
img_embeddings = model.encode([ "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg", "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg", ])
3. Encode text queries (one matching + one hard negative per image)
text_embeddings = model.encode([ "A green car parked in front of a yellow building", "A red car driving on a highway", "A bee on a pink flower", "A wasp on a wooden table", ])
4. Compute cross-modal similarities
similarities = model.similarity(text_embeddings, img_embeddings) print(similarities)
tensor([[0.5115, 0.1078],
[0.1999, 0.1108],
[0.1255, 0.6749],
[0.1283, 0.2704]])
For retrieval tasks, encode_query() and encode_document() are the recommended methods. Many embedding models use different prompts or instructions for queries vs. documents, and these methods handle that automatically:
- encode_query() uses the model’s
"query"prompt (if available) and setstask="query". - encode_document() uses the first available prompt from
"document","passage", or"corpus", and setstask="document".
These methods accept all the same input types as encode() (text, images, URLs, multimodal dicts, etc.) and pass through all the same parameters. For models without specialized query/document prompts, they behave identically to encode().
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Qwen/Qwen3-VL-Embedding-2B", revision="refs/pr/23")
Encode text queries with the query prompt
query_embeddings = model.encode_query([ "Find me a photo of a vehicle parked near a building", "Show me an image of a pollinating insect", ])
Encode document screenshots with the document prompt
doc_embeddings = model.encode_document([ "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg", "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg", ])
Compute similarities
similarities = model.similarity(query_embeddings, doc_embeddings) print(similarities)
tensor([[0.3907, 0.1490],
[0.1235, 0.4872]])
Tasks and Advanced Usage