Usage — Sentence Transformers documentation (original) (raw)

Characteristics of Sentence Transformer (a.k.a bi-encoder) models:

  1. Calculates a fixed-size vector representation (embedding) given texts, images, audio, video, or combinations thereof (depending on the model).
  2. Embedding calculation is often efficient, embedding similarity calculation is very fast.
  3. Applicable for a wide range of tasks, such as semantic textual similarity, semantic search, clustering, classification, paraphrase mining, and more.
  4. Often used as a first step in a two-step retrieval process, where a Cross-Encoder (a.k.a. reranker) model is used to re-rank the top-k results from the bi-encoder.

Once you have installed Sentence Transformers, you can easily use Sentence Transformer models:

from sentence_transformers import SentenceTransformer

1. Load a pretrained Sentence Transformer model

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

The sentences to encode

sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium.", ]

2. Calculate embeddings by calling model.encode()

embeddings = model.encode(sentences) print(embeddings.shape)

[3, 384]

3. Calculate the embedding similarities

similarities = model.similarity(embeddings, embeddings) print(similarities)

tensor([[1.0000, 0.6660, 0.1046],

[0.6660, 1.0000, 0.1411],

[0.1046, 0.1411, 1.0000]])

Some Sentence Transformer models support inputs beyond text, such as images, audio, or video. You can check which modalities a model supports using the modalities property and the supports() method. The encode() method accepts different input formats depending on the modality:

Tip

Multimodal models require additional dependencies. Install them with e.g. pip install -U "sentence-transformers[image]" for image support. See Installation for all options.

The following example loads a multimodal model and computes similarities between text and image embeddings:

from sentence_transformers import SentenceTransformer

1. Load a model that supports both text and images

model = SentenceTransformer("Qwen/Qwen3-VL-Embedding-2B", revision="refs/pr/23")

2. Encode images from URLs

img_embeddings = model.encode([ "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg", "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg", ])

3. Encode text queries (one matching + one hard negative per image)

text_embeddings = model.encode([ "A green car parked in front of a yellow building", "A red car driving on a highway", "A bee on a pink flower", "A wasp on a wooden table", ])

4. Compute cross-modal similarities

similarities = model.similarity(text_embeddings, img_embeddings) print(similarities)

tensor([[0.5115, 0.1078],

[0.1999, 0.1108],

[0.1255, 0.6749],

[0.1283, 0.2704]])

For retrieval tasks, encode_query() and encode_document() are the recommended methods. Many embedding models use different prompts or instructions for queries vs. documents, and these methods handle that automatically:

These methods accept all the same input types as encode() (text, images, URLs, multimodal dicts, etc.) and pass through all the same parameters. For models without specialized query/document prompts, they behave identically to encode().

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Qwen/Qwen3-VL-Embedding-2B", revision="refs/pr/23")

Encode text queries with the query prompt

query_embeddings = model.encode_query([ "Find me a photo of a vehicle parked near a building", "Show me an image of a pollinating insect", ])

Encode document screenshots with the document prompt

doc_embeddings = model.encode_document([ "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg", "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg", ])

Compute similarities

similarities = model.similarity(query_embeddings, doc_embeddings) print(similarities)

tensor([[0.3907, 0.1490],

[0.1235, 0.4872]])

Tasks and Advanced Usage