Modules — Sentence Transformers documentation (original) (raw)

sentence_transformers.sentence_transformer.modules defines different building blocks, a.k.a. Modules, that can be used to create SentenceTransformer models from scratch. For more details, see Creating Custom Models.

See also the modules from sentence_transformers.base.modules in Base > Modules.

Main Modules

class sentence_transformers.sentence_transformer.modules.Pooling(embedding_dimension: int, pooling_mode: Literal['cls', 'max', 'mean', 'mean_sqrt_len_tokens', 'weightedmean', 'lasttoken'] | tuple[Literal['cls', 'max', 'mean', 'mean_sqrt_len_tokens', 'weightedmean', 'lasttoken'], ...] | list[Literal['cls', 'max', 'mean', 'mean_sqrt_len_tokens', 'weightedmean', 'lasttoken']] = 'mean', include_prompt: bool = True)[source]

Performs pooling on token embeddings to produce fixed-size sentence embeddings.

Generates a fixed-size sentence embedding from variable-length token embeddings. Supports multiple pooling strategies that can also be combined by passing a tuple of mode names.

Parameters:

embedding_dimension – The dimensionality of the input token embeddings.
pooling_mode – The pooling strategy to use. Can be a single mode name (str) or a tuple/list of mode names to concatenate multiple pooled representations. Valid modes: "cls", "max", "mean", "mean_sqrt_len_tokens","weightedmean", "lasttoken". Defaults to "mean".
include_prompt – If False, prompt tokens are excluded from pooling. Useful for models like INSTRUCTOR that should not include the prompt in the pooled representation. Defaults to True.

class sentence_transformers.sentence_transformer.modules.Normalize[source]

This layer normalizes embeddings to unit length

class sentence_transformers.sentence_transformer.modules.StaticEmbedding(tokenizer: Tokenizer | PreTrainedTokenizerFast, embedding_weights: ndarray | Tensor | None = None, embedding_dim: int | None = None, **kwargs)[source]

Initializes the StaticEmbedding model given a tokenizer. The model is a simple embedding bag model that takes the mean of trained per-token embeddings to compute text embeddings.

Parameters:

tokenizer (Tokenizer | PreTrainedTokenizerFast) – The tokenizer to be used. Must be a fast tokenizer from transformers or tokenizers.
embedding_weights (np.ndarray | torch.Tensor | None , optional) – Pre-trained embedding weights. Defaults to None.
embedding_dim (int | None , optional) – Dimension of the embeddings. Required if embedding_weights is not provided. Defaults to None.

Tip

Due to the extremely efficient nature of this module architecture, the overhead for moving inputs to the GPU can be larger than the actual computation time. Therefore, consider using a CPU device for inference and training.

Example:

from sentence_transformers import SentenceTransformer from sentence_transformers.sentence_transformer.modules import StaticEmbedding from tokenizers import Tokenizer

Pre-distilled embeddings:

static_embedding = StaticEmbedding.from_model2vec("minishlab/potion-base-8M")

or distill your own embeddings:

static_embedding = StaticEmbedding.from_distillation("BAAI/bge-base-en-v1.5", device="cuda")

or start with randomized embeddings:

tokenizer = Tokenizer.from_pretrained("FacebookAI/xlm-roberta-base") static_embedding = StaticEmbedding(tokenizer, embedding_dim=512)

model = SentenceTransformer(modules=[static_embedding])

embeddings = model.encode(["What are Pandas?", "The giant panda, also known as the panda bear or simply the panda, is a bear native to south central China."]) similarity = model.similarity(embeddings[0], embeddings[1])

tensor([[0.8093]]) (If you use potion-base-8M)

tensor([[0.6234]]) (If you use the distillation method)

tensor([[-0.0693]]) (For example, if you use randomized embeddings)

Raises:

ValueError – If the tokenizer is not a fast tokenizer.
ValueError – If neither embedding_weights nor embedding_dim is provided.

classmethod from_distillation(model_name: str, vocabulary: list[str] | None = None, device: str | None = None, pca_dims: int | None = 256, apply_zipf: bool = True, sif_coefficient: float | None = 0.0001, token_remove_pattern: str | None = '\\[unused\\d+\\]', quantize_to: str = 'float32', use_subword: bool = True, **kwargs: Any) → StaticEmbedding [source]

Creates a StaticEmbedding instance from a distillation process using the model2vec package.

Parameters:

model_name (str) – The name of the model to distill.
vocabulary (list [ str ] | None , optional) – A list of vocabulary words to use. Defaults to None.
device (str) – The device to run the distillation on (e.g., ‘cpu’, ‘cuda’). If not specified, the strongest device is automatically detected. Defaults to None.
pca_dims (int | None , optional) – The number of dimensions for PCA reduction. Defaults to 256.
apply_zipf (bool) – Whether to apply Zipf’s law during distillation. Defaults to True.
sif_coefficient (float | None , optional) – The coefficient for SIF weighting. Defaults to 1e-4.
token_remove_pattern (str | None , optional) – A regex pattern to remove tokens from the vocabulary. Defaults to r”[unusedd+]”.
quantize_to (str) – The data type to quantize the weights to. Defaults to ‘float32’.
use_subword (bool) – Whether to use subword tokenization. Defaults to True.

Returns:

An instance of StaticEmbedding initialized with the distilled model’s

tokenizer and embedding weights.

Return type:

StaticEmbedding

Raises:

ImportError – If the model2vec package is not installed.

classmethod from_model2vec(model_id_or_path: str) → StaticEmbedding [source]

Create a StaticEmbedding instance from a model2vec model. This method loads a pre-trained model2vec model and extracts the embedding weights and tokenizer to create a StaticEmbedding instance.

Parameters:

model_id_or_path (str) – The identifier or path to the pre-trained model2vec model.

Returns:

An instance of StaticEmbedding initialized with the tokenizer and embedding weights

the model2vec model.

Return type:

StaticEmbedding

Raises:

ImportError – If the model2vec package is not installed.

Further Modules

class sentence_transformers.sentence_transformer.modules.BoW(vocab: list[str], word_weights: dict[str, float] = {}, unknown_word_weight: float = 1, cumulative_term_frequency: bool = True)[source]

Implements a Bag-of-Words (BoW) model to derive sentence embeddings.

A weighting can be added to allow the generation of tf-idf vectors. The output vector has the size of the vocab.

class sentence_transformers.sentence_transformer.modules.CNN(in_embedding_dimension: int, out_channels: int = 256, kernel_sizes: list[int] = [1, 3, 5], stride_sizes: list[int] = None)[source]

CNN-layer with multiple kernel-sizes over the word embeddings

class sentence_transformers.sentence_transformer.modules.LSTM(embedding_dimension: int, hidden_dim: int, num_layers: int = 1, dropout: float = 0, bidirectional: bool = True)[source]

Bidirectional LSTM running over word embeddings.

class sentence_transformers.sentence_transformer.modules.WeightedLayerPooling(embedding_dimension, num_hidden_layers: int = 12, layer_start: int = 4, layer_weights=None)[source]

Token embeddings are weighted mean of their different hidden layer representations

class sentence_transformers.sentence_transformer.modules.WordEmbeddings(tokenizer: WordTokenizer | PreTrainedTokenizerBase, embedding_weights, update_embeddings: bool = False, max_seq_length: int = 1000000)[source]

Subclass of sentence_transformers.base.modules.Module, base class for all input modules in the Sentence Transformers library, i.e. modules that are used to process inputs and optionally also perform processing in the forward pass.

This class provides a common interface for all input modules, including methods for loading and saving the module’s configuration and weights, as well as input processing. It also provides a method for performing the forward pass of the module.

Two abstract methods are inherited from Module and must be implemented by subclasses:

sentence_transformers.base.modules.Module.forward(): The forward pass of the module.
sentence_transformers.base.modules.Module.save(): Save the module to disk.

Additionally, subclasses should override:

sentence_transformers.base.modules.InputModule.preprocess(): Preprocess the inputs and return a dictionary of preprocessed features.

Optionally, you may also have to override:

sentence_transformers.base.modules.InputModule.modalities: The list of supported input modalities. Defaults to ["text"]. Override this to advertise support for non-text modalities (e.g. ["text", "image"]).
sentence_transformers.base.modules.Module.load(): Load the module from disk.

To assist with loading and saving the module, several utility methods are provided:

sentence_transformers.base.modules.Module.load_config(): Load the module’s configuration from a JSON file.
sentence_transformers.base.modules.Module.load_file_path(): Load a file from the module’s directory, regardless of whether the module is saved locally or on Hugging Face.
sentence_transformers.base.modules.Module.load_dir_path(): Load a directory from the module’s directory, regardless of whether the module is saved locally or on Hugging Face.
sentence_transformers.base.modules.Module.load_torch_weights(): Load the PyTorch weights of the module, regardless of whether the module is saved locally or on Hugging Face.
sentence_transformers.base.modules.Module.save_config(): Save the module’s configuration to a JSON file.
sentence_transformers.base.modules.Module.save_torch_weights(): Save the PyTorch weights of the module.
sentence_transformers.base.modules.InputModule.save_tokenizer(): Save the tokenizer used by the module.
sentence_transformers.base.modules.Module.get_config_dict(): Get the module’s configuration as a dictionary.

And several class variables are defined to assist with loading and saving the module:

sentence_transformers.base.modules.Module.config_file_name: The name of the configuration file used to save the module’s configuration.
sentence_transformers.base.modules.Module.config_keys: A list of keys used to save the module’s configuration.
sentence_transformers.base.modules.InputModule.save_in_root: Whether to save the module’s configuration in the root directory of the model or in a subdirectory named after the module.
sentence_transformers.base.modules.InputModule.tokenizer: The tokenizer used by the module.

class sentence_transformers.sentence_transformer.modules.WordWeights(vocab: list[str], word_weights: dict[str, float], unknown_word_weight: float = 1)[source]

This model can weight word embeddings, for example, with idf-values.

Initializes the WordWeights class.

Parameters:

vocab (List [ str ]) – Vocabulary of the tokenizer.
word_weights (Dict [ str , float ]) – Mapping of tokens to a float weight value. Word embeddings are multiplied by this float value. Tokens in word_weights must not be equal to the vocab (can contain more or less values).
unknown_word_weight (float , optional) – Weight for words in vocab that do not appear in the word_weights lookup. These can be, for example, rare words in the vocab where no weight exists. Defaults to 1.