Losses — Sentence Transformers documentation (original) (raw)

sentence_transformers.losses defines different loss functions that can be used to fine-tune embedding models on training data. The choice of loss function plays a critical role when fine-tuning the model. It determines how well our embedding model will work for the specific downstream task.

Sadly, there is no “one size fits all” loss function. Which loss function is suitable depends on the available training data and on the target task. Consider checking out the Loss Overview to help narrow down your choice of loss function(s).

BatchAllTripletLoss

class sentence_transformers.losses.BatchAllTripletLoss(model: ~sentence_transformers.SentenceTransformer.SentenceTransformer, distance_metric=<function BatchHardTripletLossDistanceFunction.eucledian_distance>, margin: float = 5)[source]

BatchAllTripletLoss takes a batch with (sentence, label) pairs and computes the loss for all possible, valid triplets, i.e., anchor and positive must have the same label, anchor and negative a different label. The labels must be integers, with same label indicating sentences from the same class. Your train dataset must contain at least 2 examples per label class.

Parameters:

References

Requirements:

  1. Each sentence must be labeled with a class.
  2. Your dataset must contain at least 2 examples per labels class.

Inputs:

Texts Labels
single sentences class

Recommendations:

Relations:

Example

from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses from datasets import Dataset

model = SentenceTransformer("microsoft/mpnet-base")

E.g. 0: sports, 1: economy, 2: politics

train_dataset = Dataset.from_dict({ "sentence": [ "He played a great game.", "The stock is up 20%", "They won 2-1.", "The last goal was amazing.", "They all voted against the bill.", ], "label": [0, 1, 0, 0, 2], }) loss = losses.BatchAllTripletLoss(model)

trainer = SentenceTransformerTrainer( model=model, train_dataset=train_dataset, loss=loss, ) trainer.train()

BatchHardSoftMarginTripletLoss

class sentence_transformers.losses.BatchHardSoftMarginTripletLoss(model: ~sentence_transformers.SentenceTransformer.SentenceTransformer, distance_metric=<function BatchHardTripletLossDistanceFunction.eucledian_distance>)[source]

BatchHardSoftMarginTripletLoss takes a batch with (sentence, label) pairs and computes the loss for all possible, valid triplets, i.e., anchor and positive must have the same label, anchor and negative a different label. The labels must be integers, with same label indicating sentences from the same class. Your train dataset must contain at least 2 examples per label class. This soft-margin variant does not require setting a margin.

Parameters:

Definitions:

Easy triplets:

Triplets which have a loss of 0 becausedistance(anchor, positive) + margin < distance(anchor, negative).

Hard triplets:

Triplets where the negative is closer to the anchor than the positive, i.e.,distance(anchor, negative) < distance(anchor, positive).

Semi-hard triplets:

Triplets where the negative is not closer to the anchor than the positive, but which still have a positive loss, i.e., distance(anchor, positive) < distance(anchor, negative) + margin.

References

Requirements:

  1. Each sentence must be labeled with a class.
  2. Your dataset must contain at least 2 examples per labels class.
  3. Your dataset should contain hard positives and negatives.

Inputs:

Texts Labels
single sentences class

Recommendations:

Relations:

Example

from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses from datasets import Dataset

model = SentenceTransformer("microsoft/mpnet-base")

E.g. 0: sports, 1: economy, 2: politics

train_dataset = Dataset.from_dict({ "sentence": [ "He played a great game.", "The stock is up 20%", "They won 2-1.", "The last goal was amazing.", "They all voted against the bill.", ], "label": [0, 1, 0, 0, 2], }) loss = losses.BatchHardSoftMarginTripletLoss(model)

trainer = SentenceTransformerTrainer( model=model, train_dataset=train_dataset, loss=loss, ) trainer.train()

BatchHardTripletLoss

class sentence_transformers.losses.BatchHardTripletLoss(model: ~sentence_transformers.SentenceTransformer.SentenceTransformer, distance_metric=<function BatchHardTripletLossDistanceFunction.eucledian_distance>, margin: float = 5)[source]

BatchHardTripletLoss takes a batch with (sentence, label) pairs and computes the loss for all possible, valid triplets, i.e., anchor and positive must have the same label, anchor and negative a different label. It then looks for the hardest positive and the hardest negatives. The labels must be integers, with same label indicating sentences from the same class. Your train dataset must contain at least 2 examples per label class.

Parameters:

Definitions:

Easy triplets:

Triplets which have a loss of 0 becausedistance(anchor, positive) + margin < distance(anchor, negative).

Hard triplets:

Triplets where the negative is closer to the anchor than the positive, i.e.,distance(anchor, negative) < distance(anchor, positive).

Semi-hard triplets:

Triplets where the negative is not closer to the anchor than the positive, but which still have a positive loss, i.e., distance(anchor, positive) < distance(anchor, negative) + margin.

References

Requirements:

  1. Each sentence must be labeled with a class.
  2. Your dataset must contain at least 2 examples per labels class.
  3. Your dataset should contain hard positives and negatives.

Inputs:

Texts Labels
single sentences class

Recommendations:

Relations:

Example

from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses from datasets import Dataset

model = SentenceTransformer("microsoft/mpnet-base")

E.g. 0: sports, 1: economy, 2: politics

train_dataset = Dataset.from_dict({ "sentence": [ "He played a great game.", "The stock is up 20%", "They won 2-1.", "The last goal was amazing.", "They all voted against the bill.", ], "label": [0, 1, 0, 0, 2], }) loss = losses.BatchHardTripletLoss(model)

trainer = SentenceTransformerTrainer( model=model, train_dataset=train_dataset, loss=loss, ) trainer.train()

BatchSemiHardTripletLoss

class sentence_transformers.losses.BatchSemiHardTripletLoss(model: ~sentence_transformers.SentenceTransformer.SentenceTransformer, distance_metric=<function BatchHardTripletLossDistanceFunction.eucledian_distance>, margin: float = 5)[source]

BatchSemiHardTripletLoss takes a batch with (label, sentence) pairs and computes the loss for all possible, valid triplets, i.e., anchor and positive must have the same label, anchor and negative a different label. It then looks for the semi hard positives and negatives. The labels must be integers, with same label indicating sentences from the same class. Your train dataset must contain at least 2 examples per label class.

Parameters:

Definitions:

Easy triplets:

Triplets which have a loss of 0 becausedistance(anchor, positive) + margin < distance(anchor, negative).

Hard triplets:

Triplets where the negative is closer to the anchor than the positive, i.e.,distance(anchor, negative) < distance(anchor, positive).

Semi-hard triplets:

Triplets where the negative is not closer to the anchor than the positive, but which still have a positive loss, i.e., distance(anchor, positive) < distance(anchor, negative) + margin.

References

Requirements:

  1. Each sentence must be labeled with a class.
  2. Your dataset must contain at least 2 examples per labels class.
  3. Your dataset should contain semi hard positives and negatives.

Inputs:

Texts Labels
single sentences class

Recommendations:

Relations:

Example

from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses from datasets import Dataset

model = SentenceTransformer("microsoft/mpnet-base")

E.g. 0: sports, 1: economy, 2: politics

train_dataset = Dataset.from_dict({ "sentence": [ "He played a great game.", "The stock is up 20%", "They won 2-1.", "The last goal was amazing.", "They all voted against the bill.", ], "label": [0, 1, 0, 0, 2], }) loss = losses.BatchSemiHardTripletLoss(model)

trainer = SentenceTransformerTrainer( model=model, train_dataset=train_dataset, loss=loss, ) trainer.train()

ContrastiveLoss

class sentence_transformers.losses.ContrastiveLoss(model: ~sentence_transformers.SentenceTransformer.SentenceTransformer, distance_metric=<function SiameseDistanceMetric.>, margin: float = 0.5, size_average: bool = True)[source]

Contrastive loss. Expects as input two texts and a label of either 0 or 1. If the label == 1, then the distance between the two embeddings is reduced. If the label == 0, then the distance between the embeddings is increased.

Parameters:

References

Requirements:

  1. (anchor, positive/negative) pairs

Inputs:

Texts Labels
(anchor, positive/negative) pairs 1 if positive, 0 if negative

Relations:

Example

from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses from datasets import Dataset

model = SentenceTransformer("microsoft/mpnet-base") train_dataset = Dataset.from_dict({ "sentence1": ["It's nice weather outside today.", "He drove to work."], "sentence2": ["It's so sunny.", "She walked to the store."], "label": [1, 0], }) loss = losses.ContrastiveLoss(model)

trainer = SentenceTransformerTrainer( model=model, train_dataset=train_dataset, loss=loss, ) trainer.train()

OnlineContrastiveLoss

class sentence_transformers.losses.OnlineContrastiveLoss(model: ~sentence_transformers.SentenceTransformer.SentenceTransformer, distance_metric=<function SiameseDistanceMetric.>, margin: float = 0.5)[source]

This Online Contrastive loss is similar to ConstrativeLoss, but it selects hard positive (positives that are far apart) and hard negative pairs (negatives that are close) and computes the loss only for these pairs. This loss often yields better performances than ContrastiveLoss.

Parameters:

References

Requirements:

  1. (anchor, positive/negative) pairs
  2. Data should include hard positives and hard negatives

Inputs:

Texts Labels
(anchor, positive/negative) pairs 1 if positive, 0 if negative

Relations:

Example

from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses from datasets import Dataset

model = SentenceTransformer("microsoft/mpnet-base") train_dataset = Dataset.from_dict({ "sentence1": ["It's nice weather outside today.", "He drove to work."], "sentence2": ["It's so sunny.", "She walked to the store."], "label": [1, 0], }) loss = losses.OnlineContrastiveLoss(model)

trainer = SentenceTransformerTrainer( model=model, train_dataset=train_dataset, loss=loss, ) trainer.train()

ContrastiveTensionLoss

class sentence_transformers.losses.ContrastiveTensionLoss(model: SentenceTransformer)[source]

This loss expects only single sentences, without any labels. Positive and negative pairs are automatically created via random sampling, such that a positive pair consists of two identical sentences and a negative pair consists of two different sentences. An independent copy of the encoder model is created, which is used for encoding the first sentence of each pair. The original encoder model encodes the second sentence. The embeddings are compared and scored using the generated labels (1 if positive, 0 if negative) using the binary cross entropy objective.

Generally, ContrastiveTensionLossInBatchNegatives is recommended over this loss, as it gives a stronger training signal.

Parameters:

model – SentenceTransformer model

References

Inputs:

+================================+=========+ | Texts | Labels | +================================+=========+ | (sentence_A, sentence_B) pairs | None | +================================+=========+

Relations:

Example

Using a dataset with sentence pairs that sometimes are identical (positive pairs) and sometimes different (negative pairs):

import random from datasets import Dataset from sentence_transformers import SentenceTransformer, losses from sentence_transformers.training_args import SentenceTransformerTrainingArguments from sentence_transformers.trainer import SentenceTransformerTrainer

model = SentenceTransformer('all-MiniLM-L6-v2')

The dataset pairs must sometimes contain identical sentences (positive pairs) and sometimes different sentences (negative pairs).

train_dataset = Dataset.from_dict({ "sentence_A": [ "It's nice weather outside today.", "He drove to work.", "It's so sunny.", ] "sentence_B": [ "It's nice weather outside today.", "He drove to work.", "It's so sunny.", ] }) train_loss = losses.ContrastiveTensionLoss(model=model)

args = SentenceTransformerTrainingArguments( num_train_epochs=10, per_device_train_batch_size=32, eval_steps=0.1, logging_steps=0.01, learning_rate=5e-5, save_strategy="no", fp16=True, ) trainer = SentenceTransformerTrainer(model=model, args=args, train_dataset=train_dataset, loss=train_loss) trainer.train()

With a dataset with single sentence pairs:

import random from datasets import Dataset from sentence_transformers import SentenceTransformer, losses from sentence_transformers.training_args import SentenceTransformerTrainingArguments from sentence_transformers.trainer import SentenceTransformerTrainer

model = SentenceTransformer('all-MiniLM-L6-v2') train_dataset = Dataset.from_dict({ "text1": [ "It's nice weather outside today.", "He drove to work.", "It's so sunny.", ] }) sentences = train_dataset['text1']

def to_ct_pairs(sample, pos_neg_ratio=8): pos_neg_ratio = 1 / pos_neg_ratio sample["text2"] = sample["text1"] if random.random() < pos_neg_ratio else random.choice(sentences) return sample

pos_neg_ratio = 8 # 1 positive pair for 7 negative pairs train_dataset = train_dataset.map(to_ct_pairs, fn_kwargs={"pos_neg_ratio": pos_neg_ratio})

train_loss = losses.ContrastiveTensionLoss(model=model)

args = SentenceTransformerTrainingArguments( num_train_epochs=10, per_device_train_batch_size=32, eval_steps=0.1, logging_steps=0.01, learning_rate=5e-5, save_strategy="no", fp16=True, ) trainer = SentenceTransformerTrainer(model=model, args=args, train_dataset=train_dataset, loss=train_loss) trainer.train()

Initializes internal Module state, shared by both nn.Module and ScriptModule.

ContrastiveTensionLossInBatchNegatives

class sentence_transformers.losses.ContrastiveTensionLossInBatchNegatives(model: ~sentence_transformers.SentenceTransformer.SentenceTransformer, scale: float = 20.0, similarity_fct=<function cos_sim>)[source]

This loss expects only single sentences, without any labels. Positive and negative pairs are automatically created via random sampling, such that a positive pair consists of two identical sentences and a negative pair consists of two different sentences. An independent copy of the encoder model is created, which is used for encoding the first sentence of each pair. The original encoder model encodes the second sentence. Unlike ContrastiveTensionLoss, this loss uses the batch negative sampling strategy, i.e. the negative pairs are sampled from the batch. Using in-batch negative sampling gives a stronger training signal than the original ContrastiveTensionLoss. The performance usually increases with increasing batch sizes.

The training datasets.Dataset must contain one text column.

Parameters:

References

Relations:

Inputs:

Texts Labels
single sentences none

Example

from sentence_transformers import SentenceTransformer, losses from datasets import Dataset

model = SentenceTransformer('all-MiniLM-L6-v2')

train_dataset = Dataset.from_dict({ "sentence": [ "It's nice weather outside today.", "He drove to work.", "It's so sunny.", ] })

train_loss = losses.ContrastiveTensionLossInBatchNegatives(model=model)

args = SentenceTransformerTrainingArguments( num_train_epochs=10, per_device_train_batch_size=32, eval_steps=0.1, logging_steps=0.01, learning_rate=5e-5, save_strategy="no", fp16=True, )

trainer = SentenceTransformerTrainer( model=model, args=args, train_dataset=train_dataset, loss=train_loss, ) trainer.train()

CoSENTLoss

class sentence_transformers.losses.CoSENTLoss(model: ~sentence_transformers.SentenceTransformer.SentenceTransformer, scale: float = 20.0, similarity_fct=<function pairwise_cos_sim>)[source]

This class implements CoSENT (Consistent SENTence embedding) loss. It expects that each of the InputExamples consists of a pair of texts and a float valued label, representing the expected similarity score between the pair.

It computes the following loss function:

loss = logsum(1+exp(s(k,l)-s(i,j))+exp...), where (i,j) and (k,l) are any of the input pairs in the batch such that the expected similarity of (i,j) is greater than (k,l). The summation is over all possible pairs of input pairs in the batch that match this condition.

Anecdotal experiments show that this loss function produces a more powerful training signal than CosineSimilarityLoss, resulting in faster convergence and a final model with superior performance. Consequently, CoSENTLoss may be used as a drop-in replacement for CosineSimilarityLoss in any training script.

Parameters:

References

Requirements:

Inputs:

Texts Labels
(sentence_A, sentence_B) pairs float similarity score

Relations:

Example

from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses from datasets import Dataset

model = SentenceTransformer("microsoft/mpnet-base") train_dataset = Dataset.from_dict({ "sentence1": ["It's nice weather outside today.", "He drove to work."], "sentence2": ["It's so sunny.", "She walked to the store."], "score": [1.0, 0.3], }) loss = losses.CoSENTLoss(model)

trainer = SentenceTransformerTrainer( model=model, train_dataset=train_dataset, loss=loss, ) trainer.train()

AnglELoss

class sentence_transformers.losses.AnglELoss(model: SentenceTransformer, scale: float = 20.0)[source]

This class implements AnglE (Angle Optimized) loss. This is a modification of CoSENTLoss, designed to address the following issue: The cosine function’s gradient approaches 0 as the wave approaches the top or bottom of its form. This can hinder the optimization process, so AnglE proposes to instead optimize the angle difference in complex space in order to mitigate this effect.

It expects that each of the inputs consists of a pair of texts and a float valued label, representing the expected similarity score between the pair. Alternatively, it can also process triplet or n-tuple inputs consisting of an anchor, a positive, and one or more negatives, which will be converted to pairwise comparisons with labels of 1 for the positive pairs and 0 for the negatives.

It computes the following loss function:

loss = logsum(1+exp(s(i,j)-s(k,l))+exp...), where (i,j) and (k,l) are any of the input pairs in the batch such that the expected similarity of (i,j) is greater than (k,l). The summation is over all possible pairs of input pairs in the batch that match this condition. This is the same as CoSENTLoss, with a different similarity function.

It is recommended to use this loss in conjunction with MultipleNegativesRankingLoss, as done in the original paper.

Parameters:

References

Requirements:

Inputs:

Texts Labels
(sentence_A, sentence_B) pairs float similarity score
(anchor, positive, negative) triplets none
(anchor, positive, negative_1, …, negative_n) none

Relations:

Example

from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses from datasets import Dataset

model = SentenceTransformer("microsoft/mpnet-base") train_dataset = Dataset.from_dict({ "sentence1": ["It's nice weather outside today.", "He drove to work."], "sentence2": ["It's so sunny.", "She walked to the store."], "score": [1.0, 0.3], }) loss = losses.AnglELoss(model)

trainer = SentenceTransformerTrainer( model=model, train_dataset=train_dataset, loss=loss, ) trainer.train()

CosineSimilarityLoss

SBERT Siamese Network Architecture

For each sentence pair, we pass sentence A and sentence B through our network which yields the embeddings u und v. The similarity of these embeddings is computed using cosine similarity and the result is compared to the gold similarity score.

This allows our network to be fine-tuned to recognize the similarity of sentences.

class sentence_transformers.losses.CosineSimilarityLoss(model: SentenceTransformer, loss_fct: Module = MSELoss(), cos_score_transformation: Module = Identity())[source]

CosineSimilarityLoss expects that the InputExamples consists of two texts and a float label. It computes the vectors u = model(sentence_A) and v = model(sentence_B) and measures the cosine-similarity between the two. By default, it minimizes the following loss: ||input_label - cos_score_transformation(cosine_sim(u,v))||_2.

Parameters:

References

Requirements:

  1. Sentence pairs with corresponding similarity scores in range [0, 1]

Inputs:

Texts Labels
(sentence_A, sentence_B) pairs float similarity score

Relations:

Example

from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses from datasets import Dataset

model = SentenceTransformer("microsoft/mpnet-base") train_dataset = Dataset.from_dict({ "sentence1": ["It's nice weather outside today.", "He drove to work."], "sentence2": ["It's so sunny.", "She walked to the store."], "score": [1.0, 0.3], }) loss = losses.CosineSimilarityLoss(model)

trainer = SentenceTransformerTrainer( model=model, train_dataset=train_dataset, loss=loss, ) trainer.train()

DenoisingAutoEncoderLoss

class sentence_transformers.losses.DenoisingAutoEncoderLoss(model: SentenceTransformer, decoder_name_or_path: str | None = None, tie_encoder_decoder: bool = True)[source]

This loss expects as input a pairs of damaged sentences and the corresponding original ones. During training, the decoder reconstructs the original sentences from the encoded sentence embeddings. Here the argument ‘decoder_name_or_path’ indicates the pretrained model (supported by Hugging Face) to be used as the decoder. Since decoding process is included, here the decoder should have a class called XXXLMHead (in the context of Hugging Face’s Transformers). The ‘tie_encoder_decoder’ flag indicates whether to tie the trainable parameters of encoder and decoder, which is shown beneficial to model performance while limiting the amount of required memory. Only when the encoder and decoder are from the same architecture, can the flag ‘tie_encoder_decoder’ work.

The data generation process (i.e. the ‘damaging’ process) has already been implemented in DenoisingAutoEncoderDataset, allowing you to only provide regular sentences.

Parameters:

References

Requirements:

  1. The decoder should have a class called XXXLMHead (in the context of Hugging Face’s Transformers)
  2. Should use a large corpus

Inputs:

Texts Labels
(damaged_sentence, original_sentence) pairs none
sentence fed through DenoisingAutoEncoderDataset none

Example

import random from datasets import Dataset from nltk import word_tokenize from nltk.tokenize.treebank import TreebankWordDetokenizer from sentence_transformers import SentenceTransformer from sentence_transformers.losses import DenoisingAutoEncoderLoss from sentence_transformers.trainer import SentenceTransformerTrainer

model_name = "bert-base-cased" model = SentenceTransformer(model_name)

def noise_transform(batch, del_ratio=0.6): texts = batch["text"] noisy_texts = [] for text in texts: words = word_tokenize(text) n = len(words) if n == 0: noisy_texts.append(text) continue

    kept_words = [word for word in words if random.random() < del_ratio]
    # Guarantee that at least one word remains
    if len(kept_words) == 0:
        noisy_texts.append(random.choice(words))
        continue

    noisy_texts.append(TreebankWordDetokenizer().detokenize(kept_words))
return {"noisy": noisy_texts, "text": texts}

train_sentences = [ "First training sentence", "Second training sentence", "Third training sentence", "Fourth training sentence", ]

train_dataset = Dataset.from_dict({"text": train_sentences}) train_dataset.set_transform(transform=lambda batch: noise_transform(batch), columns=["text"], output_all_columns=True) train_loss = DenoisingAutoEncoderLoss(model, decoder_name_or_path=model_name, tie_encoder_decoder=True) trainer = SentenceTransformerTrainer(model=model, train_dataset=train_dataset, loss=train_loss) trainer.train()

GISTEmbedLoss

class sentence_transformers.losses.GISTEmbedLoss(model: SentenceTransformer, guide: SentenceTransformer, temperature: float = 0.01, margin_strategy: Literal['absolute', 'relative'] = 'absolute', margin: float = 0.0, contrast_anchors: bool = True, contrast_positives: bool = True, gather_across_devices: bool = False)[source]

This loss is used to train a SentenceTransformer model using the GISTEmbed algorithm. It takes a model and a guide model as input, and uses the guide model to guide the in-batch negative sample selection. The cosine similarity is used to compute the loss and the temperature parameter is used to scale the cosine similarities.

You can apply different false-negative filtering strategies to discard hard negatives that are too similar to the positive. Two strategies are supported:

Parameters:

References

Requirements:

  1. (anchor, positive, negative) triplets
  2. (anchor, positive) pairs

Inputs:

Texts Labels
(anchor, positive, negative) triplets none
(anchor, positive) pairs none

Recommendations:

Relations:

Example

from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses from datasets import Dataset

model = SentenceTransformer("microsoft/mpnet-base") guide = SentenceTransformer("all-MiniLM-L6-v2") train_dataset = Dataset.from_dict({ "anchor": ["It's nice weather outside today.", "He drove to work."], "positive": ["It's so sunny.", "He took the car to the office."], }) loss = losses.GISTEmbedLoss(model, guide)

trainer = SentenceTransformerTrainer( model=model, train_dataset=train_dataset, loss=loss, ) trainer.train()

CachedGISTEmbedLoss

class sentence_transformers.losses.CachedGISTEmbedLoss(model: SentenceTransformer, guide: SentenceTransformer, temperature: float = 0.01, mini_batch_size: int = 32, show_progress_bar: bool = False, margin_strategy: Literal['absolute', 'relative'] = 'absolute', margin: float = 0.0, contrast_anchors: bool = True, contrast_positives: bool = True, gather_across_devices: bool = False)[source]

This loss is a combination of GISTEmbedLoss and CachedMultipleNegativesRankingLoss. Typically, MultipleNegativesRankingLoss requires a larger batch size for better performance.GISTEmbedLoss yields stronger training signals than MultipleNegativesRankingLoss due to the use of a guide model for in-batch negative sample selection. Meanwhile, CachedMultipleNegativesRankingLossallows for scaling of the batch size by dividing the computation into two stages of embedding and loss calculation, which both can be scaled by mini-batches (https://huggingface.co/papers/2101.06983).

By combining the guided selection from GISTEmbedLoss and Gradient Cache fromCachedMultipleNegativesRankingLoss, it is possible to reduce memory usage while maintaining performance levels comparable to those of GISTEmbedLoss.

You can apply different false-negative filtering strategies to discard hard negatives that are too similar to the positive. Two strategies are supported:

Parameters:

References

Requirements:

  1. (anchor, positive) pairs or (anchor, positive, negative pairs)
  2. Should be used with large batch sizes for superior performance, but has slower training time than MultipleNegativesRankingLoss

Inputs:

Texts Labels
(anchor, positive) pairs none
(anchor, positive, negative) triplets none
(anchor, positive, negative_1, …, negative_n) none

Recommendations:

Relations:

Example

from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses from datasets import Dataset

model = SentenceTransformer("microsoft/mpnet-base") guide = SentenceTransformer("all-MiniLM-L6-v2") train_dataset = Dataset.from_dict({ "anchor": ["It's nice weather outside today.", "He drove to work."], "positive": ["It's so sunny.", "He took the car to the office."], }) loss = losses.CachedGISTEmbedLoss( model, guide, mini_batch_size=64, margin_strategy="absolute", # or "relative" (e.g., margin=0.05 for max. 95% of positive similarity) margin=0.1 )

trainer = SentenceTransformerTrainer( model=model, train_dataset=train_dataset, loss=loss, ) trainer.train()

GlobalOrthogonalRegularizationLoss

class sentence_transformers.losses.GlobalOrthogonalRegularizationLoss(model: ~sentence_transformers.SentenceTransformer.SentenceTransformer, similarity_fct=<function cos_sim>, mean_weight: float | None = 1.0, second_moment_weight: float | None = 1.0, aggregation: ~typing.Literal['mean', 'sum'] = 'mean')[source]

Global Orthogonal Regularization (GOR) Loss that encourages embeddings to be well-distributed in the embedding space by penalizing high mean similarities and high second moments of similarities across unrelated inputs.

The loss consists of two terms:

  1. Mean term: Penalizes when the mean similarity across unrelated embeddings is high
  2. Second moment term: Penalizes when the second moment of similarities is high

A high second moment indicates that some embeddings have very high similarities, suggesting clustering or concentration in certain regions of the embedding space. A low second moment indicates that similarities are more uniformly distributed.

The loss is called independently on each input column (e.g., queries and passages) and combines the results using either mean or sum aggregation. This is why the loss can be used on any dataset configuration (e.g., single inputs, pairs, triplets, etc.).

It’s recommended to combine this loss with a primary loss function, such as MultipleNegativesRankingLoss.

Parameters:

References

Inputs:

Texts Labels
any none

Example

import torch from datasets import Dataset from torch import Tensor from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer from sentence_transformers.losses import GlobalOrthogonalRegularizationLoss, MultipleNegativesRankingLoss from sentence_transformers.util import cos_sim

model = SentenceTransformer("microsoft/mpnet-base") train_dataset = Dataset.from_dict({ "anchor": ["It's nice weather outside today.", "He drove to work."], "positive": ["It's so sunny.", "He took the car to the office."], })

class InfoNCEGORLoss(torch.nn.Module): def init(self, model: SentenceTransformer, similarity_fct=cos_sim, scale=20.0) -> None: super().init() self.model = model self.info_nce_loss = MultipleNegativesRankingLoss(model, similarity_fct=similarity_fct, scale=scale) self.gor_loss = GlobalOrthogonalRegularizationLoss(model, similarity_fct=similarity_fct)

def forward(self, sentence_features: list[dict[str, Tensor]], labels: Tensor | None = None) -> Tensor:
    embeddings = [self.model(sentence_feature)["sentence_embedding"] for sentence_feature in sentence_features]
    info_nce_loss: dict[str, Tensor] = {
        "info_nce": self.info_nce_loss.compute_loss_from_embeddings(embeddings, labels)
    }
    gor_loss: dict[str, Tensor] = self.gor_loss.compute_loss_from_embeddings(embeddings, labels)
    return {**info_nce_loss, **gor_loss}

loss = InfoNCEGORLoss(model) trainer = SentenceTransformerTrainer( model=model, train_dataset=train_dataset, loss=loss, ) trainer.train()

Alternatively, you can use multi-task learning to train with both losses:

from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer from sentence_transformers.losses import GlobalOrthogonalRegularizationLoss, MultipleNegativesRankingLoss from datasets import Dataset

model = SentenceTransformer("microsoft/mpnet-base") train_dataset = Dataset.from_dict({ "anchor": ["It's nice weather outside today.", "He drove to work."], "positive": ["It's so sunny.", "He took the car to the office."], }) mnrl_loss = MultipleNegativesRankingLoss(model) gor_loss = GlobalOrthogonalRegularizationLoss(model)

trainer = SentenceTransformerTrainer( model=model, train_dataset={"main": train_dataset, "gor": train_dataset}, loss={"main": mnrl_loss, "gor": gor_loss}, ) trainer.train()

MSELoss

class sentence_transformers.losses.MSELoss(model: SentenceTransformer)[source]

Computes the MSE loss between the computed sentence embedding and a target sentence embedding. This loss is used when extending sentence embeddings to new languages as described in our publication Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation.

For an example, see the distillation documentation on extending language models to new languages.

Parameters:

model – SentenceTransformerModel

References

Requirements:

  1. Usually uses a finetuned teacher M in a knowledge distillation setup

Inputs:

Texts Labels
sentence model sentence embeddings
sentence_1, sentence_2, …, sentence_N model sentence embeddings

Relations:

Example

from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses from datasets import Dataset

student_model = SentenceTransformer("microsoft/mpnet-base") teacher_model = SentenceTransformer("all-mpnet-base-v2") train_dataset = Dataset.from_dict({ "english": ["The first sentence", "The second sentence", "The third sentence", "The fourth sentence"], "french": ["La première phrase", "La deuxième phrase", "La troisième phrase", "La quatrième phrase"], })

def compute_labels(batch): return { "label": teacher_model.encode(batch["english"]) }

train_dataset = train_dataset.map(compute_labels, batched=True) loss = losses.MSELoss(student_model)

trainer = SentenceTransformerTrainer( model=student_model, train_dataset=train_dataset, loss=loss, ) trainer.train()

MarginMSELoss

class sentence_transformers.losses.MarginMSELoss(model: ~sentence_transformers.SentenceTransformer.SentenceTransformer, similarity_fct=<function pairwise_dot_score>)[source]

Compute the MSE loss between the |sim(Query, Pos) - sim(Query, Neg)| and |gold_sim(Query, Pos) - gold_sim(Query, Neg)|. By default, sim() is the dot-product. The gold_sim is often the similarity score from a teacher model.

In contrast to MultipleNegativesRankingLoss, the two passages do not have to be strictly positive and negative, both can be relevant or not relevant for a given query. This can be an advantage of MarginMSELoss over MultipleNegativesRankingLoss, but note that the MarginMSELoss is much slower to train. With MultipleNegativesRankingLoss, with a batch size of 64, we compare one query against 128 passages. With MarginMSELoss, we compare a query only against two passages. It’s also possible to use multiple negatives with MarginMSELoss, but the training would be even slower to train.

Parameters:

References

Requirements:

  1. (query, passage_one, passage_two) triplets or (query, positive, negative_1, …, negative_n)
  2. Usually used with a finetuned teacher M in a knowledge distillation setup

Inputs:

Texts Labels
(query, passage_one, passage_two) triplets M(query, passage_one) - M(query, passage_two)
(query, passage_one, passage_two) triplets [M(query, passage_one), M(query, passage_two)]
(query, positive, negative_1, …, negative_n) [M(query, positive) - M(query, negative_i) for i in 1..n]
(query, positive, negative_1, …, negative_n) [M(query, positive), M(query, negative_1), …, M(query, negative_n)]

Relations:

Example

With gold labels, e.g. if you have hard scores for sentences. Imagine you want a model to embed sentences with similar “quality” close to each other. If the “text1” has quality 5 out of 5, “text2” has quality 1 out of 5, and “text3” has quality 3 out of 5, then the similarity of a pair can be defined as the difference of the quality scores. So, the similarity between “text1” and “text2” is 4, and the similarity between “text1” and “text3” is 2. If we use this as our “Teacher Model”, the label becomes similraity(“text1”, “text2”) - similarity(“text1”, “text3”) = 4 - 2 = 2.

Positive values denote that the first passage is more similar to the query than the second passage, while negative values denote the opposite.

from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses from datasets import Dataset

model = SentenceTransformer("microsoft/mpnet-base") train_dataset = Dataset.from_dict({ "text1": ["It's nice weather outside today.", "He drove to work."], "text2": ["It's so sunny.", "He took the car to work."], "text3": ["It's very sunny.", "She walked to the store."], "label": [0.1, 0.8], }) loss = losses.MarginMSELoss(model)

trainer = SentenceTransformerTrainer( model=model, train_dataset=train_dataset, loss=loss, ) trainer.train()

We can also use a teacher model to compute the similarity scores. In this case, we can use the teacher model to compute the similarity scores and use them as the silver labels. This is often used in knowledge distillation.

from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses from datasets import Dataset

student_model = SentenceTransformer("microsoft/mpnet-base") teacher_model = SentenceTransformer("all-mpnet-base-v2") train_dataset = Dataset.from_dict({ "query": ["It's nice weather outside today.", "He drove to work."], "passage1": ["It's so sunny.", "He took the car to work."], "passage2": ["It's very sunny.", "She walked to the store."], })

def compute_labels(batch): emb_queries = teacher_model.encode(batch["query"]) emb_passages1 = teacher_model.encode(batch["passage1"]) emb_passages2 = teacher_model.encode(batch["passage2"]) return { "label": teacher_model.similarity_pairwise(emb_queries, emb_passages1) - teacher_model.similarity_pairwise(emb_queries, emb_passages2) }

train_dataset = train_dataset.map(compute_labels, batched=True) loss = losses.MarginMSELoss(student_model)

trainer = SentenceTransformerTrainer( model=student_model, train_dataset=train_dataset, loss=loss, ) trainer.train()

We can also use multiple negatives during the knowledge distillation.

from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses from datasets import Dataset import torch

student_model = SentenceTransformer("microsoft/mpnet-base") teacher_model = SentenceTransformer("all-mpnet-base-v2")

train_dataset = Dataset.from_dict( { "query": ["It's nice weather outside today.", "He drove to work."], "passage1": ["It's so sunny.", "He took the car to work."], "passage2": ["It's very cold.", "She walked to the store."], "passage3": ["Its rainy", "She took the bus"], } )

def compute_labels(batch): emb_queries = teacher_model.encode(batch["query"]) emb_passages1 = teacher_model.encode(batch["passage1"]) emb_passages2 = teacher_model.encode(batch["passage2"]) emb_passages3 = teacher_model.encode(batch["passage3"]) return { "label": torch.stack( [ teacher_model.similarity_pairwise(emb_queries, emb_passages1) - teacher_model.similarity_pairwise(emb_queries, emb_passages2), teacher_model.similarity_pairwise(emb_queries, emb_passages1) - teacher_model.similarity_pairwise(emb_queries, emb_passages3), ], dim=1, ) }

train_dataset = train_dataset.map(compute_labels, batched=True) loss = losses.MarginMSELoss(student_model)

trainer = SentenceTransformerTrainer(model=student_model, train_dataset=train_dataset, loss=loss) trainer.train()

MatryoshkaLoss

class sentence_transformers.losses.MatryoshkaLoss(model: SentenceTransformer, loss: Module, matryoshka_dims: Sequence[int], matryoshka_weights: Sequence[float] | Sequence[int] | None = None, n_dims_per_step: int = -1)[source]

The MatryoshkaLoss can be seen as a loss modifier that allows you to use other loss functions at various different embedding dimensions. This is useful for when you want to train a model where users have the option to lower the embedding dimension to improve their embedding comparison speed and costs.

This loss is also compatible with the Cached… losses, which are in-batch negative losses that allow for higher batch sizes. The higher batch sizes allow for more negatives, and often result in a stronger model.

Parameters:

References

Inputs:

Texts Labels
any any

Relations:

Example

from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses from datasets import Dataset

model = SentenceTransformer("microsoft/mpnet-base") train_dataset = Dataset.from_dict({ "anchor": ["It's nice weather outside today.", "He drove to work."], "positive": ["It's so sunny.", "He took the car to the office."], }) loss = losses.MultipleNegativesRankingLoss(model) loss = losses.MatryoshkaLoss(model, loss, [768, 512, 256, 128, 64])

trainer = SentenceTransformerTrainer( model=model, train_dataset=train_dataset, loss=loss, ) trainer.train()

Matryoshka2dLoss

class sentence_transformers.losses.Matryoshka2dLoss(model: SentenceTransformer, loss: Module, matryoshka_dims: list[int], matryoshka_weights: list[float | int] | None = None, n_layers_per_step: int = 1, n_dims_per_step: int = 1, last_layer_weight: float = 1.0, prior_layers_weight: float = 1.0, kl_div_weight: float = 1.0, kl_temperature: float = 0.3)[source]

The Matryoshka2dLoss can be seen as a loss modifier that combines the AdaptiveLayerLoss and theMatryoshkaLoss. This allows you to train an embedding model that 1) allows users to specify the number of model layers to use, and 2) allows users to specify the output dimensions to use.

The former is useful for when you want users to have the option to lower the number of layers used to improve their inference speed and memory usage, and the latter is useful for when you want users to have the option to lower the output dimensions to improve the efficiency of their downstream tasks (e.g. retrieval) or to lower their storage costs.

Note, this uses n_layers_per_step=1 and n_dims_per_step=1 as default, following the original 2DMSE implementation.

Parameters:

References

Requirements:

  1. The base loss cannot be CachedMultipleNegativesRankingLoss,CachedMultipleNegativesSymmetricRankingLoss, or CachedGISTEmbedLoss.

Inputs:

Texts Labels
any any

Relations:

Example

from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses from datasets import Dataset

model = SentenceTransformer("microsoft/mpnet-base") train_dataset = Dataset.from_dict({ "anchor": ["It's nice weather outside today.", "He drove to work."], "positive": ["It's so sunny.", "He took the car to the office."], }) loss = losses.MultipleNegativesRankingLoss(model) loss = losses.Matryoshka2dLoss(model, loss, [768, 512, 256, 128, 64])

trainer = SentenceTransformerTrainer( model=model, train_dataset=train_dataset, loss=loss, ) trainer.train()

AdaptiveLayerLoss

class sentence_transformers.losses.AdaptiveLayerLoss(model: SentenceTransformer, loss: nn.Module, n_layers_per_step: int = 1, last_layer_weight: float = 1.0, prior_layers_weight: float = 1.0, kl_div_weight: float = 1.0, kl_temperature: float = 0.3)[source]

The AdaptiveLayerLoss can be seen as a loss modifier that allows you to use other loss functions at non-final layers of the Sentence Transformer model. This is useful for when you want to train a model where users have the option to lower the number of layers used to improve their inference speed and memory usage.

Parameters:

References

Requirements:

  1. The base loss cannot be CachedMultipleNegativesRankingLoss,CachedMultipleNegativesSymmetricRankingLoss, or CachedGISTEmbedLoss.

Inputs:

Texts Labels
any any

Relations:

Example

from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses from datasets import Dataset

model = SentenceTransformer("microsoft/mpnet-base") train_dataset = Dataset.from_dict({ "anchor": ["It's nice weather outside today.", "He drove to work."], "positive": ["It's so sunny.", "He took the car to the office."], }) loss = losses.MultipleNegativesRankingLoss(model=model) loss = losses.AdaptiveLayerLoss(model, loss)

trainer = SentenceTransformerTrainer( model=model, train_dataset=train_dataset, loss=loss, ) trainer.train()

MegaBatchMarginLoss

class sentence_transformers.losses.MegaBatchMarginLoss(model: SentenceTransformer, positive_margin: float = 0.8, negative_margin: float = 0.3, use_mini_batched_version: bool = True, mini_batch_size: int = 50)[source]

Given a large batch (like 500 or more examples) of (anchor_i, positive_i) pairs, find for each pair in the batch the hardest negative, i.e. find j != i such that cos_sim(anchor_i, positive_j) is maximal. Then create from this a triplet (anchor_i, positive_i, positive_j) where positive_j serves as the negative for this triplet.

Then train as with the triplet loss.

Parameters:

References

Requirements:

  1. (anchor, positive) pairs
  2. Large batches (500 or more examples)

Inputs:

Texts Labels
(anchor, positive) pairs none

Recommendations:

Example

from sentence_transformers import SentenceTransformer, SentenceTransformerTrainingArguments, SentenceTransformerTrainer, losses from datasets import Dataset

train_batch_size = 250 train_mini_batch_size = 32

model = SentenceTransformer('all-MiniLM-L6-v2') train_dataset = Dataset.from_dict({ "anchor": [f"This is sentence number {i}" for i in range(500)], "positive": [f"This is sentence number {i}" for i in range(1, 501)], }) loss = losses.MegaBatchMarginLoss(model=model, mini_batch_size=train_mini_batch_size)

args = SentenceTransformerTrainingArguments( output_dir="output", per_device_train_batch_size=train_batch_size, ) trainer = SentenceTransformerTrainer( model=model, args=args, train_dataset=train_dataset, loss=loss, ) trainer.train()

MultipleNegativesRankingLoss

MultipleNegativesRankingLoss is a great loss function if you only have positive pairs, for example, only pairs of similar texts like pairs of paraphrases, pairs of duplicate questions, pairs of (query, response), or pairs of (source_language, target_language).

class sentence_transformers.losses.MultipleNegativesRankingLoss(model: ~sentence_transformers.SentenceTransformer.SentenceTransformer, scale: float = 20.0, similarity_fct: ~collections.abc.Callable[[~torch.Tensor, ~torch.Tensor], ~torch.Tensor] = <function cos_sim>, gather_across_devices: bool = False, directions: tuple[~typing.Literal['query_to_doc', 'query_to_query', 'doc_to_query', 'doc_to_doc'], ...] = ('query_to_doc',), partition_mode: ~typing.Literal['joint', 'per_direction'] = 'joint', hardness_mode: ~typing.Literal['in_batch_negatives', 'hard_negatives', 'all_negatives'] | None = None, hardness_strength: float = 0.0)[source]

Given a dataset of (anchor, positive) pairs, (anchor, positive, negative) triplets, or (anchor, positive, negative_1, …, negative_n) n-tuples, this loss implements a contrastive learning objective that encourages the model to produce similar embeddings for the anchor and positive samples, while producing dissimilar embeddings for the negative samples.

In plain terms, the loss works as follows:

  1. For each anchor (often a query) in the batch, we want the similarity to its matched positive (often a document) to be higher than the similarity to all other documents in the batch (including optional hard negatives). This is the standard forward MultipleNegativesRankingLoss / InfoNCE term, denoted with “query_to_doc”.
  2. Optionally, we can also require the opposite: for each document, its matched query should have higher similarity than all other queries in the batch. This is the symmetric backward term, denoted with “doc_to_query”.
  3. Optionally, we can further require that for each query, its similarity to all other queries in the batch is lower than to its matched document. This is the “query_to_query” term.
  4. Optionally, we can also require that for each document, its similarity to all other documents in the batch is lower than to its matched query. This excludes documents that belong to the same query in the case of hard negatives (i.e. columns beyond the first two in the input). This is the “doc_to_doc” term.

All of these are implemented via different choices of interaction directions and how we normalize the scores, but they all share the same core idea: the correct pair (query, positive) should have the highest similarity compared to all in-batch alternatives.

All of these are expressed via the same underlying formulation by choosing differentdirections and partition_mode values. Optional negatives in the input are treated as additional hard-negative documents for the corresponding query.

The default configuration is also known as the InfoNCE loss, SimCSE loss, cross-entropy loss with in-batch negatives, or simply in-batch negatives loss.

Parameters:

Requirements:

  1. (anchor, positive) pairs, (anchor, positive, negative) triplets, or (anchor, positive, negative_1, …, negative_n) n-tuples

Inputs:

Texts Labels
(anchor, positive) pairs none
(anchor, positive, negative) triplets none
(anchor, positive, negative_1, …, negative_n) none

Recommendations:

Relations:

Loss variants from the literature:

Example

from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses from datasets import Dataset

model = SentenceTransformer("microsoft/mpnet-base") train_dataset = Dataset.from_dict({ "anchor": ["It's nice weather outside today.", "He drove to work."], "positive": ["It's so sunny.", "He took the car to the office."], }) loss = losses.MultipleNegativesRankingLoss(model)

trainer = SentenceTransformerTrainer( model=model, train_dataset=train_dataset, loss=loss, ) trainer.train()

CachedMultipleNegativesRankingLoss

class sentence_transformers.losses.CachedMultipleNegativesRankingLoss(model: ~sentence_transformers.SentenceTransformer.SentenceTransformer, scale: float = 20.0, similarity_fct: ~collections.abc.Callable[[~torch.Tensor, ~torch.Tensor], ~torch.Tensor] = <function cos_sim>, mini_batch_size: int = 32, gather_across_devices: bool = False, directions: tuple[~typing.Literal['query_to_doc', 'query_to_query', 'doc_to_query', 'doc_to_doc'], ...] = ('query_to_doc',), partition_mode: ~typing.Literal['joint', 'per_direction'] = 'joint', show_progress_bar: bool = False, hardness_mode: ~typing.Literal['in_batch_negatives', 'hard_negatives', 'all_negatives'] | None = None, hardness_strength: float = 0.0)[source]

Boosted version of MultipleNegativesRankingLoss (https://huggingface.co/papers/1705.00652) by GradCache (https://huggingface.co/papers/2101.06983).

Constrastive learning (here our MNRL loss) with in-batch negatives is usually hard to work with large batch sizes due to (GPU) memory limitation. Even with batch-scaling methods like gradient-scaling, it cannot work either. This is because the in-batch negatives make the data points within the same batch non-independent and thus the batch cannot be broke down into mini-batches. GradCache is a smart way to solve this problem. It achieves the goal by dividing the computation into two stages of embedding and loss calculation, which both can be scaled by mini-batches. As a result, memory of constant size (e.g. that works with batch size = 32) can now process much larger batches (e.g. 65536).

In detail:

  1. It first does a quick embedding step without gradients/computation graphs to get all the embeddings;
  2. Calculate the loss, backward up to the embeddings and cache the gradients wrt. to the embeddings;
  3. A 2nd embedding step with gradients/computation graphs and connect the cached gradients into the backward chain.

Notes: All steps are done with mini-batches. In the original implementation of GradCache, (2) is not done in mini-batches and requires a lot memory when the batch size is large. One drawback is about the speed. Gradient caching will sacrifice around 20% computation time according to the paper.

See MultipleNegativesRankingLoss for more details about the underlying loss itself.

Parameters:

References

Requirements:

  1. (anchor, positive) pairs, (anchor, positive, negative) triplets, or (anchor, positive, negative_1, …, negative_n) n-tuples
  2. Should be used with large per_device_train_batch_size and low mini_batch_size for superior performance, but slower training time than MultipleNegativesRankingLoss.

Inputs:

Texts Labels
(anchor, positive) pairs none
(anchor, positive, negative) triplets none
(anchor, positive, negative_1, …, negative_n) none

Recommendations:

Relations:

Example

from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses from datasets import Dataset

model = SentenceTransformer("microsoft/mpnet-base") train_dataset = Dataset.from_dict({ "anchor": ["It's nice weather outside today.", "He drove to work."], "positive": ["It's so sunny.", "He took the car to the office."], }) loss = losses.CachedMultipleNegativesRankingLoss(model, mini_batch_size=64)

trainer = SentenceTransformerTrainer( model=model, train_dataset=train_dataset, loss=loss, ) trainer.train()

MultipleNegativesSymmetricRankingLoss

class sentence_transformers.losses.MultipleNegativesSymmetricRankingLoss(**kwargs)[source]

Warning

This class has been merged into MultipleNegativesRankingLoss and is now deprecated. Please use MultipleNegativesRankingLoss withdirections=("query_to_doc", "doc_to_query") and partition_mode="per_direction" for identical performance instead:

loss = MultipleNegativesRankingLoss( model, directions=("query_to_doc", "doc_to_query"), partition_mode="per_direction", )

Given a list of (anchor, positive) pairs, this loss sums the following two losses:

  1. Forward loss: Given an anchor, find the sample with the highest similarity out of all positives in the batch. This is equivalent to MultipleNegativesRankingLoss.
  2. Backward loss: Given a positive, find the sample with the highest similarity out of all anchors in the batch.

For example with question-answer pairs, MultipleNegativesRankingLoss just computes the loss to find the answer given a question, but MultipleNegativesSymmetricRankingLoss additionally computes the loss to find the question given an answer.

Note: If you pass triplets, the negative entry will be ignored. A anchor is just searched for the positive.

Parameters:

Requirements:

  1. (anchor, positive) pairs

Inputs:

Texts Labels
(anchor, positive) pairs none

Recommendations:

Relations:

Example

from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses from datasets import Dataset

model = SentenceTransformer("microsoft/mpnet-base") train_dataset = Dataset.from_dict({ "anchor": ["It's nice weather outside today.", "He drove to work."], "positive": ["It's so sunny.", "He took the car to the office."], }) loss = losses.MultipleNegativesSymmetricRankingLoss(model)

trainer = SentenceTransformerTrainer( model=model, train_dataset=train_dataset, loss=loss, ) trainer.train()

CachedMultipleNegativesSymmetricRankingLoss

class sentence_transformers.losses.CachedMultipleNegativesSymmetricRankingLoss(**kwargs)[source]

Warning

This class has been merged into CachedMultipleNegativesRankingLoss and is now deprecated. Please use CachedMultipleNegativesRankingLoss withdirections=("query_to_doc", "doc_to_query") and partition_mode="per_direction" for identical performance instead:

loss = CachedMultipleNegativesRankingLoss( model, mini_batch_size=32, directions=("query_to_doc", "doc_to_query"), partition_mode="per_direction", )

Boosted version of MultipleNegativesSymmetricRankingLoss (MNSRL) by GradCache (https://huggingface.co/papers/2101.06983).

Given a list of (anchor, positive) pairs, MNSRL sums the following two losses:

  1. Forward loss: Given an anchor, find the sample with the highest similarity out of all positives in the batch.
  2. Backward loss: Given a positive, find the sample with the highest similarity out of all anchors in the batch.

For example with question-answer pairs, the forward loss finds the answer for a given question and the backward loss finds the question for a given answer. This loss is common in symmetric tasks, such as semantic textual similarity.

The caching modification allows for large batch sizes (which give a better training signal) with constant memory usage, allowing you to reach optimal training signal with regular hardware.

Note: If you pass triplets, the negative entry will be ignored. An anchor is just searched for the positive.

Parameters:

Requirements:

  1. (anchor, positive) pairs
  2. Should be used with large batch sizes for superior performance, but has slower training time than non-cached versions

Inputs:

Texts Labels
(anchor, positive) pairs none

Recommendations:

Relations:

Example

from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses from datasets import Dataset

model = SentenceTransformer("microsoft/mpnet-base") train_dataset = Dataset.from_dict({ "anchor": ["It's nice weather outside today.", "He drove to work."], "positive": ["It's so sunny.", "He took the car to the office."], }) loss = losses.CachedMultipleNegativesSymmetricRankingLoss(model, mini_batch_size=32)

trainer = SentenceTransformerTrainer( model=model, train_dataset=train_dataset, loss=loss, ) trainer.train()

References

SoftmaxLoss

class sentence_transformers.losses.SoftmaxLoss(model: SentenceTransformer, sentence_embedding_dimension: int, num_labels: int, concatenation_sent_rep: bool = True, concatenation_sent_difference: bool = True, concatenation_sent_multiplication: bool = False, loss_fct: Callable = CrossEntropyLoss())[source]

This loss was used in our SBERT publication (https://huggingface.co/papers/1908.10084) to train the SentenceTransformer model on NLI data. It adds a softmax classifier on top of the output of two transformer networks.

MultipleNegativesRankingLoss is an alternative loss function that often yields better results, as per https://huggingface.co/papers/2004.09813.

Parameters:

References

Requirements:

  1. sentence pairs with a class label

Inputs:

Texts Labels
(sentence_A, sentence_B) pairs class

Example

from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses from datasets import Dataset

model = SentenceTransformer("microsoft/mpnet-base") train_dataset = Dataset.from_dict({ "sentence1": [ "A person on a horse jumps over a broken down airplane.", "A person on a horse jumps over a broken down airplane.", "A person on a horse jumps over a broken down airplane.", "Children smiling and waving at camera", ], "sentence2": [ "A person is training his horse for a competition.", "A person is at a diner, ordering an omelette.", "A person is outdoors, on a horse.", "There are children present.", ], "label": [1, 2, 0, 0], }) loss = losses.SoftmaxLoss(model, model.get_sentence_embedding_dimension(), num_labels=3)

trainer = SentenceTransformerTrainer( model=model, train_dataset=train_dataset, loss=loss, ) trainer.train()

TripletLoss

class sentence_transformers.losses.TripletLoss(model: ~sentence_transformers.SentenceTransformer.SentenceTransformer, distance_metric=<function TripletDistanceMetric.>, triplet_margin: float = 5)[source]

This class implements triplet loss. Given a triplet of (anchor, positive, negative), the loss minimizes the distance between anchor and positive while it maximizes the distance between anchor and negative. It compute the following loss function:

loss = max(||anchor - positive|| - ||anchor - negative|| + margin, 0).

Margin is an important hyperparameter and needs to be tuned respectively.

Parameters:

References

Requirements:

  1. (anchor, positive, negative) triplets

Inputs:

Texts Labels
(anchor, positive, negative) triplets none

Example

from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses from datasets import Dataset

model = SentenceTransformer("microsoft/mpnet-base") train_dataset = Dataset.from_dict({ "anchor": ["It's nice weather outside today.", "He drove to work."], "positive": ["It's so sunny.", "He took the car to the office."], "negative": ["It's quite rainy, sadly.", "She walked to the store."], }) loss = losses.TripletLoss(model=model)

trainer = SentenceTransformerTrainer( model=model, train_dataset=train_dataset, loss=loss, ) trainer.train()

DistillKLDivLoss

class sentence_transformers.losses.DistillKLDivLoss(model: ~sentence_transformers.SentenceTransformer.SentenceTransformer, similarity_fct=<function pairwise_dot_score>, temperature: float = 1.0)[source]

Compute the KL divergence loss between probability distributions derived from student and teacher models’ similarity scores. By default, similarity is calculated using the dot-product. This loss is designed for knowledge distillation where a smaller student model learns from a more powerful teacher model.

The loss computes softmax probabilities from the teacher similarity scores and log-softmax probabilities from the student model, then calculates the KL divergence between these distributions.

Parameters:

References

Requirements:

  1. (query, positive, negative_1, …, negative_n) examples
  2. Labels containing teacher model’s scores between query-positive and query-negative pairs

Inputs:

Texts Labels
(query, positive, negative) [Teacher(query, positive), Teacher(query, negative)]
(query, positive, negative_1, …, negative_n) [Teacher(query, positive), Teacher(query, negative_i)…]

Relations:

Example

Using a teacher model to compute similarity scores for distillation:

from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses from datasets import Dataset import torch

student_model = SentenceTransformer("microsoft/mpnet-base") teacher_model = SentenceTransformer("all-mpnet-base-v2") train_dataset = Dataset.from_dict({ "query": ["It's nice weather outside today.", "He drove to work."], "positive": ["It's so sunny.", "He took the car to work."], "negative": ["It's very cold.", "She walked to the store."], })

def compute_labels(batch): emb_queries = teacher_model.encode(batch["query"]) emb_positives = teacher_model.encode(batch["positive"]) emb_negatives = teacher_model.encode(batch["negative"])

pos_scores = teacher_model.similarity_pairwise(emb_queries, emb_positives)
neg_scores = teacher_model.similarity_pairwise(emb_queries, emb_negatives)

# Stack the scores for positive and negative pairs
return {
    "label": torch.stack([pos_scores, neg_scores], dim=1)
}

train_dataset = train_dataset.map(compute_labels, batched=True) loss = losses.DistillKLDivLoss(student_model)

trainer = SentenceTransformerTrainer(model=student_model, train_dataset=train_dataset, loss=loss) trainer.train()

With multiple negatives:

from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses from datasets import Dataset import torch

student_model = SentenceTransformer("microsoft/mpnet-base") teacher_model = SentenceTransformer("all-mpnet-base-v2")

train_dataset = Dataset.from_dict( { "query": ["It's nice weather outside today.", "He drove to work."], "positive": ["It's so sunny.", "He took the car to work."], "negative1": ["It's very cold.", "She walked to the store."], "negative2": ["Its rainy", "She took the bus"], } )

def compute_labels(batch): emb_queries = teacher_model.encode(batch["query"]) emb_positives = teacher_model.encode(batch["positive"]) emb_negatives1 = teacher_model.encode(batch["negative1"]) emb_negatives2 = teacher_model.encode(batch["negative2"])

pos_scores = teacher_model.similarity_pairwise(emb_queries, emb_positives)
neg_scores1 = teacher_model.similarity_pairwise(emb_queries, emb_negatives1)
neg_scores2 = teacher_model.similarity_pairwise(emb_queries, emb_negatives2)

# Stack the scores for positive and multiple negative pairs
return {
    "label": torch.stack([pos_scores, neg_scores1, neg_scores2], dim=1)
}

train_dataset = train_dataset.map(compute_labels, batched=True) loss = losses.DistillKLDivLoss(student_model)

trainer = SentenceTransformerTrainer(model=student_model, train_dataset=train_dataset, loss=loss) trainer.train()