Loss Overview — Sentence Transformers documentation (original) (raw)

Loss Table

Loss functions play a critical role in the performance of your fine-tuned model. Sadly, there is no “one size fits all” loss function. Ideally, this table should help narrow down your choice of loss function(s) by matching them to your data formats.

Note

You can often convert one training data format into another, allowing more loss functions to be viable for your scenario. For example, (sentence_A, sentence_B) pairs with class labels can be converted into (anchor, positive, negative) triplets by sampling sentences with the same or different classes.

Legend: Loss functions marked with are commonly recommended default choices.

Inputs Labels Appropriate Loss Functions
single sentences class BatchAllTripletLossBatchHardSoftMarginTripletLossBatchHardTripletLossBatchSemiHardTripletLoss
single sentences none ContrastiveTensionLossDenoisingAutoEncoderLoss
(anchor, anchor) pairs none ContrastiveTensionLossInBatchNegatives
(damaged_sentence, original_sentence) pairs none DenoisingAutoEncoderLoss
(sentence_A, sentence_B) pairs class SoftmaxLoss
(anchor, positive) pairs none MultipleNegativesRankingLossCachedMultipleNegativesRankingLossMegaBatchMarginLossGISTEmbedLossCachedGISTEmbedLoss
(anchor, positive/negative) pairs 1 if positive, 0 if negative ContrastiveLossOnlineContrastiveLoss
(sentence_A, sentence_B) pairs float similarity score between 0 and 1 CoSENTLossAnglELossCosineSimilarityLoss
(anchor, positive, negative) triplets none MultipleNegativesRankingLossCachedMultipleNegativesRankingLossTripletLossCachedGISTEmbedLossGISTEmbedLoss
(anchor, positive, negative_1, ..., negative_n) none MultipleNegativesRankingLossCachedMultipleNegativesRankingLossCachedGISTEmbedLoss

Loss modifiers

These loss functions can be seen as loss modifiers: they work on top of standard loss functions, but apply those loss functions in different ways to try and instil useful properties into the trained embedding model.

For example, models trained with MatryoshkaLoss produce embeddings whose size can be truncated without notable losses in performance, and models trained with AdaptiveLayerLoss still perform well when you remove model layers for faster inference.

Texts Labels Appropriate Loss Functions
any any MatryoshkaLossAdaptiveLayerLossMatryoshka2dLoss

Regularization

These losses are designed to regularize the embedding space during training, encouraging certain properties in the learned embeddings. They can often be applied to any dataset configuration.

Texts Labels Appropriate Loss Functions
any none GlobalOrthogonalRegularizationLoss

Distillation

These loss functions are specifically designed to be used when distilling the knowledge from one model into another. For example, when finetuning a small model to behave more like a larger & stronger one, or when finetuning a model to become multi-lingual.

Texts Labels Appropriate Loss Functions
sentence model sentence embeddings MSELoss
(sentence_1, sentence_2, ..., sentence_N) model sentence embeddings MSELoss
(query, passage_one, passage_two) gold_sim(query, passage_one) - gold_sim(query, passage_two) MarginMSELoss
(query, positive, negative_1, ..., negative_n) [gold_sim(query, positive) - gold_sim(query, negative_i) for i in 1..n] MarginMSELoss
(query, positive, negative) [gold_sim(query, positive), gold_sim(query, negative)] DistillKLDivLossMarginMSELoss
(query, positive, negative_1, ..., negative_n) [gold_sim(query, positive), gold_sim(query, negative_i)...] DistillKLDivLossMarginMSELoss

Commonly used Loss Functions

In practice, not all loss functions get used equally often. The most common scenarios are:

Custom Loss Functions

Advanced users can create and train with their own loss functions. Custom loss functions only have a few requirements:

To get full support with the automatic model card generation, you may also wish to implement:

Consider inspecting existing loss functions to get a feel for how loss functions are commonly implemented.