Sparse Encoder Evaluation — Sentence Transformers documentation (original) (raw)

This directory contains examples demonstrating how to evaluate Sparse Encoder models using various metrics and evaluator classes.

To run any of these evaluation scripts, simply execute the Python script. Each script will:

  1. Load a pretrained sparse encoder model.
  2. Prepare the evaluation dataset.
  3. Configure the appropriate evaluator.
  4. Run the evaluation.
  5. Report the results.
Evaluator Evaluation Script
SparseInformationRetrievalEvaluator sparse_retrieval_evaluator.py
SparseNanoBEIREvaluator sparse_nanobeir_evaluator.py
SparseEmbeddingSimilarityEvaluator sparse_similarity_evaluator.py
SparseBinaryClassificationEvaluator sparse_classification_evaluator.py
SparseTripletEvaluator sparse_triplet_evaluator.py
SparseRerankingEvaluator sparse_reranking_evaluator.py
SparseTranslationEvaluator sparse_translation_evaluator.py
SparseMSEEvaluator sparse_mse_evaluator.py

Example with Retrieval Evaluation:

This script demonstrates how to evaluate a sparse encoder on an information retrieval task (sparse_retrieval_evaluator.py):

import logging import random

from datasets import load_dataset

from sentence_transformers import SparseEncoder from sentence_transformers.sparse_encoder.evaluation import SparseInformationRetrievalEvaluator

logging.basicConfig(format="%(message)s", level=logging.INFO)

Load a model

model = SparseEncoder("naver/splade-cocondenser-ensembledistil")

Load the NFcorpus IR dataset (https://huggingface.co/datasets/BeIR/nfcorpus, https://huggingface.co/datasets/BeIR/nfcorpus-qrels)

corpus = load_dataset("BeIR/nfcorpus", "corpus", split="corpus") queries = load_dataset("BeIR/nfcorpus", "queries", split="queries") relevant_docs_data = load_dataset("BeIR/nfcorpus-qrels", split="test")

For this dataset, we want to concatenate the title and texts for the corpus

corpus = corpus.map(lambda x: {"text": x["title"] + " " + x["text"]}, remove_columns=["title"])

Shrink the corpus size heavily to only the relevant documents + 1,000 random documents

required_corpus_ids = set(map(str, relevant_docs_data["corpus-id"])) required_corpus_ids |= set(random.sample(corpus["_id"], k=1000)) corpus = corpus.filter(lambda x: x["_id"] in required_corpus_ids)

Convert the datasets to dictionaries

corpus = dict(zip(corpus["_id"], corpus["text"])) # Our corpus (cid => document) queries = dict(zip(queries["_id"], queries["text"])) # Our queries (qid => question) relevant_docs = {} # Query ID to relevant documents (qid => set([relevant_cids]) for qid, corpus_ids in zip(relevant_docs_data["query-id"], relevant_docs_data["corpus-id"]): qid = str(qid) corpus_ids = str(corpus_ids) if qid not in relevant_docs: relevant_docs[qid] = set() relevant_docs[qid].add(corpus_ids)

Given queries, a corpus and a mapping with relevant documents, the SparseInformationRetrievalEvaluator computes different IR metrics.

ir_evaluator = SparseInformationRetrievalEvaluator( queries=queries, corpus=corpus, relevant_docs=relevant_docs, name="BeIR-nfcorpus-subset-test", show_progress_bar=True, batch_size=16, )

Run evaluation

results = ir_evaluator(model) """ Queries: 323 Corpus: 3269

Score-Function: dot Accuracy@1: 50.77% Accuracy@3: 64.40% Accuracy@5: 66.87% Accuracy@10: 71.83% Precision@1: 50.77% Precision@3: 40.45% Precision@5: 34.06% Precision@10: 25.98% Recall@1: 6.27% Recall@3: 11.69% Recall@5: 13.74% Recall@10: 17.23% MRR@10: 0.5814 NDCG@10: 0.3621 MAP@100: 0.1838 Model Query Sparsity: Active Dimensions: 40.0, Sparsity Ratio: 0.9987 Model Corpus Sparsity: Active Dimensions: 206.2, Sparsity Ratio: 0.9932 """

Print the results

print(f"Primary metric: {ir_evaluator.primary_metric}")

=> Primary metric: BeIR-nfcorpus-subset-test_dot_ndcg@10

print(f"Primary metric value: {results[ir_evaluator.primary_metric]:.4f}")

=> Primary metric value: 0.3621