Evaluate Library for Hugging Face (original) (raw)

Last Updated : 9 May, 2026

Evaluate library is a tool designed to assess the performance of Hugging Face models using a wide range of evaluation metrics. It simplifies the process of measuring model accuracy, precision, recall and other metrics across different tasks.

Techniques for Evaluation

**1. ROUGE: Used for evaluating text summarisation by comparing generated text with the actual text

**2. BLEU: Used for evaluating generated text by comparing it with reference text and checks how many words and word sequences match with the actual text.

**3. **Accuracy: Measure how many predictions are correct out of total prediction and gives overall correctness of the model.

**4. **Precision: Measures how many predicted positive results are actually correct

**5. **Recall: Measure how many actual positive results are correctly identified

6. **F1 Score: Used for balancing precision and recall into a single value

Implementation

Let’s understand the implementation using a text summarization model from Hugging Face with a sample dataset.

**Step 1: Importing the required libraries

import pandas as pd from transformers import pipeline import evaluate from datasets import load_dataset

`

**Step 2: Loading the Dataset

Loading the dataset from a CSV file to use real text and summary data.

You can download the dataset from here.

Python `

df = pd.read_csv("bbc_real_dataset.csv") print(df.head())

`

**Output:

Screenshot-from-2026-03-31-13-18-03

Dataset preview showing text and summaries

Step 3: Preparing the Dataset

Selecting a small subset of the dataset and converting it into the required format.

Python `

dataset = df.head(2).to_dict(orient="records") print(dataset)

`

**Output:

Screenshot-from-2026-03-31-13-22-41

Prepared dataset in dictionary format

**Step 4: Loading the Summarization Model

Loading a pre-trained model to generate summaries from the given text.

Python `

summarizer = pipeline(task="summarization", model="sshleifer/distilbart-cnn-12-6") print("\nModel loaded successfully")

`

Generating summaries for the input text using the loaded model.

Python `

predictions = [] for item in dataset: summary = summarizer( item["text"], # article text max_length=40, min_length=15, do_sample=False )[0]['summary_text']

predictions.append(summary)

Show one generated summary

print("\nGenerated Summaries:") for i in range(len(predictions)): print(f"{i+1}. {predictions[i]}\n")

`

**Output:

Screenshot-from-2026-03-31-13-26-44

Generated summaries by the model

Step 6: Preparing Reference Summaries

Extracting the actual summaries from the dataset for comparison.

Python `

references = [item["summary"] for item in dataset]

print("Reference Summaries:") for i in range(len(references)): print(f"{i+1}. {references[i]}\n")

`

**Output:

Screenshot-from-2026-03-31-14-38-44

Actual Reference Summaries from the Dataset

**Step 7: Loading the ROUGE metric

rouge = evaluate.load("rouge") print("ROUGE metric loaded")

`

**Step 8: Computing the ROUGE Score

Compare generated summaries with actual summaries

Python `

result = rouge.compute(predictions=predictions, references=references)

`

**Step 9: Displaying the Results

Python `

print("ROUGE-1:", result['rouge1']) print("ROUGE-2:", result['rouge2']) print("ROUGE-L:", result['rougeL']) print("ROUGE-Lsum:", result['rougeLsum'])

`

**Output:

Screenshot-from-2026-03-31-14-43-24

Output

The output shows how close the generated summary is to the actual summary.

We performed this implementation to check how good the model’s output is. It helped us compare the generated summary with the actual summary and showed that the model needs improvement as the similarity is quite low.