TaskSpecific Metrics in Hugging Face (original) (raw)

Task-Specific Metrics in Hugging Face

Last Updated : 3 Apr, 2026

Task-specific metrics evaluate models based on their objective such as text generation, question answering or speech recognition. Hugging Face provides these through the evaluate library for meaningful and context aware evaluation.

task_specific_metrics_in_hugging_face

Task Specific Metrics in Hugging Face

Common Task-Specific Metrics

1. Text Generation and Translation

Text generation and translation tasks focus on producing human like text or converting text between languages. Evaluation is done using metrics that compare generated text with reference text.

!pip install rouge_score !pip install evaluate

from transformers import pipeline import evaluate

generator = pipeline("text-generation", model="gpt2") output = generator("Machine learning is", max_length=20)[0]["generated_text"]

bleu = evaluate.load("bleu") rouge = evaluate.load("rouge")

print("BLEU:", bleu.compute(predictions=[output], references=[["Machine learning is a field of AI"]])) print("ROUGE:", rouge.compute(predictions=[output], references=["Machine learning is a field of AI"]))

`

**Output:

output2

Output

2. Question Answering (QA)

QA models extract answers from a given context. Evaluation checks how closely the predicted answer matches the actual answer.

from transformers import pipeline import evaluate

qa = pipeline("question-answering") context = "AI is artificial intelligence." question = "What is AI?" result = qa(question=question, context=context)

squad_metric = evaluate.load("squad")

predictions = [{'prediction_text': result["answer"], 'id': '1'}] references = [{'answers': {'answer_start': [6], 'text': ["artificial intelligence"]}, 'id': '1'}]

results = squad_metric.compute(predictions=predictions, references=references)

print(f"Prediction: {result['answer']}") print(f"F1 Score: {results['f1']}") print(f"Exact Match: {results['exact_match']}")

`

**Output:

output1

Output

3. Named Entity Recognition (NER)

NER identifies entities like names and organizations in text. Evaluation focuses on correct labeling of sequences.

!pip install seqeval

from transformers import pipeline import evaluate

ner = pipeline("ner", aggregation_strategy="simple") preds = ner("Elon Musk founded SpaceX")

predictions = [["B-PER", "I-PER", "O", "B-ORG"]] references = [["B-PER", "I-PER", "O", "B-ORG"]]

metric = evaluate.load("seqeval") results = metric.compute(predictions=predictions, references=references)

print(preds)

print("Precision:", results["overall_precision"]) print("Recall:", results["overall_recall"]) print("F1 Score:", results["overall_f1"])

`

**Output:

output3

output

4. Speech Recognition (ASR)

Speech recognition converts audio into text. Evaluation measures transcription accuracy.

!pip install -q transformers datasets evaluate librosa soundfile !pip install jiwer

from transformers import pipeline from datasets import load_dataset import evaluate

asr = pipeline( "automatic-speech-recognition", model="facebook/wav2vec2-base-960h" )

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")

audio = dataset[0]["audio"]
reference = dataset[0]["text"]

pred = asr(audio)["text"]

print("Prediction:", pred) print("Reference:", reference)

wer = evaluate.load("wer") cer = evaluate.load("cer")

print("WER:", wer.compute(predictions=[pred], references=[reference])) print("CER:", cer.compute(predictions=[pred], references=[reference]))

`

**Output:

output2

Output

5. Image Generation

Image generation models create images from text prompts. Evaluation checks structural similarity and pixel-level quality.

!pip install diffusers transformers accelerate evaluate pillow torchvision !pip install scikit-image pillow -q

from diffusers import StableDiffusionPipeline import torch from skimage.metrics import structural_similarity as ssim from skimage.metrics import peak_signal_noise_ratio as psnr from PIL import Image import numpy as np

pipe = StableDiffusionPipeline.from_pretrained( "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16 ).to("cuda" if torch.cuda.is_available() else "cpu")

prompt = "A cat sitting on a chair"

image = pipe(prompt).images[0]

image.save("generated.png")

img = np.array(Image.open("generated.png").convert("RGB")) noisy = np.clip(img + np.random.normal(0, 10, img.shape), 0, 255).astype(np.uint8)

ssim_score = ssim(img, noisy, channel_axis=2, data_range=255) psnr_score = psnr(img, noisy, data_range=255)

print(f"SSIM : {ssim_score:.4f} (1.0 = identical, > 0.9 = very similar)") print(f"PSNR : {psnr_score:.2f} dB (> 30 dB = good quality)")

`

**Output:

Download full code from here

Advantages

Limitations