Assessing NLP Model Effectiveness: WER, CRT, and STS (original) (raw)

Last Updated : 23 Jul, 2025

To evaluate the effectiveness of NLP models, Word Error Rate (WER), Character Recognition Rate (CRT), and Semantic Textual Similarity (STS) are used. In this article, we'll explore the importance of these metrics in NLP.

**NLP Evaluation Metrics

Evaluation metrics in NLP assesses the performance of models across different tasks. These metrics help determine how well a model understands, generates, or processes human language.

The choice of evaluation metric depends on the specific NLP task, such as text classification, machine translation, or text summarization.

Evaluation methods for NLP models can be broadly categorized into intrinsic and extrinsic evaluations:

Need for Evaluation Metric

Word Error Rate (WER)

**Word Error Rate (WER) evaluates speech recognition systems to access how closely the output of a speech-to-text model matches the actual transcription of spoken content. It is calculated using:

\text{WER} = \frac{\text{S+D+I}}{\text{N}}

Where:

In simpler terms, WER gives a ratio of the number of errors (substitutions, deletions, insertions) compared to the total number of words in the reference transcript.

Let’s now implement WER in python:

Python `

import numpy as np

def error_rate(reference, hypothesis): # Initializing the matrix d = np.zeros((len(reference)+1, len(hypothesis)+1), dtype=np.uint32) for i in range(len(reference)+1): d[i][0] = i for j in range(len(hypothesis)+1): d[0][j] = j

# Computing WER
for i in range(1, len(reference)+1):
    for j in range(1, len(hypothesis)+1):
        if reference[i-1] == hypothesis[j-1]:
            substitution_cost = 0
        else:
            substitution_cost = 1
        d[i][j] = min(d[i-1][j] + 1,                    # Deletion
                      d[i][j-1] + 1,                    # Insertion
                      d[i-1][j-1] + substitution_cost)  # Substitution

return d[len(reference)][len(hypothesis)] / len(reference)

reference = "this is a test".split() hypothesis = "this is test".split() print(f"WER: {error_rate(reference, hypothesis):.2f}")

`

**Output:

WER: 0.25

Character Recognition Rate (CRT)

**Character Recognition Rate (CRT) assess the **Optical Character Recognition (OCR) **systems. OCR systems convert images of typed/handwritten text into text that can be understood my computers. CRT evaluates the percentage of characters that are correctly recognized by the OCR system.

**CRT is calculated as:

\text{CRT} = \frac{\text{Number of correct characters}}{\text{Total number of characters}} \times 100

To implement CRT we will use the same method as above, but rather than splitting the input text into words, we will iterate through each character to calculate the error.

Python `

reference = "this is a test" hypothesis = "this is test" print(f"CER: {error_rate(reference, hypothesis):.2f}")

`

**Output:

CER: 0.14

Semantic Textual Similarity (STS)

**Semantic Textual Similarity (STS) a different metric than above which calculate the similarity between two texts. This measure is particularly useful in applications like machine translation, text summarization, and information retrieval. STS is computed using **cosine similarity and pre-trained models like **BERT****.**

In this code, we will load the spaCy model and call the **similarity() function which gives the similarity score.

Python `

import spacy

def semantic_textual_similarity(text1, text2): nlp = spacy.load('en_core_web_sm') doc1 = nlp(text1) doc2 = nlp(text2) return doc1.similarity(doc2)

text1 = "this is a test" text2 = "this is test" print(f"STS: {semantic_textual_similarity(text1, text2):.2f}")

`

**Output:

STS: 0.68

**Challenges and Considerations

Evaluating NLP models is not without challenges:

NLP has significantly transformed the way humans interact with machines, enabling more intuitive and efficient communication. NLP encompasses a wide range of techniques and methodologies to understand, interpret, and generate human language.