Model-Based Evaluation (original) (raw)

Haystack supports various kinds of model-based evaluation. This page explains what model-based evaluation is and discusses the various options available with Haystack.

Model-based evaluation in Haystack uses a language model to check the results of a Pipeline. This method is easy to use because it usually doesn't need labels for the outputs. It's often used with Retrieval-Augmented Generative (RAG) Pipelines, but can work with any Pipeline.

Currently, Haystack supports the end-to-end, model-based evaluation of a complete RAG Pipeline.

A common strategy for model-based evaluation involves using a Language Model (LLM), such as OpenAI's GPT models, as the evaluator model, often referred to as the golden model. The most frequently used golden model is GPT-4. We utilize this model to evaluate a RAG Pipeline by providing it with the Pipeline's results and sometimes additional information, along with a prompt that outlines the evaluation criteria.

This method of using an LLM as the evaluator is very flexible as it exposes a number of metrics to you. Each of these metrics is ultimately a well-crafted prompt describing to the LLM how to evaluate and score results. Common metrics are faithfulness, context relevance, and so on.

Alongside LLMs for evaluation, we can also use small cross-encoder models. These models can calculate, for example, semantic answer similarity. In contrast to metrics based on LLMs, the metrics based on smaller models don’t require an API key of a model provider.

This method of using small cross-encoder models as evaluators is faster and cheaper to run but is less flexible in terms of what aspect you can evaluate. You can only evaluate what the small model was trained to evaluate.

There are two ways of performing model-based evaluation in Haystack, both of which leverage Pipelines and Evaluator components.

Currently, Haystack has integrations with DeepEval and Ragas. There is an Evaluator component available for each of these frameworks:

Feature/Integration RagasEvaluator DeepEvalEvaluator
Evaluator Models All GPT models from OpenAIGoogle VertexAI ModelsAzure OpenAI ModelsAmazon Bedrock Models All GPT models from OpenAI
Supported metrics ANSWER_CORRECTNESS, FAITHFULNESS, ANSWER_SIMILARITY, CONTEXT_PRECISION, CONTEXT_UTILIZATION,CONTEXT_RECALL, ASPECT_CRITIQUE, CONTEXT_RELEVANCY, ANSWER_RELEVANCY ANSWER_RELEVANCY, FAITHFULNESS, CONTEXTUAL_PRECISION, CONTEXTUAL_RECALL, CONTEXTUAL_RELEVANCE
Customizable prompt for response evaluation ✅, with ASPECT_CRITIQUE metric
Explanations of scores
Monitoring dashboard

📘

Framework Documentation

You can find more information about the metrics in the documentation of the respective evaluation frameworks:

Updated about 1 year ago