DeepEval - The LLM Evaluation Framework (original) (raw)

Pytest-native evals that run in CI/CD or as Python scripts. Iterate locally, on your own environment, on your own criteria.

from deepeval.metrics import TaskCompletenessMetric

from deepeval.test_case import LLMTestCase

from deepeval import assert_test

@pytest.mark.parametrize("test_case", LLMTestCase)

def test_agent(test_case: LLMTestCase):

my_ai_agent(test_case.input)

assert_test(metrics=[TaskCompletenessMetric()])

$deepeval test run tests/test_agent.py

Research-backed metrics with transparent, explainable scores — every judgment comes with reasoning you can trust, debug, and defend.

50+ research-backed metrics

Hallucination, faithfulness, answer relevancy, summarization, toxicity, bias, and more — ready out of the box.

Native conversational evals

Role adherence, knowledge retention, and conversation completeness — dedicated metrics built for multi-turn from day one.

Text, images, and audio — all first-class. Same test case, same runner, same metrics across every modality.

Compose state-of-the-art techniques into metrics that fit your product — plain-English criteria, decision graphs, weighted scoring, and more, all in the same runner.

G-Eval

Criteria-based, chain-of-thought scoring via form-filling for reliable subjective evals.

DAG

Directed-acyclic-graph metrics for objective, multi-step conditional scoring.

QAG

Question-Answer Generation for close-ended, reference-grounded scoring.

DeepEval traces every step of your agent into something you can grade, and improve — visible in your terminal, testable in your runner, shippable in your next commit. No dashboards to open. No context switch required.

$deepeval test run agents/checkout.py

●test_checkout_agent

│

├─AGENTplan_refund_strategyG-Eval0.94220ms

│ ├─RETRIEVERretrieve_policy_docs(query=…)Context Recall0.8968ms

│ ├─TOOLlookup_order(id="#9281")Faithfulness1.0045ms

│ └─LLMgpt-4o · classify_intentAnswer Relevancy0.92130ms

│

├─TOOLprocess_refund(amount=29.99)deterministic85ms

│

└─LLMgpt-4o · draft_responseHelpfulness0.88195ms

Trace score 0.92 · 5/5 metrics passedpassed

Claude Code

>eval the refund agent and fix any regressions

⏺Bash(deepeval test run agents/checkout.py)⎿faithfulness 0.64 ⚠

⏺Edit(agents/retriever.py)⎿scoped to active refund policies

⏺Bash(deepeval test run agents/checkout.py)⎿faithfulness 0.98 ✓

●All metrics green — ready to commit.

>Try “ship it”

? for shortcuts

Generate synthetic goldens from your knowledge base, or simulate full conversations across user personas — all before a single real user shows up.

DeepEval is the eval harness for vibe coding agents — closing the build → eval → patch loop your coding agent has been missing. Cursor, Claude Code, and Codex shell out to one CLI, read scored traces with reasons, then patch the failing span and re-run to confirm.

Coding Agent

Cursor · Claude Code · Codex

Your AI App

Agent · RAG · Chatbot

deepeval test run

50+ metrics, one CLI

Scored Trace

Span-level scores + reasons

DeepEval integrates natively with Confident AI, an AI observability and evaluation platform for AI quality. It is our Vercel for DeepEval. The same test file you run on your laptop now poweres engineering, product, QAs, and domain experts.

Explore enterprise

Plug DeepEval into the tools you already ship with — evaluate across any LLM, any agent framework, and any CI/CD runner without rewriting a line.

Nothing would be possible without our community of 250+ contributors, thank you!

Start Evaluating