DeepEval - The LLM Evaluation Framework (original) (raw)
Pytest-native evals that run in CI/CD or as Python scripts. Iterate locally, on your own environment, on your own criteria.
from deepeval.metrics import TaskCompletenessMetric
from deepeval.test_case import LLMTestCase
from deepeval import assert_test
@pytest.mark.parametrize("test_case", LLMTestCase)
def test_agent(test_case: LLMTestCase):
my_ai_agent(test_case.input)
assert_test(metrics=[TaskCompletenessMetric()])
$deepeval test run tests/test_agent.py
Research-backed metrics with transparent, explainable scores — every judgment comes with reasoning you can trust, debug, and defend.
50+ research-backed metrics
Hallucination, faithfulness, answer relevancy, summarization, toxicity, bias, and more — ready out of the box.
Native conversational evals
Role adherence, knowledge retention, and conversation completeness — dedicated metrics built for multi-turn from day one.
Multi-modal by default
Text, images, and audio — all first-class. Same test case, same runner, same metrics across every modality.
Compose state-of-the-art techniques into metrics that fit your product — plain-English criteria, decision graphs, weighted scoring, and more, all in the same runner.
G-Eval
Criteria-based, chain-of-thought scoring via form-filling for reliable subjective evals.
DAG
Directed-acyclic-graph metrics for objective, multi-step conditional scoring.
QAG
Question-Answer Generation for close-ended, reference-grounded scoring.
DeepEval traces every step of your agent into something you can grade, and improve — visible in your terminal, testable in your runner, shippable in your next commit. No dashboards to open. No context switch required.
$deepeval test run agents/checkout.py
●test_checkout_agent
│
├─AGENTplan_refund_strategyG-Eval0.94220ms
│ ├─RETRIEVERretrieve_policy_docs(query=…)Context Recall0.8968ms
│ ├─TOOLlookup_order(id="#9281")Faithfulness1.0045ms
│ └─LLMgpt-4o · classify_intentAnswer Relevancy0.92130ms
│
├─TOOLprocess_refund(amount=29.99)deterministic85ms
│
└─LLMgpt-4o · draft_responseHelpfulness0.88195ms
Trace score 0.92 · 5/5 metrics passedpassed
>eval the refund agent and fix any regressions
⏺Bash(deepeval test run agents/checkout.py)⎿faithfulness 0.64 ⚠
⏺Edit(agents/retriever.py)⎿scoped to active refund policies
⏺Bash(deepeval test run agents/checkout.py)⎿faithfulness 0.98 ✓
●All metrics green — ready to commit.
>Try “ship it”
? for shortcuts
Generate synthetic goldens from your knowledge base, or simulate full conversations across user personas — all before a single real user shows up.
DeepEval is the eval harness for vibe coding agents — closing the build → eval → patch loop your coding agent has been missing. Cursor, Claude Code, and Codex shell out to one CLI, read scored traces with reasons, then patch the failing span and re-run to confirm.
Coding Agent
Cursor · Claude Code · Codex
Your AI App
Agent · RAG · Chatbot
deepeval test run
50+ metrics, one CLI
Scored Trace
Span-level scores + reasons
DeepEval integrates natively with Confident AI, an AI observability and evaluation platform for AI quality. It is our Vercel for DeepEval. The same test file you run on your laptop now poweres engineering, product, QAs, and domain experts.
Plug DeepEval into the tools you already ship with — evaluate across any LLM, any agent framework, and any CI/CD runner without rewriting a line.
Nothing would be possible without our community of 250+ contributors, thank you!