Detectors (original) (raw)

Detectors answer “why did my agent fail?” by analyzing execution traces for failures, performing root cause analysis, and producing actionable fix recommendations. While evaluators tell you whether an agent performed well, detectors tell you what went wrong and how to fix it.

Detectors operate on Session objects (the same trace format used by trace-based evaluators) and use LLM-based analysis to identify semantic failures that go beyond simple error codes — hallucinations, tool misuse, reasoning breakdowns, policy violations, and more.

Evaluators give you a score. Detectors give you a diagnosis.

Evaluators alone:

Evaluators + Detectors:

When to Use Detectors

Section titled “When to Use Detectors”

Use detectors when you need to:

Available Detectors

Section titled “Available Detectors”

Failure Detection

Section titled “Failure Detection”

detect_failures

Root Cause Analysis

Section titled “Root Cause Analysis”

analyze_root_cause

Session Diagnosis

Section titled “Session Diagnosis”

diagnose_session


from strands_evals.detectors import diagnose_session

# session is a Session object from a trace provider or in-memory mapper

result = diagnose_session(session)

# What failed?

for failure in result.failures:

    print(f"Span {failure.span_id}: {failure.category}")

    print(f"  Evidence: {failure.evidence}")

# Why did it fail?

for rc in result.root_causes:

    print(f"Root cause at {rc.location}: {rc.root_cause_explanation}")

    print(f"  Fix ({rc.fix_type}): {rc.fix_recommendation}")

# Deduplicated recommendations

for rec in result.recommendations:

    print(f"  - {rec}")

Detectors vs Evaluators

Section titled “Detectors vs Evaluators”

Aspect Evaluators Detectors
Question ”How well did the agent do?" "Why did it fail?”
Output Score + pass/fail Failures + root causes + fix recommendations
Granularity Per-case or per-session Per-span
Purpose Measure quality Diagnose problems
Use Case Benchmarking, regression testing Debugging, improvement

Use Together: Run evaluators to score your agent, then use detectors on failing cases to understand what went wrong and how to fix it. The Experiment class supports this workflow natively via DiagnosisConfig.

Integration with Experiments

Section titled “Integration with Experiments”

Detectors integrate directly into the evaluation pipeline. Pass a DiagnosisConfig to Experiment to automatically diagnose failing cases:


from strands_evals import Experiment, Case, DiagnosisConfig

from strands_evals.evaluators import GoalSuccessRateEvaluator

from strands_evals.detectors import DiagnosisTrigger

experiment = Experiment(

    cases=test_cases,

    evaluators=[GoalSuccessRateEvaluator()],

    diagnosis_config=DiagnosisConfig(trigger=DiagnosisTrigger.ON_FAILURE),

)

report = experiment.run_evaluations(my_task)

# View recommendations for failing cases

report.display(include_recommendations=True)

See the Session Diagnosis guide for the full integration walkthrough.

Detectors use a two-phase analysis pipeline:


flowchart TD

    A[Session traces] --> B[1. Failure Detection]

    B --> C[2. Root Cause Analysis]

    C --> D[DiagnosisResult]

    B -. "span_id, category,\nconfidence, evidence" .-> B

    C -. "causality, propagation impact,\nfix recommendation" .-> C

    D -. "failures[]\nroot_causes[]\nrecommendations[]" .-> D

Both phases handle large sessions that exceed LLM context limits through automatic chunking with overlap and merge strategies.