evaluate — 🦜️🛠️ LangSmith documentation (original) (raw)

langsmith.evaluation._runner.evaluate(

target: TARGET_T | Runnable | EXPERIMENT_T | tuple[EXPERIMENT_T, EXPERIMENT_T],

data: DATA_T | None = None,

evaluators: Sequence[EVALUATOR_T] | Sequence[COMPARATIVE_EVALUATOR_T] | None = None,

summary_evaluators: Sequence[SUMMARY_EVALUATOR_T] | None = None,

metadata: dict | None = None,

experiment_prefix: str | None = None,

description: str | None = None,

max_concurrency: int | None = 0,

num_repetitions: int = 1,

client: langsmith.Client | None = None,

blocking: bool = True,

experiment: EXPERIMENT_T | None = None,

upload_results: bool = True,

**kwargs: Any,

) → ExperimentResults | ComparativeExperimentResults [source]#

Evaluate a target system on a given dataset.

Parameters:

target (TARGET_T | Runnable | EXPERIMENT_T | Tuple [ EXPERIMENT_T , EXPERIMENT_T ]) – The target system or experiment(s) to evaluate. Can be a function that takes a dict and returns a dict, a langchain Runnable, an existing experiment ID, or a two-tuple of experiment IDs.
data (DATA_T) – The dataset to evaluate on. Can be a dataset name, a list of examples, or a generator of examples.
evaluators (Sequence [ EVALUATOR_T ] | Sequence [ COMPARATIVE_EVALUATOR_T ] | None) – A list of evaluators to run on each example. The evaluator signature depends on the target type. Default to None.
summary_evaluators (Sequence [ SUMMARY_EVALUATOR_T ] | None) – A list of summary evaluators to run on the entire dataset. Should not be specified if comparing two existing experiments. Defaults to None.
metadata (dict | None) – Metadata to attach to the experiment. Defaults to None.
experiment_prefix (str | None) – A prefix to provide for your experiment name. Defaults to None.
description (str | None) – A free-form text description for the experiment.
max_concurrency (int | None) – The maximum number of concurrent evaluations to run. If None then no limit is set. If 0 then no concurrency. Defaults to 0.
client (langsmith.Client | None) – The LangSmith client to use. Defaults to None.
blocking (bool) – Whether to block until the evaluation is complete. Defaults to True.
num_repetitions (int) – The number of times to run the evaluation. Each item in the dataset will be run and evaluated this many times. Defaults to 1.
experiment (schemas.TracerSession | None) – An existing experiment to extend. If provided, experiment_prefix is ignored. For advanced usage only. Should not be specified if target is an existing experiment or two-tuple fo experiments.
load_nested (bool) – Whether to load all child runs for the experiment. Default is to only load the top-level root runs. Should only be specified when target is an existing experiment or two-tuple of experiments.
randomize_order (bool) – Whether to randomize the order of the outputs for each evaluation. Default is False. Should only be specified when target is a two-tuple of existing experiments.
upload_results (bool)
kwargs (Any)

Returns:

If target is a function, Runnable, or existing experiment. ComparativeExperimentResults: If target is a two-tuple of existing experiments.

Return type:

ExperimentResults

Examples

Prepare the dataset:

from typing import Sequence from langsmith import Client from langsmith.evaluation import evaluate from langsmith.schemas import Example, Run client = Client() dataset = client.clone_public_dataset( ... "https://smith.langchain.com/public/419dcab2-1d66-4b94-8901-0357ead390df/d" ... ) dataset_name = "Evaluate Examples"

Basic usage:

def accuracy(run: Run, example: Example): ... # Row-level evaluator for accuracy. ... pred = run.outputs["output"] ... expected = example.outputs["answer"] ... return {"score": expected.lower() == pred.lower()} def precision(runs: Sequence[Run], examples: Sequence[Example]): ... # Experiment-level evaluator for precision. ... # TP / (TP + FP) ... predictions = [run.outputs["output"].lower() for run in runs] ... expected = [example.outputs["answer"].lower() for example in examples] ... # yes and no are the only possible answers ... tp = sum([p == e for p, e in zip(predictions, expected) if p == "yes"]) ... fp = sum([p == "yes" and e == "no" for p, e in zip(predictions, expected)]) ... return {"score": tp / (tp + fp)} def predict(inputs: dict) -> dict: ... # This can be any function or just an API call to your app. ... return {"output": "Yes"} results = evaluate( ... predict, ... data=dataset_name, ... evaluators=[accuracy], ... summary_evaluators=[precision], ... experiment_prefix="My Experiment", ... description="Evaluating the accuracy of a simple prediction model.", ... metadata={ ... "my-prompt-version": "abcd-1234", ... }, ... ) View the evaluation results for experiment:...

Evaluating over only a subset of the examples

experiment_name = results.experiment_name examples = client.list_examples(dataset_name=dataset_name, limit=5) results = evaluate( ... predict, ... data=examples, ... evaluators=[accuracy], ... summary_evaluators=[precision], ... experiment_prefix="My Experiment", ... description="Just testing a subset synchronously.", ... ) View the evaluation results for experiment:...

Streaming each prediction to more easily + eagerly debug.

results = evaluate( ... predict, ... data=dataset_name, ... evaluators=[accuracy], ... summary_evaluators=[precision], ... description="I don't even have to block!", ... blocking=False, ... ) View the evaluation results for experiment:... for i, result in enumerate(results): ... pass

Using the evaluate API with an off-the-shelf LangChain evaluator:

from langsmith.evaluation import LangChainStringEvaluator from langchain_openai import ChatOpenAI def prepare_criteria_data(run: Run, example: Example): ... return { ... "prediction": run.outputs["output"], ... "reference": example.outputs["answer"], ... "input": str(example.inputs), ... } results = evaluate( ... predict, ... data=dataset_name, ... evaluators=[ ... accuracy, ... LangChainStringEvaluator("embedding_distance"), ... LangChainStringEvaluator( ... "labeled_criteria", ... config={ ... "criteria": { ... "usefulness": "The prediction is useful if it is correct" ... " and/or asks a useful followup question." ... }, ... "llm": ChatOpenAI(model="gpt-4o"), ... }, ... prepare_data=prepare_criteria_data, ... ), ... ], ... description="Evaluating with off-the-shelf LangChain evaluators.", ... summary_evaluators=[precision], ... ) View the evaluation results for experiment:...

Evaluating a LangChain object:

from langchain_core.runnables import chain as as_runnable @as_runnable ... def nested_predict(inputs): ... return {"output": "Yes"} @as_runnable ... def lc_predict(inputs): ... return nested_predict.invoke(inputs) results = evaluate( ... lc_predict.invoke, ... data=dataset_name, ... evaluators=[accuracy], ... description="This time we're evaluating a LangChain object.", ... summary_evaluators=[precision], ... ) View the evaluation results for experiment:...

Changed in version 0.2.0: ‘max_concurrency’ default updated from None (no limit on concurrency) to 0 (no concurrency at all).