Evaluate a target system on a given dataset.
evaluate( self, target: Union[TARGET_T, Runnable, EXPERIMENT_T, tuple[EXPERIMENT_T, EXPERIMENT_T]], , data: Optional[DATA_T] = None, evaluators: Optional[Union[Sequence[EVALUATOR_T], Sequence[COMPARATIVE_EVALUATOR_T]]] = None, summary_evaluators: Optional[Sequence[SUMMARY_EVALUATOR_T]] = None, metadata: Optional[dict] = None, experiment_prefix: Optional[str] = None, description: Optional[str] = None, max_concurrency: Optional[int] = 0, num_repetitions: int = 1, blocking: bool = True, experiment: Optional[EXPERIMENT_T] = None, upload_results: bool = True, error_handling: Literal['log', 'ignore'] = 'log', **kwargs: Any = {} ) -> Union[ExperimentResults, ComparativeExperimentResults]| Name | Type | Description |
|---|---|---|
target* | Union[TARGET_T, Runnable, EXPERIMENT_T, Tuple[EXPERIMENT_T, EXPERIMENT_T]] | The target system or experiment(s) to evaluate. Can be a function that takes a |
data | DATA_T | Default: NoneThe dataset to evaluate on. Can be a dataset name, a list of examples, or a generator of examples. |
evaluators | Optional[Union[Sequence[EVALUATOR_T], Sequence[COMPARATIVE_EVALUATOR_T]]] | Default: NoneA list of evaluators to run on each example. The evaluator signature depends on the target type. Default to None. |
summary_evaluators | Optional[Sequence[SUMMARY_EVALUATOR_T]] | Default: NoneA list of summary evaluators to run on the entire dataset. Should not be specified if comparing two existing experiments. |
metadata | Optional[dict] | Default: NoneMetadata to attach to the experiment. |
experiment_prefix | Optional[str] | Default: NoneA prefix to provide for your experiment name. |
description | Optional[str] | Default: NoneA free-form text description for the experiment. |
max_concurrency | Optional[int], default=0 | Default: 0The maximum number of concurrent evaluations to run. If |
blocking | bool, default=True | Default: TrueWhether to block until the evaluation is complete. |
num_repetitions | int, default=1 | Default: 1The number of times to run the evaluation. Each item in the dataset will be run and evaluated this many times. Defaults to 1. |
experiment | Optional[EXPERIMENT_T] | Default: NoneAn existing experiment to extend. If provided, For advanced usage only. Should not be specified if target is an existing experiment or two-tuple fo experiments. |
upload_results | bool, default=True | Default: TrueWhether to upload the results to LangSmith. |
error_handling | str, default="log" | Default: 'log'How to handle individual run errors.
|
**kwargs | Any | Default: {}Additional keyword arguments to pass to the evaluator. |