You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Run and evaluate the workflow on a specified dataset. The dataset files types are `json`, `jsonl`, `csv`, `xls`, or `parquet`.
32
+
Run and evaluate the workflow on a specified dataset. The dataset files types are `json`, `jsonl`, `csv`, `xls`, or `parquet`.
33
33
34
34
Download and use datasets provided by AgentIQ examples by running the following.
35
35
36
36
```bash
37
37
git lfs fetch
38
38
git lfs pull
39
39
```
40
-
The dataset used for evaluation is specified in the `config.yml` file via `eval.general.dataset`. For example, to use the `langsmith.json` dataset, the configuration is as follows:
40
+
The dataset used for evaluation is specified in the configuration file via `eval.general.dataset`. For example, to use the `langsmith.json` dataset, the configuration is as follows:
You can evaluate a dataset with previously generated answers via the `--skip_workflow` option. In this case the dataset has both the expected `answer` and the `generated_answer`.
The asynchronous evaluate method provide by the custom evaluator takes an `EvalInput` object as input and returns an `EvalOutput` object as output.
81
84
82
85
`EvalInput` is a list of `EvalInputItem` objects. Each `EvalInputItem` object contains the following fields:
83
-
-`id`: The unique identifier for the item.
84
-
-`input_obj`: This is typically the question. It can be a string or any serializable object.
85
-
-`expected_output_obj`: The expected answer for the question. This can be a string or any serializable object.
86
+
-`id`: The unique identifier for the item. It is defined in the dataset file and can be an integer or a string.
87
+
-`input_obj`: This is typically the question. It is derived from the dataset file and can be a string or any serializable object.
88
+
-`expected_output_obj`: The expected answer for the question. It is derived from the dataset file and can be a string or any serializable object.
86
89
-`output_obj`: The answer generated by the workflow for the question. This can be a string or any serializable object.
87
90
-`trajectory`: List of intermediate steps returned by the workflow. This is a list of `IntermediateStep` objects.
88
91
@@ -97,8 +100,8 @@ The evaluate method computes the score for each item in the evaluation input and
97
100
98
101
### Similarity Evaluator
99
102
Similarity evaluator is used as an example to demonstrate the process of creating and registering a custom evaluator with AgentIQ. We add this code to a new `similarity_evaluator.py` file in the simple example directory for testing purposes.
Copy file name to clipboardExpand all lines: docs/source/guides/evaluate.md
+6
Original file line number
Diff line number
Diff line change
@@ -68,6 +68,12 @@ The dataset file provides a list of questions and expected answers. The followin
68
68
## Understanding the Evaluator Configuration
69
69
The evaluators section specifies the evaluators to use for evaluating the workflow output. The evaluator configuration includes the evaluator type, the metric to evaluate, and any additional parameters required by the evaluator.
70
70
71
+
### Display all evaluators
72
+
To display all existing evaluators, run the following command:
73
+
```bash
74
+
aiq info components -t evaluator
75
+
```
76
+
71
77
### Ragas Evaluator
72
78
[RAGAS](https://docs.ragas.io/) is an OSS evaluation framework that enables end-to-end
73
79
evaluation of RAG workflows. AgentIQ provides an interface to RAGAS to evaluate the performance
0 commit comments