Skip to content

Commit 1e4026b

Browse files
Merge pull request #2 from AnuradhaKaruppiah/eval-doc-fixes
Update evaluation docs
2 parents e7199e8 + 9a4fe25 commit 1e4026b

File tree

3 files changed

+30
-12
lines changed

3 files changed

+30
-12
lines changed

‎docs/source/concepts/evaluate.md

+13-4
Original file line numberDiff line numberDiff line change
@@ -29,15 +29,15 @@ aiq eval --config_file=examples/simple/configs/eval_config.yml
2929
```
3030

3131
## Using Datasets
32-
Run and evaluate the workflow on a specified dataset. The dataset files types are `json`, `jsonl`, `csv`, `xls`, or `parquet`.
32+
Run and evaluate the workflow on a specified dataset. The dataset files types are `json`, `jsonl`, `csv`, `xls`, or `parquet`.
3333

3434
Download and use datasets provided by AgentIQ examples by running the following.
3535

3636
```bash
3737
git lfs fetch
3838
git lfs pull
3939
```
40-
The dataset used for evaluation is specified in the `config.yml` file via `eval.general.dataset`. For example, to use the `langsmith.json` dataset, the configuration is as follows:
40+
The dataset used for evaluation is specified in the configuration file via `eval.general.dataset`. For example, to use the `langsmith.json` dataset, the configuration is as follows:
4141
```yaml
4242
eval:
4343
general:
@@ -246,11 +246,20 @@ aiq eval --config_file=examples/simple/configs/eval_config.yml --skip_completed_
246246
## Running evaluation offline
247247
You can evaluate a dataset with previously generated answers via the `--skip_workflow` option. In this case the dataset has both the expected `answer` and the `generated_answer`.
248248
```bash
249-
aiq eval --config_file=examples/simple/configs/config.yml --skip_workflow
249+
aiq eval --config_file=examples/simple/configs/eval_config.yml --skip_workflow --dataset=.tmp/aiq/examples/simple/workflow_output.json
250250
```
251+
This assumes that the workflow output is previously generated and stored in the `.tmp/aiq/examples/simple/workflow_output.json` file.
251252

252253
## Running the workflow over a dataset without evaluation
253-
You can do this via a config.yml file that has no `evaluators`.
254+
You can do this by running `aiq eval` with a workflow configuration file that includes an `eval` section with no `evaluators`.
255+
```yaml
256+
eval:
257+
general:
258+
output_dir: ./.tmp/aiq/examples/simple/
259+
dataset:
260+
_type: json
261+
file_path: examples/simple/data/langsmith.json
262+
```
254263

255264
## Evaluation output
256265
The output of the workflow is stored as `workflow_output.json` in the `output_dir` provided in the config.yml -

‎docs/source/guides/custom-evaluator.md

+11-8
Original file line numberDiff line numberDiff line change
@@ -44,15 +44,17 @@ The following is an example of an evaluator configuration and evaluator function
4444

4545
`examples/simple/src/aiq_simple/evaluator_register.py`:
4646
```python
47+
from pydantic import Field
48+
4749
from aiq.builder.builder import EvalBuilder
4850
from aiq.builder.evaluator import EvaluatorInfo
4951
from aiq.cli.register_workflow import register_evaluator
5052
from aiq.data_models.evaluator import EvaluatorBaseConfig
5153

5254

5355
class SimilarityEvaluatorConfig(EvaluatorBaseConfig, name="similarity"):
54-
'''Configuration for custom evaluator'''
55-
similarity_type: str = "cosine"
56+
'''Configuration for custom similarity evaluator'''
57+
similarity_type: str = = Field(description="Similarity type to be computed", default="cosine")
5658

5759

5860
@register_evaluator(config_type=SimilarityEvaluatorConfig)
@@ -72,6 +74,7 @@ The `register_similarity_evaluator` function is used to register the evaluator w
7274

7375
To ensure that evaluator is registered the evaluator function is imported, but not used, in the simple example's `register.py`
7476

77+
`examples/simple/src/aiq_simple/register.py`:
7578
```python
7679
from .evaluator_register import register_similarity_evaluator # pylint: disable=unused-import
7780
```
@@ -80,9 +83,9 @@ from .evaluator_register import register_similarity_evaluator # pylint: disable
8083
The asynchronous evaluate method provide by the custom evaluator takes an `EvalInput` object as input and returns an `EvalOutput` object as output.
8184

8285
`EvalInput` is a list of `EvalInputItem` objects. Each `EvalInputItem` object contains the following fields:
83-
- `id`: The unique identifier for the item.
84-
- `input_obj`: This is typically the question. It can be a string or any serializable object.
85-
- `expected_output_obj`: The expected answer for the question. This can be a string or any serializable object.
86+
- `id`: The unique identifier for the item. It is defined in the dataset file and can be an integer or a string.
87+
- `input_obj`: This is typically the question. It is derived from the dataset file and can be a string or any serializable object.
88+
- `expected_output_obj`: The expected answer for the question. It is derived from the dataset file and can be a string or any serializable object.
8689
- `output_obj`: The answer generated by the workflow for the question. This can be a string or any serializable object.
8790
- `trajectory`: List of intermediate steps returned by the workflow. This is a list of `IntermediateStep` objects.
8891

@@ -97,8 +100,8 @@ The evaluate method computes the score for each item in the evaluation input and
97100

98101
### Similarity Evaluator
99102
Similarity evaluator is used as an example to demonstrate the process of creating and registering a custom evaluator with AgentIQ. We add this code to a new `similarity_evaluator.py` file in the simple example directory for testing purposes.
100-
`examples/simple/src/aiq_simple/similarity_evaluator.py`:
101103

104+
`examples/simple/src/aiq_simple/similarity_evaluator.py`:
102105
```python
103106
import asyncio
104107

@@ -152,7 +155,7 @@ class SimilarityEvaluator:
152155
sample_scores, sample_reasonings = zip(*results) if results else ([], [])
153156

154157
# Compute average score
155-
avg_score = sum(sample_scores) / len(sample_scores) if sample_scores else 0.0
158+
avg_score = round(sum(sample_scores) / len(sample_scores), 2) if sample_scores else 0.0
156159

157160
# Construct EvalOutputItems
158161
eval_output_items = [
@@ -208,7 +211,7 @@ The results of each evaluator is stored in a separate file with name `<keyword>_
208211
`examples/simple/.tmp/aiq/examples/simple/similarity_eval_output.json`:
209212
```json
210213
{
211-
"average_score": 0.6333333333333334,
214+
"average_score": 0.63,
212215
"eval_output_items": [
213216
{
214217
"id": 1,

‎docs/source/guides/evaluate.md

+6
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,12 @@ The dataset file provides a list of questions and expected answers. The followin
6868
## Understanding the Evaluator Configuration
6969
The evaluators section specifies the evaluators to use for evaluating the workflow output. The evaluator configuration includes the evaluator type, the metric to evaluate, and any additional parameters required by the evaluator.
7070

71+
### Display all evaluators
72+
To display all existing evaluators, run the following command:
73+
```bash
74+
aiq info components -t evaluator
75+
```
76+
7177
### Ragas Evaluator
7278
[RAGAS](https://docs.ragas.io/) is an OSS evaluation framework that enables end-to-end
7379
evaluation of RAG workflows. AgentIQ provides an interface to RAGAS to evaluate the performance

0 commit comments

Comments
 (0)