-
-
Notifications
You must be signed in to change notification settings - Fork 381
chore(docs): update how to create custom metrics #2150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -141,6 +141,289 @@ Built-in metrics include `ragas_context_precision`, `ragas_faithfulness`, `ragas | |
|
|
||
| Alternatively, you can directly pass a list of answers instead of `get_answer_fn` to the `evaluate` function, you can then pass the retrieved documents as an optional argument `retrieved_documents` to compute the RAGAS metrics. | ||
|
|
||
| ### Creating Custom Metrics | ||
|
|
||
| **You can easily create your own metrics by extending the base Metric class from Giskard.** | ||
|
|
||
| To illustrate how simple it is to build a custom metric, we’ll implement the `CorrectnessScoreMetric` — a custom metric that evaluates the performance of a language model as a judge. | ||
|
|
||
| While `Giskard` provides a default boolean correctness metric to check if a RAG (Retrieval-Augmented Generation) answer is correct, in this case, we want to generate a numerical score to gain deeper insights into the model's performance. | ||
|
|
||
| #### Step 1: Building the Prompts | ||
|
|
||
| Before we can evaluate the model's output, we need to craft a system prompt and input template for the language model. The system prompt will instruct the LLM on how to evaluate the Q&A system’s response, and the input template will format the conversation history, agent’s response, and reference answer. | ||
|
|
||
| ##### System Prompt | ||
|
|
||
| The system prompt provides the LLM with clear instructions on how to rate the agent's response. You’ll be guiding the LLM to give a score between 1 and 5 based on the correctness, accuracy, and factuality of the answer. | ||
|
|
||
| ```python | ||
| SYSTEM_PROMPT = """Your task is to evaluate a Q/A system. | ||
| The user will give you a question, an expected answer and the system's response. | ||
| You will evaluate the system's response and provide a score. | ||
| We are asking ourselves if the response is correct, accurate and factual, based on the reference answer. | ||
|
|
||
| Guidelines: | ||
| 1. Write a score that is an integer between 1 and 5. You should refer to the scores description. | ||
| 2. Follow the JSON format provided below for your output. | ||
|
|
||
| Scores description: | ||
| Score 1: The response is completely incorrect, inaccurate, and/or not factual. | ||
| Score 2: The response is mostly incorrect, inaccurate, and/or not factual. | ||
| Score 3: The response is somewhat correct, accurate, and/or factual. | ||
| Score 4: The response is mostly correct, accurate, and factual. | ||
| Score 5: The response is completely correct, accurate, and factual. | ||
|
|
||
| Output Format (JSON only): | ||
| {{ | ||
| "correctness_score": (your rating, as a number between 1 and 5) | ||
| }} | ||
|
|
||
| Do not include any additional text—only the JSON object. Any extra content will result in a grade of 0. | ||
| """ | ||
| ``` | ||
|
|
||
| ##### Input Template | ||
|
|
||
| The input template formats the actual content you will send to the model. This includes the conversation history, the agent's response, and the reference answer. | ||
|
|
||
| ```python | ||
| INPUT_TEMPLATE = """ | ||
| ### CONVERSATION | ||
| {conversation} | ||
|
|
||
| ### AGENT ANSWER | ||
| {answer} | ||
|
|
||
| ### REFERENCE ANSWER | ||
| {reference_answer} | ||
| """ | ||
| ``` | ||
|
|
||
| #### Step 2: Subclassing the Metric Class | ||
|
|
||
| We implement the custom metric by subclassing the `Metric` class provided by `Giskard`. | ||
|
|
||
| ```python | ||
| from giskard.rag.metrics.base import Metric | ||
|
|
||
| class CorrectnessScoreMetric(Metric): | ||
| """ | ||
| Custom metric to evaluate the correctness of a system's response based on a reference answer. | ||
| The metric returns a score between 1 and 5, with 1 being completely incorrect and 5 being completely correct. | ||
| """ | ||
GTimothee marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| def __call__(self, question_sample: dict, answer: AgentAnswer) -> dict: | ||
| pass | ||
| ``` | ||
|
|
||
| Now all we need to do is to fill in this `__call__` method with our logic. | ||
|
|
||
| Metrics are meant to be used with the `evaluate` function. That is where we assume the inputs are coming from. | ||
|
|
||
| Note the input types: | ||
| - The `question_sample` is a dictionary that must have the following keys: `conversation_history`, `question` and `reference_answer` | ||
| - The `AgentAnswer` is a dataclass containing the output of the RAG system being evaluated. It is defined as follows: | ||
|
|
||
| ```python | ||
| @dataclass | ||
| class AgentAnswer: | ||
GTimothee marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
GTimothee marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| message: str | ||
| documents: Optional[Sequence[str]] = None | ||
| ``` | ||
|
|
||
| The core of our new metric consists in a litellm call with the prompts specified earlier. | ||
|
|
||
| ```python | ||
| from giskard.rag.question_generators.utils import parse_json_output | ||
| from giskard.rag.metrics.correctness import format_conversation | ||
| from giskard.llm.client import get_default_client | ||
|
|
||
| def __call__(self, question_sample: dict, answer: AgentAnswer) -> dict: | ||
GTimothee marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| # Retrieve the llm client | ||
| llm_client = self._llm_client or get_default_client() | ||
|
|
||
| # Call the llm | ||
| out = llm_client.complete( | ||
| messages=[ | ||
| ChatMessage( | ||
| role="system", | ||
| content=SYSTEM_PROMPT, | ||
| ), | ||
| ChatMessage( | ||
| role="user", | ||
| content=INPUT_TEMPLATE.format( | ||
| conversation=format_conversation( | ||
| question_sample.conversation_history | ||
| + [{"role": "user", "content": question_sample.question}] | ||
| ), | ||
| answer=answer.message, | ||
| reference_answer=question_sample.reference_answer, | ||
| ), | ||
| ), | ||
| ], | ||
| temperature=0, | ||
| format="json_object", | ||
| ) | ||
|
|
||
| # Parse the output string representation of a JSON object into a dictionary | ||
| json_output = parse_json_output( | ||
| out.content, | ||
| llm_client=llm_client, | ||
| keys=["correctness_score"], | ||
| caller_id=self.__class__.__name__, | ||
| ) | ||
|
|
||
| return json_output | ||
| ``` | ||
|
|
||
| Note how the `keys=["correctness_score"]` must match with the keys you asked from the llm in your `SYSTEM_PROMPT`. | ||
|
|
||
| As you can see, it is as simple as calling our llm and parsing its output as a dictionary with one key, `correctness_score`, containing the score given by the LLM. | ||
|
|
||
| <details> | ||
| <summary>Putting everything together, here is our final implementation:</summary> | ||
|
|
||
| ```python | ||
| from giskard.llm.client import get_default_client | ||
| from giskard.llm.errors import LLMGenerationError | ||
|
|
||
| from giskard.rag.metrics.base import Metric | ||
| from giskard.rag.question_generators.utils import parse_json_output | ||
| from giskard.rag.metrics.correctness import format_conversation | ||
|
|
||
| SYSTEM_PROMPT = """Your task is to evaluate a Q/A system. | ||
| The user will give you a question, an expected answer and the system's response. | ||
| You will evaluate the system's response and provide a score. | ||
| We are asking ourselves if the response is correct, accurate and factual, based on the reference answer. | ||
|
|
||
| Guidelines: | ||
| 1. Write a score that is an integer between 1 and 5. You should refer to the scores description. | ||
| 2. Follow the JSON format provided below for your output. | ||
|
|
||
| Scores description: | ||
| Score 1: The response is completely incorrect, inaccurate, and/or not factual. | ||
| Score 2: The response is mostly incorrect, inaccurate, and/or not factual. | ||
| Score 3: The response is somewhat correct, accurate, and/or factual. | ||
| Score 4: The response is mostly correct, accurate, and factual. | ||
| Score 5: The response is completely correct, accurate, and factual. | ||
|
|
||
| Output Format (JSON only): | ||
| {{ | ||
| "correctness_score": (your rating, as a number between 1 and 5) | ||
| }} | ||
|
|
||
| Do not include any additional text—only the JSON object. Any extra content will result in a grade of 0. | ||
| """ | ||
|
|
||
| INPUT_TEMPLATE = """ | ||
| ### CONVERSATION | ||
| {conversation} | ||
|
|
||
| ### AGENT ANSWER | ||
| {answer} | ||
|
|
||
| ### REFERENCE ANSWER | ||
| {reference_answer} | ||
| """ | ||
|
|
||
| class CorrectnessScoreMetric(Metric): | ||
| """ | ||
| Custom metric to evaluate the correctness of a system's response based on a reference answer. | ||
| The metric returns a score between 1 and 5, with 1 being completely incorrect and 5 being completely correct. | ||
| """ | ||
|
|
||
| def __call__(self, question_sample: dict, answer: AgentAnswer) -> dict: | ||
| """ | ||
| Compute the correctness *as a number from 1 to 5* between the agent answer and the reference answer. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| question_sample : dict | ||
| A question sample from a QATestset. | ||
| answer : AgentAnswer | ||
| The answer of the agent on the question. | ||
|
|
||
| Returns | ||
| ------- | ||
| dict | ||
| The result of the correctness scoring. It contains the key 'correctness_score'. | ||
|
|
||
| Raises | ||
| ------ | ||
| LLMGenerationError | ||
| If there is an issue during the LLM evaluation process. | ||
| """ | ||
|
|
||
| # Implement your LLM call with litellm | ||
| llm_client = self._llm_client or get_default_client() | ||
| try: | ||
| out = llm_client.complete( | ||
| messages=[ | ||
| ChatMessage( | ||
| role="system", | ||
| content=SYSTEM_PROMPT, | ||
| ), | ||
| ChatMessage( | ||
| role="user", | ||
| content=INPUT_TEMPLATE.format( | ||
| conversation=format_conversation( | ||
| question_sample.conversation_history | ||
| + [{"role": "user", "content": question_sample.question}] | ||
| ), | ||
| answer=answer.message, | ||
| reference_answer=question_sample.reference_answer, | ||
| ), | ||
| ), | ||
| ], | ||
| temperature=0, | ||
| format="json_object", | ||
| ) | ||
|
|
||
| # We asked the LLM to output a JSON object, so we must parse the output into a dict | ||
| json_output = parse_json_output( | ||
| out.content, | ||
| llm_client=llm_client, | ||
| keys=["correctness_score"], | ||
| caller_id=self.__class__.__name__, | ||
| ) | ||
|
|
||
| return json_output | ||
|
|
||
| except Exception as err: | ||
| raise LLMGenerationError("Error while evaluating the agent") from err | ||
| ``` | ||
| </details> | ||
|
|
||
| #### Step 3: Use Your New Metric in the Evaluation Function | ||
|
|
||
| Integrating our new custom metric into the evaluation process is straightforward. Simply instantiate the metric and pass it to the `evaluate` function. | ||
|
|
||
| ```python | ||
| # Instantiate the custom correctness metric | ||
| correctness_score = CorrectnessScoreMetric(name="correctness_score") | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should we emphasise something about the "correctness_score" name?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @davidberenstein1957 I am not sure what to say about that. From what I tested you can use whatever name you want it does not seem to be used anywhere. I tried setting a random name as the metric name and there is no error, there is still the correctness_score in the RAGReport and when we print the report as html it is correctness_score that shows up in the "selected metrics" panel. Edit: .name attribute seems mostly used by the RagasMetric metric; they use the .name as a key in the dict that is returned by the metric.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay, let's keep this for a more serious overhaul then. |
||
|
|
||
| # Run the evaluation with the custom metric | ||
| report = evaluate( | ||
| answer_fn, | ||
| testset=testset, # a QATestset instance | ||
| knowledge_base=knowledge_base, # the knowledge base used for building the QATestset | ||
| metrics=[correctness_score] | ||
| ) | ||
| ``` | ||
|
|
||
| Again, be careful that the `name` argument in the `CorrectnessScoreMetric` match the key you return in the output dictionary. | ||
|
|
||
| #### Wrap-up | ||
|
|
||
| In this tutorial, we covered the steps to create and use a custom evaluation metric. | ||
|
|
||
| Here's a quick summary of the key points: | ||
| 1. *Define the prompts*: Create a system prompt and input template for the LLM to evaluate answers. | ||
| 2. *Implement the new metric*: Subclass the `Metric` class and implement the `__call__` method to create the custom `CorrectnessScoreMetric`. | ||
| 3. *Use the new metric*: Instantiate and integrate the custom metric into the evaluation function. | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| If you encounter any issues, join our [Discord community](https://discord.gg/fkv7CAr3FE) and ask questions in our #support channel. | ||
Uh oh!
There was an error while loading. Please reload this page.