-
-
Notifications
You must be signed in to change notification settings - Fork 381
chore(docs): update how to create custom metrics #2150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -141,6 +141,239 @@ Built-in metrics include `ragas_context_precision`, `ragas_faithfulness`, `ragas | |
|
|
||
| Alternatively, you can directly pass a list of answers instead of `get_answer_fn` to the `evaluate` function, you can then pass the retrieved documents as an optional argument `retrieved_documents` to compute the RAGAS metrics. | ||
|
|
||
| ### Custom Metrics | ||
|
|
||
| **You can also implement your own metrics** using the base `Metric` class from Giskard. | ||
|
|
||
| Here is an example of how you can implement a custom LLM-as-judge metric, as described in the [RAG cookbook](https://huggingface.co/learn/cookbook/en/rag_evaluation#evaluating-rag-performance) by Huggingface. | ||
|
|
||
| #### Step 1 - Subclass the Metric class | ||
|
|
||
| Implement the new metric in the `__call__` method: | ||
GTimothee marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ```python | ||
|
|
||
| from giskard.rag.metrics.base import Metric | ||
|
|
||
| from giskard.llm.client import get_default_client | ||
| from giskard.rag.question_generators.utils import parse_json_output | ||
| from giskard.llm.errors import LLMGenerationError | ||
| from giskard.rag.metrics.correctness import format_conversation | ||
|
|
||
|
|
||
| class CorrectnessScoreMetric(Metric): | ||
|
|
||
| def __call__(self, question_sample: dict, answer: AgentAnswer) -> dict: | ||
| """ your docstring here | ||
GTimothee marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| """ | ||
|
|
||
| # Implement your LLM call with litellm | ||
| llm_client = self._llm_client or get_default_client() | ||
| try: | ||
| out = llm_client.complete( | ||
| messages=[ | ||
| ChatMessage( | ||
| role="system", | ||
| content=SYSTEM_PROMPT, | ||
GTimothee marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| ), | ||
| ChatMessage( | ||
| role="user", | ||
| content=INPUT_TEMPLATE.format( | ||
| conversation=format_conversation( | ||
| question_sample.conversation_history | ||
| + [{"role": "user", "content": question_sample.question}] | ||
| ), | ||
| answer=answer.message, | ||
| reference_answer=question_sample.reference_answer, | ||
| ), | ||
| ), | ||
| ], | ||
| temperature=0, | ||
| format="json_object", | ||
| ) | ||
|
|
||
| # I will ask the LLM to output a JSON object | ||
| # Let us give the code to parse the LLM's output | ||
| json_output = parse_json_output( | ||
| out.content, | ||
| llm_client=llm_client, | ||
| keys=["correctness_score"], | ||
| caller_id=self.__class__.__name__, | ||
| ) | ||
|
|
||
| return json_output | ||
|
|
||
| except Exception as err: | ||
| raise LLMGenerationError("Error while evaluating the agent") from err | ||
| ``` | ||
|
|
||
| #### Step 2 - Add your prompts | ||
|
|
||
| ```python | ||
| SYSTEM_PROMPT = """Your task is to evaluate a Q/A system. | ||
| The user will give you a question, an expected answer and the system's response. | ||
| You will evaluate the system's response and provide a score. | ||
| We are asking ourselves if the response is correct, accurate and factual, based on the reference answer. | ||
|
|
||
| Guidelines: | ||
| 1. Write a score that is an integer between 1 and 5. You should refer to the scores description. | ||
| 2. Follow the JSON format provided below for your output. | ||
|
|
||
| Scores description: | ||
| Score 1: The response is completely incorrect, inaccurate, and/or not factual. | ||
| Score 2: The response is mostly incorrect, inaccurate, and/or not factual. | ||
| Score 3: The response is somewhat correct, accurate, and/or factual. | ||
| Score 4: The response is mostly correct, accurate, and factual. | ||
| Score 5: The response is completely correct, accurate, and factual. | ||
|
|
||
| Output Format (JSON only): | ||
| {{ | ||
| "correctness_score": (your rating, as a number between 1 and 5) | ||
| }} | ||
|
|
||
| Do not include any additional text—only the JSON object. Any extra content will result in a grade of 0. | ||
| """ | ||
|
|
||
| INPUT_TEMPLATE = """ | ||
| ### CONVERSATION | ||
| {conversation} | ||
|
|
||
| ### AGENT ANSWER | ||
| {answer} | ||
|
|
||
| ### REFERENCE ANSWER | ||
| {reference_answer} | ||
| """ | ||
| ``` | ||
|
|
||
| #### Step 3 - Use your new metric in the evaluation function | ||
|
|
||
| Now using your custom metric is as easy as instanciating it and passing it to the `evaluate` function. | ||
|
|
||
| ```python | ||
| correctness_score = CorrectnessScoreMetric(name="correctness_score") | ||
GTimothee marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| report = evaluate(answer_fn, | ||
| testset=testset, | ||
| knowledge_base=knowledge_base, | ||
| metrics=[correctness_score]) | ||
| ``` | ||
|
|
||
| #### Full code | ||
GTimothee marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| Putting everything together, the final implementation would look like this: | ||
|
|
||
| ```python | ||
|
|
||
| from giskard.rag.metrics.base import Metric | ||
|
|
||
| from giskard.llm.client import get_default_client | ||
| from giskard.rag.question_generators.utils import parse_json_output | ||
| from giskard.llm.errors import LLMGenerationError | ||
| from giskard.rag.metrics.correctness import format_conversation | ||
|
|
||
|
|
||
| SYSTEM_PROMPT = """Your task is to evaluate a Q/A system. | ||
| The user will give you a question, an expected answer and the system's response. | ||
| You will evaluate the system's response and provide a score. | ||
| We are asking ourselves if the response is correct, accurate and factual, based on the reference answer. | ||
|
|
||
| Guidelines: | ||
| 1. Write a score that is an integer between 1 and 5. You should refer to the scores description. | ||
| 2. Follow the JSON format provided below for your output. | ||
|
|
||
| Scores description: | ||
| Score 1: The response is completely incorrect, inaccurate, and/or not factual. | ||
| Score 2: The response is mostly incorrect, inaccurate, and/or not factual. | ||
| Score 3: The response is somewhat correct, accurate, and/or factual. | ||
| Score 4: The response is mostly correct, accurate, and factual. | ||
| Score 5: The response is completely correct, accurate, and factual. | ||
|
|
||
| Output Format (JSON only): | ||
| {{ | ||
| "correctness_score": (your rating, as a number between 1 and 5) | ||
| }} | ||
|
|
||
| Do not include any additional text—only the JSON object. Any extra content will result in a grade of 0. | ||
| """ | ||
|
|
||
| INPUT_TEMPLATE = """ | ||
| ### CONVERSATION | ||
| {conversation} | ||
|
|
||
| ### AGENT ANSWER | ||
| {answer} | ||
|
|
||
| ### REFERENCE ANSWER | ||
| {reference_answer} | ||
| """ | ||
|
|
||
|
|
||
| class CorrectnessScoreMetric(Metric): | ||
|
|
||
| def __call__(self, question_sample: dict, answer: AgentAnswer) -> dict: | ||
| """ | ||
| Compute the correctness *as a number from 1 to 5* between the agent answer and the reference answer from QATestset. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| question_sample : dict | ||
| A question sample from a QATestset. | ||
| answer : ModelOutput | ||
| The answer of the agent on the question. | ||
|
|
||
| Returns | ||
| ------- | ||
| dict | ||
| The result of the correctness scoring. It contains the key 'correctness_score'. | ||
| """ | ||
|
|
||
| # Implement your evaluation logic with the litellm client | ||
| llm_client = self._llm_client or get_default_client() | ||
| try: | ||
| out = llm_client.complete( | ||
| messages=[ | ||
| ChatMessage( | ||
| role="system", | ||
| content=SYSTEM_PROMPT, | ||
| ), | ||
| ChatMessage( | ||
| role="user", | ||
| content=INPUT_TEMPLATE.format( | ||
| conversation=format_conversation( | ||
| question_sample.conversation_history | ||
| + [{"role": "user", "content": question_sample.question}] | ||
| ), | ||
| answer=answer.message, | ||
| reference_answer=question_sample.reference_answer, | ||
| ), | ||
| ), | ||
| ], | ||
| temperature=0, | ||
| format="json_object", | ||
| ) | ||
|
|
||
| json_output = parse_json_output( | ||
| out.content, | ||
| llm_client=llm_client, | ||
| keys=["correctness_score"], | ||
| caller_id=self.__class__.__name__, | ||
| ) | ||
|
|
||
| return json_output | ||
|
|
||
| except Exception as err: | ||
| raise LLMGenerationError("Error while evaluating the agent") from err | ||
|
|
||
|
|
||
| correctness_score = CorrectnessScoreMetric(name="correctness_score") | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should we emphasise something about the "correctness_score" name?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @davidberenstein1957 I am not sure what to say about that. From what I tested you can use whatever name you want it does not seem to be used anywhere. I tried setting a random name as the metric name and there is no error, there is still the correctness_score in the RAGReport and when we print the report as html it is correctness_score that shows up in the "selected metrics" panel. Edit: .name attribute seems mostly used by the RagasMetric metric; they use the .name as a key in the dict that is returned by the metric.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay, let's keep this for a more serious overhaul then. |
||
|
|
||
| report = evaluate(answer_fn, | ||
| testset=testset, | ||
| knowledge_base=knowledge_base, | ||
| metrics=[correctness_score]) | ||
| ``` | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| If you encounter any issues, join our [Discord community](https://discord.gg/fkv7CAr3FE) and ask questions in our #support channel. | ||
Uh oh!
There was an error while loading. Please reload this page.