Skip to content

Conversation

@MukulLambat
Copy link

@MukulLambat MukulLambat commented Nov 15, 2025

Description

This PR fixes compatibility issues between Giskard's RAG evaluation and ragas>=0.3.x.

In ragas 0.3.x, the RAGAS API introduced two breaking changes:

  1. BaseRagasLLM now defines an abstract method is_finished(), which caused:
    TypeError: Can't instantiate abstract class RagasLLMWrapper with abstract method is_finished.

  2. RAGAS metrics (e.g. AnswerRelevancy) no longer expose .score(dict) and instead use
    single_turn_score(SingleTurnSample), causing:
    AttributeError: 'AnswerRelevancy' object has no attribute 'score'.

This PR:

  • Implements is_finished() in RagasLLMWrapper to conform to the new BaseRagasLLM interface.
  • Updates RagasMetric.__call__ to:
    • Convert the existing ragas_sample dict into a SingleTurnSample.
    • Call metric.single_turn_score(sample) when available.
    • Fall back to metric.score(ragas_sample) for older RAGAS versions.
  • Restores successful execution of RAG-based metrics (Answer Relevancy, Faithfulness, Context Precision, Context Recall, etc.) when using ragas 0.3.x.

Related Issue

This PR fixes #2217

Type of Change

  • 📚 Examples / docs / tutorials / dependencies update
  • 🔧 Bug fix (non-breaking change which fixes an issue)
  • 🥂 Improvement (non-breaking change which improves an existing feature)
  • 🚀 New feature (non-breaking change which adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to change)
  • 🔐 Security fix

Checklist

  • I've read the CODE_OF_CONDUCT.md document.
  • I've read the CONTRIBUTING.md guide.
  • I've written tests for all new methods and classes that I created.
  • I've written the docstring in Google format for all the methods and classes that I used.
  • I've updated the pdm.lock running pdm update-lock (only applicable when pyproject.toml has been
    modified)
@kevinmessiaen kevinmessiaen self-requested a review November 18, 2025 10:51
@SurajBhar
Copy link

Hey @MukulLambat – the issue is super relevant for anyone using ragas>=0.3.x with Giskard.

I had a look at your changes + the failing CI, and it seems we now also need to align with the async metric API that RAGAS uses (single_turn_ascore / multi_turn_ascore) instead of the older sync .score(...) / .single_turn_score(...).

Below is a concrete set of changes that should make the PR pass CI and be fully compatible with the current RAGAS API.

Update RagasMetric.call in ragas_metrics.py

Use SingleTurnSample + async scoring, and execute it via asyncio:

# add at the top of ragas_metrics.py
import asyncio
from ragas.dataset_schema import SingleTurnSample

Then, inside RagasMetric.call (where you currently build ragas_sample and compute the score), replace the scoring part with:

ragas_sample = self.prepare_ragas_sample(question_sample, answer)

sample = SingleTurnSample(
    user_input=ragas_sample.get("user_input") or ragas_sample.get("question"),
    response=ragas_sample.get("response") or ragas_sample.get("answer"),
    retrieved_contexts=ragas_sample.get("retrieved_contexts") or ragas_sample.get("contexts"),
    reference=ragas_sample.get("reference") or ragas_sample.get("ground_truth"),
)

async def _compute_score():
    if hasattr(self.metric, "single_turn_ascore"):
        return await self.metric.single_turn_ascore(sample)
    elif hasattr(self.metric, "multi_turn_ascore"):
        return await self.metric.multi_turn_ascore(sample)
    else:
        raise AttributeError(
            f"{self.metric} has neither single_turn_ascore nor multi_turn_ascore "
            "— check ragas version or metric type."
        )

loop = asyncio.get_event_loop()
val = loop.run_until_complete(_compute_score())

return {self.name: val}

This matches the new RAGAS guidance where metrics expose async evaluators like single_turn_ascore(...) / multi_turn_ascore(...) for SingleTurnSample.

Fix tests with AsyncMock in tests/rag/test_ragas_metrics.py

Since we’re now calling async methods on the metric, the tests need to mock those with AsyncMock instead of MagicMock.

At the top of tests/rag/test_ragas_metrics.py:

from unittest.mock import MagicMock, Mock, patch, AsyncMock

Then update the two tests to mock single_turn_ascore:

def test_ragas_metric_computation_with_context(ragas_llm_wrapper, ragas_embeddings_wrapper):
    from giskard.rag.metrics.ragas_metrics import RagasMetric

    ragas_metric_mock = AsyncMock()
    ragas_metric_mock.single_turn_ascore.return_value = 0.5

    metric = RagasMetric(
        "test",
        ragas_metric_mock,
        requires_context=True,
        llm_client=Mock(),
        embedding_model=Mock(),
    )

    question_sample = {
        "question": "What is the capital of France?",
        "reference_answer": "Paris",
    }
    answer = AgentAnswer("The capital of France is Paris.", documents=["Paris"])

    result = metric(question_sample, answer)

    assert result == {"test": 0.5}


def test_ragas_metric_computation(ragas_llm_wrapper, ragas_embeddings_wrapper):
    from giskard.rag.metrics.ragas_metrics import RagasMetric

    ragas_metric_mock = AsyncMock()
    ragas_metric_mock.single_turn_ascore.return_value = 0.5

    metric = RagasMetric(
        "test",
        ragas_metric_mock,
        llm_client=Mock(),
        embedding_model=Mock(),
    )

    question_sample = {
        "question": "What is the capital of France?",
        "reference_answer": "Paris",
    }
    answer = AgentAnswer("The capital of France is Paris.")

    result = metric(question_sample, answer)

    assert result == {"test": 0.5}

With these changes:
RagasMetric uses the async scoring API (single_turn_ascore / multi_turn_ascore) that RAGAS now documents.
The tests correctly mock the async methods and should no longer return raw MagicMock(...) objects (which caused the CI failures).
CI should be green again once you push the updated commits.

@MukulLambat
Copy link
Author

Raised new PR with few additional changes. You can check #2220.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants