Evaluation Metrics for Retrieval-Augmented Generation (RAG) Systems
Retrieval Augmented Generation (RAG) is LLM framework that combines information retrieval and text generation to produce more accurate, factual and context rich responses. Evaluation metrics help check if the system retrieves relevant information, gives accurate answers and meets performance goals while also guiding improvements and model comparisons.

Steps to Evaluate RAG System
Evaluating a RAG system means checking how well it retrieves and generates accurate, relevant and grounded responses.
1. Set Goals: Define what matters most—accuracy, relevance, fluency or groundedness.
2. Pick Metrics:
- Retrieval level: Precision, Recall, F1, MRR, nDCG.
- Generation level: BLEU, ROUGE, METEOR, BERTScore, Perplexity.
- End-to-end: Groundedness, Hallucination Rate, Factual Consistency, Answer Relevance.
3. Automate: Use tools like NLTK, ROUGE-score, BERTScore or Textstat for quick evaluation.
4. Add Human Review: Rate responses for clarity, accuracy and informativeness.
5. Analyze Results: Visualize performance, compare models and find weak spots.
6. Iterate: Refine retrieval and generation steps to improve factuality and coherence.
Types of Evaluation Metrics
Some of the types of evaluation metrics are:

1. Retrieval Level Metrics
Some of the retrieval level metrices are Precision, Recall and F1-Score.
1. Precision: Portion of retrieved documents that are actually relevant.
2. Recall: Portion of relevant documents that were successfully retrieved.
3 F1-Score: Harmonic mean of precision and recall, balancing both.
from sklearn.metrics import precision_score, recall_score, f1_score
y_true = [1, 0, 1, 1, 0]
y_pred = [1, 0, 0, 1, 0]
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
print("Precision:", precision, "Recall:", recall, "F1-Score:", f1)
Output:
Precision: 1.0, Recall: 0.6666666666666666, F1-Score: 0.8
4. Hit Rate: Shows how often retrieved answers exactly match the expected ones, higher is better.
def hit_rate(y_true, y_pred):
hits = 0
for true_docs, pred_docs in zip(y_true, y_pred):
if any(doc in true_docs for doc in pred_docs):
hits += 1
return hits / len(y_true)
y_true = [['doc1', 'doc2'], ['doc3']]
y_pred = [['doc2', 'doc4'], ['doc5']]
print("Hit Rate:", hit_rate(y_true, y_pred))
Output:
Hit Rate: 0.5
5. Mean Reciprocal Rank (MRR): Measures how quickly the correct answer appears in the ranked results, higher is better.
- N: total number of queries
- rank: rank position of the first relevant document for the ith query
def mean_reciprocal_rank(y_true, y_pred):
reciprocal_ranks = []
for true_docs, pred_docs in zip(y_true, y_pred):
rr = 0
for rank, doc in enumerate(pred_docs, start=1):
if doc in true_docs:
rr = 1 / rank
break
reciprocal_ranks.append(rr)
return sum(reciprocal_ranks) / len(reciprocal_ranks)
y_true = [['doc1', 'doc2'], ['doc3']]
y_pred = [['doc2', 'doc4'], ['doc5']]
print("MRR:", mean_reciprocal_rank(y_true, y_pred))
Output:
MRR: 0.5
6. Mean Average Precision (MAP): Evaluates ranking quality across multiple queries.
- N: total number of queries
- API: average precision for the ith query
- Ri: number of relevant documents for query i
- Pi(k): precision at cutoff k
- reli(k): 1 if the document at rank k is relevant, else 0
def mean_average_precision(y_true, y_pred):
avg_precisions = []
for true_docs, pred_docs in zip(y_true, y_pred):
hits = 0
precision_sum = 0
for rank, doc in enumerate(pred_docs, start=1):
if doc in true_docs:
hits += 1
precision_sum += hits / rank
avg_precisions.append(precision_sum / len(true_docs) if true_docs else 0)
return sum(avg_precisions) / len(avg_precisions)
y_true = [['doc1', 'doc2'], ['doc3']]
y_pred = [['doc2', 'doc4'], ['doc5']]
print("MAP:", mean_average_precision(y_true, y_pred))
Output:
MAP: 0.25
7. Normalized Discounted Cumulative Gain (nDCG): Rewards highly relevant documents appearing earlier in results.
- p: rank position cutoff
- reli: relevance score of the document at rank i
- reliideal: relevance of document at rank i in ideal ordering
import numpy as np
def ndcg(y_true, y_pred, k=5):
ndcg_scores = []
for true_docs, pred_docs in zip(y_true, y_pred):
pred_docs_k = pred_docs[:k]
dcg = sum([1 / np.log2(idx + 2) if doc in true_docs else 0 for idx, doc in enumerate(pred_docs_k)])
ideal_docs_k = true_docs[:k]
idcg = sum([1 / np.log2(idx + 2) for idx, _ in enumerate(ideal_docs_k)])
ndcg_scores.append(dcg / idcg if idcg > 0 else 0)
return np.mean(ndcg_scores)
y_true = [['doc1', 'doc2'], ['doc3']]
y_pred = [['doc2', 'doc4'], ['doc5']]
print("nDCG@5:", ndcg(y_true, y_pred))
Output:
nDCG@5: 0.3065735963827292
8. Recall@k and Precision@k: Check relevance within the top k retrieved items.
def recall_precision_at_k(y_true, y_pred, k=5):
recall_list = []
precision_list = []
for true_docs, pred_docs in zip(y_true, y_pred):
top_k = pred_docs[:k]
hits = len([doc for doc in top_k if doc in true_docs])
recall_list.append(hits / len(true_docs) if true_docs else 0)
precision_list.append(hits / k)
return {"Recall@{}".format(k): np.mean(recall_list),
"Precision@{}".format(k): np.mean(precision_list)}
y_true = [['doc1', 'doc2'], ['doc3']]
y_pred = [['doc2', 'doc4'], ['doc5']]
metrics = recall_precision_at_k(y_true, y_pred, k=2)
print(metrics)
Output:
{'Recall@2': np.float64(0.25), 'Precision@2': np.float64(0.25)}
9. Similarity Measures (Cosine, BM25): Quantify how closely retrieved documents match the query.
Here we have illustrated cosine similarity.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
def cosine_similarity_score(queries, documents):
vectorizer = TfidfVectorizer()
all_texts = queries + documents
tfidf = vectorizer.fit_transform(all_texts)
query_vecs = tfidf[:len(queries)]
doc_vecs = tfidf[len(queries):]
sim_matrix = cosine_similarity(query_vecs, doc_vecs)
return sim_matrix.mean()
queries = ["RAG systems combine retrieval and generation"]
documents = ["Retrieval-Augmented Generation combines retrieval and generation.", "Cats sit on mats."]
print("Cosine Similarity:", cosine_similarity_score(queries, documents))
Output:
Cosine Similarity: 0.24755053441657565
2. Generation Level Metrices
Some of the generation level metrices are:
1. BLEU, ROUGE, METEOR, BERTScore: Compare generated text with reference answers for similarity.
Here we have illustrated BLEU.
- pn: modified n-gram precision
- wn: weight for n-gram
- BP: Brevity Penalty
import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
import numpy as np
nltk.download('punkt')
def generation_metrics(predictions, references):
chencherry = SmoothingFunction()
bleu_scores = [sentence_bleu([nltk.word_tokenize(ref)], nltk.word_tokenize(pred), smoothing_function=chencherry.method1)
for pred, ref in zip(predictions, references)]
metrics = {
"BLEU": np.mean(bleu_scores),
}
return metrics
predictions = ["The cat sits on the mat.", "RAG systems combine retrieval and generation."]
references = ["A cat is sitting on the mat.", "Retrieval-Augmented Generation combines retrieval and generation."]
metrics = generation_metrics(predictions, references)
print(metrics)
Output:
{'BLEU': np.float64(0.3939917666748808)}
2. Perplexity: Measures how well the model predicts the next word, lower perplexity is better.
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import numpy as np
def compute_perplexity(predictions, model_name='gpt2'):
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)
model.eval()
perplexities = []
for text in predictions:
encodings = tokenizer(text, return_tensors='pt')
with torch.no_grad():
outputs = model(**encodings, labels=encodings["input_ids"])
loss = outputs.loss
perplexities.append(torch.exp(loss).item())
return np.mean(perplexities)
predictions = ["The cat sits on the mat.", "RAG systems combine retrieval and generation."]
perplexity = compute_perplexity(predictions)
print("Perplexity:", perplexity)
Output:
Perplexity: 901.9484596252441
3. Factual Consistency: Checks if generated content aligns with retrieved information.
import nltk
import numpy as np
nltk.download('punkt')
def factual_consistency(predictions, references):
scores = []
for pred, ref in zip(predictions, references):
pred_words = set(nltk.word_tokenize(pred.lower()))
ref_words = set(nltk.word_tokenize(ref.lower()))
overlap = len(pred_words & ref_words) / len(pred_words) if len(pred_words) > 0 else 0
scores.append(overlap)
return np.mean(scores)
predictions = ["RAG systems combine retrieval and generation."]
references = ["Retrieval-Augmented Generation combines retrieval and generation."]
score = factual_consistency(predictions, references)
print("Factual Consistency:", score)
Output:
Factual Consistency: 0.5714285714285714
4. Fluency and Readability: Assesses how natural and easy to understand the text is.
!pip install textstat
import nltk
from textstat import flesch_reading_ease
import numpy as np
nltk.download('punkt')
def fluency_readability(predictions):
readability_scores = [flesch_reading_ease(pred) for pred in predictions]
sentence_counts = [len(nltk.sent_tokenize(pred)) for pred in predictions]
fluency_scores = [len(nltk.word_tokenize(pred))/max(s,1) for pred, s in zip(predictions, sentence_counts)]
return {
"Average Readability (Flesch)": np.mean(readability_scores),
"Average Fluency (words/sentence)": np.mean(fluency_scores)
}
predictions = [
"RAG systems combine retrieval and generation effectively.",
"The cat sits on the mat."
]
metrics = fluency_readability(predictions)
print(metrics)
Output:
{'Average Readability (Flesch)': np.float64(55.2089285714286), 'Average Fluency (words/sentence)': np.float64(7.5)}
5. Diversity and Novelty: Evaluates variety and originality in generated responses.
import nltk
import numpy as np
nltk.download('punkt')
def diversity_novelty(predictions):
all_unigrams = []
all_bigrams = []
for pred in predictions:
tokens = nltk.word_tokenize(pred.lower())
all_unigrams.extend(tokens)
all_bigrams.extend(list(nltk.bigrams(tokens)))
distinct_unigrams = len(set(all_unigrams)) / len(all_unigrams) if all_unigrams else 0
distinct_bigrams = len(set(all_bigrams)) / len(all_bigrams) if all_bigrams else 0
seen_words = set()
novel_counts = []
for pred in predictions:
tokens = set(nltk.word_tokenize(pred.lower()))
novel = len(tokens - seen_words)
novel_counts.append(novel / len(tokens) if tokens else 0)
seen_words.update(tokens)
return {
"Distinct-Unigram": distinct_unigrams,
"Distinct-Bigram": distinct_bigrams,
"Novelty": np.mean(novel_counts)
}
predictions = [
"RAG systems combine retrieval and generation effectively.",
"The cat sits on the mat."
]
metrics = diversity_novelty(predictions)
print(metrics)
Output:
{'Distinct-Unigram': 0.8666666666666667, 'Distinct-Bigram': 1.0, 'Novelty': np.float64(0.9166666666666667)}
3. End to End RAG System Evaluation
End to end evaluation looks at the overall performance of a RAG system considering both retrieval and generation together.
1. Answer Relevance and Context Utilization: Checks if the system’s answers address the user’s query and effectively use the retrieved information.
import nltk
import numpy as np
nltk.download('punkt')
def answer_relevance_context_utilization(responses, references, retrieved_docs, top_k=5):
relevance_scores = []
context_scores = []
for resp, ref, docs in zip(responses, references, retrieved_docs):
resp_words = set(nltk.word_tokenize(resp.lower()))
ref_words = set(nltk.word_tokenize(ref.lower()))
relevance_scores.append(len(resp_words & ref_words) / len(ref_words) if ref_words else 0)
doc_words = set(word for d in docs[:top_k] for word in nltk.word_tokenize(d.lower()))
context_scores.append(len(resp_words & doc_words) / len(resp_words) if resp_words else 0)
return {
"Answer Relevance": np.mean(relevance_scores),
"Context Utilization": np.mean(context_scores)
}
responses = [
"RAG systems combine retrieval and generation effectively.",
"The cat sits on the mat."
]
references = [
"Retrieval-Augmented Generation combines retrieval and generation.",
"A cat is sitting on the mat."
]
retrieved_docs = [
["RAG pipelines retrieve relevant info.", "Then the generation model produces answers."],
["Cats often sit on mats.", "Cats are animals."]
]
metrics = answer_relevance_context_utilization(responses, references, retrieved_docs, top_k=2)
print(metrics)
Output:
{'Answer Relevance': np.float64(0.6458333333333333), 'Context Utilization': np.float64(0.35416666666666663)}
2. Groundedness: Measures whether the generated text is supported by the retrieved sources reducing the risk of hallucinations.
import nltk
import numpy as np
nltk.download('punkt')
def groundedness(responses, retrieved_docs, top_k=3):
scores = []
for resp, docs in zip(responses, retrieved_docs):
resp_words = set(nltk.word_tokenize(resp.lower()))
doc_words = set(word for d in docs[:top_k] for word in nltk.word_tokenize(d.lower()))
overlap = len(resp_words & doc_words) / len(resp_words) if len(resp_words) > 0 else 0
scores.append(overlap)
return np.mean(scores)
responses = [
"RAG systems combine retrieval and generation effectively.",
"Cats often sit on mats."
]
retrieved_docs = [
["RAG uses external documents for knowledge retrieval.",
"The generation model integrates this retrieved info to produce final answers."],
["Cats love warm places such as mats.",
"They often sit in cozy areas."]
]
score = groundedness(responses, retrieved_docs)
print("Groundedness:", score)
Output:
Groundedness: 0.6666666666666667
3. Hallucination Rate: Tracks how often the system produces information that is incorrect or not backed by sources.
import nltk
import numpy as np
nltk.download('punkt')
def hallucination_rate(responses, retrieved_docs, top_k=3):
rates = []
for resp, docs in zip(responses, retrieved_docs):
resp_words = set(nltk.word_tokenize(resp.lower()))
doc_words = set(word for d in docs[:top_k] for word in nltk.word_tokenize(d.lower()))
unsupported = len(resp_words - doc_words)
hallucination = unsupported / len(resp_words) if len(resp_words) > 0 else 0
rates.append(hallucination)
return np.mean(rates)
responses = [
"RAG systems combine retrieval and generation effectively.",
"Cats fly over mountains."
]
retrieved_docs = [
["RAG retrieves documents and generates context-aware responses.",
"It combines retrieval and generation for better accuracy."],
["Cats are domestic animals that sit on mats.",
"They are known for agility, not flight."]
]
rate = hallucination_rate(responses, retrieved_docs)
print("Hallucination Rate:", rate)
Output:
Hallucination Rate: 0.4875
4. Response Coherence and Readability: Ensures the generated answers are clear, logically structured and easy to understand.
import nltk
from textstat import flesch_reading_ease
import numpy as np
nltk.download('punkt')
def response_coherence_readability(responses):
coherence_scores = []
readability_scores = []
for resp in responses:
sentences = nltk.sent_tokenize(resp)
words = nltk.word_tokenize(resp)
coherence = len(words) / len(sentences) if len(sentences) > 0 else 0
coherence_scores.append(coherence)
readability = flesch_reading_ease(resp)
readability_scores.append(readability)
return {
"Average Coherence (words/sentence)": np.mean(coherence_scores),
"Average Readability (Flesch)": np.mean(readability_scores)
}
responses = [
"RAG systems combine retrieval and generation effectively. This improves factual accuracy.",
"Cats sit on mats. They are domestic animals known for agility."
]
metrics = response_coherence_readability(responses)
print(metrics)
Output:
{'Average Coherence (words/sentence)': np.float64(6.5), 'Average Readability (Flesch)': np.float64(28.20704545454545)}
5. Relevancy Score: Measures how well the system’s output matches the user’s query intent.
import nltk
import numpy as np
nltk.download('punkt')
def relevancy_score(responses, queries):
float: Average relevancy score (0 to 1)
scores = []
for resp, query in zip(responses, queries):
resp_words = set(nltk.word_tokenize(resp.lower()))
query_words = set(nltk.word_tokenize(query.lower()))
overlap = len(resp_words & query_words)
relevancy = overlap / len(query_words) if len(query_words) > 0 else 0
scores.append(relevancy)
return np.mean(scores)
responses = [
"RAG systems combine retrieval and generation to improve accuracy.",
"Cats are domestic animals that often sit on mats."
]
queries = [
"What is Retrieval-Augmented Generation?",
"Tell me something about cats sitting on mats."
]
score = relevancy_score(responses, queries)
print("Relevancy Score:", score)
Output:
Relevancy Score: 0.3222222222222222
Human Evaluation in RAG Systems
Human evaluation assesses the quality and usefulness of a RAG system’s responses from a real user perspective.
Criteria for Human Evaluation
Criteria for Human Evaluation in RAG Systems:
- Relevance: Ensures the answer directly addresses the user’s query.
- Informativeness: Measures whether the response is helpful, detailed and meaningful.
- Factual Accuracy: Confirms that statements are correct and supported by sources.
- Clarity and Readability: Evaluates if the response is easy to understand and well structured.
- Evaluation Methods: Includes rating scales, pairwise comparisons and expert reviews.
Methods of Human Evaluation
Methods of Human Evaluation in RAG Systems:
- Rating Scales: Evaluators score responses on criteria like relevance, accuracy and clarity.
- Pairwise Comparison: Responses are compared in pairs to determine which is better.
- Expert Review: Subject matter experts assess the quality, factual correctness and usefulness of responses.
Emerging and Hybrid Evaluation Approaches
Advanced and combined evaluation methods to get a more accurate performance are:
- LLM based Evaluators: Using large language models to automatically assess relevance, factuality and coherence.
- Task Specific Evaluation Pipelines: Custom metrics tailored to the domain or application of the RAG system.
- Automatic Fact Checking and Citation Tracking: Tools that verify information against trusted sources.
- Hybrid Approaches: Combining automated metrics with human evaluation for a balanced, comprehensive assessment.
Comparative Analysis of Metrics
Comparison table of different RAG evaluation metrics:
Metric Type | Examples | Strengths | Weaknesses |
|---|---|---|---|
Retrieval Metrics | Hit Rate, MRR, Precision, Recall, nDCG | Simple, interpretable, directly measures relevance and ranking quality | Don’t evaluate answer quality, fluency or coherence |
Generation Metrics | BLEU, ROUGE, METEOR, BERTScore, Perplexity | Quantitative, widely used, easy to compute | May miss semantic meaning, context or factual correctness |
End-to-End Metrics | Answer Relevance, Groundedness, Hallucination Rate, Coherence | Holistic evaluation of system, includes factual grounding | Harder to compute automatically, may require human evaluation |
Human Evaluation | Rating scales, Pairwise comparison, Expert review | Captures nuance, context, readability and factual correctness | Time consuming, subjective, not easily scalable |
Challenges in Evaluating RAG Systems
Some of the challenges faced during evaluating RAG Systems are:
- Measuring Contextual Understanding: Ensuring the system correctly interprets the user’s intent and context.
- Balancing Factuality and Creativity: Avoiding hallucinations while allowing flexible, natural responses.
- Dataset Bias and Subjectivity: Evaluation may be affected by biased datasets or differing human judgments.
- Limited Automated Metrics: Existing metrics may not fully capture relevance, coherence or groundedness.
- Scaling Human Evaluation: Conducting thorough human assessments can be time consuming and resource intensive.
Best Practices for RAG Evaluation
We can follow these best practices to get reliable and meaningful results when evaluating RAG systems:
- Combine Multiple Metrics: Use retrieval, generation and end-to-end metrics together for better evaluation.
- Use Domain Specific Metrics: Tailor evaluation metrics to the application area like medical, legal, technical.
- Monitor Hallucinations and Groundedness: Regularly check for unsupported or fabricated content.
- Track Top-k Performance: Evaluate not just the top answer but also top-ranked results to assess retrieval effectiveness.
- Maintain Consistent Evaluation Pipelines: Ensure reproducibility by using standardized datasets, metrics and procedures.
- Incorporate User Feedback: Real world feedback helps assess usefulness, clarity and relevance.
- Visualize Results: Use dashboards or charts to track metrics over time and identify trends.