Evaluating and Improving RAG Systems

Learning Objectives

By the end of this module you will understand:

Why evaluating RAG systems is difficult
The difference between retrieval evaluation and generation evaluation
Metrics used to measure retrieval quality
How to detect hallucinations
How to systematically improve a RAG system

Building a RAG system is only half the job.

The real challenge is answering the question:

“How good is my RAG system?”

Without evaluation, you cannot know if the system is:

retrieving the correct documents
generating correct answers
hallucinating information
improving over time

This module explains how RAG systems are evaluated in practice.

1. Why Evaluating RAG is Difficult

Traditional machine learning models are easier to evaluate.

Example:

Image classification → accuracy
Spam detection → precision and recall

But RAG systems contain two major components:

Retrieval System
+
Language Model

So errors can occur in multiple places:

The system retrieved the wrong documents
The system retrieved correct documents but the LLM ignored them
The LLM hallucinated information
The prompt was poorly constructed

Example:

User Question:
What is the refund policy?

Retrieved Document:
Return policy for electronics

Generated Answer:
Incorrect refund time

The retrieval may be correct, but the generation may still fail.

This is why RAG evaluation is more complex.

2. Two Levels of Evaluation

RAG systems must be evaluated at two levels.

Level 1 — Retrieval Evaluation

Measures whether the right documents were retrieved.

Key question:


Did the system retrieve the correct knowledge?

Level 2 — Generation Evaluation

Measures whether the LLM produced a correct answer.

Key question:


Did the LLM correctly use the retrieved knowledge?

Both evaluations are necessary.

3. Retrieval Evaluation Metrics

Retrieval metrics measure how well the vector search performs.

Recall@K

Recall@K checks whether the correct document appears in the top K results.

Example:

User Question:
What is the return period?

Correct Document:
Return policy document

If the correct document appears in:


Top 3 results → success

Then:


Recall@3 = 1

If it does not appear:


Recall@3 = 0

High Recall@K means the system rarely misses relevant documents.

Precision@K

Precision measures how many retrieved documents are actually relevant.

Example:

Top 5 retrieved chunks

Chunk 1 → relevant
Chunk 2 → relevant
Chunk 3 → irrelevant
Chunk 4 → irrelevant
Chunk 5 → relevant

Precision:


3 relevant / 5 retrieved = 0.6

Higher precision means less noise in retrieval.

Mean Reciprocal Rank (MRR)

MRR measures how early the correct result appears.

Example:

Rank 1 → correct → score = 1
Rank 2 → correct → score = 0.5
Rank 5 → correct → score = 0.2

Higher MRR means the system retrieves relevant documents earlier.

4. Generation Evaluation

After retrieval, we must evaluate the LLM response.

Important questions:

Is the answer correct?
Is the answer grounded in the retrieved documents?
Is the answer complete?

Human Evaluation

One approach is manual review.

Experts evaluate responses based on:

factual accuracy
completeness
relevance

Example rating scale:

→ incorrect
→ partially correct
→ correct but incomplete
→ correct and complete

Human evaluation is accurate but slow and expensive.

5. Detecting Hallucinations

Hallucination happens when the LLM makes up information not present in the documents.

Example:

Retrieved Document:
Return period is 30 days

Generated Answer:
Return period is 60 days

This is a hallucination.

Ways to detect hallucinations:

Grounding Check

Verify whether the answer is supported by retrieved documents.

Citation Requirement

Force the model to include citations.

Example:


Answer: The return period is 30 days [Doc 2]

Automated LLM Evaluation

Use another LLM to check if the answer is supported by the context.

6. Building an Evaluation Dataset

To evaluate RAG systems properly, we need benchmark questions.

Example dataset:

Question: What is the refund period?
Answer: 30 days
Relevant Document: Return policy document

A good evaluation dataset contains:

realistic user questions
verified answers
relevant documents

This dataset allows consistent testing.

7. RAG Evaluation Pipeline

A typical evaluation pipeline looks like this:

Evaluation Questions
↓
Run Through RAG System
↓
Retrieve Documents
↓
Generate Answers
↓
Compute Metrics

Metrics include:

Recall@K
Precision@K
Answer correctness
Hallucination rate

This pipeline helps track improvements.

8. Improving RAG Systems

Once evaluation identifies problems, improvements can be applied.

Improve Chunking

Poor chunking often causes retrieval failures.

Example improvement:

Add overlapping chunks
Use semantic chunking

Improve Embeddings

Better embedding models improve semantic search.

Example:

Switch from basic embeddings
to domain-specific embeddings

Improve Prompts

Prompt structure greatly affects answer quality.

Example:

Use only the provided context.
If the answer is not found, say "I don't know".

This reduces hallucinations.

Improve Retrieval

Techniques include:

hybrid search
query expansion
re-ranking

9. Continuous Evaluation

Production RAG systems require continuous monitoring.

Example workflow:

User Queries
↓
Log Retrieval + Responses
↓
Evaluate Quality
↓
Identify Failures
↓
Improve System

This process continuously improves the system.

Key Takeaways

RAG systems must be evaluated at two levels: retrieval and generation
Retrieval metrics include Recall@K, Precision@K, and MRR
Generation evaluation focuses on accuracy and grounding
Hallucinations must be detected and minimized
Evaluation datasets allow consistent system benchmarking
Continuous evaluation is essential for production systems

A well-evaluated RAG system becomes more reliable and trustworthy over time.

Next Module

In the next module we will explore:

Industry RAG Architectures and Design Patterns

You will learn how companies build:

large scale RAG pipelines
enterprise knowledge assistants
multi-agent retrieval systems