The RAG Architecture

Learning Objectives

By the end of this module you will understand:

What Retrieval-Augmented Generation (RAG) is
How retrieval and generation are combined
The step-by-step flow of a RAG system
The role of embeddings, vector databases, and chunked documents
How RAG improves accuracy and reduces hallucinations

In previous modules we learned:

How embeddings capture semantic meaning
How vector databases enable fast similarity search
How chunking prepares documents for retrieval

Now we connect these components into a complete RAG system.

1. What is RAG?

Retrieval-Augmented Generation (RAG) is a framework that combines:

1. A retrieval system → finds relevant knowledge
2. A language model → generates answers based on retrieved knowledge

Instead of asking the LLM to remember everything, we retrieve relevant context on demand.

This makes LLMs:

more accurate
up-to-date
capable of reasoning over large external knowledge bases

2. Why RAG?

Traditional LLM-only approaches:

rely solely on model memory
have knowledge cutoff limits
hallucinate if information is missing

RAG overcomes this by feeding the model retrieved information, so the model has the context it needs.

Example:

Question: "What is our company refund policy for electronics purchased last month?"

LLM alone: may not know last month’s updated policy
RAG: retrieves relevant policy document chunk → LLM generates an accurate answer

3. The RAG Pipeline Overview

The RAG process can be broken down into four main steps:

Query Encoding
Convert user query into an embedding vector
Document Retrieval
Search the vector database for top K most similar embeddings
Context Construction
Combine retrieved document chunks into a context for the LLM
Answer Generation
Feed query + retrieved context to the LLM → generate answer

User Query → Embedding → Vector DB → Retrieve Chunks → LLM → Answer

````id="b1m0c7"

---

# 4. Step 1: Query Encoding

* Convert the user’s query into a vector using the **same embedding model** as used for documents  
* Ensures embeddings are in the same semantic space  
* Example:

```python
query_embedding = embedding_model.encode("Return policy for electronics")

This embedding is used to find similar document chunks in the vector database.

5. Step 2: Document Retrieval

Use the query embedding to perform nearest neighbor search in the vector DB
Retrieve top K chunks based on cosine similarity
Example:

Top 3 retrieved chunks:
Electronics return policy, section 2
Refund process details
Exclusions and conditions

These chunks become the context for the LLM.

6. Step 3: Context Construction

Combine retrieved chunks into a single prompt
May include metadata or citations
Use a template to guide the LLM:

You are a helpful assistant. Use the following retrieved information to answer the question.

Context:
[Chunk 1]
[Chunk 2]
[Chunk 3]

Question: What is the return policy for electronics purchased last month?
Answer:

Important: Be mindful of LLM context window — too many chunks may exceed it.

7. Step 4: Answer Generation

Feed the query + retrieved context to the LLM
LLM generates an answer grounded in retrieved documents
Benefits:
- Reduces hallucinations
- Incorporates updated knowledge
- Handles large knowledge bases

Example output:

Customers can return electronics within 15 days of purchase, provided the items are unused and in original packaging.

8. RAG Variants

There are two main types of RAG:

8.1 RAG-Sequence

LLM processes each retrieved document sequentially
Pros: thorough reasoning
Cons: slower for many documents

8.2 RAG-Token

LLM attends to all retrieved documents at each token generation step
Pros: potentially more accurate
Cons: computationally expensive

Both approaches are used depending on accuracy vs speed requirements.

9. Key Components of RAG

Embedding Model: converts text into vectors
Vector Database: stores embeddings, supports similarity search
Chunked Documents: ensures LLM can process retrieved information
Language Model (LLM): generates the final answer using context

Together, these components create a scalable and accurate RAG system.

Key Takeaways

RAG combines retrieval and generation for better LLM performance
Query embedding + vector database → relevant context retrieval
Context is fed to LLM to generate grounded answers
RAG reduces hallucinations and enables up-to-date knowledge
Variants (sequence/token) trade off accuracy vs speed

Next Module

In the next module we will build a basic RAG system from scratch:

Integrate embedding model + vector database + LLM
Feed a query → retrieve context → generate answer
Hands-on example of a working RAG pipeline