Building a Basic RAG System

Learning Objectives

By the end of this module you will understand:

How to combine embeddings, vector databases, and an LLM into a working RAG system
How queries are processed through the RAG pipeline
How to retrieve top K chunks and generate answers
Basic code implementation of a RAG system
Key best practices for small-scale RAG pipelines

In previous modules we covered the theory:

embeddings
vector similarity
vector databases
document chunking
RAG architecture

Now we put it all together into a hands-on basic RAG system.

1. System Overview

A basic RAG system consists of three main components:

Embedding model → converts text into vectors
Vector database → stores document embeddings
Language model (LLM) → generates answers using retrieved context

Workflow:

User Query → Embedding → Vector DB → Retrieve top K chunks → LLM → Answer

2. Preparing Documents

Start with a set of documents you want your RAG system to know.

Example:

documents = [
"Customers can return products within 15 days of purchase.",
"Warranty covers defects for 1 year from the purchase date.",
"Shipping takes 3-5 business days."
]

3. Chunking Documents

Split documents into smaller chunks for better embeddings.

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=50,
    chunk_overlap=10
)

chunks = []
for doc in documents:
    chunks.extend(text_splitter.split_text(doc))

Each chunk will later be embedded.

4. Generating Embeddings

Use an embedding model to convert chunks into vectors.

from openai import OpenAI

client = OpenAI()

embeddings = []
for chunk in chunks:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=chunk
    )
    embeddings.append(response.data[0].embedding)

Now each chunk has a corresponding embedding vector.

5. Storing Embeddings in a Vector Database

Choose a vector database (e.g., FAISS for local or Pinecone for cloud).

Example using FAISS:

import faiss
import numpy as np

dimension = len(embeddings[0])
index = faiss.IndexFlatL2(dimension)  # Euclidean distance
index.add(np.array(embeddings).astype("float32"))

Now the system can search for nearest neighbor embeddings efficiently.

6. Querying the System

Convert the user query into an embedding
Search vector database for top K most similar chunks

query = "How long can I return a product?"
query_embedding = client.embeddings.create(
    model="text-embedding-3-small",
    input=query
).data[0].embedding

# Search top 2 similar chunks
D, I = index.search(np.array([query_embedding]).astype("float32"), k=2)
retrieved_chunks = [chunks[i] for i in I[0]]

7. Constructing the Prompt

Combine retrieved chunks into a prompt for the LLM:

prompt = f"""
You are a helpful assistant. Use the following information to answer the question.

Context:
{''.join(retrieved_chunks)}

Question: {query}
Answer:
"""

8. Generating the Answer

Feed the prompt to an LLM to generate a response:

from openai import OpenAI

client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}]
)

answer = response.choices[0].message.content
print(answer)

Example output:

Customers can return products within 15 days of purchase, provided the items are unused and in their original packaging.

9. Putting It All Together

Summary of steps:

Prepare documents and chunk them
Generate embeddings for each chunk
Store embeddings in vector database
Convert user query to embedding
Retrieve top K relevant chunks
Construct prompt with context
Feed prompt to LLM → generate answer

Even this basic system already implements RAG and can be extended to:

larger document collections
hybrid search
multiple LLMs
caching and batching

Key Takeaways

RAG systems combine embeddings, vector databases, and LLMs to answer questions
Chunking ensures documents fit within LLM context windows
Vector similarity retrieves the most relevant chunks
Prompt construction and feeding context to the LLM produces grounded answers
A simple RAG pipeline can be implemented with minimal code

Next Module

In the next module we will explore Advanced RAG Techniques:

Query refinement and re-ranking
Multi-step retrieval
Handling long documents efficiently
Scaling and performance optimizations