Document Chunking Strategies

Learning Objectives

By the end of this module you will understand:

In previous modules, we learned about embeddings and vector databases.
Now we tackle large documents and how to make them compatible with retrieval systems.


1. Why Chunking is Necessary

Large documents often exceed the context window of language models.

Example:


A research report with 50,000 words

Even large LLMs can only process a few thousand tokens at once.

Problem: Feeding the entire document to the model is impossible.

Solution: Split the document into smaller, manageable chunks.


2. Chunking and Embeddings

When we generate embeddings, each chunk is converted into a vector.


Document → Split into chunks → Each chunk → Embedding

Benefits:

Example:


Document: Refund Policy (15,000 words)
Chunks: 100 chunks of ~150 words each
Embedding: Each chunk → vector stored in vector DB


3. Chunk Size

Choosing the right chunk size is crucial.

Rule of thumb: 100–500 words per chunk (or 500–1000 tokens)


4. Chunk Overlap

Overlap ensures context is preserved between chunks.

Example:


Chunk 1: words 1–150
Chunk 2: words 120–270 (overlap 30 words)
Chunk 3: words 240–390 (overlap 30 words)

Benefits:


5. Splitting Strategies

There are several ways to chunk text:

5.1 Fixed-Size Chunks

5.2 Sentence-Based Chunks

5.3 Paragraph-Based Chunks

5.4 Hybrid Chunks


6. Impact on Retrieval

Proper chunking affects RAG retrieval:

Example:


Query: "Return policy for electronics"
Retrieval returns top 3 chunks → Combined into prompt → LLM generates answer


7. Tools for Chunking

There are several libraries and tools to automate chunking:

Example with LangChain:

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)

chunks = text_splitter.split_text(document_text)

Each chunk can then be embedded and stored in a vector database.


8. Best Practices

Chunking + embeddings + vector database → precise and scalable RAG retrieval


Key Takeaways


Next Module

In the next module we will explore the RAG architecture:

💬
AI Learning Assistant