Document Chunking Strategies
Learning Objectives
By the end of this module you will understand:
- Why large documents need to be split into smaller chunks
- How chunking impacts embeddings and retrieval
- Different strategies for splitting text
- The concept of overlap in chunks
- How chunking improves RAG system performance
In previous modules, we learned about embeddings and vector databases.
Now we tackle large documents and how to make them compatible with retrieval systems.
1. Why Chunking is Necessary
Large documents often exceed the context window of language models.
Example:
A research report with 50,000 words
Even large LLMs can only process a few thousand tokens at once.
Problem: Feeding the entire document to the model is impossible.
Solution: Split the document into smaller, manageable chunks.
2. Chunking and Embeddings
When we generate embeddings, each chunk is converted into a vector.
Document → Split into chunks → Each chunk → Embedding
Benefits:
- Smaller chunks fit into LLM context windows
- Retrieval is more precise
- Avoids diluting relevant information in long documents
Example:
Document: Refund Policy (15,000 words)
Chunks: 100 chunks of ~150 words each
Embedding: Each chunk → vector stored in vector DB
3. Chunk Size
Choosing the right chunk size is crucial.
-
Too small:
- Number of chunks explodes
- More storage and retrieval overhead
-
Too large:
- Embeddings may contain unrelated content
- Reduced retrieval accuracy
Rule of thumb: 100–500 words per chunk (or 500–1000 tokens)
4. Chunk Overlap
Overlap ensures context is preserved between chunks.
Example:
Chunk 1: words 1–150
Chunk 2: words 120–270 (overlap 30 words)
Chunk 3: words 240–390 (overlap 30 words)
Benefits:
- Maintains continuity across chunks
- Prevents missing important information at chunk boundaries
5. Splitting Strategies
There are several ways to chunk text:
5.1 Fixed-Size Chunks
- Divide text by a fixed number of words or tokens
- Simple but may cut sentences abruptly
5.2 Sentence-Based Chunks
- Split text at sentence boundaries
- More natural for LLMs
5.3 Paragraph-Based Chunks
- Use paragraphs as chunk boundaries
- Preserves semantic integrity
5.4 Hybrid Chunks
- Combine paragraph-based chunks with token limits
- Add overlap for continuity
- Works well for large documents
6. Impact on Retrieval
Proper chunking affects RAG retrieval:
- Smaller, precise chunks → more relevant results
- Overlapping chunks → prevent missing context
- Balanced chunk size → efficient storage and search
Example:
Query: "Return policy for electronics"
Retrieval returns top 3 chunks → Combined into prompt → LLM generates answer
7. Tools for Chunking
There are several libraries and tools to automate chunking:
- LangChain TextSplitter
- LlamaIndex DocumentLoader
- Custom Python scripts using
nltk,spaCy, or regex
Example with LangChain:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50
)
chunks = text_splitter.split_text(document_text)
Each chunk can then be embedded and stored in a vector database.
8. Best Practices
- Always test different chunk sizes for your data
- Use overlap to prevent information loss
- Maintain semantic integrity (don’t split mid-sentence)
- Keep track of metadata (source, chunk ID, position)
Chunking + embeddings + vector database → precise and scalable RAG retrieval
Key Takeaways
- Large documents must be split into chunks to fit LLM context windows
- Proper chunk size and overlap are critical for retrieval accuracy
- Different chunking strategies: fixed-size, sentence-based, paragraph-based, hybrid
- Chunking improves embeddings quality and relevance in RAG pipelines
Next Module
In the next module we will explore the RAG architecture:
- How retrieval and generation are combined
- How query → retrieval → language model → answer works
- End-to-end flow of a RAG system