The Core Idea
Chunking is the most impactful decision in RAG quality. Get it wrong, and no amount of better embeddings or LLMs will save you.
Most RAG tutorials treat chunking as an afterthought: "just split by 500 tokens." This works for demos but fails in production because different documents need different chunking strategies.
Why Chunking Matters
The Retrieval Problem
When a user asks a question, your retriever must find the right chunk. This requires:
- The chunk contains the answer — Can't find what isn't there
- The chunk is semantically coherent — Embedding captures the meaning
- The chunk is appropriately sized — Not too big, not too small
Bad chunking breaks all three.
The Size Dilemma
Chunks too small:
- Lose context needed to understand meaning
- "The answer is yes" — yes to what?
- Retrieve many fragments, can't synthesize
Chunks too large:
- Embedding averages too much meaning together
- "This chunk is about everything and nothing"
- Context window fills with irrelevant text
The sweet spot depends on the document type.
The Five Chunking Strategies
Strategy 1: Fixed-Size Chunking
How it works: Split every N tokens/characters, with optional overlap.
def fixed_size_chunk(text: str, chunk_size: int = 500, overlap: int = 50):
tokens = tokenize(text)
chunks = []
for i in range(0, len(tokens), chunk_size - overlap):
chunk = tokens[i:i + chunk_size]
chunks.append(detokenize(chunk))
return chunks
Pros:
- Simple, predictable
- Works for any document
- Easy to tune
Cons:
- Ignores document structure
- Splits mid-sentence, mid-paragraph
- Semantically incoherent boundaries
Best for: Uniform text without clear structure (logs, transcripts).
Typical settings: 256-512 tokens, 10-20% overlap.
Strategy 2: Semantic Chunking
How it works: Split when the topic/meaning changes.
def semantic_chunk(text: str, threshold: float = 0.5):
sentences = split_sentences(text)
embeddings = [embed(s) for s in sentences]
chunks = []
current_chunk = [sentences[0]]
for i in range(1, len(sentences)):
similarity = cosine_similarity(embeddings[i], embeddings[i-1])
if similarity < threshold:
# Topic shift detected, start new chunk
chunks.append(" ".join(current_chunk))
current_chunk = [sentences[i]]
else:
current_chunk.append(sentences[i])
chunks.append(" ".join(current_chunk))
return chunks
Pros:
- Respects meaning boundaries
- Better embedding quality
- Fewer retrieval misses
Cons:
- Computationally expensive
- Variable chunk sizes
- Threshold tuning required
Best for: Articles, documentation, narrative text.
Typical settings: threshold 0.3-0.7 depending on topic density.
Strategy 3: Document Structure Chunking
How it works: Use document structure (headings, sections) as boundaries.
def structure_chunk(document: Document):
chunks = []
for section in document.sections:
# Each major section is a chunk
if len(section.content) < MAX_CHUNK_SIZE:
chunks.append(Chunk(
text=section.content,
metadata={
"heading": section.heading,
"level": section.level,
"parent": section.parent_heading
}
))
else:
# Section too large, split by subsections or paragraphs
for subsection in section.split_by_structure():
chunks.append(Chunk(
text=subsection.content,
metadata={...}
))
return chunks
Pros:
- Preserves author's intended organization
- Rich metadata for filtering
- Natural semantic coherence
Cons:
- Requires structured documents
- Section sizes vary wildly
- Parsing complexity
Best for: Technical docs, manuals, wikis, legal documents.
Implementation tip: Parse markdown/HTML structure explicitly.
Strategy 4: Recursive Chunking
How it works: Start with large chunks, recursively split until target size.
def recursive_chunk(text: str, target_size: int = 500, separators: list = None):
if separators is None:
separators = ["\n\n", "\n", ". ", " ", ""]
if len(tokenize(text)) <= target_size:
return [text]
for separator in separators:
if separator in text:
splits = text.split(separator)
chunks = []
for split in splits:
chunks.extend(recursive_chunk(split, target_size, separators))
return chunks
# No separators left, force split
return fixed_size_chunk(text, target_size)
Pros:
- Adapts to document structure
- Prefers natural boundaries
- Handles mixed content well
Cons:
- Still may split mid-thought
- Separator priority is heuristic
- Can create tiny chunks
Best for: General-purpose default, mixed document types.
Typical separators: ["\n\n", "\n", ". ", ", ", " "]
Strategy 5: Agentic Chunking
How it works: Use an LLM to decide chunk boundaries.
def agentic_chunk(text: str, context: str = None):
prompt = f"""
Divide this text into coherent chunks. Each chunk should:
1. Contain one complete idea or topic
2. Be understandable without other chunks
3. Be 100-500 words
Return JSON array of chunks.
Text:
{text}
"""
response = llm.complete(prompt)
return parse_json_chunks(response)
Pros:
- Human-quality boundaries
- Handles complex structures
- Can add summaries/metadata
Cons:
- Expensive (LLM call per document)
- Slow
- Non-deterministic
Best for: High-value documents worth the cost, pre-processing pipelines.
Cost example: $0.01-0.10 per page with GPT-4.
Choosing the Right Strategy
Decision Matrix
| Document Type | Recommended Strategy | Why |
|---|---|---|
| Technical docs | Structure | Natural sections, headings critical |
| Legal documents | Structure + small chunks | Precision required |
| Articles/blogs | Semantic or Recursive | Topic flow matters |
| Chat logs | Fixed-size | No structure to exploit |
| Code | Structure (by function) | Syntax boundaries critical |
| Books | Chapter → Semantic | Multi-level structure |
| PDFs (mixed) | Recursive | Handle tables, images, text |
The Hybrid Approach
In production, combine strategies:
def hybrid_chunk(document: Document):
# 1. First, split by structure if available
if document.has_structure():
sections = structure_chunk(document)
else:
sections = [document.text]
# 2. For each section, apply semantic or recursive chunking
chunks = []
for section in sections:
if len(tokenize(section)) > MAX_CHUNK_SIZE:
sub_chunks = recursive_chunk(section, target_size=400)
chunks.extend(sub_chunks)
else:
chunks.append(section)
# 3. Add overlap for continuity
chunks = add_overlap(chunks, overlap_tokens=50)
return chunks
Chunking Best Practices
1. Preserve Context with Metadata
Include surrounding context in chunk metadata:
chunk = Chunk(
text="The return policy allows 30-day returns.",
metadata={
"source": "handbook.pdf",
"section": "Chapter 4: Returns",
"page": 23,
"preceding_heading": "Customer Policies",
"document_summary": "Employee handbook covering HR policies..."
}
)
This helps LLMs understand context even when chunks are retrieved in isolation.
2. Add Contextual Headers
Prepend section context to each chunk:
def add_context_header(chunk: str, section: str, document: str) -> str:
return f"[{document} > {section}]\n{chunk}"
# Before: "Returns are accepted within 30 days."
# After: "[Employee Handbook > Return Policy]\nReturns are accepted within 30 days."
This helps embeddings capture the full meaning.
3. Handle Tables and Lists
Tables and lists need special treatment:
def chunk_table(table: Table) -> list[Chunk]:
chunks = []
# Option A: Serialize as markdown
chunks.append(Chunk(
text=table.to_markdown(),
type="table"
))
# Option B: Natural language summary
summary = llm.complete(f"Summarize this table: {table.to_markdown()}")
chunks.append(Chunk(
text=summary,
type="table_summary",
original_table=table.to_markdown()
))
return chunks
4. Test with Eval Set
Build an eval set to measure retrieval quality:
eval_set = [
{
"query": "What is the return policy?",
"relevant_chunks": ["handbook_chunk_23", "handbook_chunk_24"],
"irrelevant_chunks": ["handbook_chunk_1", "handbook_chunk_50"]
},
...
]
def evaluate_chunking(chunks, eval_set):
retriever = build_retriever(chunks)
metrics = {"precision": [], "recall": []}
for test in eval_set:
results = retriever.search(test["query"], k=5)
retrieved_ids = {r.id for r in results}
relevant_ids = set(test["relevant_chunks"])
precision = len(retrieved_ids & relevant_ids) / len(retrieved_ids)
recall = len(retrieved_ids & relevant_ids) / len(relevant_ids)
metrics["precision"].append(precision)
metrics["recall"].append(recall)
return {k: sum(v)/len(v) for k, v in metrics.items()}
5. Iterate and Measure
Chunking is not a one-time decision:
- Start with recursive chunking (good default)
- Build eval set from real user queries
- Measure retrieval quality
- Identify failure patterns
- Try alternative strategies
- A/B test in production
Common Mistakes
Mistake 1: One-Size-Fits-All
Using 500-token fixed chunks for everything. Different content needs different strategies.
Mistake 2: Ignoring Overlap
Zero overlap means context can be split across chunks. Always use 10-20% overlap.
Mistake 3: Chunking Before Cleaning
Chunk clean text, not raw HTML/markdown with artifacts.
Mistake 4: No Metadata
Chunks without source/section metadata make debugging impossible.
Mistake 5: Not Testing
Evaluating chunking "by feel" instead of with metrics.
Conclusion
Chunking is where RAG systems are won or lost.
The rules:
- Match strategy to document type
- Preserve structure and context
- Test with real queries
- Iterate based on metrics
Get chunking right, and retrieval quality follows. Get it wrong, and nothing else matters.
The best RAG engineers spend 50% of their time on chunking. The worst spend 0% and wonder why retrieval fails.
What's your chunking strategy?