The Demo vs. Production Gap
Every RAG demo looks the same: upload a PDF, ask questions, get answers with citations. It takes 20 lines of code and 20 minutes to build. "Look how easy this is!"
Then you try to put it in production.
The demo worked because it used one clean document, asked predictable questions, and had no users depending on correct answers. Production has thousands of messy documents, unpredictable queries, and real consequences for wrong answers.
I've built production RAG systems for three different organizations. Each time, I underestimated the complexity by at least 10x. This essay is everything I wish I'd known before starting.
The Ten Ways RAG Breaks
1. Chunking Failures
What you expect: Split documents into pieces, embed each piece.
What actually happens:
- Tables become gibberish when split mid-row
- Code blocks lose their structure
- Headers get separated from their content
- Lists become orphaned items without context
- PDFs extract with bizarre whitespace and layout artifacts
Real example: A legal contract contained a table of payment terms. The fixed-size chunking split it mid-table. When users asked about payment due dates, the retriever returned half the table—missing the due date column entirely.
The fix: Document-type-specific chunking. Tables get special handling. Code blocks stay intact. Headers stay attached to their content. This takes 10x longer than the naive approach.
2. Embedding Limitations
What you expect: Embeddings capture semantic meaning. Similar questions retrieve similar content.
What actually happens:
- Embeddings are great at semantic similarity but bad at exact match
- "What is policy 3.2.1?" retrieves any mention of policies
- Negation is hard: "not covered" vs "covered" have similar embeddings
- Domain jargon maps poorly to generic embedding models
- Different questions about the same topic retrieve inconsistently
Real example: A healthcare RAG system confused "covered" and "not covered" procedures because the embeddings were nearly identical. Users got wrong answers about their insurance with real financial consequences.
The fix: Hybrid search (keywords + semantic), domain fine-tuned embeddings, and explicit handling of critical terms.
3. The Wrong Document Problem
What you expect: If the document is in the corpus, the system will find it.
What actually happens:
Retrieval fails silently. The system doesn't return "I couldn't find anything"—it returns the closest match, which might be completely wrong.
Real example: A user asked about the "WFH policy" (work from home). The corpus contained a "remote work policy" but not a document titled "WFH policy." The retriever returned documents about office WiFi setup because "WFH" → "Wi-Fi H..." was the closest string match.
The fix: Concept mapping (synonyms, abbreviations), retrieval confidence thresholds, and "I don't know" responses when confidence is low.
4. Stale Data
What you expect: Upload once, query forever.
What actually happens:
- Policies change monthly
- Product specs update
- Org charts reorganize
- Old information contaminates new answers
Real example: An HR assistant kept telling employees about the old vacation policy for three months after it changed. Nobody remembered to re-index the updated handbook.
The fix: Incremental indexing pipelines, document versioning, and freshness metadata in prompts.
5. Permission Nightmares
What you expect: Index everything, users see everything.
What actually happens:
Everyone has different access. Finance docs are sensitive. HR information is per-person. The CEO can see things the intern cannot.
Real example: An enterprise RAG system indexed all shared drives, including a folder with upcoming layoff plans. A junior employee asked about "organizational changes" and received chunks from the confidential layoff document.
The fix: Permission-aware retrieval (see my essay on Permission Passthrough Pattern). Index permissions with documents. Filter at query time, not after.
6. Context Window Overflow
What you expect: Retrieve relevant chunks, send to LLM, get answer.
What actually happens:
- Retriever returns 10 "relevant" chunks
- Each chunk is 500 tokens
- Plus conversation history
- Plus system prompt
- Plus user query
- Equals: context overflow or degraded response quality
Real example: A debugging assistant retrieved 15 stack traces related to a query. Only one was actually relevant. The LLM hallucinated a solution mixing concepts from unrelated stack traces.
The fix: Aggressive reranking. Limit to top 3-5 chunks. Summarize when necessary. Monitor context utilization.
7. Answer Hallucination
What you expect: The LLM uses retrieved context to generate accurate answers.
What actually happens:
- LLM confidently states things not in the context
- It extrapolates beyond what documents say
- It mixes information from different documents incorrectly
- Citations don't always match the actual source of information
Real example: User asked "What's the maximum reimbursement for home office equipment?" Documents said "$500 for furniture, $300 for electronics." LLM answered "$800 total for home office equipment"—correctly adding the numbers but stating a policy that doesn't exist.
The fix: Constrained generation, citation verification, and "I don't know" training.
8. Query Understanding Failures
What you expect: Users ask clear questions. System understands them.
What actually happens:
- Users ask vague questions: "what about the thing from last time?"
- They use pronouns: "what's the deadline for it?"
- They ask compound questions: "how do I submit expenses and what's the approval process and where do I see status?"
- They misspell: "expance policy"
Real example: User asked "Where's the form?" The system had 200 forms. It guessed wrong.
The fix: Query expansion, spelling correction, follow-up questions, and conversation context injection.
9. Evaluation Blindness
What you expect: It works because it seems to work.
What actually happens:
- No systematic evaluation
- Success stories get remembered; failures get forgotten
- Edge cases aren't tested
- Degradation is invisible
Real example: A documentation RAG system seemed great for 6 months. A new employee used it and got wrong answers 30% of the time—but they assumed they were asking wrong questions. The system had been broken since a library upgrade two months prior.
The fix: Automated eval harnesses, user feedback tracking, regression testing, and shadow mode deployments.
10. The Cost Surprise
What you expect: RAG is cheaper than fine-tuning.
What actually happens:
- Embedding API costs add up with large corpora
- LLM costs multiply with long contexts
- Reranking adds more model calls
- Re-indexing after updates costs again
- Infrastructure for real-time retrieval isn't free
Real example: A startup budgeted $500/month for their RAG system. Embedding their 50,000 documents cost $2,000 upfront. Monthly retrieval and LLM costs hit $3,000. Reindexing after updates added $500/month.
The fix: Caching strategies, tiered retrieval (cheap retrieval → expensive rerank → LLM), and token-efficient context packing.
What Actually Works
After three production deployments, here's what I've learned:
Start with a Narrow Scope
Don't index "all company documents." Start with one document type, one use case, one user group. Get that working well. Then expand.
Invest in Evaluation First
Before building features, build the system that tells you if features work. Eval sets, retrieval metrics, user feedback loops. Without measurement, you're optimizing for vibes.
Build Permission Passthrough from Day One
Adding security later is 10x harder. The AI should never see documents the user can't see.
Expect to Iterate on Chunking
Your first chunking strategy will be wrong. Plan for experimentation. Build tooling to re-chunk and re-embed easily.
Monitor Everything
Retrieval scores. Context lengths. Response times. User satisfaction. Without monitoring, degradation is invisible.
Have a Human Escape Hatch
Every RAG system should offer a path to human help. "I'm not sure—would you like to contact support?" is better than a confidently wrong answer.
The Hidden Truth
RAG is harder than it looks because it's not one problem—it's twelve interconnected problems:
- Document processing
- Chunking strategy
- Embedding quality
- Retrieval accuracy
- Reranking
- Context assembly
- LLM prompting
- Answer verification
- Permission management
- Data freshness
- Evaluation
- Cost optimization
Each problem is solvable. But solving all twelve, together, in production, is the challenge.
The companies that succeed with RAG are the ones that treat it as a system engineering problem, not a demo you ship.
Conclusion
RAG looks easy because demos are easy. Production is hard because:
- Data is messy
- Users are unpredictable
- Wrong answers have consequences
- Scale reveals edge cases
- Nobody reads documentation
If you're starting a RAG project, budget 5x the time and 3x the cost you initially estimated. Not because the technology is bad, but because production systems require production engineering.
The teams that succeed are the ones who expect the failures and build systems to detect, prevent, and recover from them.
RAG in a demo: 20 lines of code. RAG in production: 20,000 lines of code and 2,000 test cases.
What's your RAG war story?