Back to all writing
Essay
November 15, 2024·8 min read

Why RAG is Harder Than It Looks

Retrieval-Augmented Generation seems simple in demos but breaks in a dozen ways in production. Here's why most RAG projects fail and what to do about it.

RAGAI EngineeringProduction Systems

The Demo vs. Production Gap

Every RAG demo looks the same: upload a PDF, ask questions, get answers with citations. It takes 20 lines of code and 20 minutes to build. "Look how easy this is!"

Then you try to put it in production.

The demo worked because it used one clean document, asked predictable questions, and had no users depending on correct answers. Production has thousands of messy documents, unpredictable queries, and real consequences for wrong answers.

I've built production RAG systems for three different organizations. Each time, I underestimated the complexity by at least 10x. This essay is everything I wish I'd known before starting.

The Ten Ways RAG Breaks

1. Chunking Failures

What you expect: Split documents into pieces, embed each piece.

What actually happens:

  • Tables become gibberish when split mid-row
  • Code blocks lose their structure
  • Headers get separated from their content
  • Lists become orphaned items without context
  • PDFs extract with bizarre whitespace and layout artifacts

Real example: A legal contract contained a table of payment terms. The fixed-size chunking split it mid-table. When users asked about payment due dates, the retriever returned half the table—missing the due date column entirely.

The fix: Document-type-specific chunking. Tables get special handling. Code blocks stay intact. Headers stay attached to their content. This takes 10x longer than the naive approach.

2. Embedding Limitations

What you expect: Embeddings capture semantic meaning. Similar questions retrieve similar content.

What actually happens:

  • Embeddings are great at semantic similarity but bad at exact match
  • "What is policy 3.2.1?" retrieves any mention of policies
  • Negation is hard: "not covered" vs "covered" have similar embeddings
  • Domain jargon maps poorly to generic embedding models
  • Different questions about the same topic retrieve inconsistently

Real example: A healthcare RAG system confused "covered" and "not covered" procedures because the embeddings were nearly identical. Users got wrong answers about their insurance with real financial consequences.

The fix: Hybrid search (keywords + semantic), domain fine-tuned embeddings, and explicit handling of critical terms.

3. The Wrong Document Problem

What you expect: If the document is in the corpus, the system will find it.

What actually happens:

Retrieval fails silently. The system doesn't return "I couldn't find anything"—it returns the closest match, which might be completely wrong.

Real example: A user asked about the "WFH policy" (work from home). The corpus contained a "remote work policy" but not a document titled "WFH policy." The retriever returned documents about office WiFi setup because "WFH" → "Wi-Fi H..." was the closest string match.

The fix: Concept mapping (synonyms, abbreviations), retrieval confidence thresholds, and "I don't know" responses when confidence is low.

4. Stale Data

What you expect: Upload once, query forever.

What actually happens:

  • Policies change monthly
  • Product specs update
  • Org charts reorganize
  • Old information contaminates new answers

Real example: An HR assistant kept telling employees about the old vacation policy for three months after it changed. Nobody remembered to re-index the updated handbook.

The fix: Incremental indexing pipelines, document versioning, and freshness metadata in prompts.

5. Permission Nightmares

What you expect: Index everything, users see everything.

What actually happens:

Everyone has different access. Finance docs are sensitive. HR information is per-person. The CEO can see things the intern cannot.

Real example: An enterprise RAG system indexed all shared drives, including a folder with upcoming layoff plans. A junior employee asked about "organizational changes" and received chunks from the confidential layoff document.

The fix: Permission-aware retrieval (see my essay on Permission Passthrough Pattern). Index permissions with documents. Filter at query time, not after.

6. Context Window Overflow

What you expect: Retrieve relevant chunks, send to LLM, get answer.

What actually happens:

  • Retriever returns 10 "relevant" chunks
  • Each chunk is 500 tokens
  • Plus conversation history
  • Plus system prompt
  • Plus user query
  • Equals: context overflow or degraded response quality

Real example: A debugging assistant retrieved 15 stack traces related to a query. Only one was actually relevant. The LLM hallucinated a solution mixing concepts from unrelated stack traces.

The fix: Aggressive reranking. Limit to top 3-5 chunks. Summarize when necessary. Monitor context utilization.

7. Answer Hallucination

What you expect: The LLM uses retrieved context to generate accurate answers.

What actually happens:

  • LLM confidently states things not in the context
  • It extrapolates beyond what documents say
  • It mixes information from different documents incorrectly
  • Citations don't always match the actual source of information

Real example: User asked "What's the maximum reimbursement for home office equipment?" Documents said "$500 for furniture, $300 for electronics." LLM answered "$800 total for home office equipment"—correctly adding the numbers but stating a policy that doesn't exist.

The fix: Constrained generation, citation verification, and "I don't know" training.

8. Query Understanding Failures

What you expect: Users ask clear questions. System understands them.

What actually happens:

  • Users ask vague questions: "what about the thing from last time?"
  • They use pronouns: "what's the deadline for it?"
  • They ask compound questions: "how do I submit expenses and what's the approval process and where do I see status?"
  • They misspell: "expance policy"

Real example: User asked "Where's the form?" The system had 200 forms. It guessed wrong.

The fix: Query expansion, spelling correction, follow-up questions, and conversation context injection.

9. Evaluation Blindness

What you expect: It works because it seems to work.

What actually happens:

  • No systematic evaluation
  • Success stories get remembered; failures get forgotten
  • Edge cases aren't tested
  • Degradation is invisible

Real example: A documentation RAG system seemed great for 6 months. A new employee used it and got wrong answers 30% of the time—but they assumed they were asking wrong questions. The system had been broken since a library upgrade two months prior.

The fix: Automated eval harnesses, user feedback tracking, regression testing, and shadow mode deployments.

10. The Cost Surprise

What you expect: RAG is cheaper than fine-tuning.

What actually happens:

  • Embedding API costs add up with large corpora
  • LLM costs multiply with long contexts
  • Reranking adds more model calls
  • Re-indexing after updates costs again
  • Infrastructure for real-time retrieval isn't free

Real example: A startup budgeted $500/month for their RAG system. Embedding their 50,000 documents cost $2,000 upfront. Monthly retrieval and LLM costs hit $3,000. Reindexing after updates added $500/month.

The fix: Caching strategies, tiered retrieval (cheap retrieval → expensive rerank → LLM), and token-efficient context packing.

What Actually Works

After three production deployments, here's what I've learned:

Start with a Narrow Scope

Don't index "all company documents." Start with one document type, one use case, one user group. Get that working well. Then expand.

Invest in Evaluation First

Before building features, build the system that tells you if features work. Eval sets, retrieval metrics, user feedback loops. Without measurement, you're optimizing for vibes.

Build Permission Passthrough from Day One

Adding security later is 10x harder. The AI should never see documents the user can't see.

Expect to Iterate on Chunking

Your first chunking strategy will be wrong. Plan for experimentation. Build tooling to re-chunk and re-embed easily.

Monitor Everything

Retrieval scores. Context lengths. Response times. User satisfaction. Without monitoring, degradation is invisible.

Have a Human Escape Hatch

Every RAG system should offer a path to human help. "I'm not sure—would you like to contact support?" is better than a confidently wrong answer.

The Hidden Truth

RAG is harder than it looks because it's not one problem—it's twelve interconnected problems:

  1. Document processing
  2. Chunking strategy
  3. Embedding quality
  4. Retrieval accuracy
  5. Reranking
  6. Context assembly
  7. LLM prompting
  8. Answer verification
  9. Permission management
  10. Data freshness
  11. Evaluation
  12. Cost optimization

Each problem is solvable. But solving all twelve, together, in production, is the challenge.

The companies that succeed with RAG are the ones that treat it as a system engineering problem, not a demo you ship.

Conclusion

RAG looks easy because demos are easy. Production is hard because:

  • Data is messy
  • Users are unpredictable
  • Wrong answers have consequences
  • Scale reveals edge cases
  • Nobody reads documentation

If you're starting a RAG project, budget 5x the time and 3x the cost you initially estimated. Not because the technology is bad, but because production systems require production engineering.

The teams that succeed are the ones who expect the failures and build systems to detect, prevent, and recover from them.


RAG in a demo: 20 lines of code. RAG in production: 20,000 lines of code and 2,000 test cases.

What's your RAG war story?

Enjoyed this article?

Share it with others who might find it useful

AM

Written by Abhinav Mahajan

AI Product & Engineering Leader

I write about building AI systems that work in production—from RAG pipelines to agent architectures. These insights come from real experience shipping enterprise AI.

Keep Exploring

Check out more writing on AI engineering, system design, and building production-ready AI systems.