The Problem
Your RAG pipeline is live. Users are complaining. The answers are wrong, incomplete, or irrelevant. You stare at logs wondering where it's breaking.
Welcome to RAG debugging.
RAG systems fail in predictable ways. Debug them systematically, and you'll find the problem in minutes instead of hours.
The Debugging Framework
RAG has four stages. Failures cascade—an early failure causes all downstream stages to fail.
Query → Retrieval → Context Assembly → Generation → Response
Debug from left to right. Don't skip stages.
Stage 1: Query Issues
Symptom: Wrong documents retrieved for reasonable questions
Check: Query preprocessing
def debug_query(raw_query: str):
print(f"1. Raw query: '{raw_query}'")
# Check: Is the query being modified unexpectedly?
processed = preprocess_query(raw_query)
print(f"2. Processed query: '{processed}'")
# Check: Is the embedding what you expect?
embedding = embed(processed)
print(f"3. Embedding shape: {embedding.shape}")
print(f"4. Embedding sample: {embedding[:5]}")
# Check: What's the nearest neighbor to this query?
similar = find_similar_queries_in_history(embedding)
print(f"5. Similar past queries: {similar}")
Common issues:
- Spelling correction mangled the query
- Query expansion added wrong terms
- Conversation context injection confused the intent
Symptom: No documents retrieved
Check: Is the question answerable from your corpus?
def check_corpus_coverage(query: str):
# Do a keyword search (ignores embedding quality)
keyword_results = keyword_search(query)
print(f"Keyword results: {len(keyword_results)}")
if len(keyword_results) == 0:
print("❌ Query topic not in corpus!")
else:
print(f"Documents found: {[r.source for r in keyword_results]}")
The corpus might simply not contain the answer.
Stage 2: Retrieval Issues
Symptom: Relevant documents exist but aren't retrieved
Check: Retrieval scores
def debug_retrieval(query: str, k: int = 10):
results = retrieve(query, k=k)
print("Top retrieval results:")
for i, r in enumerate(results):
print(f"{i+1}. Score: {r.score:.3f} | Source: {r.source} | Preview: {r.text[:100]}")
# Check score distribution
scores = [r.score for r in results]
print(f"\nScore stats: max={max(scores):.3f}, min={min(scores):.3f}, mean={sum(scores)/len(scores):.3f}")
if max(scores) < 0.7:
print("⚠️ Low confidence in all results!")
What the scores tell you:
- All scores low → Query doesn't match corpus well
- Scores close together → Embeddings aren't discriminating
- Big gap after top result → Retrieval is confident
Symptom: Irrelevant documents ranked high
Check: Embedding quality
def debug_embedding_similarity(query: str, expected_doc: str, retrieved_doc: str):
q_emb = embed(query)
expected_emb = embed(expected_doc)
retrieved_emb = embed(retrieved_doc)
expected_sim = cosine_similarity(q_emb, expected_emb)
retrieved_sim = cosine_similarity(q_emb, retrieved_emb)
print(f"Expected doc similarity: {expected_sim:.3f}")
print(f"Retrieved doc similarity: {retrieved_sim:.3f}")
if retrieved_sim > expected_sim:
print("⚠️ Embedding model prefers wrong document!")
print("Consider: different embedding model, hybrid search, fine-tuning")
Symptom: Table/code content never retrieved
Check: How special content was indexed
def debug_table_chunk(table_chunk: str):
print(f"Chunk content:\n{table_chunk}\n")
# Tables often get mangled
if "|" not in table_chunk and ":" not in table_chunk:
print("⚠️ Table structure lost in extraction!")
# Try natural language version
nl_version = "The price for small is $10, medium is $15, large is $20"
sim = cosine_similarity(embed(table_chunk), embed(nl_version))
print(f"Similarity to natural language: {sim:.3f}")
Tables and code often need special handling.
Stage 3: Context Assembly Issues
Symptom: Right documents retrieved but wrong answer generated
Check: What context the LLM actually sees
def debug_context(query: str):
results = retrieve(query, k=10)
context = assemble_context(results)
print(f"Context token count: {count_tokens(context)}")
print(f"Number of chunks: {len(results)}")
print(f"\n=== FULL CONTEXT ===\n{context}\n=== END ===\n")
# Does the context contain the answer?
expected_answer = "your known correct answer"
if expected_answer.lower() in context.lower():
print("✓ Answer is in context (LLM generation issue)")
else:
print("❌ Answer NOT in context (retrieval issue)")
This is the most important debug step. If the answer isn't in the context, no prompt magic will fix it.
Symptom: Context too long, irrelevant chunks included
Check: Reranking and filtering
def debug_reranking(query: str):
initial_results = retrieve(query, k=20)
reranked = rerank(query, initial_results)[:5]
print("Initial top 5:")
for r in initial_results[:5]:
print(f" {r.score:.3f}: {r.text[:80]}")
print("\nReranked top 5:")
for r in reranked:
print(f" {r.score:.3f}: {r.text[:80]}")
# Did reranking improve?
correct_in_initial = "answer" in str(initial_results[:5])
correct_in_reranked = "answer" in str(reranked)
print(f"\nCorrect in initial top 5: {correct_in_initial}")
print(f"Correct in reranked top 5: {correct_in_reranked}")
Stage 4: Generation Issues
Symptom: Answer ignores relevant context
Check: System prompt and context formatting
def debug_generation(query: str, context: str):
# Check the actual prompt being sent
prompt = build_prompt(query, context)
print(f"=== FULL PROMPT ===\n{prompt}\n=== END ===")
# Check token counts
print(f"System prompt tokens: {count_tokens(system_prompt)}")
print(f"Context tokens: {count_tokens(context)}")
print(f"Query tokens: {count_tokens(query)}")
print(f"Total input tokens: {count_tokens(prompt)}")
print(f"Max output tokens: {max_output_tokens}")
Common issues:
- Context buried after very long system prompt
- Context formatted as JSON (LLM ignores)
- Answer cut off by token limit
Symptom: LLM hallucinating beyond context
Check: Prompt grounding
# Bad prompt (encourages hallucination)
bad_prompt = f"""
Answer this question: {query}
Here's some context that might help: {context}
"""
# Good prompt (enforces grounding)
good_prompt = f"""
Answer the question using ONLY the following context.
If the context doesn't contain the answer, say "I don't have information on that."
Context:
{context}
Question: {query}
Answer:
"""
Symptom: Answers are correct but verbose/wrong format
Check: Output instructions
def debug_output_format(query: str, context: str):
# Compare with explicit format instructions
responses = {}
for format_instruction in [
"Answer briefly in 1-2 sentences.",
"Provide a detailed answer with bullet points.",
"Answer with just the specific value requested.",
]:
prompt = f"{format_instruction}\n\nContext: {context}\n\nQuestion: {query}"
responses[format_instruction] = llm.generate(prompt)
for instr, resp in responses.items():
print(f"\n{instr}\n→ {resp}\n")
The Quick Debugging Checklist
Run through this for any failing query:
□ Query: Is the processed query what you expect?
□ Corpus: Does the corpus contain the answer?
□ Retrieval: Are the right chunks retrieved?
□ Scores: Are retrieval scores confident?
□ Context: Is the answer in the assembled context?
□ Tokens: Is context within token limits?
□ Prompt: Does the prompt enforce grounding?
□ Output: Is the format clearly specified?
Common Fixes
| Issue | Symptom | Fix |
|---|---|---|
| Wrong embedding model | Similar docs score differently | Try different model |
| Chunking too large | Retrieves irrelevant sections | Smaller chunks with overlap |
| Chunking too small | Loses context | Larger chunks or parent retrieval |
| No reranking | Good docs buried in results | Add reranker (Cohere, Bge) |
| Poor grounding | Hallucination | Stronger prompt constraints |
| Missing content | Always says "I don't know" | Check indexing, add data |
Conclusion
RAG debugging is systematic, not random.
- Start at the query layer
- Verify each stage independently
- Check what the LLM actually sees
- Fix the earliest failure first
The answer is always in the logs—you just need to know where to look.
What's your worst RAG debugging war story?