Back to all writing
Post
November 28, 2024·6 min read

How to Debug a Failing RAG Pipeline

Your RAG system is returning bad answers. Here's a systematic approach to find out why and fix it.

RAGDebuggingTutorial

The Problem

Your RAG pipeline is live. Users are complaining. The answers are wrong, incomplete, or irrelevant. You stare at logs wondering where it's breaking.

Welcome to RAG debugging.

RAG systems fail in predictable ways. Debug them systematically, and you'll find the problem in minutes instead of hours.

The Debugging Framework

RAG has four stages. Failures cascade—an early failure causes all downstream stages to fail.

Query → Retrieval → Context Assembly → Generation → Response

Debug from left to right. Don't skip stages.

Stage 1: Query Issues

Symptom: Wrong documents retrieved for reasonable questions

Check: Query preprocessing

def debug_query(raw_query: str):
    print(f"1. Raw query: '{raw_query}'")
    
    # Check: Is the query being modified unexpectedly?
    processed = preprocess_query(raw_query)
    print(f"2. Processed query: '{processed}'")
    
    # Check: Is the embedding what you expect?
    embedding = embed(processed)
    print(f"3. Embedding shape: {embedding.shape}")
    print(f"4. Embedding sample: {embedding[:5]}")
    
    # Check: What's the nearest neighbor to this query?
    similar = find_similar_queries_in_history(embedding)
    print(f"5. Similar past queries: {similar}")

Common issues:

  • Spelling correction mangled the query
  • Query expansion added wrong terms
  • Conversation context injection confused the intent

Symptom: No documents retrieved

Check: Is the question answerable from your corpus?

def check_corpus_coverage(query: str):
    # Do a keyword search (ignores embedding quality)
    keyword_results = keyword_search(query)
    print(f"Keyword results: {len(keyword_results)}")
    
    if len(keyword_results) == 0:
        print("❌ Query topic not in corpus!")
    else:
        print(f"Documents found: {[r.source for r in keyword_results]}")

The corpus might simply not contain the answer.

Stage 2: Retrieval Issues

Symptom: Relevant documents exist but aren't retrieved

Check: Retrieval scores

def debug_retrieval(query: str, k: int = 10):
    results = retrieve(query, k=k)
    
    print("Top retrieval results:")
    for i, r in enumerate(results):
        print(f"{i+1}. Score: {r.score:.3f} | Source: {r.source} | Preview: {r.text[:100]}")
    
    # Check score distribution
    scores = [r.score for r in results]
    print(f"\nScore stats: max={max(scores):.3f}, min={min(scores):.3f}, mean={sum(scores)/len(scores):.3f}")
    
    if max(scores) < 0.7:
        print("⚠️ Low confidence in all results!")

What the scores tell you:

  • All scores low → Query doesn't match corpus well
  • Scores close together → Embeddings aren't discriminating
  • Big gap after top result → Retrieval is confident

Symptom: Irrelevant documents ranked high

Check: Embedding quality

def debug_embedding_similarity(query: str, expected_doc: str, retrieved_doc: str):
    q_emb = embed(query)
    expected_emb = embed(expected_doc)
    retrieved_emb = embed(retrieved_doc)
    
    expected_sim = cosine_similarity(q_emb, expected_emb)
    retrieved_sim = cosine_similarity(q_emb, retrieved_emb)
    
    print(f"Expected doc similarity: {expected_sim:.3f}")
    print(f"Retrieved doc similarity: {retrieved_sim:.3f}")
    
    if retrieved_sim > expected_sim:
        print("⚠️ Embedding model prefers wrong document!")
        print("Consider: different embedding model, hybrid search, fine-tuning")

Symptom: Table/code content never retrieved

Check: How special content was indexed

def debug_table_chunk(table_chunk: str):
    print(f"Chunk content:\n{table_chunk}\n")
    
    # Tables often get mangled
    if "|" not in table_chunk and ":" not in table_chunk:
        print("⚠️ Table structure lost in extraction!")
    
    # Try natural language version
    nl_version = "The price for small is $10, medium is $15, large is $20"
    sim = cosine_similarity(embed(table_chunk), embed(nl_version))
    print(f"Similarity to natural language: {sim:.3f}")

Tables and code often need special handling.

Stage 3: Context Assembly Issues

Symptom: Right documents retrieved but wrong answer generated

Check: What context the LLM actually sees

def debug_context(query: str):
    results = retrieve(query, k=10)
    context = assemble_context(results)
    
    print(f"Context token count: {count_tokens(context)}")
    print(f"Number of chunks: {len(results)}")
    print(f"\n=== FULL CONTEXT ===\n{context}\n=== END ===\n")
    
    # Does the context contain the answer?
    expected_answer = "your known correct answer"
    if expected_answer.lower() in context.lower():
        print("✓ Answer is in context (LLM generation issue)")
    else:
        print("❌ Answer NOT in context (retrieval issue)")

This is the most important debug step. If the answer isn't in the context, no prompt magic will fix it.

Symptom: Context too long, irrelevant chunks included

Check: Reranking and filtering

def debug_reranking(query: str):
    initial_results = retrieve(query, k=20)
    reranked = rerank(query, initial_results)[:5]
    
    print("Initial top 5:")
    for r in initial_results[:5]:
        print(f"  {r.score:.3f}: {r.text[:80]}")
    
    print("\nReranked top 5:")
    for r in reranked:
        print(f"  {r.score:.3f}: {r.text[:80]}")
    
    # Did reranking improve?
    correct_in_initial = "answer" in str(initial_results[:5])
    correct_in_reranked = "answer" in str(reranked)
    print(f"\nCorrect in initial top 5: {correct_in_initial}")
    print(f"Correct in reranked top 5: {correct_in_reranked}")

Stage 4: Generation Issues

Symptom: Answer ignores relevant context

Check: System prompt and context formatting

def debug_generation(query: str, context: str):
    # Check the actual prompt being sent
    prompt = build_prompt(query, context)
    print(f"=== FULL PROMPT ===\n{prompt}\n=== END ===")
    
    # Check token counts
    print(f"System prompt tokens: {count_tokens(system_prompt)}")
    print(f"Context tokens: {count_tokens(context)}")
    print(f"Query tokens: {count_tokens(query)}")
    print(f"Total input tokens: {count_tokens(prompt)}")
    print(f"Max output tokens: {max_output_tokens}")

Common issues:

  • Context buried after very long system prompt
  • Context formatted as JSON (LLM ignores)
  • Answer cut off by token limit

Symptom: LLM hallucinating beyond context

Check: Prompt grounding

# Bad prompt (encourages hallucination)
bad_prompt = f"""
Answer this question: {query}
Here's some context that might help: {context}
"""

# Good prompt (enforces grounding)
good_prompt = f"""
Answer the question using ONLY the following context.
If the context doesn't contain the answer, say "I don't have information on that."

Context:
{context}

Question: {query}

Answer:
"""

Symptom: Answers are correct but verbose/wrong format

Check: Output instructions

def debug_output_format(query: str, context: str):
    # Compare with explicit format instructions
    responses = {}
    
    for format_instruction in [
        "Answer briefly in 1-2 sentences.",
        "Provide a detailed answer with bullet points.",
        "Answer with just the specific value requested.",
    ]:
        prompt = f"{format_instruction}\n\nContext: {context}\n\nQuestion: {query}"
        responses[format_instruction] = llm.generate(prompt)
    
    for instr, resp in responses.items():
        print(f"\n{instr}\n→ {resp}\n")

The Quick Debugging Checklist

Run through this for any failing query:

□ Query: Is the processed query what you expect?
□ Corpus: Does the corpus contain the answer?
□ Retrieval: Are the right chunks retrieved?
□ Scores: Are retrieval scores confident?
□ Context: Is the answer in the assembled context?
□ Tokens: Is context within token limits?
□ Prompt: Does the prompt enforce grounding?
□ Output: Is the format clearly specified?

Common Fixes

IssueSymptomFix
Wrong embedding modelSimilar docs score differentlyTry different model
Chunking too largeRetrieves irrelevant sectionsSmaller chunks with overlap
Chunking too smallLoses contextLarger chunks or parent retrieval
No rerankingGood docs buried in resultsAdd reranker (Cohere, Bge)
Poor groundingHallucinationStronger prompt constraints
Missing contentAlways says "I don't know"Check indexing, add data

Conclusion

RAG debugging is systematic, not random.

  1. Start at the query layer
  2. Verify each stage independently
  3. Check what the LLM actually sees
  4. Fix the earliest failure first

The answer is always in the logs—you just need to know where to look.


What's your worst RAG debugging war story?

Enjoyed this article?

Share it with others who might find it useful

AM

Written by Abhinav Mahajan

AI Product & Engineering Leader

I write about building AI systems that work in production—from RAG pipelines to agent architectures. These insights come from real experience shipping enterprise AI.

Keep Exploring

Check out more writing on AI engineering, system design, and building production-ready AI systems.