Learning the Hard Way
I've built chatbots for customer support, internal knowledge bases, and developer tools. Some were successful. Others taught me expensive lessons.
Here are the five mistakes I made more than once—and what I do differently now.
Mistake 1: No "I Don't Know" Training
What I did wrong
My first customer support chatbot was confident. Too confident. When it didn't have the answer, it made one up. Confidently.
A customer asked about a refund policy edge case. The bot invented a policy that didn't exist. The customer quoted it back to our support team. Our support team had no idea what they were talking about.
It got worse. The bot confidently gave wrong information about pricing, product features, and shipping times. Each wrong answer eroded trust.
Why it happened
LLMs are trained to be helpful. They'll generate plausible-sounding responses even when they shouldn't. Without explicit training to refuse, they never will.
What I do now
- Explicit refusal instructions in the system prompt:
If you are not certain about the answer, say: "I don't have confident information about that. Let me connect you with someone who can help."
-
Retrieval confidence thresholds — if no documents match above 0.7 similarity, trigger the uncertainty response.
-
Test with out-of-scope questions — deliberately ask things the bot shouldn't know and verify it refuses appropriately.
-
Make refusal easy — the bot should never feel like it's "failing" by saying "I don't know." It's the right answer to some questions.
Mistake 2: Testing with Clean Input
What I did wrong
My test queries were well-formed:
- "What is the return policy?"
- "How do I reset my password?"
- "What are your business hours?"
Real users asked:
- "where tf is my order???"
- "i bought thing 2 weeks ago cant login help"
- "refund"
- "your app doesn't work" (with no other context)
The bot handled test queries perfectly and real queries terribly.
Why it happened
Testing with clean input tests a different product than users actually use. Users have typos, frustration, incomplete context, and unclear questions.
What I do now
-
Test with production data — take real customer messages (anonymized) as your test set.
-
Include edge cases deliberately:
- Typos and misspellings
- ALL CAPS ANGRY MESSAGES
- Single-word queries
- Multi-question messages
- Non-English or mixed language
- Irrelevant messages (spam, wrong chat)
-
Shadow mode first — run the bot in production but show responses only to internal reviewers before going live.
-
Preprocessing pipeline — spell correction, intent clarification, and follow-up questions before passing to the main model.
Mistake 3: Ignoring Conversation History
What I did wrong
Each message was handled independently:
User: What's your cheapest plan?
Bot: Our Basic plan is $9/month.
User: Does it include email support?
Bot: I'd be happy to help with email support! What would you like to know?
The bot forgot the user was asking about the Basic plan. It treated "it" as a nonsensical query. The experience was frustrating.
Why it happened
I was sending only the latest user message to the LLM. No context. Every message was a new conversation.
What I do now
- Include conversation history in every request:
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": "What's your cheapest plan?"},
{"role": "assistant", "content": "Our Basic plan is $9/month."},
{"role": "user", "content": "Does it include email support?"}, # Current query
]
- History management — keep recent messages, summarize older ones:
if len(messages) > 10:
messages = [
{"role": "system", "content": system_prompt},
{"role": "system", "content": f"Earlier context: {summarize(old_messages)}"},
*messages[-8:] # Keep last 4 exchanges
]
- Entity tracking — maintain state about what products, orders, or topics are being discussed.
Mistake 4: No Human Escalation Path
What I did wrong
I was proud of my chatbot's coverage. It handled everything! Users never needed to talk to humans!
Except when they did. And when they did, there was no way to reach one.
Users got stuck in loops. The bot kept trying to help. The users got angrier. Some gave up and left bad reviews. Others called the support phone number and complained about "the stupid bot that wouldn't let me talk to a person."
Why it happened
I designed for the happy path. Handle queries, provide answers, close conversations. I didn't design for the frustrated user who needs something the bot can't provide.
What I do now
- Always offer human escalation:
If you cannot help the user after 2 attempts, or if the user asks to speak to a human, respond:
"I'd be happy to connect you with our support team. [Click here] to chat with a team member, or reply 'human' and I'll transfer you."
-
Detect frustration — repeated questions, capital letters, profanity, and negative sentiment trigger early escalation.
-
Graceful handoff — when escalating, include conversation history so the human has context.
-
Make it obvious — I add a "Talk to a human" button in the UI, visible at all times.
Mistake 5: Deploying Without an Eval Set
What I did wrong
I tested manually. Clicked around. Asked various questions. "Looks good to me!" Deployed.
A week later, users reported problems. I fixed them. Broke something else. Fixed that. Broke two more things.
I had no systematic way to know if the bot was improving or getting worse. Every change was a gamble.
Why it happened
Building an eval set felt like extra work. I wanted to ship. "I'll add tests later."
I never added tests later. I was too busy fixing bugs.
What I do now
- Build the eval set first — before writing any bot logic:
eval_set = [
{
"input": "What's your refund policy?",
"must_include": ["30 days", "full refund"],
"must_not_include": ["exchange only"],
"max_response_length": 200,
},
{
"input": "asf;lkj;salkfj",
"expected_behavior": "asks for clarification",
},
{
"input": "I want to speak to a human",
"expected_behavior": "escalation offered",
},
]
- Run evals on every change — automated, in CI/CD:
def test_chatbot():
for test in eval_set:
response = chatbot.respond(test["input"])
assert passes_criteria(response, test)
-
Track metrics over time — accuracy, satisfaction, escalation rate. Plot trends. Catch degradation early.
-
Golden set vs. diverse set — maintain a stable "golden" set for regression testing, plus a diverse set that grows with production failures.
Summary: My Chatbot Checklist
Before deploying any chatbot now, I verify:
✅ Refusal — Bot says "I don't know" for out-of-scope questions
✅ Messy input — Bot handles typos, anger, and vagueness
✅ Context — Bot maintains conversation history
✅ Escalation — Users can always reach a human
✅ Evals — Automated tests run on every deploy
If any of these fail, I don't deploy. Period.
The Meta-Lesson
All five mistakes share a root cause: I built for the ideal case instead of the real case.
- Ideal users ask clear questions. Real users don't.
- Ideal bots always know the answer. Real bots don't.
- Ideal conversations are single exchanges. Real conversations have history.
- Ideal automation replaces humans. Real automation supports humans.
- Ideal code works. Real code needs tests.
Build for reality, not for demos.
What chatbot mistakes have you made? I'd love to hear—we learn the best lessons from each other's failures.