The Two Faces of AI Engineering
On Twitter, AI engineering is:
- Training models
- Magical prompts that unlock capabilities
- Building agentic systems
- The cutting edge of technology
In production, AI engineering is:
- Debugging why PDFs parse differently
- Adding rate limiting because someone sent 10,000 requests in a minute
- Writing permission checks for the eighth integration
- On-call at 2 AM because the embedding service is down
I've done both. This is a defense of the unglamorous parts—because that's where actual value gets created.
The Unglamorous Reality
1. Data Pipeline Maintenance
What it looks like from outside: "We built an AI that answers questions about company documents."
What it actually involves:
def ingest_document(source: DocumentSource):
"""This function handles 847 edge cases."""
# Step 1: Download (handles retries, timeouts, auth failures)
raw = download_with_retry(source, max_retries=3)
# Step 2: Detect format (PDF, DOCX, HTML, TXT, images...)
format = detect_format(raw) # Sniffing bytes, not trusting extensions
# Step 3: Extract text (different for every format)
if format == "pdf":
text = extract_pdf(raw) # 15 libraries for 15 kinds of PDFs
elif format == "docx":
text = extract_docx(raw)
elif format == "html":
text = clean_html(raw) # Boilerplate removal
# ... 8 more formats
# Step 4: Clean text (encoding issues, artifacts, whitespace)
text = clean_text(text)
# Step 5: Detect language, structure, tables
metadata = analyze_document(text)
# Step 6: Chunk (strategy depends on document type)
chunks = smart_chunk(text, metadata)
# Step 7: Embed (handle API failures, rate limits)
embeddings = embed_with_retry(chunks)
# Step 8: Store with permissions (who can see this?)
store_with_permissions(chunks, embeddings, source.permissions)
# Step 9: Update index (atomic, handle partial failures)
update_index(source.id)
# Step 10: Log everything (for debugging when it breaks)
log_ingestion(source, chunks, status="success")
Ninety percent of AI engineering time goes into robust versions of these 10 steps. Zero percent of it is glamorous.
2. Permission Systems
The demo version:
def answer_question(question):
context = search(question)
return llm.generate(question, context)
The production version:
def answer_question(question, user: User):
# Verify user authentication
if not user.is_authenticated:
return AuthError("Please log in")
# Get user's permissions
permissions = get_permissions(user)
# Search only in documents user can access
context = search(
question,
filter={
"OR": [
{"is_public": True},
{"owner_id": user.id},
{"team_ids": {"$in": permissions.team_ids}},
{"department": permissions.department}
]
}
)
# Log the access for audit
log_access(user, question, context)
# Generate with permission-aware system prompt
return llm.generate(
question,
context,
system=get_system_prompt(permissions.clearance_level)
)
Permission systems touch everything and are never done. There's always another integration, another edge case, another team with special requirements.
3. Eval Harness Development
What AI tutorials say: "And that's it! Deploy your chatbot."
What production requires:
class EvalHarness:
def __init__(self):
self.eval_set = load_eval_set("production_queries.json") # 500+ cases
self.metrics = []
def run_full_eval(self):
"""This runs before every deploy."""
results = {
"retrieval_precision": [],
"retrieval_recall": [],
"answer_accuracy": [],
"answer_relevance": [],
"citation_accuracy": [],
"response_time": [],
"no_hallucination_rate": [],
}
for test_case in self.eval_set:
start = time.time()
response = system.answer(test_case.query, mock_user())
elapsed = time.time() - start
results["response_time"].append(elapsed)
results["retrieval_precision"].append(
self.score_retrieval_precision(response, test_case)
)
results["answer_accuracy"].append(
self.score_accuracy(response, test_case)
)
# ... 5 more scoring functions
return self.aggregate(results)
def check_deploy_gate(self, results):
"""Block deploy if quality regresses."""
baseline = load_baseline("last_deploy.json")
for metric, threshold in DEPLOY_THRESHOLDS.items():
if results[metric] < baseline[metric] - threshold:
raise DeployBlocked(
f"{metric} regressed: {baseline[metric]} -> {results[metric]}"
)
Building and maintaining the eval harness often takes as long as building the AI system itself. It's unglamorous but essential.
4. Error Handling and Recovery
The happy path: everything works.
The production path:
def answer_question(question, user):
try:
context = search(question)
except EmbeddingServiceDown:
# Fall back to keyword search
context = keyword_search(question)
log.warning("Fell back to keyword search")
except VectorDBTimeout:
# Return apologetic response
return "I'm having trouble searching right now. Try again in a moment."
try:
response = llm.generate(question, context)
except RateLimitExceeded:
# Queue for async processing
job_id = queue_async(question, user, context)
return f"Heavy load. Your answer will be ready at /results/{job_id}"
except ContextTooLong:
# Summarize context and retry
compressed = summarize(context)
response = llm.generate(question, compressed)
except LLMTimeout:
# Return partial answer
return "I found some relevant information but couldn't fully process it. Here's what I found: " + format_context(context[:3])
# Validate response before returning
if contains_pii(response):
response = redact_pii(response)
if detected_hallucination(response, context):
response += "\n\n⚠️ Please verify this information."
return response
Every external dependency fails. Robust systems handle every failure mode.
5. Monitoring and Alerting
No one mentions monitoring until something breaks at 2 AM.
# Metrics we track for every request
metrics = {
# Performance
"response_time_ms": [...],
"embedding_time_ms": [...],
"retrieval_time_ms": [...],
"llm_time_ms": [...],
# Quality signals
"retrieval_top_score": [...],
"num_chunks_retrieved": [...],
"response_length_tokens": [...],
# User signals
"thumbs_up_rate": [...],
"thumbs_down_rate": [...],
"regenerate_rate": [...],
"escalation_rate": [...],
# Errors
"retrieval_failures": [...],
"llm_failures": [...],
"timeout_rate": [...],
}
# Alerts we've set up after incidents
alerts = [
"retrieval_top_score avg < 0.5 for 10 minutes → on-call page",
"thumbs_down_rate > 20% for 1 hour → Slack alert",
"llm_failures > 5 in 5 minutes → on-call page",
"response_time_ms p95 > 10000 for 5 minutes → Slack alert",
]
We added each alert after an incident taught us we needed it.
6. Cost Management
AI systems cost money. Real money. Every month.
# Monthly cost breakdown for a medium-scale RAG system
costs = {
"embedding_api": "$500", # OpenAI embeddings
"llm_api": "$3,000", # GPT-4 for generation
"vector_db": "$200", # Pinecone/Weaviate
"compute": "$400", # API servers
"document_storage": "$50", # S3 or similar
"monitoring": "$100", # Datadog/New Relic
# Total: ~$4,250/month
}
# What we did to reduce costs
optimizations = [
"Caching: cache embeddings, don't re-embed same chunks",
"Tiered LLMs: GPT-4o-mini for simple queries, GPT-4 for complex",
"Context compression: reduce tokens sent to LLM by 40%",
"Request batching: batch embedding calls",
"Query deduplication: detect same question, return cached answer",
]
# After optimizations: ~$1,800/month
Every startup building with AI has had the "we spent how much?" moment.
7. On-Call and Incident Response
What the job posting says: "Work on cutting-edge AI"
What the job involves: 2 AM incident response
2:14 AM - Alert: thumbs_down_rate spike (45% vs 8% baseline)
2:16 AM - Check dashboards: started at 1:30 AM
2:18 AM - Review recent deploys: no deploys in 24 hours
2:22 AM - Check external dependencies: LLM API latency elevated
2:25 AM - Correlation: high latency → timeouts → partial responses → thumbs down
2:28 AM - Mitigation: increase timeout, enable response streaming
2:35 AM - Thumbs down rate returning to baseline
2:40 AM - Root cause identified: OpenAI had degraded performance
3:00 AM - Write incident report, go back to sleep
This happens. Regularly.
Why This Matters
The unglamorous parts aren't obstacles to AI engineering—they're the core of AI engineering.
The glamorous 10%:
- Choosing the right model
- Crafting clever prompts
- Building agent architectures
The unglamorous 90%:
- Making it reliable
- Making it secure
- Making it fast
- Making it affordable
- Making it maintainable
Systems fail at the unglamorous parts. The team that ignores permission systems gets a data leak. The team that skips evals ships regressions. The team without monitoring doesn't know they're broken.
The Skills That Actually Matter
If you want to build AI systems that work in production, prioritize:
1. Systems thinking Understanding how components interact, where failures propagate, where bottlenecks emerge.
2. Defensive programming Assuming everything fails, handling every edge case, building recovery paths.
3. Operations instinct Knowing what to monitor, how to debug distributed systems, when to wake up vs. when to wait until morning.
4. Security mindset Treating every input as potentially malicious, validating everything, respecting permissions.
5. Cost consciousness Understanding that scale costs money, optimizing not just for performance but for dollars.
These skills aren't taught in AI courses. They're learned from production fires.
Conclusion
AI engineering isn't what Twitter shows. It's not just prompting and models and agents.
It's:
- Data pipelines that handle garbage input
- Permission systems that work for every user
- Eval harnesses that catch regressions
- Error handling that fails gracefully
- Monitoring that reveals problems
- Cost management that avoids surprise bills
- On-call rotations that restore service at 2 AM
The companies with successful AI products aren't always the ones with the best models. They're the ones who got the unglamorous parts right.
If this sounds like work you'd find satisfying, you'll do well in AI engineering. If it sounds tedious, you'll be happier in research.
Neither is wrong. But only one ships products.
The ratio of AI engineering time spent on models vs. production infrastructure is roughly 1:9. Embrace the 9.
What's your least glamorous AI engineering task?