Back to all writing
Essay
September 22, 2024·8 min read

The Unglamorous Parts of AI Engineering

Twitter shows AI demos. Production requires data pipelines, permission systems, eval harnesses, and on-call rotations. Here's what AI engineering actually looks like.

AI EngineeringCareerProduction Systems

The Two Faces of AI Engineering

On Twitter, AI engineering is:

  • Training models
  • Magical prompts that unlock capabilities
  • Building agentic systems
  • The cutting edge of technology

In production, AI engineering is:

  • Debugging why PDFs parse differently
  • Adding rate limiting because someone sent 10,000 requests in a minute
  • Writing permission checks for the eighth integration
  • On-call at 2 AM because the embedding service is down

I've done both. This is a defense of the unglamorous parts—because that's where actual value gets created.

The Unglamorous Reality

1. Data Pipeline Maintenance

What it looks like from outside: "We built an AI that answers questions about company documents."

What it actually involves:

def ingest_document(source: DocumentSource):
    """This function handles 847 edge cases."""
    
    # Step 1: Download (handles retries, timeouts, auth failures)
    raw = download_with_retry(source, max_retries=3)
    
    # Step 2: Detect format (PDF, DOCX, HTML, TXT, images...)
    format = detect_format(raw)  # Sniffing bytes, not trusting extensions
    
    # Step 3: Extract text (different for every format)
    if format == "pdf":
        text = extract_pdf(raw)  # 15 libraries for 15 kinds of PDFs
    elif format == "docx":
        text = extract_docx(raw)
    elif format == "html":
        text = clean_html(raw)  # Boilerplate removal
    # ... 8 more formats
    
    # Step 4: Clean text (encoding issues, artifacts, whitespace)
    text = clean_text(text)
    
    # Step 5: Detect language, structure, tables
    metadata = analyze_document(text)
    
    # Step 6: Chunk (strategy depends on document type)
    chunks = smart_chunk(text, metadata)
    
    # Step 7: Embed (handle API failures, rate limits)
    embeddings = embed_with_retry(chunks)
    
    # Step 8: Store with permissions (who can see this?)
    store_with_permissions(chunks, embeddings, source.permissions)
    
    # Step 9: Update index (atomic, handle partial failures)
    update_index(source.id)
    
    # Step 10: Log everything (for debugging when it breaks)
    log_ingestion(source, chunks, status="success")

Ninety percent of AI engineering time goes into robust versions of these 10 steps. Zero percent of it is glamorous.

2. Permission Systems

The demo version:

def answer_question(question):
    context = search(question)
    return llm.generate(question, context)

The production version:

def answer_question(question, user: User):
    # Verify user authentication
    if not user.is_authenticated:
        return AuthError("Please log in")
    
    # Get user's permissions
    permissions = get_permissions(user)
    
    # Search only in documents user can access
    context = search(
        question, 
        filter={
            "OR": [
                {"is_public": True},
                {"owner_id": user.id},
                {"team_ids": {"$in": permissions.team_ids}},
                {"department": permissions.department}
            ]
        }
    )
    
    # Log the access for audit
    log_access(user, question, context)
    
    # Generate with permission-aware system prompt
    return llm.generate(
        question, 
        context,
        system=get_system_prompt(permissions.clearance_level)
    )

Permission systems touch everything and are never done. There's always another integration, another edge case, another team with special requirements.

3. Eval Harness Development

What AI tutorials say: "And that's it! Deploy your chatbot."

What production requires:

class EvalHarness:
    def __init__(self):
        self.eval_set = load_eval_set("production_queries.json")  # 500+ cases
        self.metrics = []
    
    def run_full_eval(self):
        """This runs before every deploy."""
        results = {
            "retrieval_precision": [],
            "retrieval_recall": [],
            "answer_accuracy": [],
            "answer_relevance": [],
            "citation_accuracy": [],
            "response_time": [],
            "no_hallucination_rate": [],
        }
        
        for test_case in self.eval_set:
            start = time.time()
            response = system.answer(test_case.query, mock_user())
            elapsed = time.time() - start
            
            results["response_time"].append(elapsed)
            results["retrieval_precision"].append(
                self.score_retrieval_precision(response, test_case)
            )
            results["answer_accuracy"].append(
                self.score_accuracy(response, test_case)
            )
            # ... 5 more scoring functions
        
        return self.aggregate(results)
    
    def check_deploy_gate(self, results):
        """Block deploy if quality regresses."""
        baseline = load_baseline("last_deploy.json")
        
        for metric, threshold in DEPLOY_THRESHOLDS.items():
            if results[metric] < baseline[metric] - threshold:
                raise DeployBlocked(
                    f"{metric} regressed: {baseline[metric]} -> {results[metric]}"
                )

Building and maintaining the eval harness often takes as long as building the AI system itself. It's unglamorous but essential.

4. Error Handling and Recovery

The happy path: everything works.

The production path:

def answer_question(question, user):
    try:
        context = search(question)
    except EmbeddingServiceDown:
        # Fall back to keyword search
        context = keyword_search(question)
        log.warning("Fell back to keyword search")
    except VectorDBTimeout:
        # Return apologetic response
        return "I'm having trouble searching right now. Try again in a moment."
    
    try:
        response = llm.generate(question, context)
    except RateLimitExceeded:
        # Queue for async processing
        job_id = queue_async(question, user, context)
        return f"Heavy load. Your answer will be ready at /results/{job_id}"
    except ContextTooLong:
        # Summarize context and retry
        compressed = summarize(context)
        response = llm.generate(question, compressed)
    except LLMTimeout:
        # Return partial answer
        return "I found some relevant information but couldn't fully process it. Here's what I found: " + format_context(context[:3])
    
    # Validate response before returning
    if contains_pii(response):
        response = redact_pii(response)
    if detected_hallucination(response, context):
        response += "\n\n⚠️ Please verify this information."
    
    return response

Every external dependency fails. Robust systems handle every failure mode.

5. Monitoring and Alerting

No one mentions monitoring until something breaks at 2 AM.

# Metrics we track for every request
metrics = {
    # Performance
    "response_time_ms": [...],
    "embedding_time_ms": [...],
    "retrieval_time_ms": [...],
    "llm_time_ms": [...],
    
    # Quality signals
    "retrieval_top_score": [...],
    "num_chunks_retrieved": [...],
    "response_length_tokens": [...],
    
    # User signals
    "thumbs_up_rate": [...],
    "thumbs_down_rate": [...],
    "regenerate_rate": [...],
    "escalation_rate": [...],
    
    # Errors
    "retrieval_failures": [...],
    "llm_failures": [...],
    "timeout_rate": [...],
}

# Alerts we've set up after incidents
alerts = [
    "retrieval_top_score avg < 0.5 for 10 minutes → on-call page",
    "thumbs_down_rate > 20% for 1 hour → Slack alert",
    "llm_failures > 5 in 5 minutes → on-call page",
    "response_time_ms p95 > 10000 for 5 minutes → Slack alert",
]

We added each alert after an incident taught us we needed it.

6. Cost Management

AI systems cost money. Real money. Every month.

# Monthly cost breakdown for a medium-scale RAG system
costs = {
    "embedding_api": "$500",      # OpenAI embeddings
    "llm_api": "$3,000",          # GPT-4 for generation
    "vector_db": "$200",          # Pinecone/Weaviate
    "compute": "$400",            # API servers
    "document_storage": "$50",    # S3 or similar
    "monitoring": "$100",         # Datadog/New Relic
    # Total: ~$4,250/month
}

# What we did to reduce costs
optimizations = [
    "Caching: cache embeddings, don't re-embed same chunks",
    "Tiered LLMs: GPT-4o-mini for simple queries, GPT-4 for complex",
    "Context compression: reduce tokens sent to LLM by 40%",
    "Request batching: batch embedding calls",
    "Query deduplication: detect same question, return cached answer",
]
# After optimizations: ~$1,800/month

Every startup building with AI has had the "we spent how much?" moment.

7. On-Call and Incident Response

What the job posting says: "Work on cutting-edge AI"

What the job involves: 2 AM incident response

2:14 AM - Alert: thumbs_down_rate spike (45% vs 8% baseline)
2:16 AM - Check dashboards: started at 1:30 AM
2:18 AM - Review recent deploys: no deploys in 24 hours
2:22 AM - Check external dependencies: LLM API latency elevated
2:25 AM - Correlation: high latency → timeouts → partial responses → thumbs down
2:28 AM - Mitigation: increase timeout, enable response streaming
2:35 AM - Thumbs down rate returning to baseline
2:40 AM - Root cause identified: OpenAI had degraded performance
3:00 AM - Write incident report, go back to sleep

This happens. Regularly.

Why This Matters

The unglamorous parts aren't obstacles to AI engineering—they're the core of AI engineering.

The glamorous 10%:

  • Choosing the right model
  • Crafting clever prompts
  • Building agent architectures

The unglamorous 90%:

  • Making it reliable
  • Making it secure
  • Making it fast
  • Making it affordable
  • Making it maintainable

Systems fail at the unglamorous parts. The team that ignores permission systems gets a data leak. The team that skips evals ships regressions. The team without monitoring doesn't know they're broken.

The Skills That Actually Matter

If you want to build AI systems that work in production, prioritize:

1. Systems thinking Understanding how components interact, where failures propagate, where bottlenecks emerge.

2. Defensive programming Assuming everything fails, handling every edge case, building recovery paths.

3. Operations instinct Knowing what to monitor, how to debug distributed systems, when to wake up vs. when to wait until morning.

4. Security mindset Treating every input as potentially malicious, validating everything, respecting permissions.

5. Cost consciousness Understanding that scale costs money, optimizing not just for performance but for dollars.

These skills aren't taught in AI courses. They're learned from production fires.

Conclusion

AI engineering isn't what Twitter shows. It's not just prompting and models and agents.

It's:

  • Data pipelines that handle garbage input
  • Permission systems that work for every user
  • Eval harnesses that catch regressions
  • Error handling that fails gracefully
  • Monitoring that reveals problems
  • Cost management that avoids surprise bills
  • On-call rotations that restore service at 2 AM

The companies with successful AI products aren't always the ones with the best models. They're the ones who got the unglamorous parts right.

If this sounds like work you'd find satisfying, you'll do well in AI engineering. If it sounds tedious, you'll be happier in research.

Neither is wrong. But only one ships products.


The ratio of AI engineering time spent on models vs. production infrastructure is roughly 1:9. Embrace the 9.

What's your least glamorous AI engineering task?

Enjoyed this article?

Share it with others who might find it useful

AM

Written by Abhinav Mahajan

AI Product & Engineering Leader

I write about building AI systems that work in production—from RAG pipelines to agent architectures. These insights come from real experience shipping enterprise AI.

Keep Exploring

Check out more writing on AI engineering, system design, and building production-ready AI systems.