Back to all ideas
Framework

Context Window Engineering

The context window is your most expensive and limited resource. Learn to treat it like premium real estate—every token should earn its place.

LLMArchitecturePerformance

The Core Idea

Treat context window space like premium real estate. Every token must justify its presence with value delivered.

Most AI applications waste 70% of their context window on poorly structured prompts, redundant information, and unoptimized retrieval results. This wastes money, slows responses, and degrades quality.

Context Window Engineering is the discipline of maximizing value per token.

Why This Matters

The Constraints

ModelContext WindowCost (input per 1M tokens)
GPT-4o128K~$2.50
Claude 3.5 Sonnet200K~$3.00
GPT-4o-mini128K~$0.15

These seem large, but context fills up fast:

  • System prompt: 500-2,000 tokens
  • Conversation history: 500-5,000 tokens per exchange
  • RAG results: 500-10,000 tokens
  • User query: 50-500 tokens
  • Reserved for response: 2,000-4,000 tokens

A 128K window sounds infinite until you're at 90% capacity on the third message.

The Quality Paradox

Longer context ≠ better answers.

In experiments with long-context models:

  • Relevant information in the middle of context is often ignored ("lost in the middle" effect)
  • Too much information increases hallucination risk
  • Irrelevant context dilutes attention on relevant passages

More is not better. Better is better.

The Framework

Layer 1: System Prompt Optimization

Your system prompt sets the foundation. It should be concise, precise, and structured.

Before (567 tokens):

You are a helpful AI assistant for Acme Corp. You help employees with their questions about company policies, HR matters, benefits, and general workplace information. You should be professional, accurate, and helpful. If you don't know something, say so. Don't make up information. Be sure to cite your sources when providing information from company documents. You have access to company documentation including the employee handbook, benefits guide, vacation policy, expense policy, IT security policy, and other official documents. Always be respectful and maintain confidentiality...

After (147 tokens):

You are Acme Corp's HR assistant.

RULES:
- Answer only from provided context
- Cite sources: [DocName, Section]
- Say "I don't have information on that" if unsure
- Never discuss: salaries, PIPs, terminations

FORMAT: Brief answer first, then details if needed.

Token savings: 74% reduction, clearer behavior.


Layer 2: Retrieval Compression

RAG often returns verbose chunks. Compress before inserting into context.

Technique 1: Extractive Summarization

def compress_chunk(chunk: str, query: str, max_tokens: int = 200) -> str:
    """Extract only query-relevant sentences."""
    sentences = split_sentences(chunk)
    
    # Score each sentence by relevance to query
    scored = [(s, semantic_similarity(s, query)) for s in sentences]
    scored.sort(key=lambda x: x[1], reverse=True)
    
    # Take top sentences until token budget exhausted
    compressed = []
    tokens_used = 0
    for sentence, score in scored:
        sent_tokens = count_tokens(sentence)
        if tokens_used + sent_tokens <= max_tokens:
            compressed.append(sentence)
            tokens_used += sent_tokens
    
    return " ".join(compressed)

Technique 2: LLM Preprocessing

def llm_compress(chunks: list[str], query: str) -> str:
    """Use cheap LLM to pre-summarize chunks."""
    prompt = f"""
    Query: {query}
    
    Extract ONLY the information relevant to answering this query.
    Be extremely concise. Use bullet points.
    
    Documents:
    {chr(10).join(chunks)}
    """
    
    return llm_mini.complete(prompt, max_tokens=500)

Savings: 60-80% context reduction with minimal quality loss.


Layer 3: Conversation History Management

Conversation history grows linearly. Without management, it dominates context.

Strategy 1: Sliding Window

def get_context_history(messages: list, max_tokens: int = 2000) -> list:
    """Keep recent messages that fit in budget."""
    result = []
    tokens = 0
    
    # Work backwards from most recent
    for msg in reversed(messages):
        msg_tokens = count_tokens(msg["content"])
        if tokens + msg_tokens > max_tokens:
            break
        result.insert(0, msg)
        tokens += msg_tokens
    
    return result

Strategy 2: Summarized History

def summarize_history(messages: list, keep_recent: int = 4) -> list:
    """Summarize old messages, keep recent ones verbatim."""
    if len(messages) <= keep_recent:
        return messages
    
    old = messages[:-keep_recent]
    recent = messages[-keep_recent:]
    
    summary = llm.complete(f"""
        Summarize this conversation in 2-3 sentences, 
        focusing on key facts and decisions:
        
        {format_messages(old)}
    """)
    
    return [{"role": "system", "content": f"Earlier: {summary}"}] + recent

Strategy 3: Smart Pruning

def prune_history(messages: list, query: str) -> list:
    """Keep messages relevant to current query."""
    scored = []
    for msg in messages:
        relevance = semantic_similarity(msg["content"], query)
        scored.append((msg, relevance))
    
    # Always keep last 2 messages for continuity
    recent = messages[-2:]
    older = scored[:-2]
    
    # From older, keep only high-relevance
    relevant = [m for m, score in older if score > 0.5]
    
    return relevant + recent

Layer 4: Token Budgeting

Explicitly allocate context budget:

class ContextBudget:
    def __init__(self, total_tokens: int = 16000):
        self.total = total_tokens
        self.allocations = {
            "system": 0.10,      # 10% for system prompt
            "history": 0.20,    # 20% for conversation
            "retrieval": 0.45,  # 45% for RAG results
            "query": 0.05,      # 5% for user query
            "response": 0.20,   # 20% reserved for response
        }
    
    def get_budget(self, component: str) -> int:
        return int(self.total * self.allocations[component])
    
    def remaining_for_retrieval(self, used: dict) -> int:
        """Dynamic adjustment based on actual usage."""
        fixed_usage = sum(used.values())
        response_reserve = self.get_budget("response")
        return self.total - fixed_usage - response_reserve

# Usage
budget = ContextBudget(total_tokens=16000)

system_tokens = budget.get_budget("system")       # 1,600
history_tokens = budget.get_budget("history")     # 3,200
retrieval_tokens = budget.get_budget("retrieval") # 7,200

Layer 5: Structured Formatting

How you format context matters enormously.

Verbose (wastes tokens):

The following is information from the employee handbook regarding vacation policy. According to the document titled "Employee Handbook 2024" in the section about "Paid Time Off", it states that full-time employees are entitled to fifteen days of paid vacation per year, which accrues monthly at a rate of 1.25 days per month...

Optimized (same information, fewer tokens):

[vacation-policy.pdf, §Paid Time Off]
• Full-time: 15 days/year, accrues 1.25 days/month
• Rollover: max 5 days to next year
• Request: 2 weeks advance notice required

Token reduction: 60%+

Measuring Context Efficiency

Track these metrics:

1. Token Efficiency Ratio

efficiency = quality_score / tokens_used

Compare responses with different context strategies at the same quality level.

2. Context Utilization

utilization = relevant_tokens / total_tokens

What percentage of your context is actually contributing to the answer?

3. Cost per Quality Point

cost_efficiency = (token_cost * tokens_used) / quality_score

Are you paying more for marginal improvements?

Common Anti-Patterns

Anti-Pattern 1: "Dump Everything"

Retrieving 20 documents "just in case" and stuffing them all into context.

Fix: Retrieve more, select fewer. Use reranking to pick top 3-5.

Anti-Pattern 2: "Verbose System Prompts"

Multi-paragraph instructions that could be bullet points.

Fix: Rewrite system prompts quarterly. Challenge every sentence.

Anti-Pattern 3: "Full Conversation History"

Keeping every message since conversation start.

Fix: Summarize or prune aggressively. Users don't notice.

Anti-Pattern 4: "Document as Context"

Inserting full documents instead of relevant excerpts.

Fix: Always chunk and select. Full documents waste tokens.

The ROI

Context Window Engineering typically delivers:

  • 40-60% cost reduction — Fewer tokens per request
  • 30-50% latency improvement — Less to process
  • 10-20% quality improvement — Less noise, more signal

The ROI compounds: lower costs enable more iterations, better testing, and faster shipping.

Conclusion

Context windows will keep growing, but the principles remain constant:

  1. Every token must earn its place
  2. Compress without losing meaning
  3. Prioritize relevance over completeness
  4. Budget explicitly, measure continuously

The best AI engineers treat context like a scarce resource—because it is.


The difference between a $10K/month AI bill and a $100K/month AI bill is often just context window engineering.

What's your context utilization rate?

AM

Abhinav Mahajan

AI Product & Engineering Leader

Building AI systems that work in production. These frameworks come from real experience shipping enterprise AI products.

💡 Apply This Framework

Find This Framework Useful?

I'd love to hear how you've applied it or discuss related ideas. Let's explore how these principles apply to your specific context.