The Core Idea
Treat context window space like premium real estate. Every token must justify its presence with value delivered.
Most AI applications waste 70% of their context window on poorly structured prompts, redundant information, and unoptimized retrieval results. This wastes money, slows responses, and degrades quality.
Context Window Engineering is the discipline of maximizing value per token.
Why This Matters
The Constraints
| Model | Context Window | Cost (input per 1M tokens) |
|---|---|---|
| GPT-4o | 128K | ~$2.50 |
| Claude 3.5 Sonnet | 200K | ~$3.00 |
| GPT-4o-mini | 128K | ~$0.15 |
These seem large, but context fills up fast:
- System prompt: 500-2,000 tokens
- Conversation history: 500-5,000 tokens per exchange
- RAG results: 500-10,000 tokens
- User query: 50-500 tokens
- Reserved for response: 2,000-4,000 tokens
A 128K window sounds infinite until you're at 90% capacity on the third message.
The Quality Paradox
Longer context ≠ better answers.
In experiments with long-context models:
- Relevant information in the middle of context is often ignored ("lost in the middle" effect)
- Too much information increases hallucination risk
- Irrelevant context dilutes attention on relevant passages
More is not better. Better is better.
The Framework
Layer 1: System Prompt Optimization
Your system prompt sets the foundation. It should be concise, precise, and structured.
Before (567 tokens):
You are a helpful AI assistant for Acme Corp. You help employees with their questions about company policies, HR matters, benefits, and general workplace information. You should be professional, accurate, and helpful. If you don't know something, say so. Don't make up information. Be sure to cite your sources when providing information from company documents. You have access to company documentation including the employee handbook, benefits guide, vacation policy, expense policy, IT security policy, and other official documents. Always be respectful and maintain confidentiality...
After (147 tokens):
You are Acme Corp's HR assistant.
RULES:
- Answer only from provided context
- Cite sources: [DocName, Section]
- Say "I don't have information on that" if unsure
- Never discuss: salaries, PIPs, terminations
FORMAT: Brief answer first, then details if needed.
Token savings: 74% reduction, clearer behavior.
Layer 2: Retrieval Compression
RAG often returns verbose chunks. Compress before inserting into context.
Technique 1: Extractive Summarization
def compress_chunk(chunk: str, query: str, max_tokens: int = 200) -> str:
"""Extract only query-relevant sentences."""
sentences = split_sentences(chunk)
# Score each sentence by relevance to query
scored = [(s, semantic_similarity(s, query)) for s in sentences]
scored.sort(key=lambda x: x[1], reverse=True)
# Take top sentences until token budget exhausted
compressed = []
tokens_used = 0
for sentence, score in scored:
sent_tokens = count_tokens(sentence)
if tokens_used + sent_tokens <= max_tokens:
compressed.append(sentence)
tokens_used += sent_tokens
return " ".join(compressed)
Technique 2: LLM Preprocessing
def llm_compress(chunks: list[str], query: str) -> str:
"""Use cheap LLM to pre-summarize chunks."""
prompt = f"""
Query: {query}
Extract ONLY the information relevant to answering this query.
Be extremely concise. Use bullet points.
Documents:
{chr(10).join(chunks)}
"""
return llm_mini.complete(prompt, max_tokens=500)
Savings: 60-80% context reduction with minimal quality loss.
Layer 3: Conversation History Management
Conversation history grows linearly. Without management, it dominates context.
Strategy 1: Sliding Window
def get_context_history(messages: list, max_tokens: int = 2000) -> list:
"""Keep recent messages that fit in budget."""
result = []
tokens = 0
# Work backwards from most recent
for msg in reversed(messages):
msg_tokens = count_tokens(msg["content"])
if tokens + msg_tokens > max_tokens:
break
result.insert(0, msg)
tokens += msg_tokens
return result
Strategy 2: Summarized History
def summarize_history(messages: list, keep_recent: int = 4) -> list:
"""Summarize old messages, keep recent ones verbatim."""
if len(messages) <= keep_recent:
return messages
old = messages[:-keep_recent]
recent = messages[-keep_recent:]
summary = llm.complete(f"""
Summarize this conversation in 2-3 sentences,
focusing on key facts and decisions:
{format_messages(old)}
""")
return [{"role": "system", "content": f"Earlier: {summary}"}] + recent
Strategy 3: Smart Pruning
def prune_history(messages: list, query: str) -> list:
"""Keep messages relevant to current query."""
scored = []
for msg in messages:
relevance = semantic_similarity(msg["content"], query)
scored.append((msg, relevance))
# Always keep last 2 messages for continuity
recent = messages[-2:]
older = scored[:-2]
# From older, keep only high-relevance
relevant = [m for m, score in older if score > 0.5]
return relevant + recent
Layer 4: Token Budgeting
Explicitly allocate context budget:
class ContextBudget:
def __init__(self, total_tokens: int = 16000):
self.total = total_tokens
self.allocations = {
"system": 0.10, # 10% for system prompt
"history": 0.20, # 20% for conversation
"retrieval": 0.45, # 45% for RAG results
"query": 0.05, # 5% for user query
"response": 0.20, # 20% reserved for response
}
def get_budget(self, component: str) -> int:
return int(self.total * self.allocations[component])
def remaining_for_retrieval(self, used: dict) -> int:
"""Dynamic adjustment based on actual usage."""
fixed_usage = sum(used.values())
response_reserve = self.get_budget("response")
return self.total - fixed_usage - response_reserve
# Usage
budget = ContextBudget(total_tokens=16000)
system_tokens = budget.get_budget("system") # 1,600
history_tokens = budget.get_budget("history") # 3,200
retrieval_tokens = budget.get_budget("retrieval") # 7,200
Layer 5: Structured Formatting
How you format context matters enormously.
Verbose (wastes tokens):
The following is information from the employee handbook regarding vacation policy. According to the document titled "Employee Handbook 2024" in the section about "Paid Time Off", it states that full-time employees are entitled to fifteen days of paid vacation per year, which accrues monthly at a rate of 1.25 days per month...
Optimized (same information, fewer tokens):
[vacation-policy.pdf, §Paid Time Off]
• Full-time: 15 days/year, accrues 1.25 days/month
• Rollover: max 5 days to next year
• Request: 2 weeks advance notice required
Token reduction: 60%+
Measuring Context Efficiency
Track these metrics:
1. Token Efficiency Ratio
efficiency = quality_score / tokens_used
Compare responses with different context strategies at the same quality level.
2. Context Utilization
utilization = relevant_tokens / total_tokens
What percentage of your context is actually contributing to the answer?
3. Cost per Quality Point
cost_efficiency = (token_cost * tokens_used) / quality_score
Are you paying more for marginal improvements?
Common Anti-Patterns
Anti-Pattern 1: "Dump Everything"
Retrieving 20 documents "just in case" and stuffing them all into context.
Fix: Retrieve more, select fewer. Use reranking to pick top 3-5.
Anti-Pattern 2: "Verbose System Prompts"
Multi-paragraph instructions that could be bullet points.
Fix: Rewrite system prompts quarterly. Challenge every sentence.
Anti-Pattern 3: "Full Conversation History"
Keeping every message since conversation start.
Fix: Summarize or prune aggressively. Users don't notice.
Anti-Pattern 4: "Document as Context"
Inserting full documents instead of relevant excerpts.
Fix: Always chunk and select. Full documents waste tokens.
The ROI
Context Window Engineering typically delivers:
- 40-60% cost reduction — Fewer tokens per request
- 30-50% latency improvement — Less to process
- 10-20% quality improvement — Less noise, more signal
The ROI compounds: lower costs enable more iterations, better testing, and faster shipping.
Conclusion
Context windows will keep growing, but the principles remain constant:
- Every token must earn its place
- Compress without losing meaning
- Prioritize relevance over completeness
- Budget explicitly, measure continuously
The best AI engineers treat context like a scarce resource—because it is.
The difference between a $10K/month AI bill and a $100K/month AI bill is often just context window engineering.
What's your context utilization rate?