Lesson 2.3: Context Window - Memory Limitations
Duration: 50 minutes
Learning Objectives
By the end of this lesson, you will be able to:
- Define what a context window is and why it matters
- Compare context window sizes across different models
- Implement strategies to work within context limits
- Manage long conversations without losing important information
- Decide when to use large-context models vs. other approaches
Introduction
Imagine having a conversation where you can only remember the last few sentences. That is essentially what an LLM faces with its context window. The context window is the model's "working memory" - everything it can consider when generating a response.
Understanding context windows is crucial for building applications that handle long documents, extended conversations, or complex multi-step tasks.
What is a Context Window?
The context window is the maximum amount of text (measured in tokens) that an LLM can process in a single request. This includes:
- Your system prompt
- The conversation history
- Your current message
- The model's response
┌─────────────────────────────────────────────────────────┐
│ Context Window │
│ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ System Prompt (500 tokens) │ │
│ ├─────────────────────────────────────────────────┤ │
│ │ User: Previous message (200 tokens) │ │
│ ├─────────────────────────────────────────────────┤ │
│ │ Assistant: Previous response (300 tokens) │ │
│ ├─────────────────────────────────────────────────┤ │
│ │ User: Current message (100 tokens) │ │
│ ├─────────────────────────────────────────────────┤ │
│ │ Assistant: [Response being generated] │ │
│ └─────────────────────────────────────────────────┘ │
│ │
│ Total used: 1,100 tokens │
│ Context limit: 128,000 tokens │
│ Remaining: 126,900 tokens │
└─────────────────────────────────────────────────────────┘
Why Context Windows Are Limited
- Memory: Each token requires storing attention information for every other token
- Computation: Processing time grows quadratically with context length
- Cost: Longer contexts require more GPU memory and compute
The relationship is roughly: memory and compute scale with O(n^2) where n is context length. Doubling the context quadruples the resources needed.
Context Window Sizes
Different models offer different context sizes:
┌────────────────────────────────────────────────────────────┐
│ Context Window Comparison │
├────────────────────────────────────────────────────────────┤
│ Model │ Context Window │ Approx Pages* │
├─────────────────────┼────────────────┼─────────────────────┤
│ GPT-3.5 │ 16K tokens │ ~20 pages │
│ GPT-4 │ 128K tokens │ ~160 pages │
│ GPT-4o │ 128K tokens │ ~160 pages │
│ Claude 3 Haiku │ 200K tokens │ ~250 pages │
│ Claude 3.5 Sonnet │ 200K tokens │ ~250 pages │
│ Claude 3 Opus │ 200K tokens │ ~250 pages │
│ Gemini 1.5 Pro │ 1M tokens │ ~1,250 pages │
│ Gemini 1.5 Flash │ 1M tokens │ ~1,250 pages │
└────────────────────────────────────────────────────────────┘
* Assuming ~800 tokens per page of text
What Can Fit in Different Context Sizes?
| Context Size | Real-World Equivalent |
|---|---|
| 4K tokens | A long email or short article |
| 16K tokens | A research paper |
| 128K tokens | A short novel |
| 200K tokens | Multiple research papers |
| 1M tokens | Several books |
The "Lost in the Middle" Problem
Research has shown that LLMs do not treat all parts of the context equally. Information in the middle of long contexts is often "lost" or given less attention.
┌─────────────────────────────────────────────────────────┐
│ Attention Distribution in Long Context │
│ │
│ Attention │
│ Level │
│ ▲ │
│ │ ██ ██ │
│ │ ██ ██ │
│ │ ██ ██ │
│ │ ██ ░░ ░░ ░░ ░░ ░░ ░░ ██ │
│ │ ██ ░░ ░░ ░░ ░░ ░░ ░░ ██ │
│ └──────────────────────────────────────────────► │
│ Start Middle End │
│ │
│ The model pays more attention to the beginning and │
│ end of the context than to the middle. │
└─────────────────────────────────────────────────────────┘
Implications for Your Applications
// BAD: Important information buried in the middle
const messages = [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: "Here's a long document..." }, // 50K tokens
// Critical instruction buried here
{ role: 'user', content: 'The key requirement is X' }, // Middle
{ role: 'user', content: 'More context...' }, // 50K tokens
{ role: 'user', content: 'Summarize the document.' },
];
// BETTER: Important information at start or end
const betterMessages = [
{
role: 'system',
content: 'You are a helpful assistant. KEY REQUIREMENT: X', // At start
},
{ role: 'user', content: "Here's a long document..." },
{ role: 'user', content: 'More context...' },
{
role: 'user',
content: 'Summarize the document. Remember: X', // At end
},
];
Strategies for Managing Context
Strategy 1: Sliding Window
Keep only the most recent messages:
interface Message {
role: 'system' | 'user' | 'assistant';
content: string;
}
function applySlidingWindow(
messages: Message[],
maxTokens: number,
countTokens: (text: string) => number
): Message[] {
const result: Message[] = [];
let totalTokens = 0;
// Always include system message
const systemMessage = messages.find((m) => m.role === 'system');
if (systemMessage) {
result.push(systemMessage);
totalTokens += countTokens(systemMessage.content);
}
// Add messages from the end until we hit the limit
const nonSystemMessages = messages.filter((m) => m.role !== 'system');
for (let i = nonSystemMessages.length - 1; i >= 0; i--) {
const msg = nonSystemMessages[i];
const msgTokens = countTokens(msg.content);
if (totalTokens + msgTokens > maxTokens) {
break;
}
result.splice(1, 0, msg); // Insert after system message
totalTokens += msgTokens;
}
return result;
}
// Usage
const trimmedMessages = applySlidingWindow(messages, 4000, countTokens);
Strategy 2: Summarization
Periodically summarize older messages:
async function summarizeHistory(messages: Message[], openai: OpenAI): Promise<string> {
const historyText = messages.map((m) => `${m.role}: ${m.content}`).join('\n\n');
const response = await openai.chat.completions.create({
model: 'gpt-4o-mini', // Use cheaper model for summarization
messages: [
{
role: 'system',
content: 'Summarize this conversation, preserving key facts and decisions.',
},
{ role: 'user', content: historyText },
],
max_tokens: 500,
});
return response.choices[0].message.content || '';
}
// Use in your chat application
class SmartChat {
private messages: Message[] = [];
private summaryThreshold = 3000; // Tokens before summarizing
async addMessage(message: Message): Promise<void> {
this.messages.push(message);
const totalTokens = this.countTotalTokens();
if (totalTokens > this.summaryThreshold) {
await this.compressHistory();
}
}
private async compressHistory(): Promise<void> {
// Keep system message and last 2 exchanges
const toKeep = this.messages.slice(-4);
const toSummarize = this.messages.slice(1, -4);
if (toSummarize.length > 0) {
const summary = await summarizeHistory(toSummarize, this.openai);
this.messages = [
this.messages[0], // System message
{ role: 'assistant', content: `[Previous conversation summary: ${summary}]` },
...toKeep,
];
}
}
}
Strategy 3: Chunking Long Documents
Break documents into chunks and process separately:
function chunkText(text: string, maxChunkTokens: number, overlap: number = 100): string[] {
const chunks: string[] = [];
const words = text.split(/\s+/);
const wordsPerChunk = Math.floor(maxChunkTokens * 0.75); // ~0.75 words per token
for (let i = 0; i < words.length; i += wordsPerChunk - overlap) {
const chunk = words.slice(i, i + wordsPerChunk).join(' ');
chunks.push(chunk);
}
return chunks;
}
async function processLongDocument(
document: string,
question: string,
openai: OpenAI
): Promise<string> {
const chunks = chunkText(document, 3000);
// Process each chunk
const chunkResults = await Promise.all(
chunks.map(async (chunk, index) => {
const response = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [
{
role: 'system',
content: `Extract information relevant to: "${question}"`,
},
{ role: 'user', content: chunk },
],
});
return response.choices[0].message.content;
})
);
// Combine results
const combinedResponse = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{
role: 'system',
content: 'Synthesize these extracted pieces into a coherent answer.',
},
{ role: 'user', content: chunkResults.join('\n\n---\n\n') },
],
});
return combinedResponse.choices[0].message.content || '';
}
Strategy 4: Smart Context Selection (RAG)
Only include relevant information in the context:
// This is a simplified RAG approach
// Full RAG is covered in Course 5
interface DocumentChunk {
content: string;
embedding: number[];
}
async function selectRelevantContext(
query: string,
documents: DocumentChunk[],
topK: number = 5
): Promise<string[]> {
// Get query embedding
const queryEmbedding = await getEmbedding(query);
// Calculate similarity scores
const scored = documents.map((doc) => ({
content: doc.content,
score: cosineSimilarity(queryEmbedding, doc.embedding),
}));
// Return top K most relevant chunks
return scored
.sort((a, b) => b.score - a.score)
.slice(0, topK)
.map((item) => item.content);
}
// Usage
const relevantChunks = await selectRelevantContext(userQuestion, documentChunks);
const context = relevantChunks.join('\n\n');
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{ role: 'system', content: 'Answer based on the provided context.' },
{ role: 'user', content: `Context:\n${context}\n\nQuestion: ${userQuestion}` },
],
});
Practical Example: Context-Aware Chat Application
import OpenAI from 'openai';
interface Message {
role: 'system' | 'user' | 'assistant';
content: string;
tokens?: number;
}
class ContextAwareChat {
private openai: OpenAI;
private messages: Message[] = [];
private maxContextTokens: number;
private reserveForResponse: number = 1000;
constructor(options: { systemPrompt: string; maxContextTokens?: number }) {
this.openai = new OpenAI();
this.maxContextTokens = options.maxContextTokens || 8000;
this.messages.push({
role: 'system',
content: options.systemPrompt,
tokens: this.estimateTokens(options.systemPrompt),
});
}
private estimateTokens(text: string): number {
return Math.ceil(text.length / 4);
}
private getTotalTokens(): number {
return this.messages.reduce((sum, msg) => sum + (msg.tokens || 0), 0);
}
private trimToFit(): void {
const targetTokens = this.maxContextTokens - this.reserveForResponse;
while (this.getTotalTokens() > targetTokens && this.messages.length > 2) {
// Remove oldest non-system message
const removed = this.messages.splice(1, 1)[0];
console.log(`Trimmed message (${removed.tokens} tokens)`);
}
}
async chat(userMessage: string): Promise<string> {
// Add user message
this.messages.push({
role: 'user',
content: userMessage,
tokens: this.estimateTokens(userMessage),
});
// Ensure we fit in context
this.trimToFit();
// Log context usage
const usedTokens = this.getTotalTokens();
console.log(`Context: ${usedTokens}/${this.maxContextTokens} tokens`);
// Make API call
const response = await this.openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: this.messages.map(({ role, content }) => ({ role, content })),
max_tokens: this.reserveForResponse,
});
const assistantMessage = response.choices[0].message.content || '';
// Add assistant response
this.messages.push({
role: 'assistant',
content: assistantMessage,
tokens: this.estimateTokens(assistantMessage),
});
// Log actual usage
console.log(`Actual tokens used: ${response.usage?.total_tokens}`);
return assistantMessage;
}
getConversationLength(): number {
return this.messages.length;
}
clearHistory(): void {
// Keep only system message
this.messages = [this.messages[0]];
}
}
// Usage
const chat = new ContextAwareChat({
systemPrompt: 'You are a helpful TypeScript tutor.',
maxContextTokens: 4000,
});
// Simulate a long conversation
for (let i = 0; i < 20; i++) {
const response = await chat.chat(`Question ${i + 1}: Explain a TypeScript concept.`);
console.log(`Response ${i + 1}:`, response.slice(0, 100) + '...\n');
}
When to Use Large Context Models
Large context models (100K+ tokens) are useful for:
| Use Case | Why Large Context Helps |
|---|---|
| Long document analysis | Entire document fits in context |
| Code review | Full codebase context |
| Legal document review | Complete contract analysis |
| Book summarization | No chunking needed |
| Extended conversations | Full history retained |
Trade-offs of Large Context
| Consideration | Large Context | Small Context + Techniques |
|---|---|---|
| Simplicity | Just stuff everything in | Need chunking/RAG |
| Cost | Higher per request | Lower per request |
| Latency | Slower responses | Faster responses |
| "Lost in middle" | More severe | Less of an issue |
| Precision | May use irrelevant info | Only relevant info (with RAG) |
Sometimes smaller context with smart selection beats large context:
// Option A: Use entire document with large context model
const responseA = await openai.chat.completions.create({
model: 'gpt-4', // 128K context
messages: [{ role: 'user', content: `${entireDocument}\n\nQuestion: ${question}` }],
});
// Cost: High, may include irrelevant information
// Option B: Use RAG with smaller context
const relevantChunks = await retrieveRelevant(question, documentChunks);
const responseB = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [{ role: 'user', content: `${relevantChunks}\n\nQuestion: ${question}` }],
});
// Cost: Lower, focused on relevant information
Exercises
Exercise 1: Calculate Context Usage
You have a chatbot with:
- System prompt: 500 tokens
- Average user message: 100 tokens
- Average assistant response: 300 tokens
- Context limit: 8,000 tokens
How many conversation turns can you fit before needing to trim?
Solution
Each turn = user message (100) + assistant response (300) = 400 tokens
Available for conversation = 8,000 - 500 (system) = 7,500 tokens
Maximum turns = 7,500 / 400 = 18.75
You can fit approximately 18 full turns before needing to trim.
In practice, you would want to start trimming earlier to leave room for the response.
Exercise 2: Implement Message Trimming
Complete this function to trim messages while keeping the system message and most recent exchanges:
function trimMessages(
messages: Message[],
maxTokens: number,
countTokens: (text: string) => number
): Message[] {
// Your implementation here
}
Solution
function trimMessages(
messages: Message[],
maxTokens: number,
countTokens: (text: string) => number
): Message[] {
// Always keep system message
const systemMsg = messages.find((m) => m.role === 'system');
const nonSystemMsgs = messages.filter((m) => m.role !== 'system');
const result: Message[] = systemMsg ? [systemMsg] : [];
let totalTokens = systemMsg ? countTokens(systemMsg.content) : 0;
// Add messages from the end (most recent) until we hit the limit
for (let i = nonSystemMsgs.length - 1; i >= 0; i--) {
const msg = nonSystemMsgs[i];
const msgTokens = countTokens(msg.content);
if (totalTokens + msgTokens <= maxTokens) {
result.splice(systemMsg ? 1 : 0, 0, msg);
totalTokens += msgTokens;
} else {
break; // Stop when we exceed the limit
}
}
return result;
}
Exercise 3: Choose the Right Strategy
For each scenario, which context management strategy would you recommend?
- A customer service chatbot that needs to remember the entire conversation
- Analyzing a 500-page legal contract
- A coding assistant that needs access to a large codebase
- A quiz app that asks questions about a short article
Solution
-
Customer service chatbot: Summarization strategy - Periodically summarize older messages to compress history while retaining key context.
-
Legal contract analysis: Chunking + RAG - Break the document into chunks, use embeddings to find relevant sections, include only those in context.
-
Coding assistant with large codebase: RAG with code embeddings - Index the codebase, retrieve relevant files/functions based on the query. Large context models can help but may be expensive.
-
Quiz app with short article: Simple inclusion - If the article fits in context (< 4K tokens), just include the whole thing. No special strategy needed.
Key Takeaways
- Context window is the LLM's working memory - everything must fit: prompt, history, and response
- Bigger is not always better - large contexts are slower, more expensive, and suffer from "lost in the middle"
- Use sliding window for simple conversation trimming
- Use summarization to compress long conversation history
- Use chunking for documents larger than the context window
- Use RAG for precise, relevant context selection
- Place important information at the start and end of your context
Resources
| Resource | Type | Description |
|---|---|---|
| Lost in the Middle Paper | Paper | Research on context position effects |
| OpenAI Context Length Guide | Documentation | Managing context in GPT models |
| Anthropic Long Context | Documentation | Best practices for Claude's 200K context |
| LangChain Memory | Documentation | Memory management patterns |
Next Lesson
Now that you understand context windows, let us explore how to control the randomness and creativity of model outputs with temperature and other parameters.