From Zero to AI

Lesson 2.3: Context Window - Memory Limitations

Duration: 50 minutes

Learning Objectives

By the end of this lesson, you will be able to:

  • Define what a context window is and why it matters
  • Compare context window sizes across different models
  • Implement strategies to work within context limits
  • Manage long conversations without losing important information
  • Decide when to use large-context models vs. other approaches

Introduction

Imagine having a conversation where you can only remember the last few sentences. That is essentially what an LLM faces with its context window. The context window is the model's "working memory" - everything it can consider when generating a response.

Understanding context windows is crucial for building applications that handle long documents, extended conversations, or complex multi-step tasks.


What is a Context Window?

The context window is the maximum amount of text (measured in tokens) that an LLM can process in a single request. This includes:

  • Your system prompt
  • The conversation history
  • Your current message
  • The model's response
┌─────────────────────────────────────────────────────────┐
                    Context Window                        
                                                         
  ┌─────────────────────────────────────────────────┐    
   System Prompt (500 tokens)                          
  ├─────────────────────────────────────────────────┤    
   User: Previous message (200 tokens)                 
  ├─────────────────────────────────────────────────┤    
   Assistant: Previous response (300 tokens)           
  ├─────────────────────────────────────────────────┤    
   User: Current message (100 tokens)                  
  ├─────────────────────────────────────────────────┤    
   Assistant: [Response being generated]               
  └─────────────────────────────────────────────────┘    
                                                         
  Total used: 1,100 tokens                               
  Context limit: 128,000 tokens                          
  Remaining: 126,900 tokens                              
└─────────────────────────────────────────────────────────┘

Why Context Windows Are Limited

  1. Memory: Each token requires storing attention information for every other token
  2. Computation: Processing time grows quadratically with context length
  3. Cost: Longer contexts require more GPU memory and compute

The relationship is roughly: memory and compute scale with O(n^2) where n is context length. Doubling the context quadruples the resources needed.


Context Window Sizes

Different models offer different context sizes:

┌────────────────────────────────────────────────────────────┐
│                   Context Window Comparison                 │
├────────────────────────────────────────────────────────────┤
│  Model              │ Context Window │ Approx Pages*       │
├─────────────────────┼────────────────┼─────────────────────┤
│  GPT-3.516K tokens     │ ~20 pages           │
│  GPT-4128K tokens    │ ~160 pages          │
│  GPT-4o             │ 128K tokens    │ ~160 pages          │
│  Claude 3 Haiku     │ 200K tokens    │ ~250 pages          │
│  Claude 3.5 Sonnet  │ 200K tokens    │ ~250 pages          │
│  Claude 3 Opus      │ 200K tokens    │ ~250 pages          │
│  Gemini 1.5 Pro     │ 1M tokens      │ ~1,250 pages        │
│  Gemini 1.5 Flash   │ 1M tokens      │ ~1,250 pages        │
└────────────────────────────────────────────────────────────┘
* Assuming ~800 tokens per page of text

What Can Fit in Different Context Sizes?

Context Size Real-World Equivalent
4K tokens A long email or short article
16K tokens A research paper
128K tokens A short novel
200K tokens Multiple research papers
1M tokens Several books

The "Lost in the Middle" Problem

Research has shown that LLMs do not treat all parts of the context equally. Information in the middle of long contexts is often "lost" or given less attention.

┌─────────────────────────────────────────────────────────┐
│             Attention Distribution in Long Context       │
│                                                         │
│  Attention                                              │
│  Level                                                  │
│    ▲                                                    │
│    │  ██                                         ██     │
│    │  ██                                         ██     │
│    │  ██                                         ██     │
│    │  ██    ░░    ░░    ░░    ░░    ░░    ░░    ██     │
│    │  ██    ░░    ░░    ░░    ░░    ░░    ░░    ██     │
│    └──────────────────────────────────────────────►    │
│       Start                Middle               End     │
│                                                         │
│  The model pays more attention to the beginning and     │
│  end of the context than to the middle.                 │
└─────────────────────────────────────────────────────────┘

Implications for Your Applications

// BAD: Important information buried in the middle
const messages = [
  { role: 'system', content: 'You are a helpful assistant.' },
  { role: 'user', content: "Here's a long document..." }, // 50K tokens
  // Critical instruction buried here
  { role: 'user', content: 'The key requirement is X' }, // Middle
  { role: 'user', content: 'More context...' }, // 50K tokens
  { role: 'user', content: 'Summarize the document.' },
];

// BETTER: Important information at start or end
const betterMessages = [
  {
    role: 'system',
    content: 'You are a helpful assistant. KEY REQUIREMENT: X', // At start
  },
  { role: 'user', content: "Here's a long document..." },
  { role: 'user', content: 'More context...' },
  {
    role: 'user',
    content: 'Summarize the document. Remember: X', // At end
  },
];

Strategies for Managing Context

Strategy 1: Sliding Window

Keep only the most recent messages:

interface Message {
  role: 'system' | 'user' | 'assistant';
  content: string;
}

function applySlidingWindow(
  messages: Message[],
  maxTokens: number,
  countTokens: (text: string) => number
): Message[] {
  const result: Message[] = [];
  let totalTokens = 0;

  // Always include system message
  const systemMessage = messages.find((m) => m.role === 'system');
  if (systemMessage) {
    result.push(systemMessage);
    totalTokens += countTokens(systemMessage.content);
  }

  // Add messages from the end until we hit the limit
  const nonSystemMessages = messages.filter((m) => m.role !== 'system');

  for (let i = nonSystemMessages.length - 1; i >= 0; i--) {
    const msg = nonSystemMessages[i];
    const msgTokens = countTokens(msg.content);

    if (totalTokens + msgTokens > maxTokens) {
      break;
    }

    result.splice(1, 0, msg); // Insert after system message
    totalTokens += msgTokens;
  }

  return result;
}

// Usage
const trimmedMessages = applySlidingWindow(messages, 4000, countTokens);

Strategy 2: Summarization

Periodically summarize older messages:

async function summarizeHistory(messages: Message[], openai: OpenAI): Promise<string> {
  const historyText = messages.map((m) => `${m.role}: ${m.content}`).join('\n\n');

  const response = await openai.chat.completions.create({
    model: 'gpt-4o-mini', // Use cheaper model for summarization
    messages: [
      {
        role: 'system',
        content: 'Summarize this conversation, preserving key facts and decisions.',
      },
      { role: 'user', content: historyText },
    ],
    max_tokens: 500,
  });

  return response.choices[0].message.content || '';
}

// Use in your chat application
class SmartChat {
  private messages: Message[] = [];
  private summaryThreshold = 3000; // Tokens before summarizing

  async addMessage(message: Message): Promise<void> {
    this.messages.push(message);

    const totalTokens = this.countTotalTokens();
    if (totalTokens > this.summaryThreshold) {
      await this.compressHistory();
    }
  }

  private async compressHistory(): Promise<void> {
    // Keep system message and last 2 exchanges
    const toKeep = this.messages.slice(-4);
    const toSummarize = this.messages.slice(1, -4);

    if (toSummarize.length > 0) {
      const summary = await summarizeHistory(toSummarize, this.openai);

      this.messages = [
        this.messages[0], // System message
        { role: 'assistant', content: `[Previous conversation summary: ${summary}]` },
        ...toKeep,
      ];
    }
  }
}

Strategy 3: Chunking Long Documents

Break documents into chunks and process separately:

function chunkText(text: string, maxChunkTokens: number, overlap: number = 100): string[] {
  const chunks: string[] = [];
  const words = text.split(/\s+/);
  const wordsPerChunk = Math.floor(maxChunkTokens * 0.75); // ~0.75 words per token

  for (let i = 0; i < words.length; i += wordsPerChunk - overlap) {
    const chunk = words.slice(i, i + wordsPerChunk).join(' ');
    chunks.push(chunk);
  }

  return chunks;
}

async function processLongDocument(
  document: string,
  question: string,
  openai: OpenAI
): Promise<string> {
  const chunks = chunkText(document, 3000);

  // Process each chunk
  const chunkResults = await Promise.all(
    chunks.map(async (chunk, index) => {
      const response = await openai.chat.completions.create({
        model: 'gpt-4o-mini',
        messages: [
          {
            role: 'system',
            content: `Extract information relevant to: "${question}"`,
          },
          { role: 'user', content: chunk },
        ],
      });
      return response.choices[0].message.content;
    })
  );

  // Combine results
  const combinedResponse = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      {
        role: 'system',
        content: 'Synthesize these extracted pieces into a coherent answer.',
      },
      { role: 'user', content: chunkResults.join('\n\n---\n\n') },
    ],
  });

  return combinedResponse.choices[0].message.content || '';
}

Strategy 4: Smart Context Selection (RAG)

Only include relevant information in the context:

// This is a simplified RAG approach
// Full RAG is covered in Course 5

interface DocumentChunk {
  content: string;
  embedding: number[];
}

async function selectRelevantContext(
  query: string,
  documents: DocumentChunk[],
  topK: number = 5
): Promise<string[]> {
  // Get query embedding
  const queryEmbedding = await getEmbedding(query);

  // Calculate similarity scores
  const scored = documents.map((doc) => ({
    content: doc.content,
    score: cosineSimilarity(queryEmbedding, doc.embedding),
  }));

  // Return top K most relevant chunks
  return scored
    .sort((a, b) => b.score - a.score)
    .slice(0, topK)
    .map((item) => item.content);
}

// Usage
const relevantChunks = await selectRelevantContext(userQuestion, documentChunks);
const context = relevantChunks.join('\n\n');

const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [
    { role: 'system', content: 'Answer based on the provided context.' },
    { role: 'user', content: `Context:\n${context}\n\nQuestion: ${userQuestion}` },
  ],
});

Practical Example: Context-Aware Chat Application

import OpenAI from 'openai';

interface Message {
  role: 'system' | 'user' | 'assistant';
  content: string;
  tokens?: number;
}

class ContextAwareChat {
  private openai: OpenAI;
  private messages: Message[] = [];
  private maxContextTokens: number;
  private reserveForResponse: number = 1000;

  constructor(options: { systemPrompt: string; maxContextTokens?: number }) {
    this.openai = new OpenAI();
    this.maxContextTokens = options.maxContextTokens || 8000;

    this.messages.push({
      role: 'system',
      content: options.systemPrompt,
      tokens: this.estimateTokens(options.systemPrompt),
    });
  }

  private estimateTokens(text: string): number {
    return Math.ceil(text.length / 4);
  }

  private getTotalTokens(): number {
    return this.messages.reduce((sum, msg) => sum + (msg.tokens || 0), 0);
  }

  private trimToFit(): void {
    const targetTokens = this.maxContextTokens - this.reserveForResponse;

    while (this.getTotalTokens() > targetTokens && this.messages.length > 2) {
      // Remove oldest non-system message
      const removed = this.messages.splice(1, 1)[0];
      console.log(`Trimmed message (${removed.tokens} tokens)`);
    }
  }

  async chat(userMessage: string): Promise<string> {
    // Add user message
    this.messages.push({
      role: 'user',
      content: userMessage,
      tokens: this.estimateTokens(userMessage),
    });

    // Ensure we fit in context
    this.trimToFit();

    // Log context usage
    const usedTokens = this.getTotalTokens();
    console.log(`Context: ${usedTokens}/${this.maxContextTokens} tokens`);

    // Make API call
    const response = await this.openai.chat.completions.create({
      model: 'gpt-4o-mini',
      messages: this.messages.map(({ role, content }) => ({ role, content })),
      max_tokens: this.reserveForResponse,
    });

    const assistantMessage = response.choices[0].message.content || '';

    // Add assistant response
    this.messages.push({
      role: 'assistant',
      content: assistantMessage,
      tokens: this.estimateTokens(assistantMessage),
    });

    // Log actual usage
    console.log(`Actual tokens used: ${response.usage?.total_tokens}`);

    return assistantMessage;
  }

  getConversationLength(): number {
    return this.messages.length;
  }

  clearHistory(): void {
    // Keep only system message
    this.messages = [this.messages[0]];
  }
}

// Usage
const chat = new ContextAwareChat({
  systemPrompt: 'You are a helpful TypeScript tutor.',
  maxContextTokens: 4000,
});

// Simulate a long conversation
for (let i = 0; i < 20; i++) {
  const response = await chat.chat(`Question ${i + 1}: Explain a TypeScript concept.`);
  console.log(`Response ${i + 1}:`, response.slice(0, 100) + '...\n');
}

When to Use Large Context Models

Large context models (100K+ tokens) are useful for:

Use Case Why Large Context Helps
Long document analysis Entire document fits in context
Code review Full codebase context
Legal document review Complete contract analysis
Book summarization No chunking needed
Extended conversations Full history retained

Trade-offs of Large Context

Consideration Large Context Small Context + Techniques
Simplicity Just stuff everything in Need chunking/RAG
Cost Higher per request Lower per request
Latency Slower responses Faster responses
"Lost in middle" More severe Less of an issue
Precision May use irrelevant info Only relevant info (with RAG)

Sometimes smaller context with smart selection beats large context:

// Option A: Use entire document with large context model
const responseA = await openai.chat.completions.create({
  model: 'gpt-4', // 128K context
  messages: [{ role: 'user', content: `${entireDocument}\n\nQuestion: ${question}` }],
});
// Cost: High, may include irrelevant information

// Option B: Use RAG with smaller context
const relevantChunks = await retrieveRelevant(question, documentChunks);
const responseB = await openai.chat.completions.create({
  model: 'gpt-4o-mini',
  messages: [{ role: 'user', content: `${relevantChunks}\n\nQuestion: ${question}` }],
});
// Cost: Lower, focused on relevant information

Exercises

Exercise 1: Calculate Context Usage

You have a chatbot with:

  • System prompt: 500 tokens
  • Average user message: 100 tokens
  • Average assistant response: 300 tokens
  • Context limit: 8,000 tokens

How many conversation turns can you fit before needing to trim?

Solution

Each turn = user message (100) + assistant response (300) = 400 tokens

Available for conversation = 8,000 - 500 (system) = 7,500 tokens

Maximum turns = 7,500 / 400 = 18.75

You can fit approximately 18 full turns before needing to trim.

In practice, you would want to start trimming earlier to leave room for the response.

Exercise 2: Implement Message Trimming

Complete this function to trim messages while keeping the system message and most recent exchanges:

function trimMessages(
  messages: Message[],
  maxTokens: number,
  countTokens: (text: string) => number
): Message[] {
  // Your implementation here
}
Solution
function trimMessages(
  messages: Message[],
  maxTokens: number,
  countTokens: (text: string) => number
): Message[] {
  // Always keep system message
  const systemMsg = messages.find((m) => m.role === 'system');
  const nonSystemMsgs = messages.filter((m) => m.role !== 'system');

  const result: Message[] = systemMsg ? [systemMsg] : [];
  let totalTokens = systemMsg ? countTokens(systemMsg.content) : 0;

  // Add messages from the end (most recent) until we hit the limit
  for (let i = nonSystemMsgs.length - 1; i >= 0; i--) {
    const msg = nonSystemMsgs[i];
    const msgTokens = countTokens(msg.content);

    if (totalTokens + msgTokens <= maxTokens) {
      result.splice(systemMsg ? 1 : 0, 0, msg);
      totalTokens += msgTokens;
    } else {
      break; // Stop when we exceed the limit
    }
  }

  return result;
}

Exercise 3: Choose the Right Strategy

For each scenario, which context management strategy would you recommend?

  1. A customer service chatbot that needs to remember the entire conversation
  2. Analyzing a 500-page legal contract
  3. A coding assistant that needs access to a large codebase
  4. A quiz app that asks questions about a short article
Solution
  1. Customer service chatbot: Summarization strategy - Periodically summarize older messages to compress history while retaining key context.

  2. Legal contract analysis: Chunking + RAG - Break the document into chunks, use embeddings to find relevant sections, include only those in context.

  3. Coding assistant with large codebase: RAG with code embeddings - Index the codebase, retrieve relevant files/functions based on the query. Large context models can help but may be expensive.

  4. Quiz app with short article: Simple inclusion - If the article fits in context (< 4K tokens), just include the whole thing. No special strategy needed.


Key Takeaways

  1. Context window is the LLM's working memory - everything must fit: prompt, history, and response
  2. Bigger is not always better - large contexts are slower, more expensive, and suffer from "lost in the middle"
  3. Use sliding window for simple conversation trimming
  4. Use summarization to compress long conversation history
  5. Use chunking for documents larger than the context window
  6. Use RAG for precise, relevant context selection
  7. Place important information at the start and end of your context

Resources

Resource Type Description
Lost in the Middle Paper Paper Research on context position effects
OpenAI Context Length Guide Documentation Managing context in GPT models
Anthropic Long Context Documentation Best practices for Claude's 200K context
LangChain Memory Documentation Memory management patterns

Next Lesson

Now that you understand context windows, let us explore how to control the randomness and creativity of model outputs with temperature and other parameters.

Continue to Lesson 2.4: Temperature and Other Parameters