Lesson 2.3: Context Window - Memory Limitations

Duration: 50 minutes

Learning Objectives

By the end of this lesson, you will be able to:

Define what a context window is and why it matters
Compare context window sizes across different models
Implement strategies to work within context limits
Manage long conversations without losing important information
Decide when to use large-context models vs. other approaches

Introduction

Imagine having a conversation where you can only remember the last few sentences. That is essentially what an LLM faces with its context window. The context window is the model's "working memory" - everything it can consider when generating a response.

Understanding context windows is crucial for building applications that handle long documents, extended conversations, or complex multi-step tasks.

What is a Context Window?

The context window is the maximum amount of text (measured in tokens) that an LLM can process in a single request. This includes:

Your system prompt
The conversation history
Your current message
The model's response

Why Context Windows Are Limited

Memory: Each token requires storing attention information for every other token
Computation: Processing time grows quadratically with context length
Cost: Longer contexts require more GPU memory and compute

The relationship is roughly: memory and compute scale with O(n^2) where n is context length. Doubling the context quadruples the resources needed.

Context Window Sizes

Different models offer different context sizes:

Model	Context Window	Approx Pages*
GPT-3.5	16K tokens	~20 pages
GPT-4	128K tokens	~160 pages
GPT-4o	128K tokens	~160 pages
Claude 3 Haiku	200K tokens	~250 pages
Claude 3.5 Sonnet	200K tokens	~250 pages
Claude 3 Opus	200K tokens	~250 pages
Gemini 1.5 Pro	1M tokens	~1,250 pages
Gemini 1.5 Flash	1M tokens	~1,250 pages

*Assuming ~800 tokens per page of text

What Can Fit in Different Context Sizes?

Context Size	Real-World Equivalent
4K tokens	A long email or short article
16K tokens	A research paper
128K tokens	A short novel
200K tokens	Multiple research papers
1M tokens	Several books

The "Lost in the Middle" Problem

Research has shown that LLMs do not treat all parts of the context equally. Information in the middle of long contexts is often "lost" or given less attention.

The model pays more attention to the beginning and end of the context than to the middle.

Implications for Your Applications

// BAD: Important information buried in the middle
const messages = [
  { role: 'system', content: 'You are a helpful assistant.' },
  { role: 'user', content: "Here's a long document..." }, // 50K tokens
  // Critical instruction buried here
  { role: 'user', content: 'The key requirement is X' }, // Middle
  { role: 'user', content: 'More context...' }, // 50K tokens
  { role: 'user', content: 'Summarize the document.' },
];

// BETTER: Important information at start or end
const betterMessages = [
  {
    role: 'system',
    content: 'You are a helpful assistant. KEY REQUIREMENT: X', // At start
  },
  { role: 'user', content: "Here's a long document..." },
  { role: 'user', content: 'More context...' },
  {
    role: 'user',
    content: 'Summarize the document. Remember: X', // At end
  },
];

Strategies for Managing Context

Strategy 1: Sliding Window

Keep only the most recent messages:

interface Message {
  role: 'system' | 'user' | 'assistant';
  content: string;
}

function applySlidingWindow(
  messages: Message[],
  maxTokens: number,
  countTokens: (text: string) => number
): Message[] {
  const result: Message[] = [];
  let totalTokens = 0;

  // Always include system message
  const systemMessage = messages.find((m) => m.role === 'system');
  if (systemMessage) {
    result.push(systemMessage);
    totalTokens += countTokens(systemMessage.content);
  }

  // Add messages from the end until we hit the limit
  const nonSystemMessages = messages.filter((m) => m.role !== 'system');

  for (let i = nonSystemMessages.length - 1; i >= 0; i--) {
    const msg = nonSystemMessages[i];
    const msgTokens = countTokens(msg.content);

    if (totalTokens + msgTokens > maxTokens) {
      break;
    }

    result.splice(1, 0, msg); // Insert after system message
    totalTokens += msgTokens;
  }

  return result;
}

// Usage
const trimmedMessages = applySlidingWindow(messages, 4000, countTokens);

Strategy 2: Summarization

Periodically summarize older messages:

async function summarizeHistory(messages: Message[], openai: OpenAI): Promise<string> {
  const historyText = messages.map((m) => `${m.role}: ${m.content}`).join('\n\n');

  const response = await openai.chat.completions.create({
    model: 'gpt-4o-mini', // Use cheaper model for summarization
    messages: [
      {
        role: 'system',
        content: 'Summarize this conversation, preserving key facts and decisions.',
      },
      { role: 'user', content: historyText },
    ],
    max_tokens: 500,
  });

  return response.choices[0].message.content || '';
}

// Use in your chat application
class SmartChat {
  private messages: Message[] = [];
  private summaryThreshold = 3000; // Tokens before summarizing

  async addMessage(message: Message): Promise<void> {
    this.messages.push(message);

    const totalTokens = this.countTotalTokens();
    if (totalTokens > this.summaryThreshold) {
      await this.compressHistory();
    }
  }

  private async compressHistory(): Promise<void> {
    // Keep system message and last 2 exchanges
    const toKeep = this.messages.slice(-4);
    const toSummarize = this.messages.slice(1, -4);

    if (toSummarize.length > 0) {
      const summary = await summarizeHistory(toSummarize, this.openai);

      this.messages = [
        this.messages[0], // System message
        { role: 'assistant', content: `[Previous conversation summary: ${summary}]` },
        ...toKeep,
      ];
    }
  }
}

Strategy 3: Chunking Long Documents

Break documents into chunks and process separately:

function chunkText(text: string, maxChunkTokens: number, overlap: number = 100): string[] {
  const chunks: string[] = [];
  const words = text.split(/\s+/);
  const wordsPerChunk = Math.floor(maxChunkTokens * 0.75); // ~0.75 words per token

  for (let i = 0; i < words.length; i += wordsPerChunk - overlap) {
    const chunk = words.slice(i, i + wordsPerChunk).join(' ');
    chunks.push(chunk);
  }

  return chunks;
}

async function processLongDocument(
  document: string,
  question: string,
  openai: OpenAI
): Promise<string> {
  const chunks = chunkText(document, 3000);

  // Process each chunk
  const chunkResults = await Promise.all(
    chunks.map(async (chunk, index) => {
      const response = await openai.chat.completions.create({
        model: 'gpt-4o-mini',
        messages: [
          {
            role: 'system',
            content: `Extract information relevant to: "${question}"`,
          },
          { role: 'user', content: chunk },
        ],
      });
      return response.choices[0].message.content;
    })
  );

  // Combine results
  const combinedResponse = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      {
        role: 'system',
        content: 'Synthesize these extracted pieces into a coherent answer.',
      },
      { role: 'user', content: chunkResults.join('\n\n---\n\n') },
    ],
  });

  return combinedResponse.choices[0].message.content || '';
}

Strategy 4: Smart Context Selection (RAG)

Only include relevant information in the context:

// This is a simplified RAG approach
// Full RAG is covered in Course 5

interface DocumentChunk {
  content: string;
  embedding: number[];
}

async function selectRelevantContext(
  query: string,
  documents: DocumentChunk[],
  topK: number = 5
): Promise<string[]> {
  // Get query embedding
  const queryEmbedding = await getEmbedding(query);

  // Calculate similarity scores
  const scored = documents.map((doc) => ({
    content: doc.content,
    score: cosineSimilarity(queryEmbedding, doc.embedding),
  }));

  // Return top K most relevant chunks
  return scored
    .sort((a, b) => b.score - a.score)
    .slice(0, topK)
    .map((item) => item.content);
}

// Usage
const relevantChunks = await selectRelevantContext(userQuestion, documentChunks);
const context = relevantChunks.join('\n\n');

const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [
    { role: 'system', content: 'Answer based on the provided context.' },
    { role: 'user', content: `Context:\n${context}\n\nQuestion: ${userQuestion}` },
  ],
});

Practical Example: Context-Aware Chat Application

import OpenAI from 'openai';

interface Message {
  role: 'system' | 'user' | 'assistant';
  content: string;
  tokens?: number;
}

class ContextAwareChat {
  private openai: OpenAI;
  private messages: Message[] = [];
  private maxContextTokens: number;
  private reserveForResponse: number = 1000;

  constructor(options: { systemPrompt: string; maxContextTokens?: number }) {
    this.openai = new OpenAI();
    this.maxContextTokens = options.maxContextTokens || 8000;

    this.messages.push({
      role: 'system',
      content: options.systemPrompt,
      tokens: this.estimateTokens(options.systemPrompt),
    });
  }

  private estimateTokens(text: string): number {
    return Math.ceil(text.length / 4);
  }

  private getTotalTokens(): number {
    return this.messages.reduce((sum, msg) => sum + (msg.tokens || 0), 0);
  }

  private trimToFit(): void {
    const targetTokens = this.maxContextTokens - this.reserveForResponse;

    while (this.getTotalTokens() > targetTokens && this.messages.length > 2) {
      // Remove oldest non-system message
      const removed = this.messages.splice(1, 1)[0];
      console.log(`Trimmed message (${removed.tokens} tokens)`);
    }
  }

  async chat(userMessage: string): Promise<string> {
    // Add user message
    this.messages.push({
      role: 'user',
      content: userMessage,
      tokens: this.estimateTokens(userMessage),
    });

    // Ensure we fit in context
    this.trimToFit();

    // Log context usage
    const usedTokens = this.getTotalTokens();
    console.log(`Context: ${usedTokens}/${this.maxContextTokens} tokens`);

    // Make API call
    const response = await this.openai.chat.completions.create({
      model: 'gpt-4o-mini',
      messages: this.messages.map(({ role, content }) => ({ role, content })),
      max_tokens: this.reserveForResponse,
    });

    const assistantMessage = response.choices[0].message.content || '';

    // Add assistant response
    this.messages.push({
      role: 'assistant',
      content: assistantMessage,
      tokens: this.estimateTokens(assistantMessage),
    });

    // Log actual usage
    console.log(`Actual tokens used: ${response.usage?.total_tokens}`);

    return assistantMessage;
  }

  getConversationLength(): number {
    return this.messages.length;
  }

  clearHistory(): void {
    // Keep only system message
    this.messages = [this.messages[0]];
  }
}

// Usage
const chat = new ContextAwareChat({
  systemPrompt: 'You are a helpful TypeScript tutor.',
  maxContextTokens: 4000,
});

// Simulate a long conversation
for (let i = 0; i < 20; i++) {
  const response = await chat.chat(`Question ${i + 1}: Explain a TypeScript concept.`);
  console.log(`Response ${i + 1}:`, response.slice(0, 100) + '...\n');
}

When to Use Large Context Models

Large context models (100K+ tokens) are useful for:

Use Case	Why Large Context Helps
Long document analysis	Entire document fits in context
Code review	Full codebase context
Legal document review	Complete contract analysis
Book summarization	No chunking needed
Extended conversations	Full history retained

Trade-offs of Large Context

Consideration	Large Context	Small Context + Techniques
Simplicity	Just stuff everything in	Need chunking/RAG
Cost	Higher per request	Lower per request
Latency	Slower responses	Faster responses
"Lost in middle"	More severe	Less of an issue
Precision	May use irrelevant info	Only relevant info (with RAG)

Sometimes smaller context with smart selection beats large context:

// Option A: Use entire document with large context model
const responseA = await openai.chat.completions.create({
  model: 'gpt-4', // 128K context
  messages: [{ role: 'user', content: `${entireDocument}\n\nQuestion: ${question}` }],
});
// Cost: High, may include irrelevant information

// Option B: Use RAG with smaller context
const relevantChunks = await retrieveRelevant(question, documentChunks);
const responseB = await openai.chat.completions.create({
  model: 'gpt-4o-mini',
  messages: [{ role: 'user', content: `${relevantChunks}\n\nQuestion: ${question}` }],
});
// Cost: Lower, focused on relevant information

Exercises

Exercise 1: Calculate Context Usage

You have a chatbot with:

System prompt: 500 tokens
Average user message: 100 tokens
Average assistant response: 300 tokens
Context limit: 8,000 tokens

How many conversation turns can you fit before needing to trim?

Solution

Each turn = user message (100) + assistant response (300) = 400 tokens

Available for conversation = 8,000 - 500 (system) = 7,500 tokens

Maximum turns = 7,500 / 400 = 18.75

You can fit approximately 18 full turns before needing to trim.

In practice, you would want to start trimming earlier to leave room for the response.

Exercise 2: Implement Message Trimming

Complete this function to trim messages while keeping the system message and most recent exchanges:

function trimMessages(
  messages: Message[],
  maxTokens: number,
  countTokens: (text: string) => number
): Message[] {
  // Your implementation here
}

Solution

function trimMessages(
  messages: Message[],
  maxTokens: number,
  countTokens: (text: string) => number
): Message[] {
  // Always keep system message
  const systemMsg = messages.find((m) => m.role === 'system');
  const nonSystemMsgs = messages.filter((m) => m.role !== 'system');

  const result: Message[] = systemMsg ? [systemMsg] : [];
  let totalTokens = systemMsg ? countTokens(systemMsg.content) : 0;

  // Add messages from the end (most recent) until we hit the limit
  for (let i = nonSystemMsgs.length - 1; i >= 0; i--) {
    const msg = nonSystemMsgs[i];
    const msgTokens = countTokens(msg.content);

    if (totalTokens + msgTokens <= maxTokens) {
      result.splice(systemMsg ? 1 : 0, 0, msg);
      totalTokens += msgTokens;
    } else {
      break; // Stop when we exceed the limit
    }
  }

  return result;
}

Exercise 3: Choose the Right Strategy

For each scenario, which context management strategy would you recommend?

A customer service chatbot that needs to remember the entire conversation
Analyzing a 500-page legal contract
A coding assistant that needs access to a large codebase
A quiz app that asks questions about a short article

Solution

Customer service chatbot: Summarization strategy - Periodically summarize older messages to compress history while retaining key context.
Legal contract analysis: Chunking + RAG - Break the document into chunks, use embeddings to find relevant sections, include only those in context.
Coding assistant with large codebase: RAG with code embeddings - Index the codebase, retrieve relevant files/functions based on the query. Large context models can help but may be expensive.
Quiz app with short article: Simple inclusion - If the article fits in context (< 4K tokens), just include the whole thing. No special strategy needed.

Key Takeaways

Context window is the LLM's working memory - everything must fit: prompt, history, and response
Bigger is not always better - large contexts are slower, more expensive, and suffer from "lost in the middle"
Use sliding window for simple conversation trimming
Use summarization to compress long conversation history
Use chunking for documents larger than the context window
Use RAG for precise, relevant context selection
Place important information at the start and end of your context

Resources

Resource	Type	Description
Lost in the Middle Paper	Paper	Research on context position effects
OpenAI Context Length Guide	Documentation	Managing context in GPT models
Anthropic Long Context	Documentation	Best practices for Claude's 200K context
LangChain Memory	Documentation	Memory management patterns

Next Lesson

Now that you understand context windows, let us explore how to control the randomness and creativity of model outputs with temperature and other parameters.

Continue to Lesson 2.4: Temperature and Other Parameters