From Zero to AI

Lesson 2.1: Why Streaming Matters

Duration: 45 minutes

Learning Objectives

By the end of this lesson, you will be able to:

  1. Explain why streaming improves user experience in AI applications
  2. Understand the difference between blocking and streaming responses
  3. Recognize when to use streaming vs standard API calls
  4. Measure and compare response latency

The Waiting Problem

When you make a standard API call to an AI model, your application waits for the entire response to be generated before displaying anything. For a response of 500 tokens, this might take 5-10 seconds of complete silence.

Think about typing a message to ChatGPT and waiting in silence. That wait feels much longer than it actually is. Studies show that users perceive waiting times as 36% longer when there is no visual feedback.


Streaming: The Solution

With streaming, the AI sends each word (or token) as soon as it is generated. The user sees the response build up in real-time, just like watching someone type.

The total generation time is identical, but the experience is completely different. Users see the first words within 200-500 milliseconds instead of waiting 8 seconds for everything.


Time to First Token (TTFT)

The key metric in streaming is Time to First Token - how quickly the user sees the first piece of the response.

Metric Non-Streaming Streaming
Time to First Token 8 seconds 0.3 seconds
Total Response Time 8 seconds 8 seconds
Perceived Speed Slow Fast

With streaming, users see content almost immediately. This changes the perception from "waiting for AI" to "watching AI think."


Code Comparison

Here is a simple comparison of non-streaming vs streaming approaches.

Non-Streaming (Blocking)

import OpenAI from 'openai';

const openai = new OpenAI();

async function chat(message: string): Promise<string> {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [{ role: 'user', content: message }],
  });
  return response.choices[0].message.content ?? '';
}

Streaming

import OpenAI from 'openai';

const openai = new OpenAI();

async function chatStream(message: string): Promise<void> {
  const stream = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [{ role: 'user', content: message }],
    stream: true,
  });

  for await (const chunk of stream) {
    const content = chunk.choices[0]?.delta?.content;
    if (content) {
      process.stdout.write(content);
    }
  }
}

The key difference is stream: true and the for await loop.


When to Use Streaming

Use streaming for:

  • Chat interfaces where users expect real-time responses
  • Long responses over a few sentences
  • Interactive applications where users might interrupt
  • Production applications

Skip streaming for:

  • Short responses under 1 second
  • Batch processing without user interaction
  • Background tasks and automated pipelines

Key Takeaways

  1. Streaming delivers tokens as they are generated instead of waiting for completion
  2. Time to First Token (TTFT) is typically 200-500ms with streaming
  3. Perceived performance improves dramatically even with same total time
  4. All major AI applications use streaming for better UX
  5. Implementation requires handling chunks in a loop

Resources

Resource Type Level
OpenAI Streaming Guide Documentation Beginner
Perceived Performance (MDN) Documentation Beginner

Next Lesson

In the next lesson, you will learn about Server-Sent Events (SSE) - the technology that makes streaming possible.

Continue to Lesson 2.2: Server-Sent Events