Lesson 2.1: Why Streaming Matters
Duration: 45 minutes
Learning Objectives
By the end of this lesson, you will be able to:
- Explain why streaming improves user experience in AI applications
- Understand the difference between blocking and streaming responses
- Recognize when to use streaming vs standard API calls
- Measure and compare response latency
The Waiting Problem
When you make a standard API call to an AI model, your application waits for the entire response to be generated before displaying anything. For a response of 500 tokens, this might take 5-10 seconds of complete silence.
Think about typing a message to ChatGPT and waiting in silence. That wait feels much longer than it actually is. Studies show that users perceive waiting times as 36% longer when there is no visual feedback.
Streaming: The Solution
With streaming, the AI sends each word (or token) as soon as it is generated. The user sees the response build up in real-time, just like watching someone type.
The total generation time is identical, but the experience is completely different. Users see the first words within 200-500 milliseconds instead of waiting 8 seconds for everything.
Time to First Token (TTFT)
The key metric in streaming is Time to First Token - how quickly the user sees the first piece of the response.
| Metric | Non-Streaming | Streaming |
|---|---|---|
| Time to First Token | 8 seconds | 0.3 seconds |
| Total Response Time | 8 seconds | 8 seconds |
| Perceived Speed | Slow | Fast |
With streaming, users see content almost immediately. This changes the perception from "waiting for AI" to "watching AI think."
Code Comparison
Here is a simple comparison of non-streaming vs streaming approaches.
Non-Streaming (Blocking)
import OpenAI from 'openai';
const openai = new OpenAI();
async function chat(message: string): Promise<string> {
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: message }],
});
return response.choices[0].message.content ?? '';
}
Streaming
import OpenAI from 'openai';
const openai = new OpenAI();
async function chatStream(message: string): Promise<void> {
const stream = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: message }],
stream: true,
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) {
process.stdout.write(content);
}
}
}
The key difference is stream: true and the for await loop.
When to Use Streaming
Use streaming for:
- Chat interfaces where users expect real-time responses
- Long responses over a few sentences
- Interactive applications where users might interrupt
- Production applications
Skip streaming for:
- Short responses under 1 second
- Batch processing without user interaction
- Background tasks and automated pipelines
Key Takeaways
- Streaming delivers tokens as they are generated instead of waiting for completion
- Time to First Token (TTFT) is typically 200-500ms with streaming
- Perceived performance improves dramatically even with same total time
- All major AI applications use streaming for better UX
- Implementation requires handling chunks in a loop
Resources
| Resource | Type | Level |
|---|---|---|
| OpenAI Streaming Guide | Documentation | Beginner |
| Perceived Performance (MDN) | Documentation | Beginner |
Next Lesson
In the next lesson, you will learn about Server-Sent Events (SSE) - the technology that makes streaming possible.