Lesson 2.4: Temperature and Other Parameters
Duration: 45 minutes
Learning Objectives
By the end of this lesson, you will be able to:
- Understand what temperature controls and how to use it
- Configure top-p (nucleus sampling) for different use cases
- Apply frequency and presence penalties effectively
- Set max_tokens appropriately for your needs
- Choose parameter combinations for different tasks
Introduction
When you call an LLM API, you can adjust several parameters that control how the model generates responses. Understanding these parameters is like understanding the controls on a mixing board - each one affects the output in specific ways.
The most important parameter is temperature, but there are others that can help you get exactly the output you need.
Temperature: Controlling Randomness
Temperature controls how "creative" or "random" the model's responses are. It affects how the model chooses from its predicted probabilities.
How Temperature Works
Remember that LLMs predict probability distributions over tokens. Temperature modifies these probabilities:
Original probabilities for next token after "The capital of France is":
"Paris" → 95%
"Lyon" → 2%
"Nice" → 1%
"Berlin" → 0.5%
...other tokens → <1%
With temperature = 0:
Model ALWAYS picks "Paris" (the highest probability)
With temperature = 1:
Model samples according to original probabilities
Usually picks "Paris" but sometimes "Lyon" or others
With temperature = 2:
Probabilities are "flattened" - lower probability tokens become more likely
Might pick "Nice" or even "Berlin" occasionally
Temperature Values
┌─────────────────────────────────────────────────────────┐
│ Temperature Spectrum │
│ │
│ 0.0 0.3 0.7 1.0 2.0 │
│ │ │ │ │ │ │
│ ▼ ▼ ▼ ▼ ▼ │
│ ████ ████ ████ ████ ████ │
│ Very Mostly Balanced Creative Very │
│ Focused Focused Random │
│ │
│ Use for: Use for: Use for: Use for: Rarely │
│ - Facts - Code - Chat - Stories used │
│ - Math - Analysis - General - Poetry │
│ - Data - Summaries - Ideas │
└─────────────────────────────────────────────────────────┘
Practical Examples
import OpenAI from 'openai';
const openai = new OpenAI();
// Temperature 0: Deterministic output (same every time)
const factualResponse = await openai.chat.completions.create({
model: 'gpt-4o-mini',
temperature: 0,
messages: [{ role: 'user', content: 'What is 2 + 2?' }],
});
// Always returns exactly the same response
// Temperature 0.3: Mostly focused, slight variation
const codeResponse = await openai.chat.completions.create({
model: 'gpt-4o-mini',
temperature: 0.3,
messages: [{ role: 'user', content: 'Write a function to sort an array in TypeScript' }],
});
// Consistent structure, minor wording variations
// Temperature 0.7: Balanced (default for many models)
const chatResponse = await openai.chat.completions.create({
model: 'gpt-4o-mini',
temperature: 0.7,
messages: [{ role: 'user', content: 'Tell me about TypeScript' }],
});
// Natural variation, engaging responses
// Temperature 1.0: Creative
const storyResponse = await openai.chat.completions.create({
model: 'gpt-4o-mini',
temperature: 1.0,
messages: [{ role: 'user', content: 'Write a short story about a robot learning to code' }],
});
// More creative, surprising word choices
Choosing the Right Temperature
| Task | Recommended Temperature | Why |
|---|---|---|
| Factual Q&A | 0.0 - 0.2 | Accuracy matters, no creativity needed |
| Code generation | 0.0 - 0.3 | Code should be correct, some flexibility |
| Data extraction | 0.0 | Must be deterministic and accurate |
| Summarization | 0.3 - 0.5 | Some flexibility in wording |
| General chat | 0.7 | Balance of coherence and engagement |
| Creative writing | 0.8 - 1.0 | Encourage creative expression |
| Brainstorming | 0.9 - 1.2 | Generate diverse ideas |
Top-P (Nucleus Sampling)
Top-p is an alternative way to control randomness. Instead of modifying probabilities, it limits which tokens are considered.
How Top-P Works
Token probabilities:
"Paris" → 70% ─┐
"Lyon" → 15% │ Top 90% (cumulative)
"Nice" → 5% ─┘
"Berlin" → 3% ─── Excluded (below threshold)
"Rome" → 2%
...
With top_p = 0.9:
Only tokens in the top 90% cumulative probability are considered
Model chooses from: "Paris", "Lyon", "Nice"
Lower probability tokens are excluded entirely
Top-P vs Temperature
┌─────────────────────────────────────────────────────────┐
│ Temperature vs Top-P │
│ │
│ Temperature: │
│ - Modifies ALL probabilities │
│ - Higher = flatter distribution │
│ - Can select any token (even unlikely ones) │
│ │
│ Top-P: │
│ - Limits which tokens can be selected │
│ - Lower = fewer candidates │
│ - Eliminates unlikely tokens entirely │
└─────────────────────────────────────────────────────────┘
Using Top-P
// Top-p = 1.0: Consider all tokens (default)
// Top-p = 0.9: Consider top 90% probability mass
// Top-p = 0.5: Consider only very likely tokens
const response = await openai.chat.completions.create({
model: 'gpt-4o-mini',
top_p: 0.9,
messages: [{ role: 'user', content: 'Explain machine learning' }],
});
// Common patterns:
// - top_p = 1.0 with low temperature for focused responses
// - top_p = 0.9 for general use (slightly limits outliers)
// - top_p = 0.7 for more focused creative work
Should You Use Both?
OpenAI recommends adjusting either temperature OR top_p, not both:
// GOOD: Use one or the other
const focused = await openai.chat.completions.create({
model: "gpt-4o-mini",
temperature: 0.3,
// top_p left at default (1.0)
messages: [...]
});
// OR
const filtered = await openai.chat.completions.create({
model: "gpt-4o-mini",
top_p: 0.9,
// temperature left at default
messages: [...]
});
// AVOID: Using both can have unpredictable effects
const confusing = await openai.chat.completions.create({
model: "gpt-4o-mini",
temperature: 0.5,
top_p: 0.8, // Interaction is complex
messages: [...]
});
Frequency Penalty
Frequency penalty discourages the model from repeating the same tokens. Higher values make repetition less likely.
How It Works
Without frequency penalty:
"I like TypeScript. TypeScript is great. TypeScript helps me..."
(Model keeps using "TypeScript" because it's relevant)
With frequency_penalty = 0.5:
"I like TypeScript. It is great. This language helps me..."
(Model avoids repeating "TypeScript")
Scale: 0.0 (no penalty) to 2.0 (strong penalty)
Default: 0.0
Practical Use Cases
// Reduce repetition in long-form content
const articleResponse = await openai.chat.completions.create({
model: 'gpt-4o-mini',
frequency_penalty: 0.5,
messages: [{ role: 'user', content: 'Write a 500-word article about TypeScript' }],
});
// Higher penalty for creative variation
const creativeResponse = await openai.chat.completions.create({
model: 'gpt-4o-mini',
frequency_penalty: 0.8,
messages: [{ role: 'user', content: "Generate 10 different ways to say 'hello'" }],
});
// No penalty for technical accuracy (repetition may be needed)
const technicalResponse = await openai.chat.completions.create({
model: 'gpt-4o-mini',
frequency_penalty: 0.0,
messages: [{ role: 'user', content: 'Explain the TypeScript type system' }],
});
Presence Penalty
Presence penalty is similar to frequency penalty but works differently. It penalizes tokens that have appeared at all, regardless of how many times.
Frequency vs Presence Penalty
Text: "TypeScript TypeScript TypeScript code"
Frequency penalty effect on "TypeScript":
First occurrence: No penalty
Second occurrence: Moderate penalty
Third occurrence: Strong penalty
Presence penalty effect on "TypeScript":
First occurrence: No penalty
Second, third, etc.: Same penalty (token is "present")
When to Use Each
| Penalty Type | Use When |
|---|---|
| Frequency | You want to reduce excessive repetition |
| Presence | You want to encourage new topics |
// Presence penalty encourages exploring new topics
const brainstormResponse = await openai.chat.completions.create({
model: 'gpt-4o-mini',
presence_penalty: 0.6,
messages: [{ role: 'user', content: 'Give me ideas for a programming project' }],
});
// Encourages moving to new concepts rather than elaborating on one
// Combine both for maximum variety
const diverseResponse = await openai.chat.completions.create({
model: 'gpt-4o-mini',
frequency_penalty: 0.5,
presence_penalty: 0.5,
messages: [{ role: 'user', content: 'Write a varied, engaging blog post about coding' }],
});
Max Tokens
Max tokens limits the length of the model's response.
Understanding Max Tokens
// max_tokens applies to the OUTPUT only
// It does not count your input/prompt
const shortResponse = await openai.chat.completions.create({
model: 'gpt-4o-mini',
max_tokens: 50, // Response will be at most 50 tokens
messages: [{ role: 'user', content: 'Explain quantum computing' }],
});
// Response might get cut off mid-sentence if it reaches 50 tokens
const longResponse = await openai.chat.completions.create({
model: 'gpt-4o-mini',
max_tokens: 2000, // Allow longer response
messages: [{ role: 'user', content: 'Explain quantum computing' }],
});
Setting Appropriate Limits
// For specific formats, set appropriate limits
const tweetResponse = await openai.chat.completions.create({
model: 'gpt-4o-mini',
max_tokens: 100, // Tweets are short
messages: [
{
role: 'user',
content: 'Write a tweet about TypeScript',
},
],
});
// For code generation, allow more space
const codeResponse = await openai.chat.completions.create({
model: 'gpt-4o-mini',
max_tokens: 1000, // Code can be long
messages: [
{
role: 'user',
content: 'Write a complete TypeScript class for a todo list',
},
],
});
// Check if response was truncated
if (response.choices[0].finish_reason === 'length') {
console.log('Response was cut off - increase max_tokens');
}
Cost Implications
Remember: you pay for output tokens too!
// Cost-conscious approach
async function getResponse(prompt: string, maxLength: 'short' | 'medium' | 'long') {
const tokenLimits = {
short: 100,
medium: 500,
long: 2000,
};
const response = await openai.chat.completions.create({
model: 'gpt-4o-mini',
max_tokens: tokenLimits[maxLength],
messages: [{ role: 'user', content: prompt }],
});
return response;
}
// Use appropriate limits for each task
await getResponse('What is 2+2?', 'short'); // ~$0.00006
await getResponse('Explain recursion', 'medium'); // ~$0.0003
await getResponse('Write a tutorial', 'long'); // ~$0.0012
Stop Sequences
Stop sequences tell the model when to stop generating. Useful for controlling output format.
Using Stop Sequences
// Stop at specific tokens
const response = await openai.chat.completions.create({
model: 'gpt-4o-mini',
stop: ['\n\n', 'END', '---'],
messages: [{ role: 'user', content: 'Write a short paragraph about TypeScript' }],
});
// Model stops when it generates "\n\n", "END", or "---"
// Useful for structured output
const structuredResponse = await openai.chat.completions.create({
model: 'gpt-4o-mini',
stop: ['```'],
messages: [
{
role: 'user',
content: 'Generate JSON for a user: ```json\n',
},
],
});
// Model stops before generating closing ```
Practical Example: Controlled Generation
// Generate a list that stops after 5 items
const listResponse = await openai.chat.completions.create({
model: 'gpt-4o-mini',
stop: ['6.'], // Stop before item 6
messages: [
{
role: 'user',
content: 'List the benefits of TypeScript:\n1.',
},
],
});
// Generates exactly 5 items
// Generate code that stops at a specific point
const functionResponse = await openai.chat.completions.create({
model: 'gpt-4o-mini',
stop: ['\n\nfunction', '\n\nclass'],
messages: [
{
role: 'user',
content: 'Write a single TypeScript function to validate email',
},
],
});
// Stops before generating additional functions
Putting It All Together
Parameter Presets for Common Tasks
type TaskType = 'factual' | 'code' | 'creative' | 'chat' | 'data';
interface ModelParams {
temperature: number;
top_p: number;
frequency_penalty: number;
presence_penalty: number;
max_tokens: number;
}
const presets: Record<TaskType, ModelParams> = {
factual: {
temperature: 0,
top_p: 1,
frequency_penalty: 0,
presence_penalty: 0,
max_tokens: 500,
},
code: {
temperature: 0.2,
top_p: 1,
frequency_penalty: 0,
presence_penalty: 0,
max_tokens: 1500,
},
creative: {
temperature: 0.9,
top_p: 1,
frequency_penalty: 0.5,
presence_penalty: 0.5,
max_tokens: 2000,
},
chat: {
temperature: 0.7,
top_p: 0.9,
frequency_penalty: 0.3,
presence_penalty: 0.3,
max_tokens: 1000,
},
data: {
temperature: 0,
top_p: 1,
frequency_penalty: 0,
presence_penalty: 0,
max_tokens: 2000,
},
};
async function smartComplete(
messages: Array<{ role: string; content: string }>,
taskType: TaskType
): Promise<string> {
const params = presets[taskType];
const response = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: messages as any,
...params,
});
return response.choices[0].message.content || '';
}
// Usage
const factual = await smartComplete(
[{ role: 'user', content: 'What is the speed of light?' }],
'factual'
);
const creative = await smartComplete(
[{ role: 'user', content: 'Write a poem about coding' }],
'creative'
);
Anthropic Parameters
Anthropic (Claude) has similar but slightly different parameters:
import Anthropic from '@anthropic-ai/sdk';
const anthropic = new Anthropic();
const response = await anthropic.messages.create({
model: 'claude-sonnet-4-20250514',
max_tokens: 1024,
// temperature: 0-1 (similar to OpenAI)
temperature: 0.7,
// top_p: similar to OpenAI
top_p: 0.9,
// top_k: limits to K most likely tokens (not in OpenAI)
// top_k: 40,
messages: [{ role: 'user', content: 'Hello!' }],
});
Exercises
Exercise 1: Choose Parameters
For each scenario, specify the temperature and any other relevant parameters:
- A medical information chatbot
- A creative story generator
- A code completion tool
- A data extraction pipeline
Solution
-
Medical chatbot:
- temperature: 0.0 (accuracy is critical)
- frequency_penalty: 0 (technical terms may repeat)
- max_tokens: 1000 (detailed but not excessive)
-
Story generator:
- temperature: 0.9-1.0 (encourage creativity)
- frequency_penalty: 0.6 (avoid repetitive phrases)
- presence_penalty: 0.5 (encourage new elements)
- max_tokens: 2000+ (stories need length)
-
Code completion:
- temperature: 0.0-0.2 (code should be correct)
- max_tokens: 500 (complete the current block)
- stop: ["\n\n"] (stop after the current function/block)
-
Data extraction:
- temperature: 0 (must be deterministic)
- max_tokens: varies by data size
- No penalties (exact extraction needed)
Exercise 2: Debug the Parameters
This code produces responses that are too repetitive and sometimes too long. Fix the parameters:
const response = await openai.chat.completions.create({
model: 'gpt-4o-mini',
temperature: 1.5,
max_tokens: 10000,
frequency_penalty: 0,
messages: [{ role: 'user', content: 'Write a blog post about JavaScript' }],
});
Solution
const response = await openai.chat.completions.create({
model: 'gpt-4o-mini',
temperature: 0.7, // Reduced from 1.5 - too high causes incoherence
max_tokens: 1500, // Reduced - 10000 is excessive for a blog post
frequency_penalty: 0.5, // Added - reduces repetition
presence_penalty: 0.3, // Added - encourages topic variety
messages: [{ role: 'user', content: 'Write a blog post about JavaScript' }],
});
Problems with original:
- temperature 1.5 is too high, causing erratic output
- max_tokens 10000 is wasteful and allows rambling
- No frequency_penalty allows repetitive content
Exercise 3: Experiment with Temperature
Write code to generate the same prompt with temperatures 0, 0.5, and 1.0, then compare the outputs:
Solution
import OpenAI from 'openai';
const openai = new OpenAI();
async function compareTemperatures(prompt: string): Promise<void> {
const temperatures = [0, 0.5, 1.0];
for (const temp of temperatures) {
console.log(`\n--- Temperature: ${temp} ---`);
// Generate 3 responses at each temperature
for (let i = 0; i < 3; i++) {
const response = await openai.chat.completions.create({
model: 'gpt-4o-mini',
temperature: temp,
max_tokens: 100,
messages: [{ role: 'user', content: prompt }],
});
console.log(`Response ${i + 1}: ${response.choices[0].message.content}`);
}
}
}
await compareTemperatures('Describe a sunset in one sentence.');
// Expected results:
// Temperature 0: All 3 responses identical
// Temperature 0.5: Similar but with variations
// Temperature 1.0: More diverse, creative descriptions
Key Takeaways
- Temperature (0-2) controls randomness - lower for facts, higher for creativity
- Top-p (0-1) limits which tokens can be selected - use OR temperature, not both
- Frequency penalty (0-2) reduces word repetition within a response
- Presence penalty (0-2) encourages new topics
- Max tokens limits response length and affects cost
- Stop sequences provide precise control over where generation ends
- Match parameters to your task - there is no universal best setting
Resources
| Resource | Type | Description |
|---|---|---|
| OpenAI API Parameters | Documentation | Full parameter reference |
| Anthropic Parameters | Documentation | Claude parameter reference |
| Temperature Guide | Documentation | OpenAI's temperature recommendations |
Next Lesson
Now that you understand how to control model outputs, let us explore a critical challenge: when models confidently output incorrect information - hallucinations.