Lesson 1.2: How Models Are Trained
Duration: 60 minutes
Learning Objectives
By the end of this lesson, you will be able to:
- Understand the concept of training data and why it matters
- Explain the basic training loop (predict, compare, adjust)
- Understand what "weights" and "parameters" mean
- Recognize overfitting and underfitting
- Appreciate why good data is crucial for good models
The Core Idea: Learning from Examples
Machine learning is fundamentally about learning patterns from examples. Think of how you learned to recognize dogs:
- You saw many dogs as a child
- Adults told you "that's a dog"
- Eventually, you could recognize dogs you had never seen before
ML works the same way:
- Show the model many examples
- Tell it the correct answer for each
- Eventually, it can handle new examples it has never seen
Training Data: The Foundation
Training data is the collection of examples used to teach a model. Each example typically has:
- Input: The data the model sees (image, text, numbers)
- Label: The correct answer (what we want the model to predict)
// Example: Training data for sentiment analysis
type TrainingExample = {
input: string; // The text
label: 'positive' | 'negative'; // The correct sentiment
};
const trainingData: TrainingExample[] = [
{ input: 'I love this product!', label: 'positive' },
{ input: 'Terrible experience, avoid.', label: 'negative' },
{ input: 'Best purchase I ever made.', label: 'positive' },
{ input: 'Complete waste of money.', label: 'negative' },
{ input: 'Exceeded my expectations!', label: 'positive' },
// ... thousands more examples
];
Quality Over Quantity
Good training data must be:
- Accurate: Labels must be correct
- Representative: Cover the range of real-world cases
- Balanced: Roughly equal examples of each category
- Clean: No duplicates, errors, or irrelevant data
// Bad training data - all positive examples!
const badData: TrainingExample[] = [
{ input: 'Great!', label: 'positive' },
{ input: 'Excellent!', label: 'positive' },
{ input: 'Amazing!', label: 'positive' },
// Model will think everything is positive
];
// Good training data - balanced and diverse
const goodData: TrainingExample[] = [
{ input: 'Great product, works perfectly.', label: 'positive' },
{ input: 'Broke after one day.', label: 'negative' },
{ input: 'Does exactly what it says.', label: 'positive' },
{ input: 'Poor quality, disappointed.', label: 'negative' },
// Mix of positive and negative
];
The Training Loop
Training happens in cycles. Each cycle follows the same pattern:
┌─────────────────────────────────────────────────────────┐
│ Training Loop │
│ │
│ 1. PREDICT │
│ Model makes a guess │
│ │ │
│ ▼ │
│ 2. COMPARE │
│ Check guess vs correct answer │
│ │ │
│ ▼ │
│ 3. ADJUST │
│ Update model to reduce error │
│ │ │
│ ▼ │
│ Repeat thousands/millions of times │
│ │
└─────────────────────────────────────────────────────────┘
Step 1: Predict
The model sees an input and makes a prediction:
// Model sees: "This movie was boring"
// Model predicts: 60% positive, 40% negative
// (The model is not very good yet!)
type Prediction = {
positive: number; // Probability 0-1
negative: number;
};
const modelPrediction: Prediction = {
positive: 0.6,
negative: 0.4,
};
Step 2: Compare
We compare the prediction to the correct answer using a loss function:
// Correct answer: negative
// Model said: 60% positive
// Loss measures how wrong the model was
// Higher loss = more wrong
const correctLabel = 'negative';
const predictedProbability = modelPrediction.negative; // 0.4
// Simple loss: how far from correct?
// Should have predicted 1.0 for negative, got 0.4
const loss = 1.0 - predictedProbability; // 0.6 - significant error!
Step 3: Adjust
The model adjusts its internal parameters to reduce the error. This is called backpropagation - but you do not need to understand the math.
The key idea: if the model was wrong, adjust it to be less wrong next time.
// After adjustment:
// Model sees: "This movie was boring"
// Model predicts: 45% positive, 55% negative
// Better! But still not great.
// After more training:
// Model predicts: 15% positive, 85% negative
// Much better!
Repeat Many Times
Training involves millions of these cycles:
Epoch 1: Loss = 0.82 (model is mostly guessing)
Epoch 10: Loss = 0.54 (model is learning)
Epoch 100: Loss = 0.21 (model is getting good)
Epoch 1000: Loss = 0.08 (model is quite accurate)
An epoch is one complete pass through all training data.
Weights and Parameters
A model is essentially a mathematical function with adjustable numbers called weights or parameters.
A Simple Analogy
Imagine a basic spam filter with weights:
// Simplified spam scorer
type SpamWeights = {
freeWord: number; // How much "free" suggests spam
exclamationMarks: number; // How much !! suggests spam
knownSender: number; // How much known sender reduces spam
};
// Initial random weights (untrained)
let weights: SpamWeights = {
freeWord: 0.1,
exclamationMarks: 0.2,
knownSender: -0.1,
};
function spamScore(email: Email): number {
let score = 0;
if (email.body.includes('free')) score += weights.freeWord;
score += countExclamations(email.body) * weights.exclamationMarks;
if (isKnownSender(email.sender)) score += weights.knownSender;
return score;
}
Training Adjusts the Weights
Through training, the weights get better:
// After training on 10,000 emails:
weights = {
freeWord: 0.45, // "free" is moderately spammy
exclamationMarks: 0.72, // Multiple !! is very spammy
knownSender: -0.89, // Known senders are rarely spam
};
The model "learned" that:
- "Free" is somewhat indicative of spam
- Multiple exclamation marks are very suspicious
- Known senders are almost never spam
Real Models Have Millions of Parameters
Modern LLMs have billions of parameters:
| Model | Parameters |
|---|---|
| GPT-2 (2019) | 1.5 billion |
| GPT-3 (2020) | 175 billion |
| GPT-4 (2023) | ~1.7 trillion (estimated) |
| Claude 3 Opus | Not disclosed, likely 100B+ |
These parameters encode everything the model "knows" about language.
Training vs Inference
Two distinct phases:
Training: Teaching the model (adjusting weights)
- Expensive (time, compute, data)
- Done once or periodically
- Requires labeled data
Inference: Using the trained model
- Fast
- Done many times
- Works on new, unseen data
// Training phase (done by OpenAI/Anthropic)
const model = await train({
data: trillionsOfTextExamples,
epochs: manyIterations,
computeResources: 'thousands of GPUs',
});
// This took months and millions of dollars
// Inference phase (what you do via API)
const response = await model.predict('Explain quantum physics');
// This takes seconds and costs fractions of a cent
As a developer using AI APIs, you only do inference. The training was done for you.
Overfitting and Underfitting
Two common problems in training:
Overfitting: Memorizing Instead of Learning
The model learns the training data too well, including its quirks and noise.
// Training data
const data = [
{ text: "John's review: Great product!", sentiment: 'positive' },
{ text: "Mary's review: Terrible!", sentiment: 'negative' },
];
// Overfitted model might learn:
// "If text contains 'John', it's positive"
// "If text contains 'Mary', it's negative"
// This fails on new data:
// "John's review: Worst purchase ever!" -> wrongly predicts positive
Signs of overfitting:
- Great accuracy on training data
- Poor accuracy on new data
- Model seems to "memorize" rather than generalize
Underfitting: Not Learning Enough
The model is too simple to capture the patterns in the data.
// Model is too simple - just checks for "good" or "bad"
function oversimplifiedSentiment(text: string): string {
if (text.includes('good')) return 'positive';
if (text.includes('bad')) return 'negative';
return 'neutral';
}
// Fails on:
// "Not good at all" -> wrongly predicts positive
// "I had a bad feeling but it turned out great" -> wrongly predicts negative
Signs of underfitting:
- Poor accuracy on training data
- Poor accuracy on new data
- Model misses obvious patterns
The Goal: Just Right
┌─────────────────────────────────────────────────────────┐
│ │
│ Underfitting Just Right Overfitting │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ Too simple Good balance Too complex │
│ Misses patterns Generalizes well Memorizes │
│ Poor everywhere Good on new data Good only on │
│ training │
│ │
└─────────────────────────────────────────────────────────┘
Validation and Test Sets
To detect overfitting, we split data into three parts:
// Split data: 70% training, 15% validation, 15% test
const allData = loadData(); // 10,000 examples
const trainingData = allData.slice(0, 7000); // Train on this
const validationData = allData.slice(7000, 8500); // Tune on this
const testData = allData.slice(8500); // Final evaluation
Why Three Sets?
- Training Set: Model learns from this
- Validation Set: Check progress during training, tune hyperparameters
- Test Set: Final evaluation on truly unseen data
Training accuracy: 98% (model saw this data)
Validation accuracy: 92% (model tuned on this)
Test accuracy: 91% (truly unseen - this is the real measure!)
If training accuracy is much higher than test accuracy, you have overfitting.
How LLMs Are Trained
Large Language Models follow the same principles but at massive scale.
Phase 1: Pre-training
The model learns from vast amounts of text:
// Simplified: predict the next word
const trainingExample = {
context: 'The cat sat on the',
nextWord: 'mat', // Model learns to predict this
};
// Trained on trillions of such examples from:
// - Books
// - Websites
// - Wikipedia
// - Code repositories
// - Scientific papers
This is unsupervised - no human labels needed. The next word IS the label.
Phase 2: Fine-tuning
The model is refined for specific tasks:
// Fine-tuning for conversation
const conversationExample = {
prompt: 'What is the capital of France?',
idealResponse: 'The capital of France is Paris.',
};
// Human trainers create thousands of ideal responses
Phase 3: RLHF (Reinforcement Learning from Human Feedback)
Humans rate model outputs, and the model learns from these ratings:
// Model generates two responses
const response1 = 'Paris is the capital of France.';
const response2 = 'France, capital, Paris, yes.';
// Human rates: response1 is better
// Model adjusts to produce more responses like response1
This is why ChatGPT and Claude sound helpful and polished - they were trained on human preferences.
Why Training Takes So Long
Training modern AI models requires enormous resources:
| Resource | Requirement |
|---|---|
| Data | Trillions of tokens (words/pieces) |
| Compute | Thousands of high-end GPUs |
| Time | Weeks to months |
| Cost | Millions of dollars |
| Energy | Equivalent to small towns |
This is why only large companies train foundation models. Everyone else uses their APIs.
What This Means for Developers
As a developer, you do not train models. But understanding training helps you:
- Understand limitations: Models only know what they were trained on
- Write better prompts: You are working with the model's learned patterns
- Evaluate outputs: Knowing about biases and overfitting helps spot issues
- Choose the right model: Different training = different strengths
// The model was trained on data up to a cutoff date
// It does not know recent events!
const response = await llm.complete('Who won the 2024 Olympics 100m sprint?');
// Model might not know - this is after its training data
// Better approach: provide context
const response2 = await llm.complete(`
Based on this information: [recent news article],
who won the 2024 Olympics 100m sprint?
`);
Exercises
Exercise 1: Identify the Problem
For each scenario, identify whether it is overfitting, underfitting, or well-trained:
- A spam filter that gets 99% accuracy on training data but only 60% on new emails
- A sentiment model that classifies everything as "neutral"
- A translation model that works well on both training texts and user inputs
- A model that perfectly memorizes all training examples but fails on slightly different inputs
Solution
- Overfitting: High training accuracy, low real-world accuracy
- Underfitting: Model is too simple, defaults to one answer
- Well-trained: Good performance on both training and new data
- Overfitting: Memorization instead of learning patterns
Exercise 2: Design Training Data
You want to train a model to classify customer support emails into categories: billing, technical, shipping, general.
What characteristics should your training data have?
Solution
Good training data should have:
- Balance: Roughly equal examples of each category (or at least representative proportions)
- Variety: Different writing styles, lengths, levels of detail
- Edge cases: Emails that could fit multiple categories
- Clean labels: Each email correctly categorized
- Real examples: Actual customer emails, not fabricated ones
- Diversity: Different products, issues, customer types
- Sufficient quantity: At least hundreds per category, ideally thousands
Example distribution:
- Billing: 2,500 emails
- Technical: 2,500 emails
- Shipping: 2,500 emails
- General: 2,500 emails
Exercise 3: Training Loop Trace
Given this simplified training process, trace what happens:
// Initial model: always predicts 50% positive, 50% negative
// Training example: "I hate this!" - correct label: negative
// Round 1:
// Prediction: 50% positive, 50% negative
// Correct: negative (should be 0% positive, 100% negative)
// Error: model was 50% off on the negative prediction
// Adjustment: increase weight toward negative
// What do you expect after 5 rounds of training on similar examples?
Solution
After 5 rounds of training on similar negative examples:
The model would likely predict:
- ~15-25% positive
- ~75-85% negative
Why not 0%/100%? Because:
- Learning is gradual - small adjustments each round
- The model needs to generalize, not memorize
- Some uncertainty is healthy (prevents overconfidence)
After many more rounds with mixed examples:
- The model learns to recognize negative words ("hate", "terrible", etc.)
- It learns context matters ("I hate to admit I love this" is positive)
- Confidence increases but is not absolute
Key Takeaways
- Training data is the foundation - garbage in, garbage out
- Training loop: predict → compare → adjust → repeat millions of times
- Weights/parameters are the numbers the model adjusts during learning
- Overfitting = memorizing; underfitting = not learning enough
- Validation/test sets help detect problems before deployment
- LLM training involves pre-training, fine-tuning, and human feedback
- As a developer, you use pre-trained models via APIs
Resources
| Resource | Type | Description |
|---|---|---|
| Google ML Crash Course: Training | Tutorial | Interactive training concepts |
| 3Blue1Brown: Gradient Descent | Video | Visual explanation of how learning works |
| OpenAI: GPT-4 Technical Report | Paper | How a modern LLM was trained |
Next Lesson
Now that you understand how models learn, let us explore the different types of problems machine learning can solve.