Lesson 2.1: What is a Large Language Model

Duration: 45 minutes

Learning Objectives

By the end of this lesson, you will be able to:

Define what a Large Language Model is
Understand how LLMs generate text through next-token prediction
Explain the Transformer architecture at a conceptual level
Recognize the key differences between major LLMs
Understand why "large" matters in LLMs

Introduction

You have probably used ChatGPT, Claude, or another AI assistant. These tools feel almost magical - you type a question, and they respond with coherent, helpful answers. But what is actually happening under the hood?

In this lesson, we will demystify Large Language Models (LLMs). You will not learn to build one from scratch (that requires years of research and millions of dollars), but you will understand how they work well enough to use them effectively.

What Makes a Language Model "Large"

A Large Language Model is a neural network trained on massive amounts of text to predict and generate human language. Let us break down each part:

Language Model

A language model predicts what comes next in a sequence of text. Given "The cat sat on the...", a language model might predict:

"mat" (high probability)
"floor" (medium probability)
"elephant" (very low probability)

This simple idea - predicting the next word - is the foundation of all modern LLMs.

Large

"Large" refers to two things:

1. Training Data: Modern LLMs are trained on enormous datasets:

Books, articles, websites, code repositories
Hundreds of billions to trillions of words
Multiple languages and domains

2. Model Size: The number of parameters (adjustable values) in the model:

Model	Parameters
GPT-2 (2019)	1.5 billion parameters
GPT-3 (2020)	175 billion parameters
GPT-4 (2023)	~1.7 trillion parameters*
Claude 3 (2024)	Not disclosed
Llama 3 (2024)	8B to 405B parameters

*Estimated, not officially confirmed

More parameters mean the model can learn more complex patterns, but also require more compute power.

How LLMs Generate Text

The core mechanism is surprisingly simple: LLMs predict one token at a time.

Next-Token Prediction

When you ask an LLM "What is the capital of France?", here is what happens:

Input: "What is the capital of France?"

Step 1: Model predicts next token → "The"
Step 2: Model sees "...France? The" → predicts "capital"
Step 3: Model sees "...The capital" → predicts "of"
Step 4: Model sees "...capital of" → predicts "France"
Step 5: Model sees "...of France" → predicts "is"
Step 6: Model sees "...France is" → predicts "Paris"
Step 7: Model sees "...is Paris" → predicts "."
Step 8: Model predicts end of response

Output: "The capital of France is Paris."

Each prediction considers all the text that came before. The model does not "know" facts - it has learned statistical patterns from training data.

Autoregressive Generation

This process is called "autoregressive" because each output becomes input for the next prediction:

This is why LLM responses appear to "stream" - each token is generated one at a time.

Probability Distributions

For each position, the model outputs probabilities for every possible next token:

// Conceptual representation of model output
type TokenProbabilities = {
  token: string;
  probability: number;
};

// After "The cat sat on the..."
const predictions: TokenProbabilities[] = [
  { token: 'mat', probability: 0.35 },
  { token: 'floor', probability: 0.25 },
  { token: 'couch', probability: 0.15 },
  { token: 'bed', probability: 0.1 },
  { token: 'ground', probability: 0.08 },
  // ... thousands more tokens with tiny probabilities
];

// The model samples from this distribution
// With temperature=0, it always picks "mat"
// With higher temperature, it might pick "couch" or "floor"

We will explore temperature and sampling in Lesson 2.4.

The Transformer Architecture

All modern LLMs use an architecture called the "Transformer", introduced in the 2017 paper "Attention Is All You Need."

Why Transformers Changed Everything

Before Transformers, language models processed text sequentially - one word at a time, in order. This was slow and made it hard to capture long-range dependencies.

Transformers introduced "attention" - the ability to look at all parts of the input simultaneously:

Key Components

A simplified view of Transformer architecture:

You do not need to understand the mathematics. The key insight is that attention allows the model to consider relationships between all parts of the input when making predictions.

Training vs Inference

Understanding these two phases helps you use LLMs effectively:

Training (Done by AI Companies)

Training is the process of teaching the model:

Pre-training: Model learns from massive text datasets
- Predicts next tokens on billions of examples
- Takes weeks/months on thousands of GPUs
- Costs millions of dollars
Fine-tuning: Model learns specific behaviors
- Instruction following (ChatGPT, Claude)
- Safety guidelines
- Specific formats or domains
RLHF (Reinforcement Learning from Human Feedback)
- Humans rate model outputs
- Model learns to produce preferred responses

Inference (What You Do)

Inference is using a trained model:

// This is inference - using a pre-trained model
const response = await openai.chat.completions.create({
  model: 'gpt-4',
  messages: [{ role: 'user', content: 'Explain quantum computing simply' }],
});

// The model weights are frozen (not learning)
// It is just making predictions based on training

When you call an API, you are doing inference. The model does not learn from your inputs (though the company might use them for future training, depending on their policies).

Major LLMs and Their Differences

As a developer, you will work with various LLMs. Here are the major players:

OpenAI (GPT Series)

// OpenAI API example
import OpenAI from 'openai';

const openai = new OpenAI();

const response = await openai.chat.completions.create({
  model: 'gpt-4o', // Latest model
  messages: [{ role: 'user', content: 'Hello!' }],
});

Models: GPT-4, GPT-4o, GPT-4o-mini
Strengths: General capability, large ecosystem, multimodal (vision)
Pricing: Pay per token

Anthropic (Claude Series)

// Anthropic API example
import Anthropic from '@anthropic-ai/sdk';

const anthropic = new Anthropic();

const response = await anthropic.messages.create({
  model: 'claude-sonnet-4-20250514',
  max_tokens: 1024,
  messages: [{ role: 'user', content: 'Hello!' }],
});

Models: Claude 3.5 Sonnet, Claude 3 Opus, Claude 3 Haiku
Strengths: Long context (200K tokens), safety focus, strong reasoning
Pricing: Pay per token

Google (Gemini Series)

Models: Gemini Pro, Gemini Ultra
Strengths: Multimodal, Google integration
Pricing: Free tier available, then pay per token

Open Source (Meta, Mistral, etc.)

// Open source models can run locally or via providers
// Example using Ollama (local)
const response = await fetch('http://localhost:11434/api/generate', {
  method: 'POST',
  body: JSON.stringify({
    model: 'llama3',
    prompt: 'Hello!',
  }),
});

Models: Llama 3, Mistral, Mixtral
Strengths: Self-hosted, no API costs, full control
Trade-offs: Smaller than commercial models, need your own infrastructure

Comparison Table

Feature	GPT-4	Claude 3.5	Gemini Pro	Llama 3
Context Window	128K	200K	1M	8K-128K
Multimodal	Yes	Yes	Yes	Some versions
Self-hosting	No	No	No	Yes
API Required	Yes	Yes	Yes	Optional
Cost	$$$	$$	$$	Free*

*Infrastructure costs apply for self-hosting

Why "Large" Matters

You might wonder: why do bigger models perform better?

Emergent Capabilities

As models get larger, they develop capabilities that smaller models lack:

Model Size	Capabilities
Small (~1B)	Basic text completion
Medium (~10B)	+ Coherent paragraphs
Large (~100B)	+ Reasoning, following instructions
Very Large (>100B)	+ Complex reasoning, few-shot learning, code generation, multi-step tasks

Some capabilities only emerge at certain scales - they are not present in smaller models at all.

The Scaling Laws

Research has shown predictable relationships:

More data → better performance
More parameters → better performance
More compute → better performance

This is why AI companies keep building larger models.

Trade-offs

Larger is not always better for your use case:

Consideration	Large Model	Small Model
Quality	Higher	Lower
Speed	Slower	Faster
Cost	Expensive	Cheap
Use Case	Complex tasks	Simple tasks

For classification or simple extraction, a small model might be perfect. For creative writing or complex reasoning, you need larger models.

Practical Example: Comparing Models

Let us see how different models handle the same prompt:

import Anthropic from '@anthropic-ai/sdk';
import OpenAI from 'openai';

const openai = new OpenAI();
const anthropic = new Anthropic();

const prompt = 'Explain why the sky is blue in one sentence.';

// GPT-4o-mini (smaller, faster, cheaper)
const gptMiniResponse = await openai.chat.completions.create({
  model: 'gpt-4o-mini',
  messages: [{ role: 'user', content: prompt }],
});

// GPT-4o (larger, more capable)
const gpt4Response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: prompt }],
});

// Claude 3.5 Sonnet
const claudeResponse = await anthropic.messages.create({
  model: 'claude-sonnet-4-20250514',
  max_tokens: 100,
  messages: [{ role: 'user', content: prompt }],
});

console.log('GPT-4o-mini:', gptMiniResponse.choices[0].message.content);
console.log('GPT-4o:', gpt4Response.choices[0].message.content);
console.log('Claude:', claudeResponse.content[0].text);

For simple factual questions, you might not notice a difference. For complex reasoning or nuanced tasks, larger models shine.

Exercises

Exercise 1: Identify the Process

Match each description to its term: Training, Inference, Fine-tuning, or RLHF.

A company shows the model millions of Reddit posts
You ask Claude to summarize an article
Humans rank which response is better
A company teaches GPT to follow instructions

Solution

Training (Pre-training): Learning from large datasets
Inference: Using a trained model to generate output
RLHF: Human feedback to improve responses
Fine-tuning: Teaching specific behaviors

Exercise 2: Choose the Right Model

For each scenario, which model would you choose: GPT-4 (expensive, capable), GPT-4o-mini (cheap, fast), or self-hosted Llama (free, requires setup)?

A startup with no budget classifying customer emails
An enterprise building a legal document analyzer
A developer building a quick chatbot prototype
A company with strict data privacy requirements

Solution

Self-hosted Llama: No API costs, email classification does not need the largest model
GPT-4: Complex reasoning needed for legal documents, enterprise can afford it
GPT-4o-mini: Fast and cheap for prototyping
Self-hosted Llama: Data stays on your servers, no external API calls

Exercise 3: Predict the Output

Given the prompt "The quick brown fox jumps over the", what would you expect the model to predict next? Why?

Solution

The model would most likely predict "lazy" (followed by "dog"), completing the famous pangram "The quick brown fox jumps over the lazy dog."

Why? This sentence appears countless times in the training data (it is used to demonstrate fonts and keyboards). The model has learned this statistical pattern so strongly that it is almost certain to complete it this way.

This demonstrates that LLMs are pattern matchers - they predict what they have seen before.

Key Takeaways

LLMs predict one token at a time based on all previous context
"Large" means both training data and model parameters - more of both improves capability
Transformers use attention to consider relationships across the entire input
Training is expensive and done by AI companies; inference is what you do via APIs
Different models have different strengths - choose based on your needs
Emergent capabilities appear only in larger models

Resources

Resource	Type	Description
OpenAI Models Documentation	Documentation	Detailed specs for GPT models
Anthropic Model Card	Documentation	Claude model specifications
The Illustrated Transformer	Article	Visual guide to Transformer architecture
Attention Is All You Need	Paper	Original Transformer paper (advanced)

Next Lesson

Now that you understand what LLMs are and how they generate text, let us explore how they actually process text - through tokenization.

Continue to Lesson 2.2: Tokens and Tokenization