From Zero to AI

Lesson 2.1: What is a Large Language Model

Duration: 45 minutes

Learning Objectives

By the end of this lesson, you will be able to:

  • Define what a Large Language Model is
  • Understand how LLMs generate text through next-token prediction
  • Explain the Transformer architecture at a conceptual level
  • Recognize the key differences between major LLMs
  • Understand why "large" matters in LLMs

Introduction

You have probably used ChatGPT, Claude, or another AI assistant. These tools feel almost magical - you type a question, and they respond with coherent, helpful answers. But what is actually happening under the hood?

In this lesson, we will demystify Large Language Models (LLMs). You will not learn to build one from scratch (that requires years of research and millions of dollars), but you will understand how they work well enough to use them effectively.


What Makes a Language Model "Large"

A Large Language Model is a neural network trained on massive amounts of text to predict and generate human language. Let us break down each part:

Language Model

A language model predicts what comes next in a sequence of text. Given "The cat sat on the...", a language model might predict:

  • "mat" (high probability)
  • "floor" (medium probability)
  • "elephant" (very low probability)

This simple idea - predicting the next word - is the foundation of all modern LLMs.

Large

"Large" refers to two things:

1. Training Data: Modern LLMs are trained on enormous datasets:

  • Books, articles, websites, code repositories
  • Hundreds of billions to trillions of words
  • Multiple languages and domains

2. Model Size: The number of parameters (adjustable values) in the model:

┌─────────────────────────────────────────────────────────┐
│                    Model Sizes                           │
├─────────────────────────────────────────────────────────┤
│  GPT-2 (2019)         │  1.5 billion parameters         │
│  GPT-3 (2020)         │  175 billion parameters         │
│  GPT-4 (2023)         │  ~1.7 trillion parameters*      │
│  Claude 3 (2024)      │  Not disclosed                  │
│  Llama 3 (2024)       │  8B to 405B parameters          │
└─────────────────────────────────────────────────────────┘
* Estimated, not officially confirmed

More parameters mean the model can learn more complex patterns, but also require more compute power.


How LLMs Generate Text

The core mechanism is surprisingly simple: LLMs predict one token at a time.

Next-Token Prediction

When you ask an LLM "What is the capital of France?", here is what happens:

Input: "What is the capital of France?"

Step 1: Model predicts next token → "The"
Step 2: Model sees "...France? The" → predicts "capital"
Step 3: Model sees "...The capital" → predicts "of"
Step 4: Model sees "...capital of" → predicts "France"
Step 5: Model sees "...of France" → predicts "is"
Step 6: Model sees "...France is" → predicts "Paris"
Step 7: Model sees "...is Paris" → predicts "."
Step 8: Model predicts end of response

Output: "The capital of France is Paris."

Each prediction considers all the text that came before. The model does not "know" facts - it has learned statistical patterns from training data.

Autoregressive Generation

This process is called "autoregressive" because each output becomes input for the next prediction:

┌────────────────────────────────────────────────────────────┐
│              Autoregressive Text Generation                 │
│                                                            │
│  Prompt: "Hello"                                           │
│                                                            │
│  Step 1: "Hello" ────────────► Model ────► ","             │
│  Step 2: "Hello," ───────────► Model ────► "how"           │
│  Step 3: "Hello, how" ───────► Model ────► "are"           │
│  Step 4: "Hello, how are" ───► Model ────► "you"           │
│  Step 5: "Hello, how are you" ► Model ────► "?"            │
│  Step 6: "Hello, how are you?" ► Model ────► [END]         │
│                                                            │
│  Output: "Hello, how are you?"                             │
└────────────────────────────────────────────────────────────┘

This is why LLM responses appear to "stream" - each token is generated one at a time.

Probability Distributions

For each position, the model outputs probabilities for every possible next token:

// Conceptual representation of model output
type TokenProbabilities = {
  token: string;
  probability: number;
};

// After "The cat sat on the..."
const predictions: TokenProbabilities[] = [
  { token: 'mat', probability: 0.35 },
  { token: 'floor', probability: 0.25 },
  { token: 'couch', probability: 0.15 },
  { token: 'bed', probability: 0.1 },
  { token: 'ground', probability: 0.08 },
  // ... thousands more tokens with tiny probabilities
];

// The model samples from this distribution
// With temperature=0, it always picks "mat"
// With higher temperature, it might pick "couch" or "floor"

We will explore temperature and sampling in Lesson 2.4.


The Transformer Architecture

All modern LLMs use an architecture called the "Transformer", introduced in the 2017 paper "Attention Is All You Need."

Why Transformers Changed Everything

Before Transformers, language models processed text sequentially - one word at a time, in order. This was slow and made it hard to capture long-range dependencies.

Transformers introduced "attention" - the ability to look at all parts of the input simultaneously:

┌─────────────────────────────────────────────────────────┐
│                  Attention Mechanism                     │
│                                                         │
│  Sentence: "The cat sat on the mat because it was soft" │
│                                                         │
│  Question: What does "it" refer to?                     │
│                                                         │
│  Attention weights:                                     │
│    "The"0.02                                     │
│    "cat"0.05                                     │
│    "sat"0.01                                     │
│    "on"0.01                                     │
│    "the"0.02                                     │
│    "mat"0.85  ← High attention!                  │
│    "because"0.01                                     │
│    "was"0.02                                     │
│    "soft"0.01                                     │
│                                                         │
│  The model "attends" to "mat" to understand "it"        │
└─────────────────────────────────────────────────────────┘

Key Components

A simplified view of Transformer architecture:

┌─────────────────────────────────────────────────────────┐
│                 Transformer Architecture                 │
│                                                         │
│  Input Text                                             │
│       │                                                 │
│       ▼                                                 │
│  ┌─────────────┐                                        │
│  │ Tokenization │  ← Break text into tokens             │
│  └─────────────┘                                        │
│       │                                                 │
│       ▼                                                 │
│  ┌─────────────┐                                        │
│  │ Embedding   │  ← Convert tokens to vectors           │
│  └─────────────┘                                        │
│       │                                                 │
│       ▼                                                 │
│  ┌─────────────┐                                        │
│  │ Transformer │  ← Multiple layers of:                 │
│  │   Layers    │    • Self-attention                    │
│  │   (x N)     │    • Feed-forward networks             │
│  └─────────────┘                                        │
│       │                                                 │
│       ▼                                                 │
│  ┌─────────────┐                                        │
│  │ Output Head │  ← Predict next token probabilities    │
│  └─────────────┘                                        │
│       │                                                 │
│       ▼                                                 │
│  Next Token                                             │
└─────────────────────────────────────────────────────────┘

You do not need to understand the mathematics. The key insight is that attention allows the model to consider relationships between all parts of the input when making predictions.


Training vs Inference

Understanding these two phases helps you use LLMs effectively:

Training (Done by AI Companies)

Training is the process of teaching the model:

  1. Pre-training: Model learns from massive text datasets

    • Predicts next tokens on billions of examples
    • Takes weeks/months on thousands of GPUs
    • Costs millions of dollars
  2. Fine-tuning: Model learns specific behaviors

    • Instruction following (ChatGPT, Claude)
    • Safety guidelines
    • Specific formats or domains
  3. RLHF (Reinforcement Learning from Human Feedback)

    • Humans rate model outputs
    • Model learns to produce preferred responses

Inference (What You Do)

Inference is using a trained model:

// This is inference - using a pre-trained model
const response = await openai.chat.completions.create({
  model: 'gpt-4',
  messages: [{ role: 'user', content: 'Explain quantum computing simply' }],
});

// The model weights are frozen (not learning)
// It is just making predictions based on training

When you call an API, you are doing inference. The model does not learn from your inputs (though the company might use them for future training, depending on their policies).


Major LLMs and Their Differences

As a developer, you will work with various LLMs. Here are the major players:

OpenAI (GPT Series)

// OpenAI API example
import OpenAI from 'openai';

const openai = new OpenAI();

const response = await openai.chat.completions.create({
  model: 'gpt-4o', // Latest model
  messages: [{ role: 'user', content: 'Hello!' }],
});
  • Models: GPT-4, GPT-4o, GPT-4o-mini
  • Strengths: General capability, large ecosystem, multimodal (vision)
  • Pricing: Pay per token

Anthropic (Claude Series)

// Anthropic API example
import Anthropic from '@anthropic-ai/sdk';

const anthropic = new Anthropic();

const response = await anthropic.messages.create({
  model: 'claude-sonnet-4-20250514',
  max_tokens: 1024,
  messages: [{ role: 'user', content: 'Hello!' }],
});
  • Models: Claude 3.5 Sonnet, Claude 3 Opus, Claude 3 Haiku
  • Strengths: Long context (200K tokens), safety focus, strong reasoning
  • Pricing: Pay per token

Google (Gemini Series)

  • Models: Gemini Pro, Gemini Ultra
  • Strengths: Multimodal, Google integration
  • Pricing: Free tier available, then pay per token

Open Source (Meta, Mistral, etc.)

// Open source models can run locally or via providers
// Example using Ollama (local)
const response = await fetch('http://localhost:11434/api/generate', {
  method: 'POST',
  body: JSON.stringify({
    model: 'llama3',
    prompt: 'Hello!',
  }),
});
  • Models: Llama 3, Mistral, Mixtral
  • Strengths: Self-hosted, no API costs, full control
  • Trade-offs: Smaller than commercial models, need your own infrastructure

Comparison Table

Feature GPT-4 Claude 3.5 Gemini Pro Llama 3
Context Window 128K 200K 1M 8K-128K
Multimodal Yes Yes Yes Some versions
Self-hosting No No No Yes
API Required Yes Yes Yes Optional
Cost $$$ $$ $$ Free*

*Infrastructure costs apply for self-hosting


Why "Large" Matters

You might wonder: why do bigger models perform better?

Emergent Capabilities

As models get larger, they develop capabilities that smaller models lack:

┌─────────────────────────────────────────────────────────┐
│              Capabilities by Model Size                  │
├─────────────────────────────────────────────────────────┤
│  Small (~1B)    │ Basic text completion                 │
│  Medium (~10B)  │ + Coherent paragraphs                 │
│  Large (~100B)  │ + Reasoning, following instructions   │
│  Very Large     │ + Complex reasoning, few-shot learning│
│  (>100B)        │   code generation, multi-step tasks   │
└─────────────────────────────────────────────────────────┘

Some capabilities only emerge at certain scales - they are not present in smaller models at all.

The Scaling Laws

Research has shown predictable relationships:

  • More data → better performance
  • More parameters → better performance
  • More compute → better performance

This is why AI companies keep building larger models.

Trade-offs

Larger is not always better for your use case:

Consideration Large Model Small Model
Quality Higher Lower
Speed Slower Faster
Cost Expensive Cheap
Use Case Complex tasks Simple tasks

For classification or simple extraction, a small model might be perfect. For creative writing or complex reasoning, you need larger models.


Practical Example: Comparing Models

Let us see how different models handle the same prompt:

import Anthropic from '@anthropic-ai/sdk';
import OpenAI from 'openai';

const openai = new OpenAI();
const anthropic = new Anthropic();

const prompt = 'Explain why the sky is blue in one sentence.';

// GPT-4o-mini (smaller, faster, cheaper)
const gptMiniResponse = await openai.chat.completions.create({
  model: 'gpt-4o-mini',
  messages: [{ role: 'user', content: prompt }],
});

// GPT-4o (larger, more capable)
const gpt4Response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: prompt }],
});

// Claude 3.5 Sonnet
const claudeResponse = await anthropic.messages.create({
  model: 'claude-sonnet-4-20250514',
  max_tokens: 100,
  messages: [{ role: 'user', content: prompt }],
});

console.log('GPT-4o-mini:', gptMiniResponse.choices[0].message.content);
console.log('GPT-4o:', gpt4Response.choices[0].message.content);
console.log('Claude:', claudeResponse.content[0].text);

For simple factual questions, you might not notice a difference. For complex reasoning or nuanced tasks, larger models shine.


Exercises

Exercise 1: Identify the Process

Match each description to its term: Training, Inference, Fine-tuning, or RLHF.

  1. A company shows the model millions of Reddit posts
  2. You ask Claude to summarize an article
  3. Humans rank which response is better
  4. A company teaches GPT to follow instructions
Solution
  1. Training (Pre-training): Learning from large datasets
  2. Inference: Using a trained model to generate output
  3. RLHF: Human feedback to improve responses
  4. Fine-tuning: Teaching specific behaviors

Exercise 2: Choose the Right Model

For each scenario, which model would you choose: GPT-4 (expensive, capable), GPT-4o-mini (cheap, fast), or self-hosted Llama (free, requires setup)?

  1. A startup with no budget classifying customer emails
  2. An enterprise building a legal document analyzer
  3. A developer building a quick chatbot prototype
  4. A company with strict data privacy requirements
Solution
  1. Self-hosted Llama: No API costs, email classification does not need the largest model
  2. GPT-4: Complex reasoning needed for legal documents, enterprise can afford it
  3. GPT-4o-mini: Fast and cheap for prototyping
  4. Self-hosted Llama: Data stays on your servers, no external API calls

Exercise 3: Predict the Output

Given the prompt "The quick brown fox jumps over the", what would you expect the model to predict next? Why?

Solution

The model would most likely predict "lazy" (followed by "dog"), completing the famous pangram "The quick brown fox jumps over the lazy dog."

Why? This sentence appears countless times in the training data (it is used to demonstrate fonts and keyboards). The model has learned this statistical pattern so strongly that it is almost certain to complete it this way.

This demonstrates that LLMs are pattern matchers - they predict what they have seen before.


Key Takeaways

  1. LLMs predict one token at a time based on all previous context
  2. "Large" means both training data and model parameters - more of both improves capability
  3. Transformers use attention to consider relationships across the entire input
  4. Training is expensive and done by AI companies; inference is what you do via APIs
  5. Different models have different strengths - choose based on your needs
  6. Emergent capabilities appear only in larger models

Resources

Resource Type Description
OpenAI Models Documentation Documentation Detailed specs for GPT models
Anthropic Model Card Documentation Claude model specifications
The Illustrated Transformer Article Visual guide to Transformer architecture
Attention Is All You Need Paper Original Transformer paper (advanced)

Next Lesson

Now that you understand what LLMs are and how they generate text, let us explore how they actually process text - through tokenization.

Continue to Lesson 2.2: Tokens and Tokenization