How LLMs Work: From Tokens to Tokens | Transformers Architecture

🎓 Complete all tutorials to earn your Free Transformers Architecture Certificate
Shareable on LinkedIn • Verified by AITutorials.site • No signup fee

The Big Picture: What Is An LLM Really?

An LLM (Large Language Model) is fundamentally a massive neural network that predicts the next token given a sequence of previous tokens. This deceptively simple objective - next-token prediction - is the foundation of ChatGPT, Claude, GPT-4, and every modern AI assistant.

🎯 The Core Function

Input:  "The capital of France is"
                                  ↓
              [LLM thinks for a moment...]
                                  ↓
Output: P("Paris" | context) = 0.87  ← Very confident!
        P("Lyon" | context) = 0.05
        P("Nice" | context) = 0.02
        P("London" | context) = 0.01  ← Wrong, low probability
        P(other 50,000 words) = 0.05
                                  ↓
        Sample from distribution → "Paris"

Mathematical Definition

import torch

# LLM is a function: Tokens → Probability Distribution
def llm(input_tokens):
    """
    Args:
        input_tokens: List of token IDs [142, 2368, 3332, ...]
    
    Returns:
        probabilities: [vocab_size] tensor summing to 1.0
                      P(next_token | input_tokens)
    """
    # Step 1: Convert tokens to embeddings
    embeddings = embed(input_tokens)  # [seq_len, d_model]
    
    # Step 2: Process through transformer layers
    hidden_states = transformer_layers(embeddings)  # [seq_len, d_model]
    
    # Step 3: Project last token to vocabulary
    logits = output_projection(hidden_states[-1])  # [vocab_size]
    
    # Step 4: Convert to probabilities
    probabilities = softmax(logits)  # [vocab_size], sums to 1.0
    
    return probabilities


# Example usage
print("=" * 70)
print("LLM as Probability Distribution")
print("=" * 70)

# Input: "What is machine learning?"
input_text = "What is machine learning?"
input_tokens = tokenize(input_text)  # [2061, 318, 4572, 4673, 30]
print(f"Input: {input_text}")
print(f"Tokens: {input_tokens}")

# LLM predicts next token
probs = llm(input_tokens)
print(f"\nProbability distribution over {len(probs):,} possible next tokens")

# Top 5 predictions
top_5_probs, top_5_tokens = torch.topk(probs, 5)
print("\nTop 5 predictions:")
for prob, token in zip(top_5_probs, top_5_tokens):
    word = detokenize(token)
    print(f"  {word:15s}: {prob:.4f} ({prob*100:.2f}%)")

# Output:
# Top 5 predictions:
#   Machine        : 0.2847 (28.47%)  ← Most likely
#   It             : 0.1823 (18.23%)
#   A              : 0.0956 (9.56%)
#   Learning       : 0.0734 (7.34%)
#   The            : 0.0521 (5.21%)

💡 The Fundamental Insight

Everything emerges from next-token prediction:

Translation: Predict French tokens after English context
Reasoning: Predict logical next step in chain-of-thought
Coding: Predict syntactically correct next code token
Creativity: Predict plausible continuations of stories
Knowledge: Predict factually correct completions

No separate "reasoning module" or "knowledge base" - just predict next token really, really well.

Autoregressive Generation: The Loop

def generate_text(prompt, max_tokens=100):
    """
    Generate text autoregressively: one token at a time.
    """
    tokens = tokenize(prompt)
    generated_tokens = tokens.copy()
    
    print(f"Starting with: {detokenize(tokens)}")
    print("\nGenerating...")
    
    for step in range(max_tokens):
        # Step 1: Get probabilities for next token
        probs = llm(generated_tokens)
        
        # Step 2: Sample next token
        next_token = sample(probs, temperature=0.7, top_p=0.9)
        
        # Step 3: Append to sequence
        generated_tokens.append(next_token)
        
        # Step 4: Check for stop
        if next_token == END_TOKEN:
            break
        
        # Progress update
        if step % 10 == 0:
            current_text = detokenize(generated_tokens)
            print(f"Step {step}: {current_text}")
    
    return detokenize(generated_tokens)


# Example run
result = generate_text("The future of AI is")

# Output (example):
# Starting with: The future of AI is
# 
# Generating...
# Step 0: The future of AI is bright
# Step 10: The future of AI is bright and full of possibilities. As
# Step 20: The future of AI is bright and full of possibilities. As machine learning algorithms become
# Step 30: The future of AI is bright and full of possibilities. As machine learning algorithms become more sophisticated, we
# ...
#
# Each step: Run entire model, predict 1 token, append, repeat

⚠️ Key Limitation: Sequential Generation

Unlike training (where all tokens are processed in parallel), generation is sequential. To generate 100 tokens, you must run the model 100 times. This is why:

ChatGPT responses "stream" one token at a time
LLM inference is expensive (compute + memory)
Response time = num_tokens × time_per_token
Optimization focuses on making each forward pass faster

Step 1: Tokenization

Text is converted to tokens. Modern models use Byte-Pair Encoding (BPE) or similar subword tokenization:


# Example: GPT-3 tokenization
text = "Hello, world!"

# Tokenizer breaks into subwords
tokens = [
    "Hello",     # token_id: 15496
    ",",         # token_id: 11
    " world",    # token_id: 995
    "!"          # token_id: 0
]

token_ids = [15496, 11, 995, 0]
# Now we have integers that the model can process

# Key insight: tokens ≠ characters
# One token ≈ 4 characters for English
# So "Hello, world!" (13 chars) ≈ 4 tokens

Why not just characters?

Efficiency: Fewer tokens = faster processing
Semantics: Subwords capture meaning better than characters
Vocabulary Size: ~50k tokens instead of millions of rare words

Cost: Subword tokenization means the model "thinks" in chunks, not characters. This is why LLMs sometimes struggle with token-level tasks.

Step 2: Embedding Lookup

Convert token IDs to dense vectors:


# Token embeddings: learned during training
# Typical: 50k vocabulary × 768D embedding vectors
embedding_matrix = torch.randn(50000, 768)

token_ids = [15496, 11, 995, 0]  # "Hello, world!"

# Lookup embeddings
embeddings = embedding_matrix[token_ids]
# Shape: [4, 768]
# Each token is now a 768-dimensional vector

# Add positional information
positions = torch.arange(4)  # [0, 1, 2, 3]
pos_embeddings = pos_embedding_matrix[positions]
# Shape: [4, 768]

# Combine
x = embeddings + pos_embeddings
# Now each token knows: "what are you?" + "where in sequence?"

The embedding space is learned. Similar tokens (semantically or syntactically) have similar embeddings.

Step 3: Forward Through Transformer Layers

The magic happens here. Input embeddings flow through 12-40 transformer layers:


# Initial embeddings: [batch_size, seq_len, d_model]
# Example: [1, 4, 768]
x = embeddings + pos_embeddings

# Layer 1
├─ Multi-head attention (8 heads)
│  └─ Each token attends to all previous tokens
│  └─ Learn which tokens are relevant
├─ Feed-forward network
│  └─ Position-wise transformations
└─ Residual + Layer Norm
Result: [1, 4, 768] (same shape)

# Layer 2
├─ Multi-head attention
├─ Feed-forward
└─ Residual + Layer Norm
Result: [1, 4, 768]

# Layers 3-40: Same structure, different learned parameters

# Final layer output
x = [1, 4, 768]

Each layer refines representations. Early layers learn syntactic patterns, later layers learn semantic meanings.

What's Happening in Attention?

In each attention head, tokens "talk" to each other:

Sentence: "The cat sat on the mat"

Token "sat":
  "The" (1.2%)  ← Low attention (not very relevant)
  "cat" (45%)   ← High attention (who's doing the action?)
  "sat" (22%)   ← Some attention to self
  "on" (8%)     ← Low attention
  "the" (3%)    ← Low attention
  "mat" (20.8%) ← Moderate attention (where?)

Output for "sat" = 0.012*"The" + 0.45*"cat" + 0.22*"sat" + ...
                 = context-aware representation of "sat"

This happens in parallel across 8+ heads, each learning different relationships. Then they're combined:


# After multi-head attention
output = torch.cat([head_1, head_2, ..., head_8], dim=-1)
# [batch, seq_len, 768]

# Project back through output linear layer
output = output_projection(output)
# [batch, seq_len, 768]

# Add residual connection
x = x + output
# Original signal preserved + new attention information

# Normalize
x = layer_norm(x)

Step 4: Output Layer & Probability Distribution

After all layers, convert to vocabulary probabilities:


# After 40 layers of processing
x = [1, 4, 768]  # 4 tokens, each refined and contextualized

# Take last token's representation
last_token_repr = x[:, -1, :]  # [1, 768]

# Project to vocabulary
logits = output_projection(last_token_repr)
# [1, 50000] - score for each possible next token

# Convert to probabilities
probabilities = F.softmax(logits, dim=-1)
# [1, 50000] - sums to 1.0

# Top-5 most likely next tokens
top_5_probs, top_5_ids = torch.topk(probabilities, 5)

# Example:
# Token 12: "I" (prob: 0.25)
# Token 451: "it" (prob: 0.18)
# Token 89: "there" (prob: 0.15)
# Token 340: "the" (prob: 0.12)
# Token 756: "we" (prob: 0.08)

The model "predicts" the next token by assigning probabilities. Higher probability = more confident.

Step 5: Sampling & Repetition

Sample from the probability distribution to get next token:


# Sample with top-p filtering
next_token_id = sample_top_p(probabilities, p=0.9)
# Example: returns 12 (token "I")

# Append to sequence
sequence = torch.cat([sequence, next_token_id.unsqueeze(0)])
# "Hello, world! I"

# REPEAT from Step 2 with new sequence
new_embeddings = embedding_matrix[sequence]  # Now 5 tokens
new_x = transformer_layers(new_embeddings)
new_logits = output_projection(new_x[:, -1, :])  # Only last token
next_token_id = sample_top_p(new_logits)
# Now gets 2nd generated token

# Continue until:
# 1. Max length reached
# 2. End-of-sequence token sampled
# 3. Stop signal received

Autoregressive = Sequential: Must generate one token at a time, feeding each back into the model. This is why LLM inference is "slow" (milliseconds per token).

Complete End-to-End Example


import torch
import torch.nn.functional as F
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Load pre-trained model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

# Input
prompt = "The future of AI is"

# Step 1: Tokenize
input_ids = tokenizer.encode(prompt, return_tensors="pt")
# [[464, 2003, 286, 7590, 318]] (5 tokens)

# Step 2-4: Forward through model
with torch.no_grad():
    outputs = model(input_ids)
    logits = outputs.logits  # [1, 5, 50257]

# Step 5: Sample next token
next_token_logits = logits[0, -1, :] / 0.7  # Apply temperature
next_token_probs = F.softmax(next_token_logits, dim=-1)

# Top-p sampling
sorted_probs, sorted_indices = torch.sort(next_token_probs, descending=True)
cum_probs = torch.cumsum(sorted_probs, dim=-1)
sorted_indices_to_keep = cum_probs <= 0.9
sorted_indices_to_keep[0] = True  # Always keep best
filtered_indices = sorted_indices[sorted_indices_to_keep]

# Sample
next_token_id = filtered_indices[torch.multinomial(
    next_token_probs[filtered_indices], 1
)]

# Step 6: Repeat!
# Append and continue...

# For full generation:
@torch.no_grad()
def generate(model, tokenizer, prompt, max_length=50):
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    
    for _ in range(max_length):
        outputs = model(input_ids)
        next_token_logits = outputs.logits[0, -1, :] / 0.7
        
        next_token_probs = F.softmax(next_token_logits, dim=-1)
        next_token_id = torch.multinomial(next_token_probs, 1)
        
        input_ids = torch.cat([input_ids, next_token_id.unsqueeze(0)], dim=1)
        
        if next_token_id == tokenizer.eos_token_id:
            break
    
    return tokenizer.decode(input_ids[0])

# Generate!
result = generate(model, tokenizer, "The future of AI is")
print(result)
# Output: "The future of AI is not a matter of if, but when. AI..."

Scaling Laws: Why Bigger = Better

A remarkable discovery: model capabilities don't just improve linearly with scale - they follow power laws:

Loss ≈ a * N^(-α) + b

where N = number of parameters

Examples (from OpenAI, DeepMind papers):
GPT-2: 1.5B params → ~2% error rate
GPT-3: 175B params → ~0.1% error rate (1700x better!)

Not 1.5x better, but 170x better!
All from just predicting next tokens

This is why companies are building 1T, 10T parameter models. Scaling seems to be the path to intelligence.

Emergent Abilities: New capabilities appear at scale:

1B params: Good at copying
10B params: Basic reasoning
100B+ params: Complex reasoning, creative writing
Trillion+ params: Potential AGI properties?

Training: The Secret Sauce

What makes an LLM powerful? Training on massive amounts of text:

GPT-3 Training:
- 300 billion tokens of text
- From: Internet, books, papers, etc.
- Cost: ~$10 million
- Time: ~2 months on 1000+ TPUs
- Objective: Predict next token on random sequence positions

This is it. Just next-token prediction.
No labels. No supervised examples.
Raw unsupervised learning at scale.

Why does this work? Predicting text requires:

Language Understanding: Parse grammar, syntax
World Knowledge: Know facts to predict sensible continuations
Reasoning: Multi-step thinking for complex texts
Common Sense: Understand human values and expectations

All of this emerges naturally from the training objective.

Inference: The Challenge

Training is expensive (one-time). Inference is the ongoing cost:


# GPU memory needed:
GPT-3 (175B params): ~350GB in float32
Claude (100B+ params): ~200GB

# Forward pass time per token:
- Single GPU (H100): ~50ms per token
- Batch of 64 inputs: ~100ms total
- Throughput: ~640 tokens/sec

# This is why:
- Models are quantized (8-bit, 4-bit)
- Batching is used
- KV-cache is critical for speed

⚠️ Key Bottleneck: LLMs are memory-bound, not compute-bound. Moving weights around is more expensive than computations. This is why inference optimization is an active research area.

KV-Cache: Making Inference Fast

Without KV-cache, generating N tokens requires O(N²) forward passes. With KV-cache, it's O(N):


# WITHOUT KV-cache (naive):
input_ids = [token_1]
output_1 = model(input_ids)  # Compute attention from scratch
next_token = sample(output_1)

input_ids = [token_1, token_2]
output_2 = model(input_ids)  # RE-compute attention for token_1!
next_token = sample(output_2)

# This is wasteful! We already computed attention for token_1

# WITH KV-cache (smart):
input_ids = [token_1]
cache = {}
output_1 = model(input_ids, cache=cache)
# cache now stores: K and V for token_1

next_token = sample(output_1)

input_ids = [token_2]  # Only new token!
output_2 = model(input_ids, cache=cache)
# Use cached K,V for token_1, only compute for token_2!

# Same result, massive speedup

Difference: Without cache, generating 100 tokens requires computing 100+99+98+...+1 = 5050 token forward passes. With cache: 100 passes. 50x speedup!

Why Do LLMs Sometimes Fail?

Understanding limitations:

❌ Hallucination

Confident wrong answers. No grounding to reality, just predicting plausible-seeming text.

❌ Token Limit

Can't process arbitrarily long sequences. Memory-bound by sequence length.

❌ Training Data Cutoff

Knowledge frozen at training time. Can't know current events.

❌ Reasoning Limits

Next-token prediction can't do multi-step reasoning reliably.

❌ Tokenization Artifacts

Struggles with tasks needing character-level awareness (counting, spelling).

❌ Training Bias

Inherits biases from training data (gender, nationality, etc.).

The Future: What's Next?

Current directions in LLM research:

Longer Context: From 2K tokens → 100K → 1M+ tokens (better memory models)
Faster Inference: Distillation, quantization, speculative decoding
Better Reasoning: Chain-of-thought, retrieval-augmented generation (RAG)
Multimodal: Vision + text + audio in single model
Efficient Training: Reduce compute from 10M → 1M GPUs for training
Specialization: Task-specific fine-tuning instead of one-model-for-all

The Big Takeaway

LLMs are Not Magic

They're neural networks that predict the next token. But at sufficient scale, with sufficient data, this simple objective creates:

Language understanding
Reasoning ability
Knowledge of the world
Creative generation
Problem-solving skills

This is emergence: complex behavior from simple components.

Key Takeaways

Tokenization: Text → Token IDs (subword BPE)
Embedding: Token IDs → Dense vectors
Positional Info: Add position embeddings
Transformer Layers: Multi-head attention + Feed-forward (12-40 layers)
Output Projection: Last token → Vocabulary probabilities
Sampling: Select next token from distribution
Autoregressive: Repeat for each token in output
Training: Predict next tokens on massive text corpus
Scaling: Bigger models learn better (power law)
Emergence: Simple objective creates complex capabilities

You've Completed the Course! 🎉

Congratulations!

You've learned:

✅ Why RNNs failed (vanishing gradients, sequential bottleneck)
✅ Attention mechanism & self-attention mathematics
✅ Multi-head attention & positional encoding
✅ Complete transformer architecture (encoder-decoder)
✅ Decoder-only models & generation techniques
✅ Complete LLM pipeline & how modern AI works

You now understand the foundations of every modern AI system.

Test Your Knowledge

Q1: What is the first step in LLM processing?

Attention computation

Output generation

Tokenization - converting text into token IDs

Training the model

Q2: What do token embeddings convert tokens into?

Images

Dense vector representations

Audio signals

Binary code

Q3: How do LLMs generate the next token?

Random selection

Copying from input

Looking up in a database

Computing probability distribution over vocabulary and sampling

Q4: What is the final layer of an LLM that produces token probabilities?

Language modeling head (linear layer + softmax)

Attention layer

Embedding layer

Normalization layer

Q5: What happens after an LLM generates a token?

The model resets

Generation stops immediately

The token is appended to the sequence and the process repeats autoregressively

All tokens are regenerated