The complete journey: how ChatGPT, Claude, and all modern LLMs actually work
š Complete all tutorials to earn your Free Transformers Architecture Certificate
Shareable on LinkedIn ⢠Verified by AITutorials.site ⢠No signup fee
An LLM (Large Language Model) is fundamentally a massive neural network that predicts the next token given a sequence of previous tokens. This deceptively simple objective - next-token prediction - is the foundation of ChatGPT, Claude, GPT-4, and every modern AI assistant.
Input: "The capital of France is"
ā
[LLM thinks for a moment...]
ā
Output: P("Paris" | context) = 0.87 ā Very confident!
P("Lyon" | context) = 0.05
P("Nice" | context) = 0.02
P("London" | context) = 0.01 ā Wrong, low probability
P(other 50,000 words) = 0.05
ā
Sample from distribution ā "Paris"
import torch
# LLM is a function: Tokens ā Probability Distribution
def llm(input_tokens):
"""
Args:
input_tokens: List of token IDs [142, 2368, 3332, ...]
Returns:
probabilities: [vocab_size] tensor summing to 1.0
P(next_token | input_tokens)
"""
# Step 1: Convert tokens to embeddings
embeddings = embed(input_tokens) # [seq_len, d_model]
# Step 2: Process through transformer layers
hidden_states = transformer_layers(embeddings) # [seq_len, d_model]
# Step 3: Project last token to vocabulary
logits = output_projection(hidden_states[-1]) # [vocab_size]
# Step 4: Convert to probabilities
probabilities = softmax(logits) # [vocab_size], sums to 1.0
return probabilities
# Example usage
print("=" * 70)
print("LLM as Probability Distribution")
print("=" * 70)
# Input: "What is machine learning?"
input_text = "What is machine learning?"
input_tokens = tokenize(input_text) # [2061, 318, 4572, 4673, 30]
print(f"Input: {input_text}")
print(f"Tokens: {input_tokens}")
# LLM predicts next token
probs = llm(input_tokens)
print(f"\nProbability distribution over {len(probs):,} possible next tokens")
# Top 5 predictions
top_5_probs, top_5_tokens = torch.topk(probs, 5)
print("\nTop 5 predictions:")
for prob, token in zip(top_5_probs, top_5_tokens):
word = detokenize(token)
print(f" {word:15s}: {prob:.4f} ({prob*100:.2f}%)")
# Output:
# Top 5 predictions:
# Machine : 0.2847 (28.47%) ā Most likely
# It : 0.1823 (18.23%)
# A : 0.0956 (9.56%)
# Learning : 0.0734 (7.34%)
# The : 0.0521 (5.21%)
Everything emerges from next-token prediction:
No separate "reasoning module" or "knowledge base" - just predict next token really, really well.
def generate_text(prompt, max_tokens=100):
"""
Generate text autoregressively: one token at a time.
"""
tokens = tokenize(prompt)
generated_tokens = tokens.copy()
print(f"Starting with: {detokenize(tokens)}")
print("\nGenerating...")
for step in range(max_tokens):
# Step 1: Get probabilities for next token
probs = llm(generated_tokens)
# Step 2: Sample next token
next_token = sample(probs, temperature=0.7, top_p=0.9)
# Step 3: Append to sequence
generated_tokens.append(next_token)
# Step 4: Check for stop
if next_token == END_TOKEN:
break
# Progress update
if step % 10 == 0:
current_text = detokenize(generated_tokens)
print(f"Step {step}: {current_text}")
return detokenize(generated_tokens)
# Example run
result = generate_text("The future of AI is")
# Output (example):
# Starting with: The future of AI is
#
# Generating...
# Step 0: The future of AI is bright
# Step 10: The future of AI is bright and full of possibilities. As
# Step 20: The future of AI is bright and full of possibilities. As machine learning algorithms become
# Step 30: The future of AI is bright and full of possibilities. As machine learning algorithms become more sophisticated, we
# ...
#
# Each step: Run entire model, predict 1 token, append, repeat
Unlike training (where all tokens are processed in parallel), generation is sequential. To generate 100 tokens, you must run the model 100 times. This is why:
Text is converted to tokens. Modern models use Byte-Pair Encoding (BPE) or similar subword tokenization:
# Example: GPT-3 tokenization
text = "Hello, world!"
# Tokenizer breaks into subwords
tokens = [
"Hello", # token_id: 15496
",", # token_id: 11
" world", # token_id: 995
"!" # token_id: 0
]
token_ids = [15496, 11, 995, 0]
# Now we have integers that the model can process
# Key insight: tokens ā characters
# One token ā 4 characters for English
# So "Hello, world!" (13 chars) ā 4 tokens
Why not just characters?
Convert token IDs to dense vectors:
# Token embeddings: learned during training
# Typical: 50k vocabulary Ć 768D embedding vectors
embedding_matrix = torch.randn(50000, 768)
token_ids = [15496, 11, 995, 0] # "Hello, world!"
# Lookup embeddings
embeddings = embedding_matrix[token_ids]
# Shape: [4, 768]
# Each token is now a 768-dimensional vector
# Add positional information
positions = torch.arange(4) # [0, 1, 2, 3]
pos_embeddings = pos_embedding_matrix[positions]
# Shape: [4, 768]
# Combine
x = embeddings + pos_embeddings
# Now each token knows: "what are you?" + "where in sequence?"
The embedding space is learned. Similar tokens (semantically or syntactically) have similar embeddings.
The magic happens here. Input embeddings flow through 12-40 transformer layers:
# Initial embeddings: [batch_size, seq_len, d_model]
# Example: [1, 4, 768]
x = embeddings + pos_embeddings
# Layer 1
āā Multi-head attention (8 heads)
ā āā Each token attends to all previous tokens
ā āā Learn which tokens are relevant
āā Feed-forward network
ā āā Position-wise transformations
āā Residual + Layer Norm
Result: [1, 4, 768] (same shape)
# Layer 2
āā Multi-head attention
āā Feed-forward
āā Residual + Layer Norm
Result: [1, 4, 768]
# Layers 3-40: Same structure, different learned parameters
# Final layer output
x = [1, 4, 768]
Each layer refines representations. Early layers learn syntactic patterns, later layers learn semantic meanings.
In each attention head, tokens "talk" to each other:
Sentence: "The cat sat on the mat"
Token "sat":
"The" (1.2%) ā Low attention (not very relevant)
"cat" (45%) ā High attention (who's doing the action?)
"sat" (22%) ā Some attention to self
"on" (8%) ā Low attention
"the" (3%) ā Low attention
"mat" (20.8%) ā Moderate attention (where?)
Output for "sat" = 0.012*"The" + 0.45*"cat" + 0.22*"sat" + ...
= context-aware representation of "sat"
This happens in parallel across 8+ heads, each learning different relationships. Then they're combined:
# After multi-head attention
output = torch.cat([head_1, head_2, ..., head_8], dim=-1)
# [batch, seq_len, 768]
# Project back through output linear layer
output = output_projection(output)
# [batch, seq_len, 768]
# Add residual connection
x = x + output
# Original signal preserved + new attention information
# Normalize
x = layer_norm(x)
After all layers, convert to vocabulary probabilities:
# After 40 layers of processing
x = [1, 4, 768] # 4 tokens, each refined and contextualized
# Take last token's representation
last_token_repr = x[:, -1, :] # [1, 768]
# Project to vocabulary
logits = output_projection(last_token_repr)
# [1, 50000] - score for each possible next token
# Convert to probabilities
probabilities = F.softmax(logits, dim=-1)
# [1, 50000] - sums to 1.0
# Top-5 most likely next tokens
top_5_probs, top_5_ids = torch.topk(probabilities, 5)
# Example:
# Token 12: "I" (prob: 0.25)
# Token 451: "it" (prob: 0.18)
# Token 89: "there" (prob: 0.15)
# Token 340: "the" (prob: 0.12)
# Token 756: "we" (prob: 0.08)
The model "predicts" the next token by assigning probabilities. Higher probability = more confident.
Sample from the probability distribution to get next token:
# Sample with top-p filtering
next_token_id = sample_top_p(probabilities, p=0.9)
# Example: returns 12 (token "I")
# Append to sequence
sequence = torch.cat([sequence, next_token_id.unsqueeze(0)])
# "Hello, world! I"
# REPEAT from Step 2 with new sequence
new_embeddings = embedding_matrix[sequence] # Now 5 tokens
new_x = transformer_layers(new_embeddings)
new_logits = output_projection(new_x[:, -1, :]) # Only last token
next_token_id = sample_top_p(new_logits)
# Now gets 2nd generated token
# Continue until:
# 1. Max length reached
# 2. End-of-sequence token sampled
# 3. Stop signal received
import torch
import torch.nn.functional as F
from transformers import GPT2Tokenizer, GPT2LMHeadModel
# Load pre-trained model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")
# Input
prompt = "The future of AI is"
# Step 1: Tokenize
input_ids = tokenizer.encode(prompt, return_tensors="pt")
# [[464, 2003, 286, 7590, 318]] (5 tokens)
# Step 2-4: Forward through model
with torch.no_grad():
outputs = model(input_ids)
logits = outputs.logits # [1, 5, 50257]
# Step 5: Sample next token
next_token_logits = logits[0, -1, :] / 0.7 # Apply temperature
next_token_probs = F.softmax(next_token_logits, dim=-1)
# Top-p sampling
sorted_probs, sorted_indices = torch.sort(next_token_probs, descending=True)
cum_probs = torch.cumsum(sorted_probs, dim=-1)
sorted_indices_to_keep = cum_probs <= 0.9
sorted_indices_to_keep[0] = True # Always keep best
filtered_indices = sorted_indices[sorted_indices_to_keep]
# Sample
next_token_id = filtered_indices[torch.multinomial(
next_token_probs[filtered_indices], 1
)]
# Step 6: Repeat!
# Append and continue...
# For full generation:
@torch.no_grad()
def generate(model, tokenizer, prompt, max_length=50):
input_ids = tokenizer.encode(prompt, return_tensors="pt")
for _ in range(max_length):
outputs = model(input_ids)
next_token_logits = outputs.logits[0, -1, :] / 0.7
next_token_probs = F.softmax(next_token_logits, dim=-1)
next_token_id = torch.multinomial(next_token_probs, 1)
input_ids = torch.cat([input_ids, next_token_id.unsqueeze(0)], dim=1)
if next_token_id == tokenizer.eos_token_id:
break
return tokenizer.decode(input_ids[0])
# Generate!
result = generate(model, tokenizer, "The future of AI is")
print(result)
# Output: "The future of AI is not a matter of if, but when. AI..."
A remarkable discovery: model capabilities don't just improve linearly with scale - they follow power laws:
Loss ā a * N^(-α) + b
where N = number of parameters
Examples (from OpenAI, DeepMind papers):
GPT-2: 1.5B params ā ~2% error rate
GPT-3: 175B params ā ~0.1% error rate (1700x better!)
Not 1.5x better, but 170x better!
All from just predicting next tokens
This is why companies are building 1T, 10T parameter models. Scaling seems to be the path to intelligence.
What makes an LLM powerful? Training on massive amounts of text:
GPT-3 Training:
- 300 billion tokens of text
- From: Internet, books, papers, etc.
- Cost: ~$10 million
- Time: ~2 months on 1000+ TPUs
- Objective: Predict next token on random sequence positions
This is it. Just next-token prediction.
No labels. No supervised examples.
Raw unsupervised learning at scale.
Why does this work? Predicting text requires:
All of this emerges naturally from the training objective.
Training is expensive (one-time). Inference is the ongoing cost:
# GPU memory needed:
GPT-3 (175B params): ~350GB in float32
Claude (100B+ params): ~200GB
# Forward pass time per token:
- Single GPU (H100): ~50ms per token
- Batch of 64 inputs: ~100ms total
- Throughput: ~640 tokens/sec
# This is why:
- Models are quantized (8-bit, 4-bit)
- Batching is used
- KV-cache is critical for speed
Without KV-cache, generating N tokens requires O(N²) forward passes. With KV-cache, it's O(N):
# WITHOUT KV-cache (naive):
input_ids = [token_1]
output_1 = model(input_ids) # Compute attention from scratch
next_token = sample(output_1)
input_ids = [token_1, token_2]
output_2 = model(input_ids) # RE-compute attention for token_1!
next_token = sample(output_2)
# This is wasteful! We already computed attention for token_1
# WITH KV-cache (smart):
input_ids = [token_1]
cache = {}
output_1 = model(input_ids, cache=cache)
# cache now stores: K and V for token_1
next_token = sample(output_1)
input_ids = [token_2] # Only new token!
output_2 = model(input_ids, cache=cache)
# Use cached K,V for token_1, only compute for token_2!
# Same result, massive speedup
Difference: Without cache, generating 100 tokens requires computing 100+99+98+...+1 = 5050 token forward passes. With cache: 100 passes. 50x speedup!
Understanding limitations:
Confident wrong answers. No grounding to reality, just predicting plausible-seeming text.
Can't process arbitrarily long sequences. Memory-bound by sequence length.
Knowledge frozen at training time. Can't know current events.
Next-token prediction can't do multi-step reasoning reliably.
Struggles with tasks needing character-level awareness (counting, spelling).
Inherits biases from training data (gender, nationality, etc.).
Current directions in LLM research:
LLMs are Not Magic
They're neural networks that predict the next token. But at sufficient scale, with sufficient data, this simple objective creates:
This is emergence: complex behavior from simple components.
Congratulations!
You've learned:
You now understand the foundations of every modern AI system.
Q1: What is the first step in LLM processing?
Q2: What do token embeddings convert tokens into?
Q3: How do LLMs generate the next token?
Q4: What is the final layer of an LLM that produces token probabilities?
Q5: What happens after an LLM generates a token?