Understand GPT, autoregressive generation, and the art of controlling model outputs
π Complete all tutorials to earn your Free Transformers Architecture Certificate
Shareable on LinkedIn β’ Verified by AITutorials.site β’ No signup fee
The original "Attention is All You Need" transformer (2017) used an encoder-decoder architecture for machine translation. But by 2018-2019, a fundamental shift occurred: modern large language models like GPT, Claude, LLaMA, and Mistral all use decoder-only architecture. This wasn't just a design preference - it was a discovery about how to build truly scalable AI systems.
| Year | Model | Architecture | Impact |
|---|---|---|---|
| 2017 | Transformer | Encoder-Decoder | Machine translation breakthrough |
| 2018 | GPT-1 | Decoder-Only | Showed next-token prediction works |
| 2018 | BERT | Encoder-Only | Best for classification/understanding |
| 2019 | GPT-2 | Decoder-Only | 1.5B params, coherent long-form text |
| 2020 | GPT-3 | Decoder-Only | 175B params, few-shot learning emerges |
| 2023 | LLaMA, GPT-4 | Decoder-Only | Open-source & commercial dominance |
| 2024+ | Claude, Gemini | Decoder-Only | 100% market adoption for LLMs |
Encoder-Decoder: 2 separate stacks, cross-attention between them, complex data flow
Decoder-Only: Single stack, self-attention only, clean architecture
Why it matters: Simpler = easier to optimize, debug, and scale to trillions of parameters
Encoder-Decoder: Needs paired data (e.g., EnglishβFrench sentences)
Decoder-Only: Any text is training data! Just predict next token.
Why it matters: Can train on entire internet (trillions of tokens) vs. limited parallel corpora (millions)
At scale (>10B parameters), decoder-only models spontaneously develop:
Why it matters: Single model can do everything without task-specific fine-tuning
Everything becomes "text completion":
Why it matters: Single training objective for all tasks
| Aspect | Encoder-Decoder (BERT, T5) | Decoder-Only (GPT, Llama) |
|---|---|---|
| Structure | Separate encoder + decoder | Single decoder stack |
| Training | Masked language modeling (BERT) | Causal language modeling (next-token prediction) |
| Attention | Bidirectional + cross-attention | Causal only (left-to-right) |
| Best For | Classification, understanding | Generation, open-ended tasks |
| Examples | BERT, T5, mBART | GPT-3/4, Llama, Claude, Mistral |
| Industry Dominance | ~5% | ~95% (the future) |
Simplify: remove the encoder entirely. Stack decoder layers that attend to previous tokens (causal):
class GPTLikeModel(torch.nn.Module):
def __init__(self, vocab_size, d_model=768, num_layers=12,
num_heads=12, d_ff=3072, max_seq_len=2048):
super().__init__()
# Token + positional embeddings
self.token_embedding = torch.nn.Embedding(vocab_size, d_model)
self.pos_embedding = torch.nn.Embedding(max_seq_len, d_model)
# Stack of decoder layers (only self-attention, no cross-attention)
self.layers = torch.nn.ModuleList([
GPTDecoderLayer(d_model, num_heads, d_ff)
for _ in range(num_layers)
])
# Output layer
self.ln_final = torch.nn.LayerNorm(d_model)
self.lm_head = torch.nn.Linear(d_model, vocab_size)
def forward(self, token_ids):
"""
token_ids: [batch, seq_len]
"""
seq_len = token_ids.size(1)
# Embeddings
x = self.token_embedding(token_ids)
pos_ids = torch.arange(seq_len, device=token_ids.device).unsqueeze(0)
x = x + self.pos_embedding(pos_ids)
# Apply causal mask (can't look forward)
causal_mask = create_causal_mask(seq_len, device=token_ids.device)
# Decoder layers
for layer in self.layers:
x = layer(x, causal_mask)
# Output
x = self.ln_final(x)
logits = self.lm_head(x) # [batch, seq_len, vocab_size]
return logits
def create_causal_mask(seq_len, device):
"""Lower triangular matrix for causal masking"""
mask = torch.tril(torch.ones(seq_len, seq_len, device=device))
return mask.bool()
class GPTDecoderLayer(torch.nn.Module):
def __init__(self, d_model, num_heads, d_ff):
super().__init__()
self.attention = MultiHeadAttention(d_model, num_heads)
self.ff = torch.nn.Sequential(
torch.nn.Linear(d_model, d_ff),
torch.nn.GELU(), # Modern: GELU instead of ReLU
torch.nn.Linear(d_ff, d_model)
)
self.norm1 = torch.nn.LayerNorm(d_model)
self.norm2 = torch.nn.LayerNorm(d_model)
def forward(self, x, causal_mask):
# Self-attention with causal mask
attn, _ = self.attention(x, mask=causal_mask)
x = self.norm1(x + attn)
# Feed-forward
ff = self.ff(x)
x = self.norm2(x + ff)
return x
# That's it! Much simpler than encoder-decoder
Decoder-only models learn through the simplest possible objective: given a sequence of tokens, predict the next one. This is called causal language modeling or autoregressive modeling. The beauty is in its simplicity and power.
Training Objective:
Given: tokens[0], tokens[1], ..., tokens[n-1]
Predict: tokens[n]
Repeat for ALL positions in ALL training documents!
import torch
import torch.nn.functional as F
# Training sentence: "The quick brown fox jumps"
# After tokenization:
tokens = ["The", "quick", "brown", "fox", "jumps"]
token_ids = [464, 2068, 7586, 21831, 18045] # Hypothetical IDs
print("From one sentence, we create MULTIPLE training examples:\n")
# Example 1: Predict "quick" given "The"
input_1 = [464] # "The"
target_1 = 2068 # "quick"
print(f"Context: ['The'] β Predict: 'quick'")
# Example 2: Predict "brown" given "The quick"
input_2 = [464, 2068] # "The quick"
target_2 = 7586 # "brown"
print(f"Context: ['The', 'quick'] β Predict: 'brown'")
# Example 3: Predict "fox" given "The quick brown"
input_3 = [464, 2068, 7586] # "The quick brown"
target_3 = 21831 # "fox"
print(f"Context: ['The', 'quick', 'brown'] β Predict: 'fox'")
# Example 4: Predict "jumps" given "The quick brown fox"
input_4 = [464, 2068, 7586, 21831] # "The quick brown fox"
target_4 = 18045 # "jumps"
print(f"Context: ['The', 'quick', 'brown', 'fox'] β Predict: 'jumps'")
print("\nβ
From 5 tokens, we get 4 training examples!")
print("β
From 1000 tokens β 999 training examples")
print("β
From 1 trillion tokens β ~1 trillion training examples")
The transformer's key advantage: we can compute ALL predictions in parallel! Here's how:
def train_step(model, token_ids, optimizer):
"""
Single training step on a batch of sequences.
Args:
model: Decoder-only transformer
token_ids: [batch_size, seq_len] - input tokens
optimizer: PyTorch optimizer
"""
batch_size, seq_len = token_ids.shape
# Forward pass: get predictions for ALL positions at once
# Causal masking ensures position i can only see tokens 0...i-1
logits = model(token_ids) # [batch_size, seq_len, vocab_size]
print(f"Input shape: {token_ids.shape}")
print(f"Output logits shape: {logits.shape}")
print(f"For each position, we have {logits.shape[-1]} probabilities (one per vocab token)\n")
# Create targets: shift inputs left by 1
# We want to predict token[i+1] given tokens[0:i]
targets = token_ids[:, 1:].contiguous() # Remove first token
logits = logits[:, :-1, :].contiguous() # Remove last prediction
print("After alignment:")
print(f"Logits shape: {logits.shape} # Predictions for positions 0 to seq_len-2")
print(f"Targets shape: {targets.shape} # Actual tokens at positions 1 to seq_len-1\n")
# Reshape for loss computation
logits_flat = logits.view(-1, logits.size(-1)) # [batch*seq_len, vocab_size]
targets_flat = targets.view(-1) # [batch*seq_len]
print(f"Flattened logits: {logits_flat.shape}")
print(f"Flattened targets: {targets_flat.shape}\n")
# Compute cross-entropy loss
# This measures how well our predictions match the actual next tokens
loss = F.cross_entropy(logits_flat, targets_flat)
print(f"Loss: {loss.item():.4f}")
print("(Lower = better predictions)\n")
# Backpropagation
optimizer.zero_grad()
loss.backward()
optimizer.step()
return loss.item()
# Example usage
vocab_size = 50000
d_model = 768
model = GPTLikeModel(vocab_size, d_model)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
# Batch of sequences
batch_size = 2
seq_len = 10
token_ids = torch.randint(0, vocab_size, (batch_size, seq_len))
loss = train_step(model, token_ids, optimizer)
print(f"Training step complete! Loss: {loss:.4f}")
Cross-entropy measures prediction quality:
# Example: Predicting next token after "The cat"
# Correct next token: "sat" (token_id = 5678)
# Model's prediction (before softmax)
logits = torch.tensor([
2.1, # token 0: "and"
1.5, # token 1: "is"
0.8, # token 2: "the"
4.2, # token 3: "sat" β Should be high!
1.2, # token 4: "ran"
# ... 49,995 more tokens
])
# Convert to probabilities
probs = F.softmax(logits, dim=-1)
print("Probability distribution:")
print(f"P('and') = {probs[0]:.4f}")
print(f"P('is') = {probs[1]:.4f}")
print(f"P('sat') = {probs[3]:.4f} β Should be high!")
print(f"P('ran') = {probs[4]:.4f}")
# Cross-entropy loss = -log(probability of correct token)
correct_token_id = 3 # "sat"
loss = -torch.log(probs[correct_token_id])
print(f"\nCross-entropy loss: {loss:.4f}")
# Interpretation:
# If model gives "sat" 90% probability β loss β 0.10 (very good!)
# If model gives "sat" 50% probability β loss β 0.69 (okay)
# If model gives "sat" 10% probability β loss β 2.30 (bad!)
# If model gives "sat" 1% probability β loss β 4.61 (terrible!)
print("\nβ
Training pushes loss down β model learns to assign high probability to correct tokens")
"The cat" β likely: "sat", "ran", "meowed"
"The cats" β likely: "were", "are" (plural agreement)
"2 + 2 =" β likely: "4" (arithmetic)
"The capital of France is" β "Paris"
"Einstein developed" β "relativity"
"Water boils at" β "100 degrees Celsius"
"If it rains, then" β likely continuation involves wetness
"Therefore, we can conclude" β logical deduction
Chain of thought emerges at scale
"def factorial(" β "(n):"
"import" β "numpy", "torch", "os"
Learns syntax, APIs, patterns
| Model | Training Tokens | Training Time | Compute (GPU-hours) |
|---|---|---|---|
| GPT-2 | 40 billion | ~1 week | ~1,000 |
| GPT-3 | 300 billion | ~1 month | ~350,000 |
| LLaMA-2 70B | 2 trillion | ~3 months | ~1,700,000 |
| GPT-4 (est.) | ~13 trillion | ~6 months | ~10,000,000+ |
def train_language_model(model, train_dataloader, num_epochs=3,
device='cuda', learning_rate=1e-4):
"""
Complete training loop for decoder-only language model.
"""
model = model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
# Learning rate scheduler (cosine decay)
total_steps = len(train_dataloader) * num_epochs
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer, T_max=total_steps
)
model.train()
global_step = 0
for epoch in range(num_epochs):
epoch_loss = 0.0
for batch_idx, token_ids in enumerate(train_dataloader):
token_ids = token_ids.to(device)
# token_ids shape: [batch_size, seq_len]
# Forward pass
logits = model(token_ids) # [batch, seq_len, vocab_size]
# Prepare targets (shift by 1)
targets = token_ids[:, 1:].contiguous()
logits = logits[:, :-1, :].contiguous()
# Compute loss
loss = F.cross_entropy(
logits.view(-1, logits.size(-1)),
targets.view(-1)
)
# Backward pass
optimizer.zero_grad()
loss.backward()
# Gradient clipping (prevents exploding gradients)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
scheduler.step()
epoch_loss += loss.item()
global_step += 1
# Logging
if batch_idx % 100 == 0:
perplexity = torch.exp(loss).item()
print(f"Epoch {epoch+1}/{num_epochs} | "
f"Step {batch_idx}/{len(train_dataloader)} | "
f"Loss: {loss.item():.4f} | "
f"Perplexity: {perplexity:.2f} | "
f"LR: {scheduler.get_last_lr()[0]:.2e}")
avg_epoch_loss = epoch_loss / len(train_dataloader)
print(f"\nβ
Epoch {epoch+1} complete. Avg Loss: {avg_epoch_loss:.4f}\n")
return model
# Usage example
# model = GPTLikeModel(vocab_size=50000, d_model=768, num_layers=12)
# trained_model = train_language_model(model, train_dataloader)
Once trained, generate by repeatedly predicting the next token:
@torch.no_grad()
def generate_greedy(model, prompt_ids, max_length=100, vocab_size=50000):
"""
Greedy generation: pick token with highest probability
"""
sequence = prompt_ids.clone()
for _ in range(max_length):
# Get logits for next token
logits = model(sequence) # [batch, seq_len, vocab_size]
next_logits = logits[:, -1, :] # [batch, vocab_size]
# Greedy: pick token with highest probability
next_token = torch.argmax(next_logits, dim=-1) # [batch]
# Append to sequence
sequence = torch.cat([sequence, next_token.unsqueeze(1)], dim=1)
# Stop if end-of-sequence token
if (next_token == end_of_sequence_token).all():
break
return sequence
# Usage
model.eval()
prompt = "Once upon a time"
prompt_ids = tokenizer.encode(prompt) # [1, seq_len]
generated_ids = generate_greedy(model, prompt_ids, max_length=50)
text = tokenizer.decode(generated_ids[0])
print(text)
But greedy has a problem: it can get stuck in repetitive loops or miss good continuations. Enter: sampling.
Temperature is one of the most important hyperparameters in text generation. It controls the tradeoff between "safe and predictable" vs. "creative and surprising." Understanding temperature is essential for controlling model behavior.
Temperature Scaling Formula:
scaled_logits = logits / temperature
probabilities = softmax(scaled_logits)
Key Insight: Temperature doesn't change which token has highest probability - it changes the distribution's "sharpness" or "flatness."
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
# Model's raw predictions (logits) for next token
# These come from the final linear layer: [vocab_size]
logits = torch.tensor([3.0, 2.5, 2.0, 1.0, 0.5, 0.1, 0.0])
token_names = ["is", "was", "will", "can", "may", "could", "would"]
temperatures = [0.1, 0.5, 1.0, 1.5, 2.0]
fig, axes = plt.subplots(1, len(temperatures), figsize=(20, 4))
for idx, temp in enumerate(temperatures):
# Apply temperature
scaled_logits = logits / temp
probs = F.softmax(scaled_logits, dim=-1).numpy()
# Plot
ax = axes[idx]
ax.bar(range(len(token_names)), probs, color='steelblue')
ax.set_xticks(range(len(token_names)))
ax.set_xticklabels(token_names, rotation=45)
ax.set_ylim([0, 1])
ax.set_title(f'Temperature = {temp}', fontsize=12)
ax.set_ylabel('Probability' if idx == 0 else '')
# Show actual probabilities
for i, p in enumerate(probs):
ax.text(i, p + 0.02, f'{p:.2f}', ha='center', fontsize=8)
plt.tight_layout()
plt.show()
print("Observation:")
print("- Low temp (0.1): Spiky distribution, almost all weight on 'is'")
print("- Medium temp (1.0): Balanced distribution, multiple options viable")
print("- High temp (2.0): Flat distribution, even low-probability tokens get picked")
def demonstrate_temperature_effects(model, prompt="The future of AI is",
max_length=20):
"""
Generate same prompt with different temperatures.
"""
prompt_ids = tokenizer.encode(prompt, return_tensors='pt')
temps = [0.1, 0.7, 1.0, 1.5, 2.0]
print(f"Prompt: \"{prompt}\"\n")
print("=" * 70)
for temp in temps:
generated = generate(model, prompt_ids, max_length=max_length,
temperature=temp, top_p=0.95)
text = tokenizer.decode(generated[0])
print(f"\nTemperature = {temp}:")
print(f" {text}")
print("\n" + "=" * 70)
# Example output:
# Prompt: "The future of AI is"
# ======================================================================
#
# Temperature = 0.1:
# The future of AI is a very important topic that has been discussed extensively in recent years
# (Note: Safe, predictable, somewhat repetitive)
#
# Temperature = 0.7:
# The future of AI is fascinating and holds immense potential for transforming industries
# (Note: Good balance, natural and varied)
#
# Temperature = 1.0:
# The future of AI is uncertain but promising, with many exciting developments on the horizon
# (Note: More varied word choices)
#
# Temperature = 1.5:
# The future of AI is remarkable yet challenging, potentially reshaping society in unexpected ways
# (Note: More creative, less common word combinations)
#
# Temperature = 2.0:
# The future of AI is wonderfully mysterious, perhaps revolutionizing beyond imagination or confusion
# (Note: Very creative but starting to lose coherence)
| Use Case | Recommended Temp | Why |
|---|---|---|
| Code Generation | 0.1 - 0.3 | Need correctness, syntactic precision |
| Factual Q&A | 0.2 - 0.5 | Want accurate, confident answers |
| General Chat | 0.7 - 0.9 | Balance between coherence and variety |
| Creative Writing | 0.9 - 1.2 | Want surprising, creative language |
| Brainstorming | 1.0 - 1.5 | Need diverse, unconventional ideas |
| Poetry/Art | 1.2 - 2.0 | Maximum creativity, experimental language |
# With temperature=0 (or very low), always pick most probable token
def greedy_generation_problem():
"""Demonstrates repetition problem with greedy decoding."""
prompt = "I think"
# Step 1: "I think" β most probable: "that"
# Step 2: "I think that" β most probable: "I"
# Step 3: "I think that I" β most probable: "think"
# Step 4: "I think that I think" β most probable: "that"
# Step 5: "I think that I think that" β most probable: "I"
# ...
# Result: "I think that I think that I think that..." (infinite loop!)
# Another example:
prompt = "The most important"
# β "The most important thing is the most important thing is the most..."
print("Greedy decoding often causes:")
print("1. Repetition loops")
print("2. Predictable, boring outputs")
print("3. Missing better sequences that start with less probable tokens")
print("\nβ
Solution: Add randomness via temperature > 0!")
def generate_with_dynamic_temperature(model, prompt_ids, max_length=100):
"""
Use different temperatures for different parts of generation.
Strategy:
- Start high (creative opening)
- Middle moderate (balanced development)
- End lower (coherent conclusion)
"""
sequence = prompt_ids.clone()
for step in range(max_length):
# Dynamic temperature based on generation progress
progress = step / max_length
if progress < 0.2:
temperature = 1.2 # Creative start
elif progress < 0.8:
temperature = 0.8 # Balanced middle
else:
temperature = 0.5 # Focused ending
logits = model(sequence)[:, -1, :]
logits = logits / temperature
probs = F.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
sequence = torch.cat([sequence, next_token], dim=1)
return sequence
# This technique can produce more natural-seeming generations
# with varied creativity throughout
Deterministic
β
Accurate, consistent
β Repetitive, boring
Balanced
β
Natural, varied
β
Most popular setting
Default/Neutral
β
Unmodified distribution
β οΈ Sometimes too random
Creative
β
Surprising, novel
β Often incoherent
Temperature controls how we scale probabilities, but sampling strategies control which tokens we consider. These techniques prevent the model from generating unlikely/nonsensical tokens while maintaining creativity.
β Pure Random Sampling Issue:
# With just temperature, might sample ANY token from vocabulary
probs = F.softmax(logits / temperature, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
# Example probability distribution (50,257 tokens in GPT-2):
# Token "is": 40.2% β High probability, makes sense
# Token "was": 25.1% β Also reasonable
# Token "will": 15.3% β Good option
# ...
# Token "xyzabc": 0.0001% β Nonsense, but can still be sampled!
# Problem: With 50K+ tokens, even with temperature, we might
# sample truly bizarre tokens with tiny probabilities
Solution: Limit sampling to a "reasonable subset" using top-k or top-p.
Idea: Only consider the k most probable tokens. Set all others to zero probability.
def top_k_sampling(logits, k=50, temperature=1.0):
"""
Top-k sampling: Keep only top k tokens, set rest to -inf
Args:
logits: [vocab_size] unnormalized predictions
k: Number of top tokens to consider (e.g., 50)
temperature: Controls randomness
Returns:
next_token: Sampled token ID
"""
# Apply temperature
logits = logits / temperature
# Get top k logits and their indices
top_k_logits, top_k_indices = torch.topk(logits, k)
# top_k_logits: [k] highest logit values
# top_k_indices: [k] positions of those logits
# Create new tensor with -inf everywhere
logits_filtered = torch.full_like(logits, float('-inf'))
# Put back the top k logits at their original positions
logits_filtered.scatter_(0, top_k_indices, top_k_logits)
# Now softmax will give 0 probability to -inf tokens
probs = F.softmax(logits_filtered, dim=-1)
# Sample from top k only
next_token = torch.multinomial(probs, num_samples=1)
return next_token
# Example walkthrough:
print("Example: Top-k sampling with k=3")
print("=" * 60)
# Suppose model outputs these logits:
logits = torch.tensor([3.0, 2.8, 2.5, 1.0, 0.5, 0.1, 0.0])
tokens = ["is", "was", "will", "can", "may", "could", "would"]
print("Original logits:")
for tok, logit in zip(tokens, logits):
print(f" {tok:8s}: {logit:.2f}")
# Apply top-k with k=3
k = 3
top_k_logits, top_k_indices = torch.topk(logits, k)
print(f"\nTop-{k} selection:")
for idx, logit in zip(top_k_indices, top_k_logits):
print(f" {tokens[idx]:8s}: {logit:.2f}")
# After filtering
logits_filtered = torch.full_like(logits, float('-inf'))
logits_filtered.scatter_(0, top_k_indices, top_k_logits)
probs = F.softmax(logits_filtered / 1.0, dim=-1)
print(f"\nProbabilities (only top-{k} non-zero):")
for tok, prob in zip(tokens, probs):
if prob > 0.001:
print(f" {tok:8s}: {prob:.3f}")
else:
print(f" {tok:8s}: 0.000 (filtered out)")
# Output:
# Original logits:
# is : 3.00
# was : 2.80
# will : 2.50
# can : 1.00
# may : 0.50
# could : 0.10
# would : 0.00
#
# Top-3 selection:
# is : 3.00
# was : 2.80
# will : 2.50
#
# Probabilities (only top-3 non-zero):
# is : 0.442
# was : 0.357
# will : 0.201
# can : 0.000 (filtered out)
# may : 0.000 (filtered out)
# could : 0.000 (filtered out)
# would : 0.000 (filtered out)
β Advantages:
β Disadvantages:
Idea: Keep the smallest set of tokens whose cumulative probability exceeds p (e.g., 0.9). This adapts to the model's confidence.
def top_p_sampling(logits, p=0.9, temperature=1.0):
"""
Nucleus sampling: Keep tokens until cumulative probability > p
Key insight: If model is confident, nucleus is small (few tokens).
If model is uncertain, nucleus is large (many tokens).
Args:
logits: [vocab_size] unnormalized predictions
p: Cumulative probability threshold (e.g., 0.9 = 90%)
temperature: Controls randomness
Returns:
next_token: Sampled token ID
"""
# Apply temperature
logits = logits / temperature
probs = F.softmax(logits, dim=-1)
# Sort by probability (descending)
sorted_probs, sorted_indices = torch.sort(probs, descending=True, dim=-1)
# Cumulative probability
cumsum_probs = torch.cumsum(sorted_probs, dim=-1)
# Find where cumulative probability exceeds p
# We want to REMOVE tokens where cumsum > p (they're the tail)
mask = cumsum_probs > p
# Always keep the most probable token (prevent empty set)
mask[..., 0] = False
# Zero out probabilities of tail tokens
sorted_probs[mask] = 0.0
# Renormalize (so probabilities sum to 1 again)
sorted_probs = sorted_probs / sorted_probs.sum(dim=-1, keepdim=True)
# Sample from filtered distribution
next_token_sorted_idx = torch.multinomial(sorted_probs, num_samples=1)
# Map back to original vocabulary index
next_token = sorted_indices.gather(-1, next_token_sorted_idx)
return next_token
# Example walkthrough with two scenarios:
print("Example 1: Model is CONFIDENT")
print("=" * 60)
# Confident case: One token dominates
logits_confident = torch.tensor([5.0, 2.0, 1.0, 0.5, 0.2, 0.1, 0.0])
probs = F.softmax(logits_confident, dim=-1)
tokens = ["yes", "sure", "ok", "yeah", "yep", "affirmative", "indeed"]
sorted_probs, sorted_indices = torch.sort(probs, descending=True)
cumsum = torch.cumsum(sorted_probs, dim=-1)
print("Token probabilities (sorted):")
for i in range(len(tokens)):
idx = sorted_indices[i]
print(f" {tokens[idx]:15s}: {sorted_probs[i]:.3f} | Cumulative: {cumsum[i]:.3f}")
if cumsum[i] > 0.9:
print(f" ββ Top-p (p=0.9) stops here! Nucleus size: {i+1} tokens")
break
print("\n" + "=" * 60)
print("Example 2: Model is UNCERTAIN")
print("=" * 60)
# Uncertain case: Many tokens have similar probability
logits_uncertain = torch.tensor([2.0, 1.9, 1.8, 1.7, 1.6, 1.5, 1.4])
probs = F.softmax(logits_uncertain, dim=-1)
tokens = ["maybe", "perhaps", "possibly", "might", "could", "uncertain", "unclear"]
sorted_probs, sorted_indices = torch.sort(probs, descending=True)
cumsum = torch.cumsum(sorted_probs, dim=-1)
print("Token probabilities (sorted):")
for i in range(len(tokens)):
idx = sorted_indices[i]
print(f" {tokens[idx]:15s}: {sorted_probs[i]:.3f} | Cumulative: {cumsum[i]:.3f}")
if cumsum[i] > 0.9:
print(f" ββ Top-p (p=0.9) stops here! Nucleus size: {i+1} tokens")
break
# Output shows:
# Confident case: Nucleus might be only 2-3 tokens (model knows what to say)
# Uncertain case: Nucleus might be 5-6 tokens (model less sure, keeps more options)
Adaptive to Confidence:
Prevents Both Problems:
| Aspect | Top-k (k=50) | Top-p (p=0.9) |
|---|---|---|
| Nucleus Size | Always 50 tokens | Varies (2-200+ tokens) |
| When Model Confident | Still considers 50 tokens (wasteful) | Considers 2-5 tokens (efficient) |
| When Model Uncertain | Only 50 tokens (might miss good ones) | Considers 100+ tokens (flexible) |
| Use Case | Code generation, structured output | Chat, creative writing, general use |
| Typical Value | k=40-50 | p=0.9-0.95 |
def generate_token(logits, temperature=0.8, top_k=50, top_p=0.95):
"""
Industry standard: Use BOTH top-k and top-p together
1. First apply top-k to limit to reasonable candidates
2. Then apply top-p to adapt to confidence
"""
# Step 1: Temperature scaling
logits = logits / temperature
# Step 2: Top-k filtering (hard cutoff)
if top_k > 0:
top_k_logits, top_k_indices = torch.topk(logits, min(top_k, logits.size(-1)))
logits_filtered = torch.full_like(logits, float('-inf'))
logits_filtered.scatter_(-1, top_k_indices, top_k_logits)
logits = logits_filtered
# Step 3: Top-p filtering (adaptive cutoff)
if top_p < 1.0:
probs = F.softmax(logits, dim=-1)
sorted_probs, sorted_indices = torch.sort(probs, descending=True)
cumsum_probs = torch.cumsum(sorted_probs, dim=-1)
mask = cumsum_probs > top_p
mask[..., 0] = False # Keep at least one
sorted_probs[mask] = 0.0
sorted_probs = sorted_probs / sorted_probs.sum(dim=-1, keepdim=True)
next_token_idx = torch.multinomial(sorted_probs, num_samples=1)
next_token = sorted_indices.gather(-1, next_token_idx)
else:
probs = F.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
return next_token
# Real-world parameter combinations:
configs = {
"ChatGPT default": {"temperature": 0.7, "top_p": 0.95, "top_k": None},
"Code generation": {"temperature": 0.2, "top_p": 0.9, "top_k": 40},
"Creative writing": {"temperature": 1.0, "top_p": 0.95, "top_k": None},
"Factual QA": {"temperature": 0.3, "top_p": 0.85, "top_k": 20},
}
@torch.no_grad()
def generate(model, prompt_ids, max_length=100, temperature=0.7,
top_p=0.9, vocab_size=50000, end_token=50256):
"""
Generate text with temperature and top-p sampling
"""
sequence = prompt_ids.clone()
for _ in range(max_length):
# Forward pass
logits = model(sequence) # [batch, seq_len, vocab_size]
next_logits = logits[:, -1, :] # [batch, vocab_size]
# Apply temperature
next_logits = apply_temperature(next_logits, temperature)
# Apply top-p filtering
probs = F.softmax(next_logits, dim=-1)
probs = sample_top_p_filtering(probs, top_p)
probs = probs / probs.sum(dim=-1, keepdim=True) # Renormalize
# Sample next token
next_token = torch.multinomial(probs, num_samples=1)
# Append to sequence
sequence = torch.cat([sequence, next_token], dim=1)
# Stop if end token
if (next_token == end_token).all():
break
return sequence
# Different generation styles
def generate_deterministic(model, prompt_ids):
return generate(model, prompt_ids, temperature=0.1, top_p=0.99)
def generate_balanced(model, prompt_ids):
return generate(model, prompt_ids, temperature=0.7, top_p=0.9)
def generate_creative(model, prompt_ids):
return generate(model, prompt_ids, temperature=1.2, top_p=0.95)
Several advantages of decoder-only models:
Single stack vs two stacks (encoder-decoder)
Same parameters, single path of computation
Learn from raw text (next-token prediction)
Works for generation, classification, reasoning
Single forward pass per token (no cross-attention overhead)
Just append tasks, model learns from context
Q1: What is the key characteristic of decoder-only models like GPT?
Q2: What is causal masking in decoder-only models?
Q3: What is autoregressive generation?
Q4: Which model is an example of a decoder-only architecture?
Q5: Why are decoder-only models well-suited for language generation?