π Complete all tutorials to earn your Free Deep Learning Certificate
Shareable on LinkedIn β’ Verified by AITutorials.site β’ No signup fee
π The Transformer Revolution
In June 2017, a team of Google researchers published a paper titled "Attention is All You Need" that would fundamentally transform artificial intelligence. The Transformer architecture they introduced replaced RNNs and LSTMs entirely, becoming the foundation for virtually every state-of-the-art AI system today: GPT-4, Claude, Gemini, BERT, ChatGPT, and countless others.
π‘ The Big Breakthrough: Transformers process all positions simultaneously using attention, not sequentially like RNNs. This enables:
- β‘ Massive parallelization: Train on thousands of GPUs at once
- π Long-range dependencies: Handle 100,000+ token sequences
- π― Better performance: State-of-the-art on virtually all NLP tasks
- π Scalability: Grow to 100B+ parameters efficiently
Why RNNs Failed to Scale
Before Transformers, RNNs and LSTMs dominated sequential modeling. But they had fundamental limitations:
β’ Sequential processing: Must wait for step t-1 before computing step t β no parallelization
β’ Limited context: Struggle with sequences > 100 tokens
β’ Slow training: Can't use full GPU power
β’ Vanishing gradients: Hard to learn very long dependencies
β’ Memory bottleneck: Compress entire history into fixed-size hidden state
β’ Parallel processing: Compute all positions at once β 100x faster training
β’ Unlimited context: Handle 100,000+ tokens efficiently
β’ Fast training: Saturate thousands of GPUs
β’ Clean gradients: Direct connections via attention
β’ No bottleneck: Every position can access any other position directly
The Impact: From Research to Reality
2017: Original Transformer paper (machine translation)
2018: BERT revolutionizes NLP understanding tasks
2019: GPT-2 shows large-scale language modeling potential
2020: GPT-3 demonstrates few-shot learning with 175B parameters
2021: Vision Transformers (ViT) beat CNNs on image tasks
2022: ChatGPT brings LLMs to mainstream (100M users in 2 months)
2023-2024: GPT-4, Claude 3, Gemini Ultra β multimodal AI
2025: Transformers in audio, video, robotics, drug discovery, and beyond
Key insight: Transformers aren't just betterβthey're fundamentally different. They enabled the AI revolution by making it possible to train on internet-scale data efficiently.
What Makes Transformers Special?
Parallel Processing
Process entire sequence at once, not word-by-word. Train on 1000s of GPUs simultaneously.
Self-Attention
Every word directly attends to every other word. No information bottleneck.
Long Context
Handle 100,000+ tokens with techniques like sparse attention and efficient implementations.
Transfer Learning
Pre-train once on massive data, fine-tune for any task. Democratizes AI development.
π Learning Goal: By the end of this tutorial, you'll understand:
- β How attention mechanisms work (Query, Key, Value)
- β Self-attention and multi-head attention
- β Complete Transformer architecture (encoder, decoder, positional encoding)
- β Why Transformers outperform RNNs
- β Different Transformer variants (BERT, GPT, T5)
- β How to implement and use Transformers
π The Attention Mechanism: The Core Innovation
π‘ Core Idea: Attention allows the model to dynamically focus on relevant parts of the input. Instead of processing word-by-word sequentially, it calculates relationships between ALL positions simultaneously.
Think of it like reading a paragraph and being able to instantly reference any word while understanding the current wordβno need to remember everything in a fixed-size memory!
Intuitive Example: Translation with Attention
Without Attention (RNN):
β’ Encoder processes word-by-word: "The" β "cat" β "sat" β "on" β "the" β "mat"
β’ Final hidden state must remember entire sentence
β’ Decoder generates translation from this single fixed vector
β’ Problem: Information bottleneck! Long sentences lose details.
With Attention (Transformer):
When generating "assis" (sat):
β’ Attends strongly to "sat" (0.7 weight)
β’ Attends moderately to "cat" (0.2 weight) β subject agreement
β’ Attends weakly to other words (0.1 total)
β’ Result: Direct access to relevant source words, no bottleneck!
The Mathematics: Scaled Dot-Product Attention
Attention uses three learned linear projections called Query, Key, and Value. This is the famous QKV attention.
Attention(Q, K, V) = softmax(QKT / βdk) VWhere:
β’ Q (Query): "What am I looking for?" (shape: seq_len Γ dk)
β’ K (Key): "What do I represent?" (shape: seq_len Γ dk)
β’ V (Value): "What information do I carry?" (shape: seq_len Γ dv)
β’ dk: Dimension of queries/keys (scaling factor)
Step-by-step:
1. Compute similarity: QKT β scores matrix (seq_len Γ seq_len)
2. Scale by βdk to prevent vanishing gradients in softmax
3. Normalize with softmax: each row sums to 1 (attention weights)
4. Weight values: multiply weights Γ V to get output
Concrete Example with Numbers
Goal: Understand what "sat" should attend to
Step 1: Create Q, K, V matrices
Assume dk = 3 (tiny for illustration; real models use 64-128)
Word embeddings (simplified 3D vectors):
"The" = [1.0, 0.2, 0.1]
"cat" = [0.5, 1.2, 0.8]
"sat" = [0.3, 0.4, 1.5]
Apply learned weight matrices WQ, WK, WV (learned during training):
Q = embeddings Γ WQ: ["The": [0.8, 0.3, 0.2], "cat": [1.1, 0.9, 0.4], "sat": [0.7, 1.2, 0.6]]
K = embeddings Γ WK: ["The": [0.9, 0.1, 0.3], "cat": [0.6, 1.3, 0.5], "sat": [0.4, 0.5, 1.4]]
V = embeddings Γ WV: ["The": [1.1, 0.3, 0.2], "cat": [0.7, 1.4, 0.6], "sat": [0.5, 0.6, 1.8]]
Step 2: Compute attention scores for "sat"
Qsat Β· KTheT = [0.7, 1.2, 0.6] Β· [0.9, 0.1, 0.3] = 0.63 + 0.12 + 0.18 = 0.93
Qsat Β· KcatT = [0.7, 1.2, 0.6] Β· [0.6, 1.3, 0.5] = 0.42 + 1.56 + 0.30 = 2.28
Qsat Β· KsatT = [0.7, 1.2, 0.6] Β· [0.4, 0.5, 1.4] = 0.28 + 0.60 + 0.84 = 1.72
Step 3: Scale by βdk = β3 β 1.73
Scaled scores: [0.93/1.73, 2.28/1.73, 1.72/1.73] = [0.54, 1.32, 0.99]
Step 4: Apply softmax
Weights: softmax([0.54, 1.32, 0.99]) = [0.14, 0.54, 0.32]
Interpretation: When processing "sat":
β’ Attend to "The": 14% (low - not very relevant)
β’ Attend to "cat": 54% (high - subject of the verb!)
β’ Attend to "sat": 32% (moderate - self-attention)
Step 5: Weighted sum of values
Output = 0.14ΓVThe + 0.54ΓVcat + 0.32ΓVsat
Output = 0.14Γ[1.1,0.3,0.2] + 0.54Γ[0.7,1.4,0.6] + 0.32Γ[0.5,0.6,1.8]
Output = [0.154,0.042,0.028] + [0.378,0.756,0.324] + [0.160,0.192,0.576]
Output = [0.692, 0.990, 0.928]
Result: The output for "sat" is now enriched with information from "cat" (heavily weighted) and context from other words!
Why Scaling by βdk?
β οΈ Critical Detail: Without scaling, dot products grow large when dk is large.
Problem: Large dot products β large logits β softmax saturates β vanishing gradients
Example: If dk = 512, unscaled dot products can reach 100+. After softmax, one weight β 1.0, others β 0 β gradient flow broken.
Solution: Divide by βdk = β512 β 22.6 β dot products stay in reasonable range β healthy gradients!
Self-Attention: Every Position Attends to Every Position
In Transformers, Q, K, and V all come from the same input sequence. This is called self-attention (vs cross-attention in encoder-decoder models).
For "The cat sat on the mat":
The cat sat on the mat
The [0.3 0.1 0.1 0.1 0.2 0.2] β "The" attends weakly to all
cat [0.2 0.4 0.3 0.0 0.0 0.1] β "cat" attends to itself + verb
sat [0.1 0.5 0.2 0.1 0.0 0.1] β "sat" strongly attends to "cat"
on [0.1 0.1 0.2 0.3 0.1 0.2] β "on" attends to verb + object
the [0.2 0.1 0.1 0.2 0.3 0.1] β "the" attends to nearby
mat [0.1 0.1 0.2 0.3 0.2 0.1] β "mat" attends to prep + verb
Each row shows attention weights (sums to 1.0). Darker values = stronger attention.
Notice: No sequential constraint! "sat" directly attends to "cat" without processing "The" first.
Implementation: Attention from Scratch
import numpy as np
import torch
import torch.nn.functional as F
def scaled_dot_product_attention(Q, K, V, mask=None):
"""
Compute scaled dot-product attention.
Args:
Q: Query matrix (batch, seq_len, d_k)
K: Key matrix (batch, seq_len, d_k)
V: Value matrix (batch, seq_len, d_v)
mask: Optional mask (batch, seq_len, seq_len)
Returns:
Output: Attention output (batch, seq_len, d_v)
Weights: Attention weights (batch, seq_len, seq_len)
"""
# Get dimension for scaling
d_k = Q.size(-1)
# Step 1: Compute attention scores (QK^T)
# Shape: (batch, seq_len, d_k) @ (batch, d_k, seq_len) β (batch, seq_len, seq_len)
scores = torch.matmul(Q, K.transpose(-2, -1))
# Step 2: Scale by sqrt(d_k)
scores = scores / np.sqrt(d_k)
# Step 3: Apply mask if provided (e.g., for causal attention in GPT)
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
# Step 4: Apply softmax to get attention weights
# Each row sums to 1.0
weights = F.softmax(scores, dim=-1)
# Step 5: Weighted sum of values
# Shape: (batch, seq_len, seq_len) @ (batch, seq_len, d_v) β (batch, seq_len, d_v)
output = torch.matmul(weights, V)
return output, weights
# ============ EXAMPLE USAGE ============
batch_size = 2
seq_len = 6 # "The cat sat on the mat"
d_model = 512 # Model dimension (typical for Transformer)
# Simulate input embeddings
embeddings = torch.randn(batch_size, seq_len, d_model)
# Create Q, K, V with learned projection matrices
W_Q = torch.randn(d_model, d_model)
W_K = torch.randn(d_model, d_model)
W_V = torch.randn(d_model, d_model)
Q = torch.matmul(embeddings, W_Q)
K = torch.matmul(embeddings, W_K)
V = torch.matmul(embeddings, W_V)
# Compute attention
output, weights = scaled_dot_product_attention(Q, K, V)
print(f"Output shape: {output.shape}") # (2, 6, 512)
print(f"Attention weights shape: {weights.shape}") # (2, 6, 6)
# Inspect attention pattern for first batch, position 2 ("sat")
print("\nAttention weights for 'sat' (position 2):")
print(weights[0, 2, :]) # Should show high weight for position 1 ("cat")
β Key Takeaways:
- Query-Key-Value: Q asks "what?", K provides "here!", V gives "the info"
- Dot product: Measures similarity between queries and keys
- Softmax: Converts scores to probability distribution (weights sum to 1)
- Parallel: All positions computed simultaneously β massive speedup
- No recurrence: Direct connections enable clean gradient flow
- Interpretable: Attention weights show what model focuses on
ποΈ Complete Transformer Architecture
The full Transformer combines multiple components into a powerful architecture. Let's build it piece by piece, understanding each component's role.
1. Multi-Head Attention: Multiple Perspectives
Instead of one attention mechanism, Transformers use multiple attention heads in parallel. Each head learns different relationships!
π‘ Intuition: When reading "The bank can guarantee deposits", different heads focus on different meanings:
- Head 1: "bank" β "deposits" (financial institution context)
- Head 2: "can" β "guarantee" (modal verb + action)
- Head 3: "The" β "bank" (determiner + noun)
- Head 4: Overall sentence structure and syntax
Each head specializes in different linguistic patternsβsyntax, semantics, coreference, etc.
For h attention heads:
headi = Attention(QWiQ, KWiK, VWiV)
MultiHead(Q, K, V) = Concat(head1, ..., headh)WO
Where:
β’ Each head has its own WiQ, WiK, WiV projection matrices
β’ Typical: 8 or 16 heads (GPT-3 uses 96 heads!)
β’ Each head has dimension dk = dmodel / h
β’ Final WO matrix combines all heads
Example: 8 Heads with d_model = 512
β’ Model dimension: dmodel = 512
β’ Number of heads: h = 8
β’ Dimension per head: dk = 512 / 8 = 64
Process:
1. Split 512-dim embedding into 8 Γ 64-dim chunks
2. Each chunk processed by separate attention head
3. Head 1 learns subject-verb relationships (64 dims)
4. Head 2 learns semantic similarity (64 dims)
5. ... (6 more heads learning different patterns)
6. Concatenate all 8 outputs: 8 Γ 64 = 512 dims
7. Project with WO to final 512-dim output
Benefit: 8x more pattern types learned without increasing computation!
(8 heads Γ 64 dims has same compute as 1 head Γ 512 dims)
import torch
import torch.nn as nn
class MultiHeadAttention(nn.Module):
def __init__(self, d_model=512, num_heads=8):
super().__init__()
assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
self.d_model = d_model
self.num_heads = num_heads
self.d_k = d_model // num_heads # 512 / 8 = 64
# Linear projections for Q, K, V (all heads at once)
self.W_Q = nn.Linear(d_model, d_model)
self.W_K = nn.Linear(d_model, d_model)
self.W_V = nn.Linear(d_model, d_model)
# Output projection
self.W_O = nn.Linear(d_model, d_model)
def forward(self, Q, K, V, mask=None):
batch_size = Q.size(0)
# 1. Linear projections
# Shape: (batch, seq_len, d_model)
Q = self.W_Q(Q)
K = self.W_K(K)
V = self.W_V(V)
# 2. Split into multiple heads
# Shape: (batch, seq_len, d_model) β (batch, seq_len, num_heads, d_k)
Q = Q.view(batch_size, -1, self.num_heads, self.d_k)
K = K.view(batch_size, -1, self.num_heads, self.d_k)
V = V.view(batch_size, -1, self.num_heads, self.d_k)
# Transpose for attention: (batch, num_heads, seq_len, d_k)
Q = Q.transpose(1, 2)
K = K.transpose(1, 2)
V = V.transpose(1, 2)
# 3. Apply scaled dot-product attention for each head
scores = torch.matmul(Q, K.transpose(-2, -1)) / np.sqrt(self.d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
weights = torch.softmax(scores, dim=-1)
attention_output = torch.matmul(weights, V)
# 4. Concatenate heads
# (batch, num_heads, seq_len, d_k) β (batch, seq_len, num_heads, d_k)
attention_output = attention_output.transpose(1, 2).contiguous()
# (batch, seq_len, num_heads, d_k) β (batch, seq_len, d_model)
attention_output = attention_output.view(batch_size, -1, self.d_model)
# 5. Final linear projection
output = self.W_O(attention_output)
return output, weights
2. Positional Encoding: Where Am I?
Problem: Attention has no notion of order! "Cat chased dog" = "Dog chased cat" in pure attention.
Solution: Add positional information to embeddings.
PE(pos, 2i) = sin(pos / 100002i/dmodel)
PE(pos, 2i+1) = cos(pos / 100002i/dmodel)
Where:
β’ pos = position in sequence (0, 1, 2, ...)
β’ i = dimension index (0, 1, 2, ..., dmodel/2)
β’ Even dimensions use sine, odd use cosine
β’ Different frequencies for each dimension
Why this formula?
β’ Allows model to learn relative positions (PEpos+k is linear function of PEpos)
β’ Extrapolates to unseen sequence lengths
β’ Each position gets unique encoding
import numpy as np
import torch
def get_positional_encoding(max_seq_len, d_model):
"""
Generate sinusoidal positional encodings.
Args:
max_seq_len: Maximum sequence length
d_model: Model dimension (512 typical)
Returns:
Positional encoding matrix (max_seq_len, d_model)
"""
position = np.arange(max_seq_len)[:, np.newaxis] # (max_seq_len, 1)
div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))
pe = np.zeros((max_seq_len, d_model))
pe[:, 0::2] = np.sin(position * div_term) # Even dimensions
pe[:, 1::2] = np.cos(position * div_term) # Odd dimensions
return torch.FloatTensor(pe)
# Visualize positional encoding patterns
pe = get_positional_encoding(100, 512)
print(f"Positional encoding shape: {pe.shape}") # (100, 512)
# Position 0 and position 1 have different patterns
print(f"Position 0: {pe[0, :8]}") # First 8 dims
print(f"Position 1: {pe[1, :8]}") # Different values!
3. Feed-Forward Network: Non-Linear Transformation
After attention, each position passes through a 2-layer feed-forward network (same network applied to each position independently).
Typical dimensions:
β’ Input: dmodel = 512
β’ Hidden: dff = 2048 (4x expansion!)
β’ Output: dmodel = 512
Purpose: Add non-linearity and model complexity. Attention is linear, FFN adds expressiveness!
class FeedForward(nn.Module):
def __init__(self, d_model=512, d_ff=2048, dropout=0.1):
super().__init__()
self.linear1 = nn.Linear(d_model, d_ff)
self.linear2 = nn.Linear(d_ff, d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
# x: (batch, seq_len, d_model)
x = self.linear1(x) # (batch, seq_len, d_ff)
x = torch.relu(x) # ReLU activation
x = self.dropout(x)
x = self.linear2(x) # (batch, seq_len, d_model)
return x
4. Layer Normalization: Stabilize Training
Normalize activations to have mean=0, variance=1. Critical for training deep Transformers (GPT-3 has 96 layers!).
class LayerNorm(nn.Module):
def __init__(self, d_model, eps=1e-6):
super().__init__()
self.gamma = nn.Parameter(torch.ones(d_model)) # Scale
self.beta = nn.Parameter(torch.zeros(d_model)) # Shift
self.eps = eps
def forward(self, x):
mean = x.mean(-1, keepdim=True)
std = x.std(-1, keepdim=True)
return self.gamma * (x - mean) / (std + self.eps) + self.beta
5. Residual Connections: Skip Connections
Add input to output of each sub-layer. Enables gradient flow in very deep networks.
Why it works:
β’ Direct path for gradients to flow backwards
β’ Each layer learns residual (what to add), not full transformation
β’ Allows training 100+ layer models
Putting It All Together: Transformer Block
class TransformerBlock(nn.Module):
"""
Complete Transformer encoder block with:
- Multi-head self-attention
- Feed-forward network
- Layer normalization
- Residual connections
"""
def __init__(self, d_model=512, num_heads=8, d_ff=2048, dropout=0.1):
super().__init__()
self.attention = MultiHeadAttention(d_model, num_heads)
self.feed_forward = FeedForward(d_model, d_ff, dropout)
self.norm1 = LayerNorm(d_model)
self.norm2 = LayerNorm(d_model)
self.dropout1 = nn.Dropout(dropout)
self.dropout2 = nn.Dropout(dropout)
def forward(self, x, mask=None):
# 1. Multi-head self-attention with residual + norm
attention_output, _ = self.attention(x, x, x, mask)
x = self.norm1(x + self.dropout1(attention_output))
# 2. Feed-forward with residual + norm
ff_output = self.feed_forward(x)
x = self.norm2(x + self.dropout2(ff_output))
return x
# Stack multiple blocks for full Transformer
class Transformer(nn.Module):
def __init__(self, vocab_size, d_model=512, num_heads=8, num_layers=6,
d_ff=2048, max_seq_len=512, dropout=0.1):
super().__init__()
# Token + positional embeddings
self.embedding = nn.Embedding(vocab_size, d_model)
self.pos_encoding = get_positional_encoding(max_seq_len, d_model)
self.dropout = nn.Dropout(dropout)
# Stack of Transformer blocks
self.blocks = nn.ModuleList([
TransformerBlock(d_model, num_heads, d_ff, dropout)
for _ in range(num_layers)
])
# Output projection
self.norm = LayerNorm(d_model)
self.output = nn.Linear(d_model, vocab_size)
def forward(self, x, mask=None):
# x: (batch, seq_len) - token IDs
seq_len = x.size(1)
# Embedding + positional encoding
x = self.embedding(x) # (batch, seq_len, d_model)
x = x + self.pos_encoding[:seq_len, :].to(x.device)
x = self.dropout(x)
# Pass through Transformer blocks
for block in self.blocks:
x = block(x, mask)
# Output projection
x = self.norm(x)
logits = self.output(x) # (batch, seq_len, vocab_size)
return logits
The Complete Architecture Visualized
β
Token Embedding (vocab_size β d_model)
+
Positional Encoding (add position info)
β
βββββββββββββββββββββββββββββββββββ
β Transformer Block 1 β
β β’ Multi-Head Self-Attention β
β β’ Add & Norm (residual) β
β β’ Feed-Forward Network β
β β’ Add & Norm (residual) β
βββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββ
β Transformer Block 2 β
β β’ (same structure) β
βββββββββββββββββββββββββββββββββββ
β
... (4 more blocks) ...
β
βββββββββββββββββββββββββββββββββββ
β Transformer Block 6 β
βββββββββββββββββββββββββββββββββββ
β
Layer Normalization
β
Linear (d_model β vocab_size)
β
Output Logits (probabilities for next token)
β Key Architecture Components:
- Multi-Head Attention: 8-16 parallel heads learn different patterns
- Positional Encoding: Sinusoidal functions inject position information
- Feed-Forward: 2-layer MLP with 4x expansion (512 β 2048 β 512)
- Layer Norm: Stabilizes training, mean=0 std=1
- Residual Connections: x + Sublayer(x) enables deep stacking
- Parallel Processing: All positions computed simultaneously
π€ Transformer Variants: BERT, GPT, T5
The original Transformer had both encoder and decoder. Modern models specialize: encoder-only for understanding, decoder-only for generation, or encoder-decoder for translation tasks.
1. Encoder-Only: BERT (Bidirectional Encoder Representations)
BERT (Google, 2018) uses only the Transformer encoder. Processes text bidirectionallyβsees both past and future context simultaneously.
β’ Stack of Transformer encoder blocks only
β’ BERT-Base: 12 layers, 768 dims, 110M params
β’ BERT-Large: 24 layers, 1024 dims, 340M params
β’ Bidirectional attention (sees full context)
β’ Masked Language Model (MLM): Predict masked words
"The [MASK] sat on the mat" β "cat"
β’ Next Sentence Prediction: Does sentence B follow A?
β’ Trained on BookCorpus + Wikipedia
π‘ Best For:
- β Text Classification: Sentiment analysis, spam detection, intent classification
- β Named Entity Recognition (NER): Extract people, organizations, locations
- β Question Answering: Find answers in documents (SQuAD dataset)
- β Semantic Similarity: Compare sentence meanings
- β Token Classification: POS tagging, chunking
- β NOT for text generation: Encoder only, no autoregressive generation
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer
import torch
# ============ LOAD PRE-TRAINED BERT ============
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
num_labels=2 # Binary classification (positive/negative)
)
# ============ PREPARE DATA ============
texts = [
"This movie was fantastic! I loved every minute.",
"Terrible film, waste of time and money.",
"Pretty good, would recommend to friends."
]
labels = [1, 0, 1] # 1 = positive, 0 = negative
# Tokenize
encodings = tokenizer(
texts,
padding=True,
truncation=True,
max_length=128,
return_tensors='pt'
)
# ============ FINE-TUNE ON YOUR DATA ============
# (In practice, use Trainer API with train/eval datasets)
outputs = model(**encodings, labels=torch.tensor(labels))
loss = outputs.loss
logits = outputs.logits
print(f"Loss: {loss.item()}")
print(f"Predictions: {torch.argmax(logits, dim=1)}")
# ============ INFERENCE ============
new_text = "Amazing experience, highly recommend!"
inputs = tokenizer(new_text, return_tensors='pt')
with torch.no_grad():
outputs = model(**inputs)
prediction = torch.argmax(outputs.logits, dim=1)
sentiment = "Positive" if prediction == 1 else "Negative"
print(f"Sentiment: {sentiment}")
2. Decoder-Only: GPT (Generative Pre-trained Transformer)
GPT (OpenAI, 2018-2023) uses only the Transformer decoder. Processes text left-to-rightβgenerates one token at a time, conditioning on previous tokens.
β’ Stack of Transformer decoder blocks only
β’ GPT-2: 48 layers, 1600 dims, 1.5B params
β’ GPT-3: 96 layers, 12288 dims, 175B params
β’ Causal (unidirectional) attention β can't see future
β’ Causal Language Modeling: Predict next token
"The cat sat" β "on"
"The cat sat on" β "the"
β’ Trained on WebText (GPT-2), internet-scale (GPT-3)
π‘ Best For:
- β Text Generation: Creative writing, stories, articles
- β Chatbots: Conversational AI (ChatGPT, Claude)
- β Code Generation: GitHub Copilot, programming assistants
- β Few-Shot Learning: Solve tasks from examples in prompt
- β Text Completion: Autocomplete, suggestions
- β Zero-Shot Tasks: GPT-3+ can do tasks without fine-tuning
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# ============ LOAD PRE-TRAINED GPT ============
model_name = "gpt2" # or "gpt2-medium", "gpt2-large", "gpt2-xl"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Set pad token (GPT-2 doesn't have one by default)
tokenizer.pad_token = tokenizer.eos_token
# ============ TEXT GENERATION ============
prompt = "Once upon a time in a distant galaxy,"
# Encode prompt
input_ids = tokenizer.encode(prompt, return_tensors='pt')
# Generate with different strategies
outputs = model.generate(
input_ids,
max_length=100, # Maximum length
num_return_sequences=3, # Generate 3 variations
temperature=0.8, # Randomness (0.0 = deterministic, 1.0+ = creative)
top_k=50, # Sample from top 50 tokens
top_p=0.95, # Nucleus sampling (top 95% probability mass)
do_sample=True, # Enable sampling
pad_token_id=tokenizer.eos_token_id
)
# Decode and print
for i, output in enumerate(outputs):
text = tokenizer.decode(output, skip_special_tokens=True)
print(f"\n=== Generation {i+1} ===")
print(text)
# ============ FEW-SHOT PROMPTING ============
few_shot_prompt = """Translate English to French:
English: Hello, how are you?
French: Bonjour, comment allez-vous?
English: I love learning AI.
French: J'aime apprendre l'IA.
English: The weather is beautiful today.
French:"""
input_ids = tokenizer.encode(few_shot_prompt, return_tensors='pt')
output = model.generate(input_ids, max_length=150, temperature=0.3)
print(tokenizer.decode(output[0], skip_special_tokens=True))
β οΈ Causal Masking in GPT:
GPT uses causal attention mask to prevent looking at future tokens:
# Attention mask for "The cat sat"
The cat sat
The β
β β (can only see "The")
cat β
β
β (can see "The cat")
sat β
β
β
(can see all previous)
# This enables autoregressive generation!
3. Encoder-Decoder: T5 (Text-to-Text Transfer Transformer)
T5 (Google, 2019) uses the full Transformer architecture (encoder + decoder). Frames every task as text-to-text!
Translation:
Input: "translate English to German: Hello world"
Output: "Hallo Welt"
Summarization:
Input: "summarize: [long article text]"
Output: "[short summary]"
Question Answering:
Input: "question: What is the capital? context: France is a country..."
Output: "Paris"
Classification:
Input: "sentiment: This movie was terrible"
Output: "negative"
π‘ Best For:
- β Machine Translation: Encoder processes source, decoder generates target
- β Summarization: Compress long documents to short summaries
- β Question Answering: Generate answers (not just extract spans)
- β Data-to-Text: Generate descriptions from structured data
- β Paraphrasing: Rewrite text while preserving meaning
from transformers import T5Tokenizer, T5ForConditionalGeneration
# ============ LOAD T5 ============
model_name = "t5-small" # or "t5-base", "t5-large", "t5-3b", "t5-11b"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)
# ============ SUMMARIZATION ============
article = """The Transformer architecture has revolutionized natural language processing.
Introduced in 2017, it relies entirely on attention mechanisms, eschewing recurrence.
This allows for much more parallelization and has led to models like BERT and GPT-3."""
input_text = "summarize: " + article
input_ids = tokenizer.encode(input_text, return_tensors='pt', max_length=512, truncation=True)
# Generate summary
summary_ids = model.generate(
input_ids,
max_length=50,
num_beams=4, # Beam search for better quality
early_stopping=True
)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(f"Summary: {summary}")
# ============ TRANSLATION ============
input_text = "translate English to French: How are you today?"
input_ids = tokenizer.encode(input_text, return_tensors='pt')
translation_ids = model.generate(input_ids, max_length=50)
translation = tokenizer.decode(translation_ids[0], skip_special_tokens=True)
print(f"Translation: {translation}")
# ============ QUESTION ANSWERING ============
context = "Paris is the capital and largest city of France. It has a population of 2.2 million."
question = "What is the capital of France?"
input_text = f"question: {question} context: {context}"
input_ids = tokenizer.encode(input_text, return_tensors='pt', max_length=512)
answer_ids = model.generate(input_ids, max_length=20)
answer = tokenizer.decode(answer_ids[0], skip_special_tokens=True)
print(f"Answer: {answer}")
Comparison: BERT vs GPT vs T5
| Aspect | BERT (Encoder-Only) | GPT (Decoder-Only) | T5 (Encoder-Decoder) |
|---|---|---|---|
| Attention Type | Bidirectional (sees all) | Causal (left-to-right) | Bi (encoder) + Causal (decoder) |
| Pre-training | Masked LM + NSP | Next token prediction | Span corruption (fill in blanks) |
| Best For | Classification, NER, Q&A | Generation, chat, few-shot | Translation, summarization |
| Generation | β No (encoder only) | β Yes (autoregressive) | β Yes (seq2seq) |
| Understanding | β Excellent (full context) | β οΈ Good (but causal only) | β Excellent (encoder) |
| Parameters | 110M-340M | 117M-175B (GPT-3) | 60M-11B |
| Example Use | Sentiment analysis API | ChatGPT, Copilot | Google Translate |
β Quick Decision Guide:
- Need to classify/extract? β Use BERT (or RoBERTa, ALBERT)
- Need to generate text? β Use GPT (or GPT-2, GPT-3, LLaMA)
- Need sequence-to-sequence? β Use T5 (or BART, mT5)
- Need both understanding + generation? β Use modern instruct-tuned LLMs (GPT-4, Claude)
π Summary & Key Takeaways
Congratulations! You've mastered the Transformer architectureβthe foundation of modern AI. Let's consolidate everything you've learned.
Core Concepts Mastered
Attention Mechanism
Formula: Attention(Q, K, V) = softmax(QKT/βdk)V
Purpose: Focus on relevant positions dynamically
Benefit: No sequential bottleneck, direct connections
Multi-Head Attention
Concept: 8-16 parallel attention heads
Purpose: Learn different relationship patterns
Example: syntax, semantics, coreference, etc.
Positional Encoding
Method: Sinusoidal functions (sin/cos)
Purpose: Inject position information
Why needed: Attention has no notion of order
Parallel Processing
Key innovation: Process all tokens simultaneously
vs RNN: 100x faster training
Enables: Internet-scale pre-training
Architecture Components
| Component | Purpose | Key Detail |
|---|---|---|
| Multi-Head Attention | Focus on relevant positions | 8 heads, dk=64 each (typical) |
| Feed-Forward Network | Add non-linearity | 512 β 2048 β 512 (4x expansion) |
| Layer Normalization | Stabilize training | Mean=0, Std=1 per layer |
| Residual Connections | Enable deep networks | x + Sublayer(x) pattern |
| Positional Encoding | Inject position info | Sinusoidal functions |
Transformer Variants Quick Reference
Encoder-Only
Attention: Bidirectional
Task: Understanding
Use cases:
β’ Sentiment analysis
β’ NER
β’ Classification
β’ Q&A (extraction)
Examples:
BERT, RoBERTa, ALBERT, DistilBERT
Decoder-Only
Attention: Causal (left-to-right)
Task: Generation
Use cases:
β’ Text generation
β’ Chat
β’ Code generation
β’ Few-shot learning
Examples:
GPT-2, GPT-3, GPT-4, LLaMA, Claude
Encoder-Decoder
Attention: Both types
Task: Seq2seq
Use cases:
β’ Translation
β’ Summarization
β’ Q&A (generation)
β’ Paraphrasing
Examples:
T5, BART, mBART, mT5
Why Transformers Dominate: The Full Picture
β’ Sequential processing (slow)
β’ Limited context (~100 tokens)
β’ Vanishing gradients
β’ Memory bottleneck
β’ Can't parallelize training
β’ Difficult to scale
β’ Parallel processing (100x faster)
β’ Unlimited context (100,000+ tokens)
β’ Clean gradient flow
β’ Direct connections via attention
β’ Saturate thousands of GPUs
β’ Scale to 175B+ parameters
Mental Models & Decision Trees
Question 1: What's your task?
β’ Classify/Extract information? β BERT-style encoder
Examples: Sentiment, NER, intent classification
β’ Generate new text? β GPT-style decoder
Examples: Writing, chat, code completion
β’ Transform one sequence to another? β T5-style encoder-decoder
Examples: Translation, summarization
Question 2: Do you need bidirectional context?
β’ Yes (full sentence available): BERT/T5
β’ No (generate left-to-right): GPT
Question 3: How much data/compute?
β’ Limited: Use smaller models (BERT-base, DistilBERT)
β’ Abundant: Scale up (BERT-large, GPT-3)
Question 4: Fine-tune or few-shot?
β’ Have labeled data: Fine-tune BERT/GPT/T5
β’ Few examples only: Use large GPT with prompting
Common Pitfalls & Best Practices
Training crashes with CUDA OOM
Solutions:
β’ Reduce batch size
β’ Use gradient accumulation
β’ Enable mixed precision (fp16)
β’ Try gradient checkpointing
β’ Use smaller model variant
Takes too long to generate
Solutions:
β’ Use model distillation (DistilBERT)
β’ Quantization (int8, int4)
β’ Reduce max sequence length
β’ Batch predictions
β’ Consider smaller model
Model overfits or doesn't learn
Solutions:
β’ Use lower learning rate (1e-5 to 5e-5)
β’ Add warmup steps
β’ More training data
β’ Try different pre-trained model
β’ Freeze early layers
Input exceeds max length (512/1024)
Solutions:
β’ Truncate intelligently (keep important parts)
β’ Use hierarchical processing
β’ Try Longformer/BigBird (longer context)
β’ Split into chunks and aggregate
Practice Projects
Project 1: Sentiment Classifier
Goal: Fine-tune BERT for sentiment analysis
Dataset: IMDB reviews (Hugging Face)
Skills: Tokenization, fine-tuning, evaluation
Bonus: Compare BERT vs RoBERTa vs DistilBERT
Project 2: Text Generator
Goal: Generate creative stories with GPT-2
Dataset: Pre-trained GPT-2 (no training needed!)
Skills: Prompting, sampling strategies, temperature
Bonus: Implement top-k and nucleus sampling
Project 3: Translator
Goal: Fine-tune T5 for translation
Dataset: WMT translation datasets
Skills: Seq2seq, BLEU evaluation, beam search
Bonus: Try multiple language pairs
Project 4: Q&A System
Goal: Build extractive Q&A with BERT
Dataset: SQuAD 2.0 (Stanford)
Skills: Span extraction, F1 score, confidence
Bonus: Add "no answer" detection
What You've Mastered
π Congratulations! You now understand:
- β Attention mechanism: Query-Key-Value, scaled dot-product, why it works
- β Self-attention: Every position attends to every position
- β Multi-head attention: Multiple perspectives in parallel
- β Positional encoding: Sinusoidal injection of position info
- β Transformer architecture: Encoder blocks, decoder blocks, residual connections
- β Model variants: BERT (encoder), GPT (decoder), T5 (both)
- β Implementation: How to use transformers in practice
- β Design decisions: When to use which architecture
The Transformer Impact
2017: Attention is All You Need paper
2018: BERT revolutionizes NLP understanding
2019: GPT-2 shows emergence with scale
2020: GPT-3 achieves few-shot learning
2021: Vision Transformers beat CNNs
2022: ChatGPT reaches 100M users in 2 months
2023-2025: Multimodal AI, agents, reasoning systems
Key Insight: Transformers didn't just improve performanceβthey fundamentally changed what's possible with AI by enabling:
β’ Internet-scale pre-training (175B+ parameters)
β’ Transfer learning across domains
β’ Few-shot and zero-shot capabilities
β’ Emergent abilities at scale
β’ Foundation models for AGI research
What's Next?
You've conquered the architecture that powers modern AI! In the next tutorial, Transfer Learning & Fine-tuning, you'll learn how to:
- π― Leverage pre-trained models for your tasks
- π§ Fine-tune BERT, GPT, and T5 on custom data
- β‘ Use adapters and LoRA for efficient fine-tuning
- π Evaluate and deploy your models
- π‘ Build production ML systems with transfer learning
π Outstanding Achievement!
You've mastered Transformersβthe most important architecture in modern AI! You now understand the technology behind GPT-4, Claude, BERT, and every major AI breakthrough of the past 8 years.
Next: Learn how to harness pre-trained Transformers for your own applications! π
π Knowledge Check
Test your understanding of attention mechanisms and transformers!