Home β†’ Deep Learning β†’ Attention & Transformers

Attention & Transformers

Master the revolutionary architecture behind GPT, BERT, and modern AI. Learn self-attention and how Transformers process sequences in parallel

πŸ“… Tutorial 5 πŸ“Š Advanced

πŸŽ“ Complete all tutorials to earn your Free Deep Learning Certificate
Shareable on LinkedIn β€’ Verified by AITutorials.site β€’ No signup fee

πŸš€ The Transformer Revolution

In June 2017, a team of Google researchers published a paper titled "Attention is All You Need" that would fundamentally transform artificial intelligence. The Transformer architecture they introduced replaced RNNs and LSTMs entirely, becoming the foundation for virtually every state-of-the-art AI system today: GPT-4, Claude, Gemini, BERT, ChatGPT, and countless others.

πŸ’‘ The Big Breakthrough: Transformers process all positions simultaneously using attention, not sequentially like RNNs. This enables:

  • ⚑ Massive parallelization: Train on thousands of GPUs at once
  • πŸ“ Long-range dependencies: Handle 100,000+ token sequences
  • 🎯 Better performance: State-of-the-art on virtually all NLP tasks
  • πŸš€ Scalability: Grow to 100B+ parameters efficiently

Why RNNs Failed to Scale

Before Transformers, RNNs and LSTMs dominated sequential modeling. But they had fundamental limitations:

❌ RNN/LSTM Problems

β€’ Sequential processing: Must wait for step t-1 before computing step t β†’ no parallelization
β€’ Limited context: Struggle with sequences > 100 tokens
β€’ Slow training: Can't use full GPU power
β€’ Vanishing gradients: Hard to learn very long dependencies
β€’ Memory bottleneck: Compress entire history into fixed-size hidden state
βœ… Transformer Solutions

β€’ Parallel processing: Compute all positions at once β†’ 100x faster training
β€’ Unlimited context: Handle 100,000+ tokens efficiently
β€’ Fast training: Saturate thousands of GPUs
β€’ Clean gradients: Direct connections via attention
β€’ No bottleneck: Every position can access any other position directly

The Impact: From Research to Reality

Timeline of Transformer Dominance:

2017: Original Transformer paper (machine translation)
2018: BERT revolutionizes NLP understanding tasks
2019: GPT-2 shows large-scale language modeling potential
2020: GPT-3 demonstrates few-shot learning with 175B parameters
2021: Vision Transformers (ViT) beat CNNs on image tasks
2022: ChatGPT brings LLMs to mainstream (100M users in 2 months)
2023-2024: GPT-4, Claude 3, Gemini Ultra β†’ multimodal AI
2025: Transformers in audio, video, robotics, drug discovery, and beyond

Key insight: Transformers aren't just betterβ€”they're fundamentally different. They enabled the AI revolution by making it possible to train on internet-scale data efficiently.

What Makes Transformers Special?

⚑

Parallel Processing

Process entire sequence at once, not word-by-word. Train on 1000s of GPUs simultaneously.

πŸ”

Self-Attention

Every word directly attends to every other word. No information bottleneck.

πŸ“

Long Context

Handle 100,000+ tokens with techniques like sparse attention and efficient implementations.

🎯

Transfer Learning

Pre-train once on massive data, fine-tune for any task. Democratizes AI development.

πŸŽ“ Learning Goal: By the end of this tutorial, you'll understand:

  • βœ… How attention mechanisms work (Query, Key, Value)
  • βœ… Self-attention and multi-head attention
  • βœ… Complete Transformer architecture (encoder, decoder, positional encoding)
  • βœ… Why Transformers outperform RNNs
  • βœ… Different Transformer variants (BERT, GPT, T5)
  • βœ… How to implement and use Transformers

πŸ‘€ The Attention Mechanism: The Core Innovation

πŸ’‘ Core Idea: Attention allows the model to dynamically focus on relevant parts of the input. Instead of processing word-by-word sequentially, it calculates relationships between ALL positions simultaneously.

Think of it like reading a paragraph and being able to instantly reference any word while understanding the current wordβ€”no need to remember everything in a fixed-size memory!

Intuitive Example: Translation with Attention

Translating: "The cat sat on the mat" β†’ "Le chat Γ©tait assis sur le tapis"

Without Attention (RNN):
β€’ Encoder processes word-by-word: "The" β†’ "cat" β†’ "sat" β†’ "on" β†’ "the" β†’ "mat"
β€’ Final hidden state must remember entire sentence
β€’ Decoder generates translation from this single fixed vector
β€’ Problem: Information bottleneck! Long sentences lose details.

With Attention (Transformer):
When generating "assis" (sat):
β€’ Attends strongly to "sat" (0.7 weight)
β€’ Attends moderately to "cat" (0.2 weight) β€” subject agreement
β€’ Attends weakly to other words (0.1 total)
β€’ Result: Direct access to relevant source words, no bottleneck!

The Mathematics: Scaled Dot-Product Attention

Attention uses three learned linear projections called Query, Key, and Value. This is the famous QKV attention.

Attention Formula:

Attention(Q, K, V) = softmax(QKT / √dk) V

Where:
β€’ Q (Query): "What am I looking for?" (shape: seq_len Γ— dk)
β€’ K (Key): "What do I represent?" (shape: seq_len Γ— dk)
β€’ V (Value): "What information do I carry?" (shape: seq_len Γ— dv)
β€’ dk: Dimension of queries/keys (scaling factor)

Step-by-step:
1. Compute similarity: QKT β†’ scores matrix (seq_len Γ— seq_len)
2. Scale by √dk to prevent vanishing gradients in softmax
3. Normalize with softmax: each row sums to 1 (attention weights)
4. Weight values: multiply weights Γ— V to get output

Concrete Example with Numbers

Sentence: "The cat sat"
Goal: Understand what "sat" should attend to

Step 1: Create Q, K, V matrices
Assume dk = 3 (tiny for illustration; real models use 64-128)

Word embeddings (simplified 3D vectors):
"The" = [1.0, 0.2, 0.1]
"cat" = [0.5, 1.2, 0.8]
"sat" = [0.3, 0.4, 1.5]

Apply learned weight matrices WQ, WK, WV (learned during training):

Q = embeddings Γ— WQ: ["The": [0.8, 0.3, 0.2], "cat": [1.1, 0.9, 0.4], "sat": [0.7, 1.2, 0.6]]
K = embeddings Γ— WK: ["The": [0.9, 0.1, 0.3], "cat": [0.6, 1.3, 0.5], "sat": [0.4, 0.5, 1.4]]
V = embeddings Γ— WV: ["The": [1.1, 0.3, 0.2], "cat": [0.7, 1.4, 0.6], "sat": [0.5, 0.6, 1.8]]

Step 2: Compute attention scores for "sat"
Qsat Β· KTheT = [0.7, 1.2, 0.6] Β· [0.9, 0.1, 0.3] = 0.63 + 0.12 + 0.18 = 0.93
Qsat Β· KcatT = [0.7, 1.2, 0.6] Β· [0.6, 1.3, 0.5] = 0.42 + 1.56 + 0.30 = 2.28
Qsat Β· KsatT = [0.7, 1.2, 0.6] Β· [0.4, 0.5, 1.4] = 0.28 + 0.60 + 0.84 = 1.72

Step 3: Scale by √dk = √3 β‰ˆ 1.73
Scaled scores: [0.93/1.73, 2.28/1.73, 1.72/1.73] = [0.54, 1.32, 0.99]

Step 4: Apply softmax
Weights: softmax([0.54, 1.32, 0.99]) = [0.14, 0.54, 0.32]

Interpretation: When processing "sat":
β€’ Attend to "The": 14% (low - not very relevant)
β€’ Attend to "cat": 54% (high - subject of the verb!)
β€’ Attend to "sat": 32% (moderate - self-attention)

Step 5: Weighted sum of values
Output = 0.14Γ—VThe + 0.54Γ—Vcat + 0.32Γ—Vsat
Output = 0.14Γ—[1.1,0.3,0.2] + 0.54Γ—[0.7,1.4,0.6] + 0.32Γ—[0.5,0.6,1.8]
Output = [0.154,0.042,0.028] + [0.378,0.756,0.324] + [0.160,0.192,0.576]
Output = [0.692, 0.990, 0.928]

Result: The output for "sat" is now enriched with information from "cat" (heavily weighted) and context from other words!

Why Scaling by √dk?

⚠️ Critical Detail: Without scaling, dot products grow large when dk is large.

Problem: Large dot products β†’ large logits β†’ softmax saturates β†’ vanishing gradients

Example: If dk = 512, unscaled dot products can reach 100+. After softmax, one weight β‰ˆ 1.0, others β‰ˆ 0 β†’ gradient flow broken.

Solution: Divide by √dk = √512 β‰ˆ 22.6 β†’ dot products stay in reasonable range β†’ healthy gradients!

Self-Attention: Every Position Attends to Every Position

In Transformers, Q, K, and V all come from the same input sequence. This is called self-attention (vs cross-attention in encoder-decoder models).

Attention Matrix Visualization:

For "The cat sat on the mat":

     The  cat  sat  on   the  mat
The [0.3  0.1  0.1  0.1  0.2  0.2] ← "The" attends weakly to all
cat [0.2  0.4  0.3  0.0  0.0  0.1] ← "cat" attends to itself + verb
sat [0.1  0.5  0.2  0.1  0.0  0.1] ← "sat" strongly attends to "cat"
on  [0.1  0.1  0.2  0.3  0.1  0.2] ← "on" attends to verb + object
the [0.2  0.1  0.1  0.2  0.3  0.1] ← "the" attends to nearby
mat [0.1  0.1  0.2  0.3  0.2  0.1] ← "mat" attends to prep + verb


Each row shows attention weights (sums to 1.0). Darker values = stronger attention.
Notice: No sequential constraint! "sat" directly attends to "cat" without processing "The" first.

Implementation: Attention from Scratch

import numpy as np
import torch
import torch.nn.functional as F

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Compute scaled dot-product attention.
    
    Args:
        Q: Query matrix (batch, seq_len, d_k)
        K: Key matrix (batch, seq_len, d_k)
        V: Value matrix (batch, seq_len, d_v)
        mask: Optional mask (batch, seq_len, seq_len)
    
    Returns:
        Output: Attention output (batch, seq_len, d_v)
        Weights: Attention weights (batch, seq_len, seq_len)
    """
    # Get dimension for scaling
    d_k = Q.size(-1)
    
    # Step 1: Compute attention scores (QK^T)
    # Shape: (batch, seq_len, d_k) @ (batch, d_k, seq_len) β†’ (batch, seq_len, seq_len)
    scores = torch.matmul(Q, K.transpose(-2, -1))
    
    # Step 2: Scale by sqrt(d_k)
    scores = scores / np.sqrt(d_k)
    
    # Step 3: Apply mask if provided (e.g., for causal attention in GPT)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))
    
    # Step 4: Apply softmax to get attention weights
    # Each row sums to 1.0
    weights = F.softmax(scores, dim=-1)
    
    # Step 5: Weighted sum of values
    # Shape: (batch, seq_len, seq_len) @ (batch, seq_len, d_v) β†’ (batch, seq_len, d_v)
    output = torch.matmul(weights, V)
    
    return output, weights

# ============ EXAMPLE USAGE ============
batch_size = 2
seq_len = 6  # "The cat sat on the mat"
d_model = 512  # Model dimension (typical for Transformer)

# Simulate input embeddings
embeddings = torch.randn(batch_size, seq_len, d_model)

# Create Q, K, V with learned projection matrices
W_Q = torch.randn(d_model, d_model)
W_K = torch.randn(d_model, d_model)
W_V = torch.randn(d_model, d_model)

Q = torch.matmul(embeddings, W_Q)
K = torch.matmul(embeddings, W_K)
V = torch.matmul(embeddings, W_V)

# Compute attention
output, weights = scaled_dot_product_attention(Q, K, V)

print(f"Output shape: {output.shape}")        # (2, 6, 512)
print(f"Attention weights shape: {weights.shape}")  # (2, 6, 6)

# Inspect attention pattern for first batch, position 2 ("sat")
print("\nAttention weights for 'sat' (position 2):")
print(weights[0, 2, :])  # Should show high weight for position 1 ("cat")

βœ… Key Takeaways:

  • Query-Key-Value: Q asks "what?", K provides "here!", V gives "the info"
  • Dot product: Measures similarity between queries and keys
  • Softmax: Converts scores to probability distribution (weights sum to 1)
  • Parallel: All positions computed simultaneously β†’ massive speedup
  • No recurrence: Direct connections enable clean gradient flow
  • Interpretable: Attention weights show what model focuses on

πŸ—οΈ Complete Transformer Architecture

The full Transformer combines multiple components into a powerful architecture. Let's build it piece by piece, understanding each component's role.

1. Multi-Head Attention: Multiple Perspectives

Instead of one attention mechanism, Transformers use multiple attention heads in parallel. Each head learns different relationships!

πŸ’‘ Intuition: When reading "The bank can guarantee deposits", different heads focus on different meanings:

  • Head 1: "bank" β†’ "deposits" (financial institution context)
  • Head 2: "can" β†’ "guarantee" (modal verb + action)
  • Head 3: "The" β†’ "bank" (determiner + noun)
  • Head 4: Overall sentence structure and syntax

Each head specializes in different linguistic patternsβ€”syntax, semantics, coreference, etc.

Multi-Head Attention Formula:

For h attention heads:

headi = Attention(QWiQ, KWiK, VWiV)

MultiHead(Q, K, V) = Concat(head1, ..., headh)WO

Where:
β€’ Each head has its own WiQ, WiK, WiV projection matrices
β€’ Typical: 8 or 16 heads (GPT-3 uses 96 heads!)
β€’ Each head has dimension dk = dmodel / h
β€’ Final WO matrix combines all heads

Example: 8 Heads with d_model = 512

Configuration:
β€’ Model dimension: dmodel = 512
β€’ Number of heads: h = 8
β€’ Dimension per head: dk = 512 / 8 = 64

Process:
1. Split 512-dim embedding into 8 Γ— 64-dim chunks
2. Each chunk processed by separate attention head
3. Head 1 learns subject-verb relationships (64 dims)
4. Head 2 learns semantic similarity (64 dims)
5. ... (6 more heads learning different patterns)
6. Concatenate all 8 outputs: 8 Γ— 64 = 512 dims
7. Project with WO to final 512-dim output

Benefit: 8x more pattern types learned without increasing computation!
(8 heads Γ— 64 dims has same compute as 1 head Γ— 512 dims)
import torch
import torch.nn as nn

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model=512, num_heads=8):
        super().__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads  # 512 / 8 = 64
        
        # Linear projections for Q, K, V (all heads at once)
        self.W_Q = nn.Linear(d_model, d_model)
        self.W_K = nn.Linear(d_model, d_model)
        self.W_V = nn.Linear(d_model, d_model)
        
        # Output projection
        self.W_O = nn.Linear(d_model, d_model)
        
    def forward(self, Q, K, V, mask=None):
        batch_size = Q.size(0)
        
        # 1. Linear projections
        # Shape: (batch, seq_len, d_model)
        Q = self.W_Q(Q)
        K = self.W_K(K)
        V = self.W_V(V)
        
        # 2. Split into multiple heads
        # Shape: (batch, seq_len, d_model) β†’ (batch, seq_len, num_heads, d_k)
        Q = Q.view(batch_size, -1, self.num_heads, self.d_k)
        K = K.view(batch_size, -1, self.num_heads, self.d_k)
        V = V.view(batch_size, -1, self.num_heads, self.d_k)
        
        # Transpose for attention: (batch, num_heads, seq_len, d_k)
        Q = Q.transpose(1, 2)
        K = K.transpose(1, 2)
        V = V.transpose(1, 2)
        
        # 3. Apply scaled dot-product attention for each head
        scores = torch.matmul(Q, K.transpose(-2, -1)) / np.sqrt(self.d_k)
        
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        
        weights = torch.softmax(scores, dim=-1)
        attention_output = torch.matmul(weights, V)
        
        # 4. Concatenate heads
        # (batch, num_heads, seq_len, d_k) β†’ (batch, seq_len, num_heads, d_k)
        attention_output = attention_output.transpose(1, 2).contiguous()
        
        # (batch, seq_len, num_heads, d_k) β†’ (batch, seq_len, d_model)
        attention_output = attention_output.view(batch_size, -1, self.d_model)
        
        # 5. Final linear projection
        output = self.W_O(attention_output)
        
        return output, weights

2. Positional Encoding: Where Am I?

Problem: Attention has no notion of order! "Cat chased dog" = "Dog chased cat" in pure attention.
Solution: Add positional information to embeddings.

Sinusoidal Positional Encoding:

PE(pos, 2i) = sin(pos / 100002i/dmodel)
PE(pos, 2i+1) = cos(pos / 100002i/dmodel)

Where:
β€’ pos = position in sequence (0, 1, 2, ...)
β€’ i = dimension index (0, 1, 2, ..., dmodel/2)
β€’ Even dimensions use sine, odd use cosine
β€’ Different frequencies for each dimension

Why this formula?
β€’ Allows model to learn relative positions (PEpos+k is linear function of PEpos)
β€’ Extrapolates to unseen sequence lengths
β€’ Each position gets unique encoding
import numpy as np
import torch

def get_positional_encoding(max_seq_len, d_model):
    """
    Generate sinusoidal positional encodings.
    
    Args:
        max_seq_len: Maximum sequence length
        d_model: Model dimension (512 typical)
    
    Returns:
        Positional encoding matrix (max_seq_len, d_model)
    """
    position = np.arange(max_seq_len)[:, np.newaxis]  # (max_seq_len, 1)
    div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))
    
    pe = np.zeros((max_seq_len, d_model))
    pe[:, 0::2] = np.sin(position * div_term)  # Even dimensions
    pe[:, 1::2] = np.cos(position * div_term)  # Odd dimensions
    
    return torch.FloatTensor(pe)

# Visualize positional encoding patterns
pe = get_positional_encoding(100, 512)
print(f"Positional encoding shape: {pe.shape}")  # (100, 512)

# Position 0 and position 1 have different patterns
print(f"Position 0: {pe[0, :8]}")  # First 8 dims
print(f"Position 1: {pe[1, :8]}")  # Different values!

3. Feed-Forward Network: Non-Linear Transformation

After attention, each position passes through a 2-layer feed-forward network (same network applied to each position independently).

FFN(x) = max(0, xW1 + b1)W2 + b2

Typical dimensions:
β€’ Input: dmodel = 512
β€’ Hidden: dff = 2048 (4x expansion!)
β€’ Output: dmodel = 512

Purpose: Add non-linearity and model complexity. Attention is linear, FFN adds expressiveness!
class FeedForward(nn.Module):
    def __init__(self, d_model=512, d_ff=2048, dropout=0.1):
        super().__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x):
        # x: (batch, seq_len, d_model)
        x = self.linear1(x)              # (batch, seq_len, d_ff)
        x = torch.relu(x)                # ReLU activation
        x = self.dropout(x)
        x = self.linear2(x)              # (batch, seq_len, d_model)
        return x

4. Layer Normalization: Stabilize Training

Normalize activations to have mean=0, variance=1. Critical for training deep Transformers (GPT-3 has 96 layers!).

class LayerNorm(nn.Module):
    def __init__(self, d_model, eps=1e-6):
        super().__init__()
        self.gamma = nn.Parameter(torch.ones(d_model))  # Scale
        self.beta = nn.Parameter(torch.zeros(d_model))  # Shift
        self.eps = eps
        
    def forward(self, x):
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)
        return self.gamma * (x - mean) / (std + self.eps) + self.beta

5. Residual Connections: Skip Connections

Add input to output of each sub-layer. Enables gradient flow in very deep networks.

output = LayerNorm(x + Sublayer(x))

Why it works:
β€’ Direct path for gradients to flow backwards
β€’ Each layer learns residual (what to add), not full transformation
β€’ Allows training 100+ layer models

Putting It All Together: Transformer Block

class TransformerBlock(nn.Module):
    """
    Complete Transformer encoder block with:
    - Multi-head self-attention
    - Feed-forward network
    - Layer normalization
    - Residual connections
    """
    def __init__(self, d_model=512, num_heads=8, d_ff=2048, dropout=0.1):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = FeedForward(d_model, d_ff, dropout)
        self.norm1 = LayerNorm(d_model)
        self.norm2 = LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        
    def forward(self, x, mask=None):
        # 1. Multi-head self-attention with residual + norm
        attention_output, _ = self.attention(x, x, x, mask)
        x = self.norm1(x + self.dropout1(attention_output))
        
        # 2. Feed-forward with residual + norm
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout2(ff_output))
        
        return x

# Stack multiple blocks for full Transformer
class Transformer(nn.Module):
    def __init__(self, vocab_size, d_model=512, num_heads=8, num_layers=6, 
                 d_ff=2048, max_seq_len=512, dropout=0.1):
        super().__init__()
        
        # Token + positional embeddings
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = get_positional_encoding(max_seq_len, d_model)
        self.dropout = nn.Dropout(dropout)
        
        # Stack of Transformer blocks
        self.blocks = nn.ModuleList([
            TransformerBlock(d_model, num_heads, d_ff, dropout)
            for _ in range(num_layers)
        ])
        
        # Output projection
        self.norm = LayerNorm(d_model)
        self.output = nn.Linear(d_model, vocab_size)
        
    def forward(self, x, mask=None):
        # x: (batch, seq_len) - token IDs
        seq_len = x.size(1)
        
        # Embedding + positional encoding
        x = self.embedding(x)  # (batch, seq_len, d_model)
        x = x + self.pos_encoding[:seq_len, :].to(x.device)
        x = self.dropout(x)
        
        # Pass through Transformer blocks
        for block in self.blocks:
            x = block(x, mask)
        
        # Output projection
        x = self.norm(x)
        logits = self.output(x)  # (batch, seq_len, vocab_size)
        
        return logits

The Complete Architecture Visualized

Input Tokens: ["The", "cat", "sat"]
  β†“
Token Embedding (vocab_size β†’ d_model)
  +
Positional Encoding (add position info)
  β†“
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Transformer Block 1 β”‚
β”‚ β€’ Multi-Head Self-Attention β”‚
β”‚ β€’ Add & Norm (residual) β”‚
β”‚ β€’ Feed-Forward Network β”‚
β”‚ β€’ Add & Norm (residual) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  β†“
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Transformer Block 2 β”‚
β”‚ β€’ (same structure) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  β†“
... (4 more blocks) ...
  β†“
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Transformer Block 6 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  β†“
Layer Normalization
  β†“
Linear (d_model β†’ vocab_size)
  β†“
Output Logits (probabilities for next token)

βœ… Key Architecture Components:

  • Multi-Head Attention: 8-16 parallel heads learn different patterns
  • Positional Encoding: Sinusoidal functions inject position information
  • Feed-Forward: 2-layer MLP with 4x expansion (512 β†’ 2048 β†’ 512)
  • Layer Norm: Stabilizes training, mean=0 std=1
  • Residual Connections: x + Sublayer(x) enables deep stacking
  • Parallel Processing: All positions computed simultaneously

πŸ€– Transformer Variants: BERT, GPT, T5

The original Transformer had both encoder and decoder. Modern models specialize: encoder-only for understanding, decoder-only for generation, or encoder-decoder for translation tasks.

1. Encoder-Only: BERT (Bidirectional Encoder Representations)

BERT (Google, 2018) uses only the Transformer encoder. Processes text bidirectionallyβ€”sees both past and future context simultaneously.

Architecture

β€’ Stack of Transformer encoder blocks only
β€’ BERT-Base: 12 layers, 768 dims, 110M params
β€’ BERT-Large: 24 layers, 1024 dims, 340M params
β€’ Bidirectional attention (sees full context)
Pre-training Tasks

β€’ Masked Language Model (MLM): Predict masked words
  "The [MASK] sat on the mat" β†’ "cat"
β€’ Next Sentence Prediction: Does sentence B follow A?
β€’ Trained on BookCorpus + Wikipedia

πŸ’‘ Best For:

  • βœ… Text Classification: Sentiment analysis, spam detection, intent classification
  • βœ… Named Entity Recognition (NER): Extract people, organizations, locations
  • βœ… Question Answering: Find answers in documents (SQuAD dataset)
  • βœ… Semantic Similarity: Compare sentence meanings
  • βœ… Token Classification: POS tagging, chunking
  • ❌ NOT for text generation: Encoder only, no autoregressive generation
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer
import torch

# ============ LOAD PRE-TRAINED BERT ============
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2  # Binary classification (positive/negative)
)

# ============ PREPARE DATA ============
texts = [
    "This movie was fantastic! I loved every minute.",
    "Terrible film, waste of time and money.",
    "Pretty good, would recommend to friends."
]
labels = [1, 0, 1]  # 1 = positive, 0 = negative

# Tokenize
encodings = tokenizer(
    texts,
    padding=True,
    truncation=True,
    max_length=128,
    return_tensors='pt'
)

# ============ FINE-TUNE ON YOUR DATA ============
# (In practice, use Trainer API with train/eval datasets)
outputs = model(**encodings, labels=torch.tensor(labels))
loss = outputs.loss
logits = outputs.logits

print(f"Loss: {loss.item()}")
print(f"Predictions: {torch.argmax(logits, dim=1)}")

# ============ INFERENCE ============
new_text = "Amazing experience, highly recommend!"
inputs = tokenizer(new_text, return_tensors='pt')
with torch.no_grad():
    outputs = model(**inputs)
    prediction = torch.argmax(outputs.logits, dim=1)
    
sentiment = "Positive" if prediction == 1 else "Negative"
print(f"Sentiment: {sentiment}")

2. Decoder-Only: GPT (Generative Pre-trained Transformer)

GPT (OpenAI, 2018-2023) uses only the Transformer decoder. Processes text left-to-rightβ€”generates one token at a time, conditioning on previous tokens.

Architecture

β€’ Stack of Transformer decoder blocks only
β€’ GPT-2: 48 layers, 1600 dims, 1.5B params
β€’ GPT-3: 96 layers, 12288 dims, 175B params
β€’ Causal (unidirectional) attention β†’ can't see future
Pre-training Task

β€’ Causal Language Modeling: Predict next token
  "The cat sat" β†’ "on"
  "The cat sat on" β†’ "the"
β€’ Trained on WebText (GPT-2), internet-scale (GPT-3)

πŸ’‘ Best For:

  • βœ… Text Generation: Creative writing, stories, articles
  • βœ… Chatbots: Conversational AI (ChatGPT, Claude)
  • βœ… Code Generation: GitHub Copilot, programming assistants
  • βœ… Few-Shot Learning: Solve tasks from examples in prompt
  • βœ… Text Completion: Autocomplete, suggestions
  • βœ… Zero-Shot Tasks: GPT-3+ can do tasks without fine-tuning
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# ============ LOAD PRE-TRAINED GPT ============
model_name = "gpt2"  # or "gpt2-medium", "gpt2-large", "gpt2-xl"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Set pad token (GPT-2 doesn't have one by default)
tokenizer.pad_token = tokenizer.eos_token

# ============ TEXT GENERATION ============
prompt = "Once upon a time in a distant galaxy,"

# Encode prompt
input_ids = tokenizer.encode(prompt, return_tensors='pt')

# Generate with different strategies
outputs = model.generate(
    input_ids,
    max_length=100,              # Maximum length
    num_return_sequences=3,      # Generate 3 variations
    temperature=0.8,             # Randomness (0.0 = deterministic, 1.0+ = creative)
    top_k=50,                    # Sample from top 50 tokens
    top_p=0.95,                  # Nucleus sampling (top 95% probability mass)
    do_sample=True,              # Enable sampling
    pad_token_id=tokenizer.eos_token_id
)

# Decode and print
for i, output in enumerate(outputs):
    text = tokenizer.decode(output, skip_special_tokens=True)
    print(f"\n=== Generation {i+1} ===")
    print(text)

# ============ FEW-SHOT PROMPTING ============
few_shot_prompt = """Translate English to French:
English: Hello, how are you?
French: Bonjour, comment allez-vous?

English: I love learning AI.
French: J'aime apprendre l'IA.

English: The weather is beautiful today.
French:"""

input_ids = tokenizer.encode(few_shot_prompt, return_tensors='pt')
output = model.generate(input_ids, max_length=150, temperature=0.3)
print(tokenizer.decode(output[0], skip_special_tokens=True))

⚠️ Causal Masking in GPT:

GPT uses causal attention mask to prevent looking at future tokens:

# Attention mask for "The cat sat"
      The   cat   sat
The   βœ…    ❌    ❌     (can only see "The")
cat   βœ…    βœ…    ❌     (can see "The cat")
sat   βœ…    βœ…    βœ…     (can see all previous)

# This enables autoregressive generation!

3. Encoder-Decoder: T5 (Text-to-Text Transfer Transformer)

T5 (Google, 2019) uses the full Transformer architecture (encoder + decoder). Frames every task as text-to-text!

Text-to-Text Framework:

Translation:
Input: "translate English to German: Hello world"
Output: "Hallo Welt"

Summarization:
Input: "summarize: [long article text]"
Output: "[short summary]"

Question Answering:
Input: "question: What is the capital? context: France is a country..."
Output: "Paris"

Classification:
Input: "sentiment: This movie was terrible"
Output: "negative"

πŸ’‘ Best For:

  • βœ… Machine Translation: Encoder processes source, decoder generates target
  • βœ… Summarization: Compress long documents to short summaries
  • βœ… Question Answering: Generate answers (not just extract spans)
  • βœ… Data-to-Text: Generate descriptions from structured data
  • βœ… Paraphrasing: Rewrite text while preserving meaning
from transformers import T5Tokenizer, T5ForConditionalGeneration

# ============ LOAD T5 ============
model_name = "t5-small"  # or "t5-base", "t5-large", "t5-3b", "t5-11b"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

# ============ SUMMARIZATION ============
article = """The Transformer architecture has revolutionized natural language processing. 
Introduced in 2017, it relies entirely on attention mechanisms, eschewing recurrence. 
This allows for much more parallelization and has led to models like BERT and GPT-3."""

input_text = "summarize: " + article
input_ids = tokenizer.encode(input_text, return_tensors='pt', max_length=512, truncation=True)

# Generate summary
summary_ids = model.generate(
    input_ids,
    max_length=50,
    num_beams=4,           # Beam search for better quality
    early_stopping=True
)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(f"Summary: {summary}")

# ============ TRANSLATION ============
input_text = "translate English to French: How are you today?"
input_ids = tokenizer.encode(input_text, return_tensors='pt')

translation_ids = model.generate(input_ids, max_length=50)
translation = tokenizer.decode(translation_ids[0], skip_special_tokens=True)
print(f"Translation: {translation}")

# ============ QUESTION ANSWERING ============
context = "Paris is the capital and largest city of France. It has a population of 2.2 million."
question = "What is the capital of France?"
input_text = f"question: {question} context: {context}"

input_ids = tokenizer.encode(input_text, return_tensors='pt', max_length=512)
answer_ids = model.generate(input_ids, max_length=20)
answer = tokenizer.decode(answer_ids[0], skip_special_tokens=True)
print(f"Answer: {answer}")

Comparison: BERT vs GPT vs T5

Aspect BERT (Encoder-Only) GPT (Decoder-Only) T5 (Encoder-Decoder)
Attention Type Bidirectional (sees all) Causal (left-to-right) Bi (encoder) + Causal (decoder)
Pre-training Masked LM + NSP Next token prediction Span corruption (fill in blanks)
Best For Classification, NER, Q&A Generation, chat, few-shot Translation, summarization
Generation ❌ No (encoder only) βœ… Yes (autoregressive) βœ… Yes (seq2seq)
Understanding βœ… Excellent (full context) ⚠️ Good (but causal only) βœ… Excellent (encoder)
Parameters 110M-340M 117M-175B (GPT-3) 60M-11B
Example Use Sentiment analysis API ChatGPT, Copilot Google Translate

βœ… Quick Decision Guide:

  • Need to classify/extract? β†’ Use BERT (or RoBERTa, ALBERT)
  • Need to generate text? β†’ Use GPT (or GPT-2, GPT-3, LLaMA)
  • Need sequence-to-sequence? β†’ Use T5 (or BART, mT5)
  • Need both understanding + generation? β†’ Use modern instruct-tuned LLMs (GPT-4, Claude)

πŸ“‹ Summary & Key Takeaways

Congratulations! You've mastered the Transformer architectureβ€”the foundation of modern AI. Let's consolidate everything you've learned.

Core Concepts Mastered

🎯

Attention Mechanism

Formula: Attention(Q, K, V) = softmax(QKT/√dk)V

Purpose: Focus on relevant positions dynamically

Benefit: No sequential bottleneck, direct connections

πŸ”

Multi-Head Attention

Concept: 8-16 parallel attention heads

Purpose: Learn different relationship patterns

Example: syntax, semantics, coreference, etc.

πŸ“

Positional Encoding

Method: Sinusoidal functions (sin/cos)

Purpose: Inject position information

Why needed: Attention has no notion of order

⚑

Parallel Processing

Key innovation: Process all tokens simultaneously

vs RNN: 100x faster training

Enables: Internet-scale pre-training

Architecture Components

Component Purpose Key Detail
Multi-Head Attention Focus on relevant positions 8 heads, dk=64 each (typical)
Feed-Forward Network Add non-linearity 512 β†’ 2048 β†’ 512 (4x expansion)
Layer Normalization Stabilize training Mean=0, Std=1 per layer
Residual Connections Enable deep networks x + Sublayer(x) pattern
Positional Encoding Inject position info Sinusoidal functions

Transformer Variants Quick Reference

πŸ” BERT
Encoder-Only

Attention: Bidirectional
Task: Understanding
Use cases:
β€’ Sentiment analysis
β€’ NER
β€’ Classification
β€’ Q&A (extraction)

Examples:
BERT, RoBERTa, ALBERT, DistilBERT
✍️ GPT
Decoder-Only

Attention: Causal (left-to-right)
Task: Generation
Use cases:
β€’ Text generation
β€’ Chat
β€’ Code generation
β€’ Few-shot learning

Examples:
GPT-2, GPT-3, GPT-4, LLaMA, Claude
πŸ”„ T5
Encoder-Decoder

Attention: Both types
Task: Seq2seq
Use cases:
β€’ Translation
β€’ Summarization
β€’ Q&A (generation)
β€’ Paraphrasing

Examples:
T5, BART, mBART, mT5

Why Transformers Dominate: The Full Picture

❌ RNN/LSTM Limitations

β€’ Sequential processing (slow)
β€’ Limited context (~100 tokens)
β€’ Vanishing gradients
β€’ Memory bottleneck
β€’ Can't parallelize training
β€’ Difficult to scale
βœ… Transformer Advantages

β€’ Parallel processing (100x faster)
β€’ Unlimited context (100,000+ tokens)
β€’ Clean gradient flow
β€’ Direct connections via attention
β€’ Saturate thousands of GPUs
β€’ Scale to 175B+ parameters

Mental Models & Decision Trees

🎯 Choosing the Right Model:

Question 1: What's your task?
β€’ Classify/Extract information? β†’ BERT-style encoder
  Examples: Sentiment, NER, intent classification

β€’ Generate new text? β†’ GPT-style decoder
  Examples: Writing, chat, code completion

β€’ Transform one sequence to another? β†’ T5-style encoder-decoder
  Examples: Translation, summarization

Question 2: Do you need bidirectional context?
β€’ Yes (full sentence available): BERT/T5
β€’ No (generate left-to-right): GPT

Question 3: How much data/compute?
β€’ Limited: Use smaller models (BERT-base, DistilBERT)
β€’ Abundant: Scale up (BERT-large, GPT-3)

Question 4: Fine-tune or few-shot?
β€’ Have labeled data: Fine-tune BERT/GPT/T5
β€’ Few examples only: Use large GPT with prompting

Common Pitfalls & Best Practices

❌ Pitfall: Out of Memory
Training crashes with CUDA OOM

Solutions:
β€’ Reduce batch size
β€’ Use gradient accumulation
β€’ Enable mixed precision (fp16)
β€’ Try gradient checkpointing
β€’ Use smaller model variant
❌ Pitfall: Slow Inference
Takes too long to generate

Solutions:
β€’ Use model distillation (DistilBERT)
β€’ Quantization (int8, int4)
β€’ Reduce max sequence length
β€’ Batch predictions
β€’ Consider smaller model
❌ Pitfall: Poor Fine-tuning
Model overfits or doesn't learn

Solutions:
β€’ Use lower learning rate (1e-5 to 5e-5)
β€’ Add warmup steps
β€’ More training data
β€’ Try different pre-trained model
β€’ Freeze early layers
❌ Pitfall: Sequence Length Issues
Input exceeds max length (512/1024)

Solutions:
β€’ Truncate intelligently (keep important parts)
β€’ Use hierarchical processing
β€’ Try Longformer/BigBird (longer context)
β€’ Split into chunks and aggregate

Practice Projects

😊

Project 1: Sentiment Classifier

Goal: Fine-tune BERT for sentiment analysis

Dataset: IMDB reviews (Hugging Face)

Skills: Tokenization, fine-tuning, evaluation

Bonus: Compare BERT vs RoBERTa vs DistilBERT

✍️

Project 2: Text Generator

Goal: Generate creative stories with GPT-2

Dataset: Pre-trained GPT-2 (no training needed!)

Skills: Prompting, sampling strategies, temperature

Bonus: Implement top-k and nucleus sampling

🌐

Project 3: Translator

Goal: Fine-tune T5 for translation

Dataset: WMT translation datasets

Skills: Seq2seq, BLEU evaluation, beam search

Bonus: Try multiple language pairs

πŸ”

Project 4: Q&A System

Goal: Build extractive Q&A with BERT

Dataset: SQuAD 2.0 (Stanford)

Skills: Span extraction, F1 score, confidence

Bonus: Add "no answer" detection

What You've Mastered

πŸŽ“ Congratulations! You now understand:

  • βœ… Attention mechanism: Query-Key-Value, scaled dot-product, why it works
  • βœ… Self-attention: Every position attends to every position
  • βœ… Multi-head attention: Multiple perspectives in parallel
  • βœ… Positional encoding: Sinusoidal injection of position info
  • βœ… Transformer architecture: Encoder blocks, decoder blocks, residual connections
  • βœ… Model variants: BERT (encoder), GPT (decoder), T5 (both)
  • βœ… Implementation: How to use transformers in practice
  • βœ… Design decisions: When to use which architecture

The Transformer Impact

πŸš€ The AI Revolution, Enabled by Transformers:

2017: Attention is All You Need paper
2018: BERT revolutionizes NLP understanding
2019: GPT-2 shows emergence with scale
2020: GPT-3 achieves few-shot learning
2021: Vision Transformers beat CNNs
2022: ChatGPT reaches 100M users in 2 months
2023-2025: Multimodal AI, agents, reasoning systems

Key Insight: Transformers didn't just improve performanceβ€”they fundamentally changed what's possible with AI by enabling:
β€’ Internet-scale pre-training (175B+ parameters)
β€’ Transfer learning across domains
β€’ Few-shot and zero-shot capabilities
β€’ Emergent abilities at scale
β€’ Foundation models for AGI research

What's Next?

You've conquered the architecture that powers modern AI! In the next tutorial, Transfer Learning & Fine-tuning, you'll learn how to:

  • 🎯 Leverage pre-trained models for your tasks
  • πŸ”§ Fine-tune BERT, GPT, and T5 on custom data
  • ⚑ Use adapters and LoRA for efficient fine-tuning
  • πŸ“Š Evaluate and deploy your models
  • πŸ’‘ Build production ML systems with transfer learning

πŸŽ‰ Outstanding Achievement!

You've mastered Transformersβ€”the most important architecture in modern AI! You now understand the technology behind GPT-4, Claude, BERT, and every major AI breakthrough of the past 8 years.

Next: Learn how to harness pre-trained Transformers for your own applications! πŸš€

πŸ“ Knowledge Check

Test your understanding of attention mechanisms and transformers!

1. What is the primary purpose of the attention mechanism?

A) To reduce model size
B) To speed up training
C) To allow models to focus on relevant parts of the input
D) To eliminate the need for training data

2. In self-attention, what are the three key components?

A) Input, output, and hidden state
B) Query, key, and value
C) Encoder, decoder, and attention
D) Forward, backward, and update

3. What is a key advantage of Transformers over RNNs?

A) They require less training data
B) They use less memory
C) They are easier to implement
D) They can process sequences in parallel

4. What does the encoder in a Transformer do?

A) Processes and encodes the input sequence into a representation
B) Generates the output sequence
C) Calculates the loss function
D) Applies regularization to prevent overfitting

5. What problem does multi-head attention help solve?

A) Reduces computational cost
B) Allows the model to attend to different aspects of the input simultaneously
C) Eliminates the need for positional encoding
D) Prevents overfitting