πŸ“œ Free Certificate Upon Completion - Earn a verified certificate when you complete all 7 modules in the Transformers Architecture course.

Attention Is All You Need

πŸ“š Tutorial 2 🟒 Beginner

Understand the revolutionary 2017 paper that changed AI forever

πŸŽ“ Complete all tutorials to earn your Free Transformers Architecture Certificate
Shareable on LinkedIn β€’ Verified by AITutorials.site β€’ No signup fee

The Paper That Changed Everything

In June 2017, researchers at Google published "Attention Is All You Need" - one of the most impactful papers in AI history. Its core claim was radical: you don't need recurrence. Pure attention is sufficient for sequence modeling.

Historic Impact: This paper has been cited over 100,000 times and spawned BERT (2018), GPT-2 (2019), GPT-3 (2020), GPT-4 (2023), Claude, Llama, and every modern LLM. Understanding it is understanding modern AI.

The Revolutionary Context

To appreciate how radical this paper was, consider the state of AI in 2016-2017:

State of the Art Before Transformers (2016)

  • Machine Translation: LSTM Seq2Seq with attention (GNMT - Google Neural Machine Translation)
  • Text Generation: Character-level RNNs, word-level LSTMs
  • Speech Recognition: Deep RNN stacks with CTC loss
  • Training Time: Weeks to months on large GPU clusters
  • Sequence Length Limit: ~100-200 tokens (beyond this, quality degraded)
  • Parallelization: Limited to batch dimension only

The Paper's Bold Claims

❌ Old Wisdom (2016)

  • "Recurrence is essential for sequences"
  • "LSTMs are needed for long-range dependencies"
  • "You must process sequences step-by-step"
  • "Attention is a nice add-on to RNNs"

βœ… Transformer Insight (2017)

  • "Attention is ALL you need"
  • "No recurrence necessary"
  • "Fully parallel processing"
  • "Attention as the primary mechanism"

The Authors and Their Vision

The paper was authored by eight researchers at Google Brain, Google Research, and University of Toronto:

  • Ashish Vaswani (lead author) - "The attention mechanism could replace recurrence"
  • Noam Shazeer - Pioneered many architectural innovations
  • Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin

πŸ’‘ Fun Fact: The paper title "Attention Is All You Need" was inspired by The Beatles song "All You Need Is Love". The authors wanted to convey that attention alone, without recurrence or convolutions, was sufficient for state-of-the-art results.

The Impact Timeline

πŸ“… June 2017: Paper published on arXiv

πŸ“… December 2017: Accepted to NeurIPS (then NIPS) - awarded Best Paper

πŸ“… October 2018: BERT released (transformer encoder for understanding)

πŸ“… February 2019: GPT-2 released (transformer decoder for generation)

πŸ“… June 2020: GPT-3 (175B parameters) - transformers scaled massively

πŸ“… 2021-2023: Transformer variants dominate all of AI (vision, audio, code, proteins)

πŸ“… 2023-2024: GPT-4, Claude 3, Gemini, Llama 3 - all built on transformer architecture

πŸ† Result: From zero to dominating AI in just 7 years. This is one of the fastest paradigm shifts in the history of computer science.

The Core Insight: Attention Replaces Recurrence

Instead of processing sequences step-by-step (like RNNs), transformers process all tokens at once using attention - a mechanism that learns which tokens are important for understanding each position.

RNN Approach:
h₁ = RNN(token₁, hβ‚€)
hβ‚‚ = RNN(tokenβ‚‚, h₁)  ← Must wait for h₁
h₃ = RNN(token₃, hβ‚‚)  ← Must wait for hβ‚‚
Sequential! ⏳

Attention Approach:
attention_weights = compute_attention(all_tokens)
output = weighted_sum(all_tokens)  ← All at once! πŸš€

This simple change unlocked massive parallelization and better long-range learning.

What is Attention? The Core Mechanism

Attention is deceptively elegant: for each token, compute how important every other token is, then take a weighted average. This simple idea replaces the complex sequential processing of RNNs.

The Human Analogy

Intuition: When you read a sentence, your brain doesn't process each word in isolation. You dynamically focus on relevant words based on what you're trying to understand.

"The cat sat on the mat because it was comfortable."

Question: What does "it" refer to?

  • Your brain attends to "cat" (likely referent)
  • Considers "mat" (possible but less likely)
  • Ignores "the", "on", "because", "was" (grammatical but uninformative)

This selective focusing is what attention mechanisms formalize mathematically!

The Attention Process: 4 Steps

For each position i in the sequence:

Step 1: Compute Similarity Scores
    - Compare token i to ALL other tokens (including itself)
    - Result: "How relevant is each token to understanding token i?"
    - Output: Raw attention scores [score₁, scoreβ‚‚, ..., scoreβ‚™]

Step 2: Normalize with Softmax
    - Convert scores to probabilities (sum to 1)
    - Output: Attention weights [weight₁, weightβ‚‚, ..., weightβ‚™]
    - High weight = very relevant, Low weight = less relevant

Step 3: Weighted Combination
    - Take weighted sum of token representations
    - Relevant tokens contribute more to the result
    - Output: Context-aware representation

Step 4: Apply to All Positions
    - Repeat for every position simultaneously
    - Fully parallelizable!
    - Output: Entire sequence with context

Concrete Example: Pronoun Resolution

Let's trace attention step-by-step for the sentence: "Alice helped Bob because she cared"

Processing "she" (position 4):

Step 1: Compute Similarities

Token     Similarity Score
Alice     0.85  ← High (likely referent, female name)
helped    0.12  ← Low (verb, not relevant)
Bob       0.31  ← Medium (possible but name suggests male)
because   0.03  ← Very low (connector word)
she       0.45  ← Medium (self-reference)
cared     0.18  ← Low (verb)

Step 2: Apply Softmax (normalize to probabilities)

Token     Attention Weight
Alice     0.62  ← Highest probability (62%)
helped    0.05
Bob       0.18
because   0.01
she       0.11
cared     0.03
Total:    1.00  ← Sums to 100%

Step 3: Weighted Combination

representation("she") = 
    0.62 Γ— embedding("Alice") +
    0.18 Γ— embedding("Bob") +
    0.11 Γ— embedding("she") +
    0.05 Γ— embedding("helped") +
    ... (small contributions from others)

Result: "she" representation strongly influenced by "Alice"

Mathematical Formulation

Here's the precise mathematical definition:

Attention(Q, K, V) = softmax(QK^T / √d_k) V

Breaking down each component:

  • Q (Query): "What am I looking for?" - Representation of current token asking questions
  • K (Key): "What do I represent?" - Representation of all tokens being considered
  • V (Value): "What information do I carry?" - Actual content to be aggregated
  • QK^T: Dot product between queries and keys β†’ similarity scores (matrix: seq_len Γ— seq_len)
  • √d_k: Scaling factor (d_k = dimension) - prevents scores from getting too large
  • softmax(...): Convert scores to probabilities (each row sums to 1)
  • (...) V: Use probabilities as weights to combine values

🎯 Key Insight: Q, K, V are all learned linear projections of the input. The model learns what queries to ask, what keys to match, and what values to return. This is all trained end-to-end with backpropagation!

Why Query, Key, Value?

The Q, K, V terminology comes from information retrieval (like database queries):

πŸ” Query

Like a search query: "Find me documents about cats"

In attention: "What information do I need to understand this token?"

πŸ”‘ Key

Like document tags: "This document is about [animals, pets]"

In attention: "What does this token represent?"

πŸ’Ž Value

Like document content: "Here's the actual information"

In attention: "What information does this token carry?"

Database Analogy: Imagine searching a database where queries match against keys, and the best matches return their associated values. That's exactly what attention does, but learned and differentiable!

Scaled Dot-Product Attention: The Implementation

The paper introduced "Scaled Dot-Product Attention" - the specific attention variant that became the industry standard. Let's implement it from scratch and understand every detail.

Step-by-Step Implementation

Here's a complete, production-quality implementation with detailed explanations:

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class ScaledDotProductAttention(nn.Module):
    """
    Scaled Dot-Product Attention from 'Attention Is All You Need'
    
    Formula: Attention(Q, K, V) = softmax(QK^T / √d_k) V
    """
    
    def __init__(self, dropout=0.1):
        super().__init__()
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, query, key, value, mask=None):
        """
        Args:
            query: (batch, seq_len_q, d_k)
            key:   (batch, seq_len_k, d_k)
            value: (batch, seq_len_v, d_v)  where seq_len_v == seq_len_k
            mask:  (batch, 1, seq_len_q, seq_len_k) - optional, for padding/causality
        
        Returns:
            output: (batch, seq_len_q, d_v)
            attention_weights: (batch, seq_len_q, seq_len_k)
        """
        
        # Get dimension for scaling
        d_k = query.size(-1)
        
        # Step 1: Compute attention scores (QK^T)
        # Shape: (batch, seq_len_q, seq_len_k)
        scores = torch.matmul(query, key.transpose(-2, -1))
        
        # Step 2: Scale by √d_k
        # Why? Prevent dot products from getting too large
        # (which would push softmax into regions with tiny gradients)
        scores = scores / math.sqrt(d_k)
        
        # Step 3: Apply mask (if provided)
        if mask is not None:
            # Set masked positions to large negative value
            # After softmax, these become β‰ˆ0
            scores = scores.masked_fill(mask == 0, -1e9)
        
        # Step 4: Apply softmax to get probabilities
        # Each row sums to 1.0
        attention_weights = F.softmax(scores, dim=-1)
        
        # Step 5: Apply dropout (optional, for regularization)
        attention_weights = self.dropout(attention_weights)
        
        # Step 6: Multiply by values
        # Shape: (batch, seq_len_q, d_v)
        output = torch.matmul(attention_weights, value)
        
        return output, attention_weights

# Example usage
batch_size = 2
seq_len = 4
d_model = 512
d_k = 64  # Typically d_model // num_heads

# Create random Q, K, V matrices
Q = torch.randn(batch_size, seq_len, d_k)
K = torch.randn(batch_size, seq_len, d_k)
V = torch.randn(batch_size, seq_len, d_k)

# Apply attention
attention = ScaledDotProductAttention()
output, weights = attention(Q, K, V)

print(f"Input Q shape:  {Q.shape}")      # torch.Size([2, 4, 64])
print(f"Output shape:   {output.shape}")  # torch.Size([2, 4, 64])
print(f"Weights shape:  {weights.shape}") # torch.Size([2, 4, 4])
print(f"\nAttention weights (batch 0, first token):")
print(weights[0, 0, :])  # Shows which tokens attended to
# Example: tensor([0.32, 0.18, 0.41, 0.09]) - token 0 attends most to token 2

Understanding the Scaling Factor √d_k

Why do we divide by √d_k? This is a crucial detail often overlooked:

⚠️ The Problem Without Scaling

Dot products between high-dimensional vectors can become very large:

# Example: Why scaling matters
import torch
import torch.nn.functional as F

d_k = 512  # High dimensional

# Generate random unit-norm vectors
q = torch.randn(1, d_k)
k = torch.randn(1, d_k)

# Compute dot product
score = torch.matmul(q, k.T).item()
print(f"Unscaled score: {score:.2f}")  # Often 15-30 in magnitude

# Apply softmax
softmax_input = torch.tensor([score, score * 0.5, score * 0.3])
probs = F.softmax(softmax_input, dim=-1)
print(f"Softmax probs: {probs}")
# Result: tensor([0.9998, 0.0002, 0.0000])
# Nearly all weight on first token - gradient β‰ˆ 0!

# With scaling
scaled_score = score / (d_k ** 0.5)
print(f"\nScaled score: {scaled_score:.2f}")  # Much smaller

softmax_input_scaled = torch.tensor([scaled_score, scaled_score * 0.5, scaled_score * 0.3])
probs_scaled = F.softmax(softmax_input_scaled, dim=-1)
print(f"Softmax probs (scaled): {probs_scaled}")
# Result: tensor([0.5763, 0.2719, 0.1518])
# More balanced distribution - better gradients!

🎯 Key Insight: Without scaling, large dot products push softmax into saturation regions where gradients vanish. Scaling by √d_k keeps scores in a reasonable range where gradients flow well.

Mathematical Intuition for √d_k

Why specifically √d_k and not just d_k?

If Q and K have independent entries with mean 0 and variance 1:

Dot product QΒ·K = Ξ£(q_i Γ— k_i) for i=1 to d_k

Expected value: E[QΒ·K] = 0 (since entries have mean 0)

Variance: Var(QΒ·K) = Ξ£ Var(q_i Γ— k_i) = d_k

Standard deviation: Οƒ(QΒ·K) = √d_k

By dividing by √d_k, we normalize the dot product to have variance 1:

Οƒ(QΒ·K / √d_k) = Οƒ(QΒ·K) / √d_k = √d_k / √d_k = 1

Result: Scaled dot products have consistent variance regardless of dimension, keeping softmax inputs in the optimal range for gradient flow.

Complete Example: Processing a Real Sentence

import torch
import torch.nn as nn
import torch.nn.functional as F

# Sentence: "The cat sat on the mat"
tokens = ["The", "cat", "sat", "on", "the", "mat"]
seq_len = len(tokens)
d_model = 8  # Small for illustration

# Simulated token embeddings (normally from embedding layer)
x = torch.randn(1, seq_len, d_model)

# Create Q, K, V projections (learned linear layers)
W_q = nn.Linear(d_model, d_model, bias=False)
W_k = nn.Linear(d_model, d_model, bias=False)
W_v = nn.Linear(d_model, d_model, bias=False)

# Project inputs to Q, K, V
Q = W_q(x)  # (1, 6, 8)
K = W_k(x)  # (1, 6, 8)
V = W_v(x)  # (1, 6, 8)

# Compute attention
d_k = d_model
scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)  # (1, 6, 6)
attention_weights = F.softmax(scores, dim=-1)
output = torch.matmul(attention_weights, V)  # (1, 6, 8)

# Visualize attention for token 1 ("cat")
print("Attention weights for 'cat' (token 1):")
print("Token  | Attention")
print("-------|----------")
for i, token in enumerate(tokens):
    weight = attention_weights[0, 1, i].item()
    bar = "β–ˆ" * int(weight * 50)
    print(f"{token:6} | {weight:.3f} {bar}")

# Example output:
# Token  | Attention
# -------|----------
# The    | 0.142 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
# cat    | 0.284 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
# sat    | 0.198 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
# on     | 0.089 β–ˆβ–ˆβ–ˆβ–ˆ
# the    | 0.125 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
# mat    | 0.162 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ

Attention Patterns: What Does the Model Learn?

Attention weights reveal what the model considers important. Common patterns include:

πŸ” Self-Attention

Tokens attend to themselves (diagonal high in attention matrix)

Example: "running" attends to itself to maintain word identity

πŸ‘₯ Dependency Attention

Tokens attend to syntactically related words

Example: Verbs attend to their subjects and objects

πŸ”— Positional Attention

Tokens attend to nearby positions

Example: Words attend to immediate neighbors for local context

πŸ“š Semantic Attention

Tokens attend based on meaning

Example: "king" attends to "queen", "throne", "crown"

Why Attention Solves RNN Problems

βœ… No Vanishing Gradients

Attention is one multiplication. Direct path from any token to any other. Gradients flow well.

βœ… Parallelizable

All tokens processed simultaneously. Scales efficiently to long sequences.

βœ… Long-Range Dependencies

Direct connections between distant tokens. No information lost through bottlenecks.

βœ… Flexible Context

Attention weights learned during training. Model learns what to focus on.

Multi-Head Attention: Multiple Perspectives

Single attention is powerful, but the paper found that multiple attention heads work even better. Each head can learn different patterns.

The Intuition

Analogy: When understanding a sentence, humans consider multiple aspects simultaneously:

  • Syntax Head: Which words are grammatically related?
  • Semantic Head: Which words have similar meanings?
  • Positional Head: Which nearby words provide context?
  • Referential Head: Which nouns do pronouns refer to?

Multi-head attention lets the model learn these different perspectives and combine them!

How Multi-Head Attention Works

import torch
import torch.nn as nn
import torch.nn.functional as F

class MultiHeadAttention(nn.Module):
    """
    Multi-Head Attention from 'Attention Is All You Need'
    
    Key idea: Instead of one attention, run h parallel attention heads,
    then concatenate and project the results.
    """
    
    def __init__(self, d_model=512, num_heads=8, dropout=0.1):
        super().__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads  # Dimension per head
        
        # Linear projections for Q, K, V (for all heads combined)
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        
        # Output projection
        self.W_o = nn.Linear(d_model, d_model)
        
        self.dropout = nn.Dropout(dropout)
    
    def split_heads(self, x):
        """
        Split the last dimension into (num_heads, d_k)
        x: (batch, seq_len, d_model)
        Returns: (batch, num_heads, seq_len, d_k)
        """
        batch_size, seq_len, d_model = x.size()
        x = x.view(batch_size, seq_len, self.num_heads, self.d_k)
        return x.transpose(1, 2)  # (batch, num_heads, seq_len, d_k)
    
    def combine_heads(self, x):
        """
        Combine heads back
        x: (batch, num_heads, seq_len, d_k)
        Returns: (batch, seq_len, d_model)
        """
        batch_size, num_heads, seq_len, d_k = x.size()
        x = x.transpose(1, 2)  # (batch, seq_len, num_heads, d_k)
        return x.contiguous().view(batch_size, seq_len, self.d_model)
    
    def forward(self, query, key, value, mask=None):
        """
        Args:
            query, key, value: (batch, seq_len, d_model)
            mask: (batch, 1, seq_len, seq_len) or None
        
        Returns:
            output: (batch, seq_len, d_model)
            attention_weights: (batch, num_heads, seq_len, seq_len)
        """
        batch_size = query.size(0)
        
        # 1. Linear projections
        Q = self.W_q(query)  # (batch, seq_len, d_model)
        K = self.W_k(key)
        V = self.W_v(value)
        
        # 2. Split into multiple heads
        Q = self.split_heads(Q)  # (batch, num_heads, seq_len, d_k)
        K = self.split_heads(K)
        V = self.split_heads(V)
        
        # 3. Scaled dot-product attention for each head
        scores = torch.matmul(Q, K.transpose(-2, -1)) / (self.d_k ** 0.5)
        
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        
        attention_weights = F.softmax(scores, dim=-1)
        attention_weights = self.dropout(attention_weights)
        
        # 4. Apply attention to values
        context = torch.matmul(attention_weights, V)  # (batch, num_heads, seq_len, d_k)
        
        # 5. Concatenate heads
        context = self.combine_heads(context)  # (batch, seq_len, d_model)
        
        # 6. Final linear projection
        output = self.W_o(context)
        
        return output, attention_weights

# Example usage
batch_size = 2
seq_len = 10
d_model = 512
num_heads = 8

x = torch.randn(batch_size, seq_len, d_model)

mha = MultiHeadAttention(d_model, num_heads)
output, weights = mha(x, x, x)  # Self-attention

print(f"Input shape:    {x.shape}")        # torch.Size([2, 10, 512])
print(f"Output shape:   {output.shape}")   # torch.Size([2, 10, 512])
print(f"Weights shape:  {weights.shape}")  # torch.Size([2, 8, 10, 10])
#                                            # ↑ 8 heads, each with 10Γ—10 attention matrix

Visualizing Different Attention Heads

Research has shown that different heads learn different linguistic patterns:

Example from BERT: Analyzing Attention Heads on "The cat sat on the mat"

Head 1: Syntax
sat β†’ cat (subject)
sat β†’ mat (object)
Head 2: Determiners
cat β†’ The
mat β†’ the
Head 3: Positional
Each token β†’ neighbors
(local context)
Head 4: Semantics
cat β†’ sat (action)
cat β†’ mat (location)

🎯 Key Finding: Different heads specialize! The model automatically learns to use different heads for different types of relationships. This emergent behavior wasn't explicitly programmed - it arose from training.

Why Multiple Heads Help

❌ Single Head

  • One attention pattern
  • Must compromise between different needs
  • Limited expressiveness

Example: Can't simultaneously capture syntax AND semantics

βœ… Multi-Head (8 heads)

  • 8 different attention patterns
  • Each head specializes
  • Rich, diverse representations

Example: Head 1=syntax, Head 2=semantics, Head 3=position, etc.

Empirical Results: The original paper found 8 heads optimal for their setup. Modern models (GPT-4, Claude) use 96-128 heads with much larger dimensions!

The Complete Transformer Architecture

The paper combined attention with several other key components to create the full transformer:

Architecture Overview

TRANSFORMER ARCHITECTURE

Input: "The cat sat on the mat"
  ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ TOKEN EMBEDDINGS (learned lookup table)     β”‚
β”‚ "The" β†’ [0.12, -0.34, 0.56, ...]           β”‚
β”‚ "cat" β†’ [-0.23, 0.67, -0.12, ...]          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ + POSITIONAL ENCODING (position info)       β”‚
β”‚ Position 0: [sin(...), cos(...), ...]      β”‚
β”‚ Position 1: [sin(...), cos(...), ...]      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ENCODER STACK (6 identical layers)          β”‚
β”‚                                              β”‚
β”‚ Layer 1:                                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ Multi-Head Self-Attention            β”‚  β”‚
β”‚  β”‚ (each token attends to all tokens)   β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚    ↓ + Residual & LayerNorm               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ Feed-Forward Network                 β”‚  β”‚
β”‚  β”‚ FFN(x) = max(0, xW₁ + b₁)Wβ‚‚ + bβ‚‚    β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚    ↓ + Residual & LayerNorm               β”‚
β”‚                                              β”‚
β”‚ Layers 2-6: (same structure)                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ DECODER STACK (6 identical layers)          β”‚
β”‚                                              β”‚
β”‚ Layer 1:                                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ Masked Multi-Head Self-Attention     β”‚  β”‚
β”‚  β”‚ (causal: can't see future tokens)    β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚    ↓ + Residual & LayerNorm               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ Multi-Head Cross-Attention           β”‚  β”‚
β”‚  β”‚ (decoder attends to encoder)         β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚    ↓ + Residual & LayerNorm               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ Feed-Forward Network                 β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚    ↓ + Residual & LayerNorm               β”‚
β”‚                                              β”‚
β”‚ Layers 2-6: (same structure)                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ LINEAR + SOFTMAX                            β”‚
β”‚ Project to vocabulary size (50K dims)       β”‚
β”‚ Softmax β†’ probability distribution          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  ↓
Output: Next token probabilities

Key Components Explained

🎯 Multi-Head Attention

8 parallel attention heads learn different patterns (syntax, semantics, position, etc.)

Original paper: 8 heads, 64 dims each

πŸ“ Positional Encoding

Sinusoidal functions encode position since attention is permutation-invariant

PE(pos, 2i) = sin(pos/10000^(2i/d_model))

🧠 Feed-Forward Network

2-layer MLP applied to each position independently. Adds non-linearity and expressiveness

FFN(x) = ReLU(xW₁ + b₁)Wβ‚‚ + bβ‚‚

πŸ”— Residual Connections

Add input to output: y = x + Sublayer(x). Helps gradients flow through 12+ layers

Essential for training deep networks

πŸ“Š Layer Normalization

Normalize activations to mean=0, var=1 within each layer. Stabilizes training

LN(x) = Ξ³(x - ΞΌ)/Οƒ + Ξ²

🎭 Masked Attention

In decoder, mask future positions so token i can only attend to positions ≀ i

Ensures autoregressive generation

But Attention Seems Expensive!

You're right. Computing attention between all pairs of tokens is O(nΒ²) - for long sequences, this seems costly:

⚠️ Concern: If sequence length is 1000, attention is 1000² = 1M operations

But modern GPUs excel at matrix operations. On GPU, this is faster than sequential processing!

Plus, there are workarounds:

  • Flash Attention: Algorithmic improvement (3-4x speedup)
  • Sparse Attention: Only attend to nearby tokens
  • Grouped Query Attention: Share attention heads across tokens

Why This Was Revolutionary

Before transformers, sequence modeling was dominated by RNNs. The insight that you can remove recurrence entirely and still work better was counterintuitive and transformative.

Before (2016): RNN variants with tricks to handle sequences

After (2017): Pure attention, no recurrence needed

This paradigm shift enabled scaling to billions of parameters and unlocked modern AI.

Key Takeaways

  • Attention Mechanism: Learn which tokens are important for each position
  • Scaled Dot-Product: Core formula: softmax(QK^T/√d_k)V
  • Parallel Processing: All tokens processed simultaneously (vs RNN's sequential)
  • Long-Range Learning: Direct connections between distant tokens
  • Positional Encoding: Add position information since attention is permutation-invariant
  • Multi-Head Attention: Multiple attention perspectives combined
  • Historic Impact: This 2017 paper is the foundation of all modern LLMs

Test Your Knowledge

Q1: What was the key claim of the "Attention Is All You Need" paper?

RNNs are sufficient for all tasks
Convolutions are better than attention
Self-attention alone can effectively model sequences without recurrence
Attention should only be used with RNNs

Q2: What is the main advantage of the Transformer architecture?

It uses less memory than RNNs
It can process all positions in parallel, enabling faster training
It doesn't require GPUs
It works only for short sequences

Q3: What components make up a Transformer block?

Only attention layers
Only feedforward layers
RNNs and attention
Self-attention, feedforward networks, and normalization layers

Q4: Why did the Transformer architecture revolutionize NLP?

It enabled efficient training on much larger datasets due to parallelization
It eliminated the need for data
It made models smaller
It only works for translation

Q5: What year was the "Attention Is All You Need" paper published?

2014
2015
2017
2020