Understand the revolutionary 2017 paper that changed AI forever
π Complete all tutorials to earn your Free Transformers Architecture Certificate
Shareable on LinkedIn β’ Verified by AITutorials.site β’ No signup fee
In June 2017, researchers at Google published "Attention Is All You Need" - one of the most impactful papers in AI history. Its core claim was radical: you don't need recurrence. Pure attention is sufficient for sequence modeling.
To appreciate how radical this paper was, consider the state of AI in 2016-2017:
The paper was authored by eight researchers at Google Brain, Google Research, and University of Toronto:
π‘ Fun Fact: The paper title "Attention Is All You Need" was inspired by The Beatles song "All You Need Is Love". The authors wanted to convey that attention alone, without recurrence or convolutions, was sufficient for state-of-the-art results.
π June 2017: Paper published on arXiv
π December 2017: Accepted to NeurIPS (then NIPS) - awarded Best Paper
π October 2018: BERT released (transformer encoder for understanding)
π February 2019: GPT-2 released (transformer decoder for generation)
π June 2020: GPT-3 (175B parameters) - transformers scaled massively
π 2021-2023: Transformer variants dominate all of AI (vision, audio, code, proteins)
π 2023-2024: GPT-4, Claude 3, Gemini, Llama 3 - all built on transformer architecture
π Result: From zero to dominating AI in just 7 years. This is one of the fastest paradigm shifts in the history of computer science.
Instead of processing sequences step-by-step (like RNNs), transformers process all tokens at once using attention - a mechanism that learns which tokens are important for understanding each position.
RNN Approach:
hβ = RNN(tokenβ, hβ)
hβ = RNN(tokenβ, hβ) β Must wait for hβ
hβ = RNN(tokenβ, hβ) β Must wait for hβ
Sequential! β³
Attention Approach:
attention_weights = compute_attention(all_tokens)
output = weighted_sum(all_tokens) β All at once! π
This simple change unlocked massive parallelization and better long-range learning.
Attention is deceptively elegant: for each token, compute how important every other token is, then take a weighted average. This simple idea replaces the complex sequential processing of RNNs.
Intuition: When you read a sentence, your brain doesn't process each word in isolation. You dynamically focus on relevant words based on what you're trying to understand.
"The cat sat on the mat because it was comfortable."
Question: What does "it" refer to?
This selective focusing is what attention mechanisms formalize mathematically!
For each position i in the sequence:
Step 1: Compute Similarity Scores
- Compare token i to ALL other tokens (including itself)
- Result: "How relevant is each token to understanding token i?"
- Output: Raw attention scores [scoreβ, scoreβ, ..., scoreβ]
Step 2: Normalize with Softmax
- Convert scores to probabilities (sum to 1)
- Output: Attention weights [weightβ, weightβ, ..., weightβ]
- High weight = very relevant, Low weight = less relevant
Step 3: Weighted Combination
- Take weighted sum of token representations
- Relevant tokens contribute more to the result
- Output: Context-aware representation
Step 4: Apply to All Positions
- Repeat for every position simultaneously
- Fully parallelizable!
- Output: Entire sequence with context
Let's trace attention step-by-step for the sentence: "Alice helped Bob because she cared"
Processing "she" (position 4):
Step 1: Compute Similarities
Token Similarity Score
Alice 0.85 β High (likely referent, female name)
helped 0.12 β Low (verb, not relevant)
Bob 0.31 β Medium (possible but name suggests male)
because 0.03 β Very low (connector word)
she 0.45 β Medium (self-reference)
cared 0.18 β Low (verb)
Step 2: Apply Softmax (normalize to probabilities)
Token Attention Weight
Alice 0.62 β Highest probability (62%)
helped 0.05
Bob 0.18
because 0.01
she 0.11
cared 0.03
Total: 1.00 β Sums to 100%
Step 3: Weighted Combination
representation("she") =
0.62 Γ embedding("Alice") +
0.18 Γ embedding("Bob") +
0.11 Γ embedding("she") +
0.05 Γ embedding("helped") +
... (small contributions from others)
Result: "she" representation strongly influenced by "Alice"
Here's the precise mathematical definition:
Attention(Q, K, V) = softmax(QK^T / βd_k) V
Breaking down each component:
π― Key Insight: Q, K, V are all learned linear projections of the input. The model learns what queries to ask, what keys to match, and what values to return. This is all trained end-to-end with backpropagation!
The Q, K, V terminology comes from information retrieval (like database queries):
Like a search query: "Find me documents about cats"
In attention: "What information do I need to understand this token?"
Like document tags: "This document is about [animals, pets]"
In attention: "What does this token represent?"
Like document content: "Here's the actual information"
In attention: "What information does this token carry?"
Database Analogy: Imagine searching a database where queries match against keys, and the best matches return their associated values. That's exactly what attention does, but learned and differentiable!
The paper introduced "Scaled Dot-Product Attention" - the specific attention variant that became the industry standard. Let's implement it from scratch and understand every detail.
Here's a complete, production-quality implementation with detailed explanations:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
class ScaledDotProductAttention(nn.Module):
"""
Scaled Dot-Product Attention from 'Attention Is All You Need'
Formula: Attention(Q, K, V) = softmax(QK^T / βd_k) V
"""
def __init__(self, dropout=0.1):
super().__init__()
self.dropout = nn.Dropout(dropout)
def forward(self, query, key, value, mask=None):
"""
Args:
query: (batch, seq_len_q, d_k)
key: (batch, seq_len_k, d_k)
value: (batch, seq_len_v, d_v) where seq_len_v == seq_len_k
mask: (batch, 1, seq_len_q, seq_len_k) - optional, for padding/causality
Returns:
output: (batch, seq_len_q, d_v)
attention_weights: (batch, seq_len_q, seq_len_k)
"""
# Get dimension for scaling
d_k = query.size(-1)
# Step 1: Compute attention scores (QK^T)
# Shape: (batch, seq_len_q, seq_len_k)
scores = torch.matmul(query, key.transpose(-2, -1))
# Step 2: Scale by βd_k
# Why? Prevent dot products from getting too large
# (which would push softmax into regions with tiny gradients)
scores = scores / math.sqrt(d_k)
# Step 3: Apply mask (if provided)
if mask is not None:
# Set masked positions to large negative value
# After softmax, these become β0
scores = scores.masked_fill(mask == 0, -1e9)
# Step 4: Apply softmax to get probabilities
# Each row sums to 1.0
attention_weights = F.softmax(scores, dim=-1)
# Step 5: Apply dropout (optional, for regularization)
attention_weights = self.dropout(attention_weights)
# Step 6: Multiply by values
# Shape: (batch, seq_len_q, d_v)
output = torch.matmul(attention_weights, value)
return output, attention_weights
# Example usage
batch_size = 2
seq_len = 4
d_model = 512
d_k = 64 # Typically d_model // num_heads
# Create random Q, K, V matrices
Q = torch.randn(batch_size, seq_len, d_k)
K = torch.randn(batch_size, seq_len, d_k)
V = torch.randn(batch_size, seq_len, d_k)
# Apply attention
attention = ScaledDotProductAttention()
output, weights = attention(Q, K, V)
print(f"Input Q shape: {Q.shape}") # torch.Size([2, 4, 64])
print(f"Output shape: {output.shape}") # torch.Size([2, 4, 64])
print(f"Weights shape: {weights.shape}") # torch.Size([2, 4, 4])
print(f"\nAttention weights (batch 0, first token):")
print(weights[0, 0, :]) # Shows which tokens attended to
# Example: tensor([0.32, 0.18, 0.41, 0.09]) - token 0 attends most to token 2
Why do we divide by βd_k? This is a crucial detail often overlooked:
Dot products between high-dimensional vectors can become very large:
# Example: Why scaling matters
import torch
import torch.nn.functional as F
d_k = 512 # High dimensional
# Generate random unit-norm vectors
q = torch.randn(1, d_k)
k = torch.randn(1, d_k)
# Compute dot product
score = torch.matmul(q, k.T).item()
print(f"Unscaled score: {score:.2f}") # Often 15-30 in magnitude
# Apply softmax
softmax_input = torch.tensor([score, score * 0.5, score * 0.3])
probs = F.softmax(softmax_input, dim=-1)
print(f"Softmax probs: {probs}")
# Result: tensor([0.9998, 0.0002, 0.0000])
# Nearly all weight on first token - gradient β 0!
# With scaling
scaled_score = score / (d_k ** 0.5)
print(f"\nScaled score: {scaled_score:.2f}") # Much smaller
softmax_input_scaled = torch.tensor([scaled_score, scaled_score * 0.5, scaled_score * 0.3])
probs_scaled = F.softmax(softmax_input_scaled, dim=-1)
print(f"Softmax probs (scaled): {probs_scaled}")
# Result: tensor([0.5763, 0.2719, 0.1518])
# More balanced distribution - better gradients!
π― Key Insight: Without scaling, large dot products push softmax into saturation regions where gradients vanish. Scaling by βd_k keeps scores in a reasonable range where gradients flow well.
Why specifically βd_k and not just d_k?
If Q and K have independent entries with mean 0 and variance 1:
Dot product QΒ·K = Ξ£(q_i Γ k_i) for i=1 to d_k
Expected value: E[QΒ·K] = 0 (since entries have mean 0)
Variance: Var(QΒ·K) = Ξ£ Var(q_i Γ k_i) = d_k
Standard deviation: Ο(QΒ·K) = βd_k
By dividing by βd_k, we normalize the dot product to have variance 1:
Ο(QΒ·K / βd_k) = Ο(QΒ·K) / βd_k = βd_k / βd_k = 1
Result: Scaled dot products have consistent variance regardless of dimension, keeping softmax inputs in the optimal range for gradient flow.
import torch
import torch.nn as nn
import torch.nn.functional as F
# Sentence: "The cat sat on the mat"
tokens = ["The", "cat", "sat", "on", "the", "mat"]
seq_len = len(tokens)
d_model = 8 # Small for illustration
# Simulated token embeddings (normally from embedding layer)
x = torch.randn(1, seq_len, d_model)
# Create Q, K, V projections (learned linear layers)
W_q = nn.Linear(d_model, d_model, bias=False)
W_k = nn.Linear(d_model, d_model, bias=False)
W_v = nn.Linear(d_model, d_model, bias=False)
# Project inputs to Q, K, V
Q = W_q(x) # (1, 6, 8)
K = W_k(x) # (1, 6, 8)
V = W_v(x) # (1, 6, 8)
# Compute attention
d_k = d_model
scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5) # (1, 6, 6)
attention_weights = F.softmax(scores, dim=-1)
output = torch.matmul(attention_weights, V) # (1, 6, 8)
# Visualize attention for token 1 ("cat")
print("Attention weights for 'cat' (token 1):")
print("Token | Attention")
print("-------|----------")
for i, token in enumerate(tokens):
weight = attention_weights[0, 1, i].item()
bar = "β" * int(weight * 50)
print(f"{token:6} | {weight:.3f} {bar}")
# Example output:
# Token | Attention
# -------|----------
# The | 0.142 βββββββ
# cat | 0.284 ββββββββββββββ
# sat | 0.198 ββββββββββ
# on | 0.089 ββββ
# the | 0.125 ββββββ
# mat | 0.162 ββββββββ
Attention weights reveal what the model considers important. Common patterns include:
Tokens attend to themselves (diagonal high in attention matrix)
Example: "running" attends to itself to maintain word identity
Tokens attend to syntactically related words
Example: Verbs attend to their subjects and objects
Tokens attend to nearby positions
Example: Words attend to immediate neighbors for local context
Tokens attend based on meaning
Example: "king" attends to "queen", "throne", "crown"
Attention is one multiplication. Direct path from any token to any other. Gradients flow well.
All tokens processed simultaneously. Scales efficiently to long sequences.
Direct connections between distant tokens. No information lost through bottlenecks.
Attention weights learned during training. Model learns what to focus on.
Single attention is powerful, but the paper found that multiple attention heads work even better. Each head can learn different patterns.
Analogy: When understanding a sentence, humans consider multiple aspects simultaneously:
Multi-head attention lets the model learn these different perspectives and combine them!
import torch
import torch.nn as nn
import torch.nn.functional as F
class MultiHeadAttention(nn.Module):
"""
Multi-Head Attention from 'Attention Is All You Need'
Key idea: Instead of one attention, run h parallel attention heads,
then concatenate and project the results.
"""
def __init__(self, d_model=512, num_heads=8, dropout=0.1):
super().__init__()
assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
self.d_model = d_model
self.num_heads = num_heads
self.d_k = d_model // num_heads # Dimension per head
# Linear projections for Q, K, V (for all heads combined)
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
# Output projection
self.W_o = nn.Linear(d_model, d_model)
self.dropout = nn.Dropout(dropout)
def split_heads(self, x):
"""
Split the last dimension into (num_heads, d_k)
x: (batch, seq_len, d_model)
Returns: (batch, num_heads, seq_len, d_k)
"""
batch_size, seq_len, d_model = x.size()
x = x.view(batch_size, seq_len, self.num_heads, self.d_k)
return x.transpose(1, 2) # (batch, num_heads, seq_len, d_k)
def combine_heads(self, x):
"""
Combine heads back
x: (batch, num_heads, seq_len, d_k)
Returns: (batch, seq_len, d_model)
"""
batch_size, num_heads, seq_len, d_k = x.size()
x = x.transpose(1, 2) # (batch, seq_len, num_heads, d_k)
return x.contiguous().view(batch_size, seq_len, self.d_model)
def forward(self, query, key, value, mask=None):
"""
Args:
query, key, value: (batch, seq_len, d_model)
mask: (batch, 1, seq_len, seq_len) or None
Returns:
output: (batch, seq_len, d_model)
attention_weights: (batch, num_heads, seq_len, seq_len)
"""
batch_size = query.size(0)
# 1. Linear projections
Q = self.W_q(query) # (batch, seq_len, d_model)
K = self.W_k(key)
V = self.W_v(value)
# 2. Split into multiple heads
Q = self.split_heads(Q) # (batch, num_heads, seq_len, d_k)
K = self.split_heads(K)
V = self.split_heads(V)
# 3. Scaled dot-product attention for each head
scores = torch.matmul(Q, K.transpose(-2, -1)) / (self.d_k ** 0.5)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
attention_weights = F.softmax(scores, dim=-1)
attention_weights = self.dropout(attention_weights)
# 4. Apply attention to values
context = torch.matmul(attention_weights, V) # (batch, num_heads, seq_len, d_k)
# 5. Concatenate heads
context = self.combine_heads(context) # (batch, seq_len, d_model)
# 6. Final linear projection
output = self.W_o(context)
return output, attention_weights
# Example usage
batch_size = 2
seq_len = 10
d_model = 512
num_heads = 8
x = torch.randn(batch_size, seq_len, d_model)
mha = MultiHeadAttention(d_model, num_heads)
output, weights = mha(x, x, x) # Self-attention
print(f"Input shape: {x.shape}") # torch.Size([2, 10, 512])
print(f"Output shape: {output.shape}") # torch.Size([2, 10, 512])
print(f"Weights shape: {weights.shape}") # torch.Size([2, 8, 10, 10])
# # β 8 heads, each with 10Γ10 attention matrix
Research has shown that different heads learn different linguistic patterns:
Example from BERT: Analyzing Attention Heads on "The cat sat on the mat"
sat β cat (subject)
sat β mat (object)
cat β The
mat β the
Each token β neighbors
(local context)
cat β sat (action)
cat β mat (location)
π― Key Finding: Different heads specialize! The model automatically learns to use different heads for different types of relationships. This emergent behavior wasn't explicitly programmed - it arose from training.
Example: Can't simultaneously capture syntax AND semantics
Example: Head 1=syntax, Head 2=semantics, Head 3=position, etc.
Empirical Results: The original paper found 8 heads optimal for their setup. Modern models (GPT-4, Claude) use 96-128 heads with much larger dimensions!
The paper combined attention with several other key components to create the full transformer:
TRANSFORMER ARCHITECTURE
Input: "The cat sat on the mat"
β
βββββββββββββββββββββββββββββββββββββββββββββββ
β TOKEN EMBEDDINGS (learned lookup table) β
β "The" β [0.12, -0.34, 0.56, ...] β
β "cat" β [-0.23, 0.67, -0.12, ...] β
βββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββ
β + POSITIONAL ENCODING (position info) β
β Position 0: [sin(...), cos(...), ...] β
β Position 1: [sin(...), cos(...), ...] β
βββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββ
β ENCODER STACK (6 identical layers) β
β β
β Layer 1: β
β ββββββββββββββββββββββββββββββββββββββββ β
β β Multi-Head Self-Attention β β
β β (each token attends to all tokens) β β
β ββββββββββββββββββββββββββββββββββββββββ β
β β + Residual & LayerNorm β
β ββββββββββββββββββββββββββββββββββββββββ β
β β Feed-Forward Network β β
β β FFN(x) = max(0, xWβ + bβ)Wβ + bβ β β
β ββββββββββββββββββββββββββββββββββββββββ β
β β + Residual & LayerNorm β
β β
β Layers 2-6: (same structure) β
βββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββ
β DECODER STACK (6 identical layers) β
β β
β Layer 1: β
β ββββββββββββββββββββββββββββββββββββββββ β
β β Masked Multi-Head Self-Attention β β
β β (causal: can't see future tokens) β β
β ββββββββββββββββββββββββββββββββββββββββ β
β β + Residual & LayerNorm β
β ββββββββββββββββββββββββββββββββββββββββ β
β β Multi-Head Cross-Attention β β
β β (decoder attends to encoder) β β
β ββββββββββββββββββββββββββββββββββββββββ β
β β + Residual & LayerNorm β
β ββββββββββββββββββββββββββββββββββββββββ β
β β Feed-Forward Network β β
β ββββββββββββββββββββββββββββββββββββββββ β
β β + Residual & LayerNorm β
β β
β Layers 2-6: (same structure) β
βββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββ
β LINEAR + SOFTMAX β
β Project to vocabulary size (50K dims) β
β Softmax β probability distribution β
βββββββββββββββββββββββββββββββββββββββββββββββ
β
Output: Next token probabilities
8 parallel attention heads learn different patterns (syntax, semantics, position, etc.)
Original paper: 8 heads, 64 dims each
Sinusoidal functions encode position since attention is permutation-invariant
PE(pos, 2i) = sin(pos/10000^(2i/d_model))
2-layer MLP applied to each position independently. Adds non-linearity and expressiveness
FFN(x) = ReLU(xWβ + bβ)Wβ + bβ
Add input to output: y = x + Sublayer(x). Helps gradients flow through 12+ layers
Essential for training deep networks
Normalize activations to mean=0, var=1 within each layer. Stabilizes training
LN(x) = Ξ³(x - ΞΌ)/Ο + Ξ²
In decoder, mask future positions so token i can only attend to positions β€ i
Ensures autoregressive generation
You're right. Computing attention between all pairs of tokens is O(nΒ²) - for long sequences, this seems costly:
β οΈ Concern: If sequence length is 1000, attention is 1000Β² = 1M operations
But modern GPUs excel at matrix operations. On GPU, this is faster than sequential processing!
Plus, there are workarounds:
Before transformers, sequence modeling was dominated by RNNs. The insight that you can remove recurrence entirely and still work better was counterintuitive and transformative.
Before (2016): RNN variants with tricks to handle sequences
After (2017): Pure attention, no recurrence needed
This paradigm shift enabled scaling to billions of parameters and unlocked modern AI.
Q1: What was the key claim of the "Attention Is All You Need" paper?
Q2: What is the main advantage of the Transformer architecture?
Q3: What components make up a Transformer block?
Q4: Why did the Transformer architecture revolutionize NLP?
Q5: What year was the "Attention Is All You Need" paper published?