The curse of dimensionality makes many ML algorithms struggle with high-dimensional data. Dimensionality reduction techniques compress data into fewer dimensions while retaining essential patterns, improving model performance, reducing computational cost, and enabling visualization.
🎯 What You'll Learn
- PCA (Principal Component Analysis) for linear dimensionality reduction
- t-SNE for non-linear visualization of high-dimensional data
- UMAP - modern alternative to t-SNE with better scalability
- LDA (Linear Discriminant Analysis) for supervised reduction
Principal Component Analysis (PCA)
Finds orthogonal directions (principal components) that maximize variance
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
# Load high-dimensional data (64 features)
digits = load_digits()
X, y = digits.data, digits.target
# Standardize (important for PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply PCA
pca = PCA(n_components=0.95) # Keep 95% of variance
X_pca = pca.fit_transform(X_scaled)
print(f"Original dimensions: {X.shape[1]}")
print(f"Reduced dimensions: {X_pca.shape[1]}")
print(f"Variance explained: {pca.explained_variance_ratio_.sum():.4f}")
print(f"Components: {pca.n_components_}")
Scree Plot - Choosing Number of Components
# Full PCA to see all components
pca_full = PCA()
pca_full.fit(X_scaled)
# Plot explained variance
plt.figure(figsize=(12, 5))
# Individual variance
plt.subplot(1, 2, 1)
plt.bar(range(1, 21), pca_full.explained_variance_ratio_[:20])
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.title('Scree Plot')
# Cumulative variance
plt.subplot(1, 2, 2)
cumsum = np.cumsum(pca_full.explained_variance_ratio_)
plt.plot(cumsum)
plt.axhline(y=0.95, color='r', linestyle='--', label='95% variance')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Cumulative Variance')
plt.legend()
plt.tight_layout()
plt.show()
PCA for Visualization
# Reduce to 2D for visualization
pca_2d = PCA(n_components=2)
X_2d = pca_2d.fit_transform(X_scaled)
# Plot
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_2d[:, 0], X_2d[:, 1], c=y, cmap='tab10', alpha=0.6)
plt.colorbar(scatter, label='Digit')
plt.xlabel(f'PC1 ({pca_2d.explained_variance_ratio_[0]:.2%} variance)')
plt.ylabel(f'PC2 ({pca_2d.explained_variance_ratio_[1]:.2%} variance)')
plt.title('PCA: 64D → 2D Visualization')
plt.show()
t-SNE: Non-Linear Dimensionality Reduction
Excellent for visualization, preserves local structure
from sklearn.manifold import TSNE
# Apply t-SNE (works better on pre-reduced data)
# First reduce with PCA to ~50 dimensions for speed
pca_pre = PCA(n_components=50)
X_pca_pre = pca_pre.fit_transform(X_scaled)
# t-SNE
tsne = TSNE(n_components=2, perplexity=30, random_state=42, n_iter=1000)
X_tsne = tsne.fit_transform(X_pca_pre)
# Visualize
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='tab10', alpha=0.6)
plt.colorbar(scatter, label='Digit')
plt.title('t-SNE: Digits Dataset')
plt.xlabel('t-SNE 1')
plt.ylabel('t-SNE 2')
plt.show()
⚠️ t-SNE Limitations
- Slow on large datasets (>10,000 samples)
- Non-deterministic (different runs produce different results)
- Cannot transform new data (no .transform() method)
- Hyperparameter sensitive (perplexity affects results)
UMAP: Modern Alternative to t-SNE
# Install: pip install umap-learn
import umap
# Apply UMAP
reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, random_state=42)
X_umap = reducer.fit_transform(X_scaled)
# Visualize
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_umap[:, 0], X_umap[:, 1], c=y, cmap='tab10', alpha=0.6)
plt.colorbar(scatter, label='Digit')
plt.title('UMAP: Digits Dataset')
plt.xlabel('UMAP 1')
plt.ylabel('UMAP 2')
plt.show()
✅ UMAP Advantages Over t-SNE
- Much faster, scales to millions of samples
- Preserves both local and global structure
- Can transform new data (has .transform() method)
- More robust hyperparameters
Linear Discriminant Analysis (LDA)
Supervised dimensionality reduction that maximizes class separability
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
# LDA (requires labels)
lda = LinearDiscriminantAnalysis(n_components=2)
X_lda = lda.fit_transform(X_scaled, y)
print(f"Original dimensions: {X.shape[1]}")
print(f"LDA dimensions: {X_lda.shape[1]}")
print(f"Explained variance ratio: {lda.explained_variance_ratio_}")
# Visualize
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_lda[:, 0], X_lda[:, 1], c=y, cmap='tab10', alpha=0.6)
plt.colorbar(scatter, label='Digit')
plt.title('LDA: Supervised Dimensionality Reduction')
plt.xlabel('LD 1')
plt.ylabel('LD 2')
plt.show()
Comparison: PCA vs LDA
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
# Compare classification performance
models = {
'Original (64D)': X_scaled,
'PCA (20D)': PCA(n_components=20).fit_transform(X_scaled),
'LDA (9D)': LinearDiscriminantAnalysis(n_components=9).fit_transform(X_scaled, y)
}
rf = RandomForestClassifier(n_estimators=100, random_state=42)
for name, data in models.items():
scores = cross_val_score(rf, data, y, cv=5, scoring='accuracy')
print(f"{name}: {scores.mean():.4f} (+/- {scores.std():.4f})")
🔍 When to Use Each Method
- PCA: Unsupervised, variance-based. Use when labels unavailable or for preprocessing.
- t-SNE: Visualization only. Best for exploring data structure in 2D/3D.
- UMAP: Visualization + transformation. Better than t-SNE for most use cases.
- LDA: Supervised, class-aware. Use when labels available and classes well-separated.
Complete Pipeline with Dimensionality Reduction
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
# Pipeline: Scale → PCA → Classifier
pipe = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=0.95)), # Keep 95% variance
('classifier', LogisticRegression(max_iter=1000))
])
# Train
pipe.fit(X, y)
# Evaluate
from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipe, X, y, cv=5, scoring='accuracy')
print(f"Pipeline accuracy: {scores.mean():.4f} (+/- {scores.std():.4f})")
# Check final dimensionality
n_components = pipe.named_steps['pca'].n_components_
print(f"PCA reduced to {n_components} components")
🧠 Knowledge Check
Question 1: What does PCA maximize?
Classification accuracy
Variance along principal components
Distance between classes
Correlation between features
Question 2: What is the main limitation of t-SNE?
Requires labeled data
Only works with categorical features
Cannot transform new data and is computationally expensive
Preserves too much variance
Question 3: How is LDA different from PCA?
LDA is supervised and maximizes class separability
LDA is faster than PCA
LDA works with non-linear relationships
LDA doesn't require standardization
Question 4: Why is UMAP often preferred over t-SNE?
UMAP is always more accurate
UMAP requires less data
UMAP is older and more tested
UMAP is faster and can transform new data
📝 Summary
Key Takeaways
- PCA: Linear, unsupervised. Maximizes variance. Use for preprocessing and feature extraction.
- t-SNE: Non-linear visualization. Slow but excellent for exploring data structure.
- UMAP: Modern alternative to t-SNE. Faster, can transform new data, preserves structure.
- LDA: Supervised reduction. Maximizes class separation when labels available.
- Always standardize: Scale features before applying dimensionality reduction.