Imagine training a machine learning model to predict house prices using features like square footage (2000 sqft) and number of bedrooms (3). The model might think square footage is 667 times more important simply because the numbers are larger! This is where feature scaling comes inβit ensures all features contribute proportionally to the model's learning process.
π― What You'll Learn
- Why feature scaling is critical for many ML algorithms
- Four essential scaling techniques: Standardization, Min-Max, Robust, and Normalization
- Which algorithms require scaling and which don't
- How to prevent data leakage when scaling
- Building scalable preprocessing pipelines
Why Feature Scaling Matters
The Scale Problem
Consider this dataset for house price prediction:
import pandas as pd
df = pd.DataFrame({
'square_feet': [1500, 2000, 1200, 2500],
'bedrooms': [3, 4, 2, 4],
'age_years': [10, 5, 15, 3],
'price': [300000, 400000, 250000, 450000]
})
print(df.describe())
# square_feet bedrooms age_years price
# mean 1800.0 3.25 8.25 350000.0
# std 561.1 0.96 5.38 87677.9
# min 1200.0 2.00 3.00 250000.0
# max 2500.0 4.00 15.00 450000.0
# Problem: Features on vastly different scales!
# - square_feet: 1200 to 2500 (range: 1300)
# - bedrooms: 2 to 4 (range: 2)
# - age_years: 3 to 15 (range: 12)
Impact on Different Algorithms
β οΈ Distance-Based Algorithms (Highly Affected)
K-Nearest Neighbors (KNN), K-Means, SVM: These algorithms calculate distances between points. Features with larger scales dominate the distance calculation.
# Without scaling: Euclidean distance dominated by square_feet
# House A: [1500 sqft, 3 bed]
# House B: [1600 sqft, 4 bed]
# Distance = sqrt((1600-1500)Β² + (4-3)Β²)
# = sqrt(10000 + 1)
# = sqrt(10001) β 100.005
# Bedroom difference (1) is negligible compared to sqft difference (100)!
β οΈ Gradient-Based Algorithms (Moderately Affected)
Linear Regression, Logistic Regression, Neural Networks: Features on different scales cause uneven gradients, leading to slower convergence and zigzagging optimization paths.
# Without scaling, gradient descent struggles:
# - Large feature values β large gradients β overshooting
# - Small feature values β small gradients β slow learning
# Result: Requires many more iterations to converge
β Tree-Based Algorithms (Not Affected)
Decision Trees, Random Forests, XGBoost: These algorithms split data based on thresholds, not distances. Feature scales don't matter.
# Tree-based models ask: "Is square_feet > 1800?"
# They don't care if it's 1800 or 0.8 (scaled)
# Splitting logic works regardless of scale
Demonstration: Scaling Impact on KNN
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
import numpy as np
# Create synthetic dataset with features on different scales
X, y = make_classification(n_samples=1000, n_features=2, n_redundant=0,
n_informative=2, random_state=42)
# Scale one feature to be 1000x larger
X[:, 0] = X[:, 0] * 1000
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Test WITHOUT scaling
knn_unscaled = KNeighborsClassifier(n_neighbors=5)
knn_unscaled.fit(X_train, y_train)
score_unscaled = knn_unscaled.score(X_test, y_test)
# Test WITH scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
score_scaled = knn_scaled.score(X_test_scaled, y_test)
print(f"KNN Accuracy WITHOUT scaling: {score_unscaled:.4f}")
print(f"KNN Accuracy WITH scaling: {score_scaled:.4f}")
print(f"Improvement: {(score_scaled - score_unscaled) * 100:.2f}%")
# Typical output:
# KNN Accuracy WITHOUT scaling: 0.7533
# KNN Accuracy WITH scaling: 0.9267
# Improvement: 17.34%
Method 1: Standardization (Z-Score Normalization)
Best for: Most algorithms (especially those assuming normally distributed data)
Standardization transforms features to have mean = 0 and standard deviation = 1. Also called "Z-score normalization."
The Formula
# For each feature value x:
# z = (x - mean) / standard_deviation
# Example:
# Feature values: [1, 2, 3, 4, 5]
# Mean = 3, Std = 1.414
# Standardized:
# (1-3)/1.414 = -1.41
# (2-3)/1.414 = -0.71
# (3-3)/1.414 = 0.00
# (4-3)/1.414 = 0.71
# (5-3)/1.414 = 1.41
Implementation with Scikit-learn
from sklearn.preprocessing import StandardScaler
import numpy as np
import pandas as pd
# Sample data
data = np.array([[1, 2000, 3],
[2, 1500, 2],
[3, 2500, 4],
[4, 1800, 3]])
df = pd.DataFrame(data, columns=['age', 'square_feet', 'bedrooms'])
print("Original data:")
print(df)
print("\nStatistics:")
print(df.describe())
# Create and fit scaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df)
# Convert back to DataFrame
df_scaled = pd.DataFrame(scaled_data, columns=df.columns)
print("\nStandardized data:")
print(df_scaled)
print("\nStandardized statistics:")
print(df_scaled.describe())
# Output shows:
# - Mean β 0 for all features
# - Std = 1 for all features
# Access scaling parameters
print("\nScaling parameters:")
print(f"Means: {scaler.mean_}")
print(f"Standard deviations: {scaler.scale_}")
Properties of Standardization
π Key Characteristics
- Output range: Unbounded (typically -3 to +3 for most data)
- Preserves outliers: They remain outliers, just scaled
- Assumes normality: Works best when data is roughly normally distributed
- Center and scale: Centers around 0, scales by standard deviation
When to Use Standardization
β Use when:
- Working with algorithms that assume normally distributed features (Logistic Regression, SVM, Neural Networks)
- Features have different units (meters vs kilometers, dollars vs millions)
- You want to preserve outlier information
- PCA or other techniques that are sensitive to variance
β Avoid when:
- Data has extreme outliers (use Robust Scaling instead)
- You need values in a specific range like [0, 1] (use Min-Max instead)
- Using tree-based models only (scaling not needed)
Method 2: Min-Max Scaling
Best for: When you need values in a specific bounded range
Min-Max scaling transforms features to a fixed range, typically [0, 1]. Also called "normalization" (though this term can be ambiguous).
The Formula
# For each feature value x:
# x_scaled = (x - min) / (max - min)
# This maps:
# - minimum value β 0
# - maximum value β 1
# - everything else β proportionally between 0 and 1
# Example:
# Feature values: [10, 20, 30, 40, 50]
# Min = 10, Max = 50
# Scaled:
# (10-10)/(50-10) = 0.00
# (20-10)/(50-10) = 0.25
# (30-10)/(50-10) = 0.50
# (40-10)/(50-10) = 0.75
# (50-10)/(50-10) = 1.00
Implementation
from sklearn.preprocessing import MinMaxScaler
import numpy as np
import pandas as pd
# Sample data
data = pd.DataFrame({
'age': [25, 30, 35, 40, 45],
'salary': [50000, 60000, 70000, 80000, 90000],
'experience': [1, 3, 5, 7, 10]
})
print("Original data:")
print(data)
# Create Min-Max scaler (default range [0, 1])
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(data)
df_scaled = pd.DataFrame(scaled_data, columns=data.columns)
print("\nMin-Max scaled data [0, 1]:")
print(df_scaled)
# Custom range: scale to [-1, 1]
scaler_custom = MinMaxScaler(feature_range=(-1, 1))
scaled_custom = scaler_custom.fit_transform(data)
df_custom = pd.DataFrame(scaled_custom, columns=data.columns)
print("\nMin-Max scaled data [-1, 1]:")
print(df_custom)
# Verify ranges
print("\nVerification:")
print(f"Min values: {df_scaled.min().values}") # All 0.0
print(f"Max values: {df_scaled.max().values}") # All 1.0
Handling New Data
from sklearn.preprocessing import MinMaxScaler
# Training data
train_data = pd.DataFrame({
'temperature': [20, 25, 30, 35, 40]
})
# Fit scaler on training data
scaler = MinMaxScaler()
scaler.fit(train_data)
print(f"Training min: {scaler.data_min_}") # [20.]
print(f"Training max: {scaler.data_max_}") # [40.]
# Transform training data
train_scaled = scaler.transform(train_data)
print("\nTraining data scaled:")
print(train_scaled.ravel()) # [0.0, 0.25, 0.5, 0.75, 1.0]
# New test data (includes values outside training range!)
test_data = pd.DataFrame({
'temperature': [15, 22, 45] # 15 < min, 45 > max
})
# Transform test data using same scaler
test_scaled = scaler.transform(test_data)
print("\nTest data scaled:")
print(test_scaled.ravel())
# Output:
# [-0.25, 0.1, 1.25]
# Note: Values can be outside [0,1] for out-of-range data!
# 15 β -0.25 (below training min)
# 45 β 1.25 (above training max)
β οΈ Sensitive to Outliers
Min-Max scaling is highly sensitive to outliers. A single extreme value can compress all other values into a tiny range:
# Example: One outlier ruins scaling
data = [10, 20, 30, 40, 1000] # 1000 is outlier
# After Min-Max scaling [0, 1]:
# 10 β 0.000
# 20 β 0.010
# 30 β 0.020
# 40 β 0.030
# 1000 β 1.000
# All normal values compressed to 0.00-0.03 range!
When to Use Min-Max Scaling
β Use when:
- You need features in a specific bounded range (e.g., [0, 1] for neural network inputs)
- Data doesn't have significant outliers
- Working with image data (pixel values 0-255 β 0-1)
- Algorithm performs better with bounded inputs (e.g., neural networks with sigmoid/tanh)
β Avoid when:
- Data contains outliers (use Robust Scaling)
- Distribution is heavily skewed
- You'll receive test data that might exceed training min/max
Method 3: Robust Scaling
Best for: Data with outliers
Robust scaling uses statistics that are resistant to outliers: median and interquartile range (IQR) instead of mean and standard deviation.
The Formula
# For each feature value x:
# x_scaled = (x - median) / IQR
# Where:
# - median = 50th percentile (Q2)
# - IQR = Q3 - Q1 (75th percentile - 25th percentile)
# Example:
# Data: [1, 2, 3, 4, 5, 100] # 100 is outlier
# Q1 = 2, Median = 3.5, Q3 = 5
# IQR = 5 - 2 = 3
# Scaled:
# (1-3.5)/3 = -0.83
# (2-3.5)/3 = -0.50
# (3-3.5)/3 = -0.17
# (4-3.5)/3 = 0.17
# (5-3.5)/3 = 0.50
# (100-3.5)/3 = 32.17 # Outlier still far, but doesn't compress others
Implementation
from sklearn.preprocessing import RobustScaler
import numpy as np
import pandas as pd
# Data with outliers
data = pd.DataFrame({
'salary': [50000, 55000, 60000, 65000, 70000, 500000], # Last value is outlier
'age': [25, 28, 30, 32, 35, 40]
})
print("Original data:")
print(data)
print("\nStatistics:")
print(data.describe())
# Standard Scaler (for comparison)
from sklearn.preprocessing import StandardScaler
std_scaler = StandardScaler()
std_scaled = std_scaler.fit_transform(data)
df_std = pd.DataFrame(std_scaled, columns=data.columns)
# Robust Scaler
robust_scaler = RobustScaler()
robust_scaled = robust_scaler.fit_transform(data)
df_robust = pd.DataFrame(robust_scaled, columns=data.columns)
print("\nStandardScaler (affected by outlier):")
print(df_std['salary'].describe())
print("\nRobustScaler (resistant to outlier):")
print(df_robust['salary'].describe())
# Compare outlier impact
print("\nOutlier scaling comparison:")
print(f"StandardScaler: salary[5] = {df_std['salary'].iloc[5]:.2f}")
print(f"RobustScaler: salary[5] = {df_robust['salary'].iloc[5]:.2f}")
# StandardScaler: outlier might be ~2-3 std devs
# RobustScaler: outlier in terms of IQRs (more interpretable)
Comparison: Robust vs Standard Scaling
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, RobustScaler
import numpy as np
# Create data with outliers
np.random.seed(42)
normal_data = np.random.normal(100, 15, 95)
outliers = np.array([300, 350, 400, 450, 500])
data = np.concatenate([normal_data, outliers]).reshape(-1, 1)
# Apply both scalers
std_scaler = StandardScaler()
robust_scaler = RobustScaler()
data_std = std_scaler.fit_transform(data)
data_robust = robust_scaler.fit_transform(data)
# Compare results
print("Original data range:", data.min(), "to", data.max())
print("\nStandardScaler:")
print(f" Normal data range: {data_std[:95].min():.2f} to {data_std[:95].max():.2f}")
print(f" Outliers range: {data_std[95:].min():.2f} to {data_std[95:].max():.2f}")
print("\nRobustScaler:")
print(f" Normal data range: {data_robust[:95].min():.2f} to {data_robust[:95].max():.2f}")
print(f" Outliers range: {data_robust[95:].min():.2f} to {data_robust[95:].max():.2f}")
# RobustScaler keeps normal data in a more reasonable range
# while still flagging outliers as extreme
When to Use Robust Scaling
β Use when:
- Data contains outliers that you want to preserve but not let dominate
- Distribution is skewed
- You want scaling that's resistant to extreme values
- Financial data, sensor data, or any domain with natural outliers
β Avoid when:
- Data is clean with no outliers (StandardScaler is simpler)
- You need a specific bounded range (use MinMaxScaler)
- Outliers are actually errors that should be removed first
Method 4: Normalization (L1, L2)
Best for: Text processing, when direction matters more than magnitude
Normalization scales individual samples (rows) to have unit norm, not features (columns). It's about making vectors have length 1.
The Concept
# L2 Normalization (Euclidean norm)
# For each sample (row) x = [x1, x2, ..., xn]:
# x_normalized = x / ||x||β
# where ||x||β = sqrt(x1Β² + x2Β² + ... + xnΒ²)
# L1 Normalization (Manhattan norm)
# x_normalized = x / ||x||β
# where ||x||β = |x1| + |x2| + ... + |xn|
# Example with L2:
# Sample: [3, 4]
# L2 norm = sqrt(3Β² + 4Β²) = sqrt(9 + 16) = 5
# Normalized: [3/5, 4/5] = [0.6, 0.8]
# Verify: sqrt(0.6Β² + 0.8Β²) = sqrt(0.36 + 0.64) = 1.0 β
Implementation
from sklearn.preprocessing import Normalizer
import numpy as np
# Sample data (each row is a sample)
data = np.array([
[3, 4], # Sample 1
[1, 1], # Sample 2
[5, 12] # Sample 3
])
print("Original data:")
print(data)
# Calculate L2 norms manually
l2_norms = np.sqrt((data ** 2).sum(axis=1))
print("\nL2 norms per sample:")
print(l2_norms) # [5.0, 1.414, 13.0]
# L2 Normalization
normalizer_l2 = Normalizer(norm='l2')
data_l2 = normalizer_l2.fit_transform(data)
print("\nL2 normalized:")
print(data_l2)
# Verify: each row should have norm = 1
l2_norms_after = np.sqrt((data_l2 ** 2).sum(axis=1))
print("\nL2 norms after normalization:")
print(l2_norms_after) # All approximately 1.0
# L1 Normalization
normalizer_l1 = Normalizer(norm='l1')
data_l1 = normalizer_l1.fit_transform(data)
print("\nL1 normalized:")
print(data_l1)
# For L1: each row sums to 1
l1_sums = np.abs(data_l1).sum(axis=1)
print("\nL1 sums after normalization:")
print(l1_sums) # All 1.0
Use Case: Text Processing (TF-IDF)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import Normalizer
# Documents
documents = [
"machine learning is awesome",
"deep learning requires GPUs",
"machine learning and deep learning"
]
# TF-IDF vectorization
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)
print("TF-IDF matrix shape:", tfidf_matrix.shape)
print("\nFeature names:", vectorizer.get_feature_names_out())
# TF-IDF already applies L2 normalization by default!
print("\nRow norms (should be 1.0):")
import numpy as np
for i in range(tfidf_matrix.shape[0]):
row_norm = np.sqrt(tfidf_matrix[i].multiply(tfidf_matrix[i]).sum())
print(f"Document {i+1}: {row_norm:.4f}")
# Each document vector has unit length
# This means cosine similarity = dot product
# Useful for comparing document similarity!
Normalization vs Standardization
π Key Differences
| Aspect | Standardization | Normalization |
|---|---|---|
| Operates on | Features (columns) | Samples (rows) |
| Goal | Mean=0, Std=1 | Vector length=1 |
| Use case | Make features comparable | Make samples comparable |
| Best for | Most ML algorithms | Text, cosine similarity |
When to Use Normalization
β Use when:
- Working with text data (TF-IDF, word embeddings)
- Using cosine similarity or dot products
- Direction of data matters more than magnitude
- Building recommendation systems
β Avoid when:
- You need to standardize features across samples (use StandardScaler)
- Magnitude information is important
- Working with typical tabular data
Preventing Data Leakage in Scaling
β οΈ Critical: Fit Only on Training Data!
The most common mistake in feature scaling is fitting the scaler on the entire dataset (train + test). This causes data leakage because the test set influences the scaling parameters.
Wrong Way (Data Leakage)
# β WRONG: Fitting on all data
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
X = df[features]
y = df['target']
# BAD: Scale entire dataset first
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # β Uses test data statistics!
# Then split
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y, test_size=0.2, random_state=42
)
# Problem: Test set statistics influenced training data scaling
# Model has indirectly "seen" test data distribution
Right Way (No Leakage)
# β
RIGHT: Split first, then fit on training only
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
X = df[features]
y = df['target']
# GOOD: Split first
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Fit scaler on training data only
scaler = StandardScaler()
scaler.fit(X_train) # β Learn parameters from training only
# Transform both sets using training parameters
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Now test set is truly "unseen"
Using Pipelines (Recommended)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
# Create pipeline (handles fit/transform automatically)
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', LogisticRegression())
])
# Cross-validation with pipeline
# Pipeline ensures scaling is done correctly in each fold
scores = cross_val_score(pipeline, X, y, cv=5, scoring='accuracy')
print(f"Cross-validation scores: {scores}")
print(f"Mean accuracy: {scores.mean():.4f} (+/- {scores.std():.4f})")
# Pipeline automatically:
# 1. Fits scaler on training folds
# 2. Transforms training folds
# 3. Transforms validation fold (using training stats)
# No leakage!
Cross-Validation Leakage Example
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import numpy as np
# Create sample data
np.random.seed(42)
X = np.random.randn(1000, 20)
y = (X[:, 0] + X[:, 1] > 0).astype(int)
# β WRONG: Scale before cross-validation
scaler = StandardScaler()
X_scaled_wrong = scaler.fit_transform(X) # All data
model_wrong = LogisticRegression()
scores_wrong = cross_val_score(model_wrong, X_scaled_wrong, y, cv=5)
# β
RIGHT: Use pipeline for cross-validation
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', LogisticRegression())
])
scores_right = cross_val_score(pipeline, X, y, cv=5)
print("Wrong way (with leakage):")
print(f" Mean accuracy: {scores_wrong.mean():.4f}")
print("\nRight way (no leakage):")
print(f" Mean accuracy: {scores_right.mean():.4f}")
# Wrong way often shows inflated scores due to leakage
Choosing the Right Scaling Method
π Decision Flowchart
- Are you using tree-based models only?
- YES β No scaling needed
- Is your data text/documents?
- YES β L2 Normalization (often built into TF-IDF)
- Does your data have significant outliers?
- YES β Robust Scaling
- Do you need bounded range [0,1]?
- YES β Min-Max Scaling
- Otherwise:
- Use Standardization (default choice)
Algorithm-Specific Requirements
# Summary table
import pandas as pd
scaling_guide = pd.DataFrame({
'Algorithm': [
'Linear/Logistic Regression',
'SVM',
'K-Nearest Neighbors',
'K-Means Clustering',
'Neural Networks',
'PCA',
'Decision Trees',
'Random Forest',
'XGBoost/LightGBM',
'Naive Bayes'
],
'Needs Scaling': [
'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes',
'No', 'No', 'No', 'Depends'
],
'Recommended Method': [
'StandardScaler',
'StandardScaler',
'StandardScaler or MinMaxScaler',
'StandardScaler',
'MinMaxScaler or StandardScaler',
'StandardScaler',
'None',
'None',
'None',
'MinMaxScaler (for continuous features)'
]
})
print(scaling_guide.to_string(index=False))
Practical Comparison
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
import numpy as np
import pandas as pd
# Create test data with outliers
np.random.seed(42)
data = pd.DataFrame({
'normal': np.random.normal(100, 15, 100),
'with_outlier': list(np.random.normal(100, 15, 95)) + [500, 600, 700, 800, 900]
})
# Apply all scalers
scalers = {
'Original': None,
'StandardScaler': StandardScaler(),
'MinMaxScaler': MinMaxScaler(),
'RobustScaler': RobustScaler()
}
results = {}
for name, scaler in scalers.items():
if scaler is None:
results[name] = data
else:
scaled = scaler.fit_transform(data)
results[name] = pd.DataFrame(scaled, columns=data.columns)
# Compare statistics
print("Comparison of scaling methods:\n")
for name, df in results.items():
print(f"{name}:")
print(df.describe()[['normal', 'with_outlier']].loc[['mean', 'std', 'min', 'max']])
print()
# Key observations:
# - StandardScaler: Mean β 0, std = 1 for both
# - MinMaxScaler: Both in [0, 1], but outliers compress normal data
# - RobustScaler: Resistant to outliers in 'with_outlier' column
Building a Complete Preprocessing Pipeline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
import pandas as pd
import numpy as np
# Sample dataset
np.random.seed(42)
df = pd.DataFrame({
'age': np.random.randint(18, 80, 1000),
'salary': np.random.normal(60000, 20000, 1000),
'credit_score': np.random.randint(300, 850, 1000),
'num_loans': np.random.randint(0, 5, 1000),
'debt': np.concatenate([
np.random.normal(10000, 5000, 950),
np.random.normal(100000, 20000, 50) # Some with high debt (outliers)
]),
'approved': np.random.binomial(1, 0.6, 1000)
})
# Add some missing values
df.loc[np.random.choice(df.index, 50), 'salary'] = np.nan
df.loc[np.random.choice(df.index, 30), 'debt'] = np.nan
print("Dataset shape:", df.shape)
print("\nMissing values:")
print(df.isnull().sum())
# Define feature groups
standard_features = ['age', 'salary', 'credit_score']
minmax_features = ['num_loans']
robust_features = ['debt'] # Has outliers
# Create preprocessing pipeline
preprocessor = ColumnTransformer(
transformers=[
('standard', Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler())
]), standard_features),
('minmax', Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', MinMaxScaler())
]), minmax_features),
('robust', Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', RobustScaler())
]), robust_features)
]
)
# Split data
X = df.drop('approved', axis=1)
y = df['approved']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Create full pipeline with model
full_pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', LogisticRegression(random_state=42))
])
# Train
full_pipeline.fit(X_train, y_train)
# Evaluate
train_score = full_pipeline.score(X_train, y_train)
test_score = full_pipeline.score(X_test, y_test)
print(f"\nTraining accuracy: {train_score:.4f}")
print(f"Test accuracy: {test_score:.4f}")
# Save pipeline
import joblib
joblib.dump(full_pipeline, 'preprocessing_pipeline.pkl')
# Load and use
loaded_pipeline = joblib.load('preprocessing_pipeline.pkl')
new_data = pd.DataFrame({
'age': [35],
'salary': [75000],
'credit_score': [720],
'num_loans': [2],
'debt': [15000]
})
prediction = loaded_pipeline.predict(new_data)
print(f"\nPrediction for new data: {prediction[0]}")
π§ Knowledge Check
Question 1: Which algorithms are most affected by unscaled features?
Question 2: What does StandardScaler transform features to?
Question 3: When should you use RobustScaler instead of StandardScaler?
Question 4: What's the main risk of MinMaxScaler?
Question 5: When should you fit the scaler?
Question 6: What does L2 normalization do?
Question 7: Which scaling method is best for neural network input layers?
Question 8: Why don't Decision Trees need feature scaling?
π» Practice Exercises
Exercise 1: Impact of Scaling on KNN
Compare KNN performance with and without scaling:
from sklearn.datasets import load_wine
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Load wine dataset
wine = load_wine()
X, y = wine.data, wine.target
# TODO:
# 1. Split into train/test (80/20)
# 2. Train KNN (k=5) without scaling
# 3. Train KNN with StandardScaler
# 4. Compare test accuracies
# 5. Try different values of k (3, 5, 7, 10)
# 6. Plot accuracy vs k for scaled and unscaled
Exercise 2: Scaling Method Comparison
Compare all scaling methods on data with outliers:
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
# Create data with outliers
np.random.seed(42)
normal_data = np.random.normal(50, 10, 95)
outliers = [150, 200, 250, 300, 350]
data = np.concatenate([normal_data, outliers]).reshape(-1, 1)
# TODO:
# 1. Apply StandardScaler, MinMaxScaler, RobustScaler
# 2. For each method, calculate:
# - Range of normal data (indices 0:95)
# - Range of outliers (indices 95:100)
# 3. Visualize with box plots or histograms
# 4. Which method best preserves normal data range?
# 5. Which method is most affected by outliers?
Exercise 3: Preventing Data Leakage
Fix this leaky preprocessing code:
# WRONG implementation with leakage
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
# Scale entire dataset
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # Leakage!
# Cross-validation
model = LogisticRegression()
scores = cross_val_score(model, X_scaled, y, cv=5)
# TODO:
# 1. Identify the data leakage problem
# 2. Fix it using Pipeline
# 3. Compare scores before and after fix
# 4. Explain why scores might differ
Exercise 4: Multi-Feature Scaling Pipeline
Build a pipeline with different scalers for different feature types:
# Dataset features:
# - age: normal distribution, no outliers β StandardScaler
# - salary: has outliers β RobustScaler
# - num_transactions: bounded [0, 100] β MinMaxScaler
# - credit_score: normal, range [300-850] β StandardScaler
# TODO:
# 1. Create ColumnTransformer with appropriate scalers
# 2. Build Pipeline with preprocessor + classifier
# 3. Evaluate with cross-validation
# 4. Compare to using StandardScaler for all features
# 5. Save and load the pipeline
Exercise 5: Neural Network Scaling
Compare scaling methods for neural networks:
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# TODO:
# 1. Load or create a classification dataset
# 2. Train neural network with no scaling
# 3. Train with StandardScaler
# 4. Train with MinMaxScaler
# 5. Compare:
# - Training convergence speed (n_iter_)
# - Final accuracy
# - Loss curves
# 6. Which scaling method works best for neural networks?
π Summary
Feature scaling is essential for many machine learning algorithms to perform optimally:
Key Takeaways
- StandardScaler (Z-score): Most common method. Transforms to mean=0, std=1. Use for most algorithms that aren't tree-based. Best when data is roughly normally distributed.
- MinMaxScaler: Scales to bounded range [0,1] or custom range. Good for neural networks. Highly sensitive to outliersβavoid if data has extreme values.
- RobustScaler: Uses median and IQR instead of mean and std. Resistant to outliers. Best choice when data contains significant outliers you want to preserve.
- Normalizer (L1/L2): Scales individual samples to unit norm, not features. Used for text data, cosine similarity, and when direction matters more than magnitude.
- Prevention of Data Leakage: ALWAYS fit scalers on training data only, then transform both train and test. Use Pipeline for automatic handling.
- Algorithm Requirements: Distance-based (KNN, SVM, K-Means) and gradient-based (Linear models, Neural Nets) algorithms need scaling. Tree-based models (RF, XGBoost) don't.
π― Quick Reference
- Default choice: StandardScaler
- Have outliers? RobustScaler
- Need [0,1] range? MinMaxScaler
- Text data? L2 Normalization
- Trees only? No scaling needed
Next tutorial, we'll explore Feature Extraction & Creationβgenerating new features from existing ones. See you there! π
π― Ready for the Next Challenge?
Continue your feature engineering journey