Real-world data is messy. Features often have skewed distributions, extreme outliers, or non-normal shapes that violate assumptions of many ML algorithms. Feature transformation reshapes these distributions to make them more suitable for modeling, improving both performance and interpretability.
🎯 What You'll Learn
- Log, square root, and power transformations for skewed data
- Box-Cox and Yeo-Johnson transforms for automatic optimization
- Binning and discretization for creating categorical features
- When and why to apply each transformation technique
Understanding Skewed Distributions
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
# Generate skewed data
np.random.seed(42)
normal_data = np.random.normal(100, 15, 1000)
right_skewed = np.random.exponential(2, 1000)
left_skewed = 10 - np.random.exponential(2, 1000)
# Calculate skewness
print(f"Normal skewness: {stats.skew(normal_data):.2f}")
print(f"Right-skewed: {stats.skew(right_skewed):.2f}")
print(f"Left-skewed: {stats.skew(left_skewed):.2f}")
# Plot distributions
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
axes[0].hist(normal_data, bins=30)
axes[0].set_title('Normal (Skew ≈ 0)')
axes[1].hist(right_skewed, bins=30)
axes[1].set_title('Right-Skewed (Skew > 0)')
axes[2].hist(left_skewed, bins=30)
axes[2].set_title('Left-Skewed (Skew < 0)')
plt.tight_layout()
plt.show()
Log Transformation
Best for: Right-skewed data with positive values
import numpy as np
from sklearn.preprocessing import FunctionTransformer
# Right-skewed data (e.g., income, house prices)
data = np.random.exponential(50000, 1000).reshape(-1, 1)
# Apply log transformation
log_transformer = FunctionTransformer(np.log1p, validate=True)
data_log = log_transformer.fit_transform(data)
print(f"Original skewness: {stats.skew(data):.2f}")
print(f"Log-transformed skewness: {stats.skew(data_log):.2f}")
# Manual log transformation
data_log_manual = np.log1p(data) # log1p(x) = log(1 + x), handles zeros
⚠️ Important: Use log1p for Data with Zeros
Use np.log1p(x) instead of np.log(x) to handle zeros safely. log1p computes log(1 + x), avoiding log(0) = -∞.
Box-Cox Transformation
Automatically finds optimal power transformation (requires strictly positive values)
from sklearn.preprocessing import PowerTransformer
# Box-Cox (positive values only)
data_positive = np.random.exponential(50, 1000).reshape(-1, 1)
boxcox_transformer = PowerTransformer(method='box-cox')
data_boxcox = boxcox_transformer.fit_transform(data_positive)
print(f"Original skewness: {stats.skew(data_positive):.2f}")
print(f"Box-Cox skewness: {stats.skew(data_boxcox):.2f}")
print(f"Optimal lambda: {boxcox_transformer.lambdas_[0]:.4f}")
Yeo-Johnson Transformation
Like Box-Cox but handles zero and negative values
from sklearn.preprocessing import PowerTransformer
# Data with zeros and negatives
data_mixed = np.concatenate([
np.random.exponential(50, 500),
-np.random.exponential(20, 300),
np.zeros(200)
]).reshape(-1, 1)
yeo_transformer = PowerTransformer(method='yeo-johnson')
data_yeo = yeo_transformer.fit_transform(data_mixed)
print(f"Original skewness: {stats.skew(data_mixed):.2f}")
print(f"Yeo-Johnson skewness: {stats.skew(data_yeo):.2f}")
Binning and Discretization
import pandas as pd
from sklearn.preprocessing import KBinsDiscretizer
# Age data
ages = np.random.randint(18, 80, 1000)
# Equal-width binning
age_bins = pd.cut(ages, bins=[18, 30, 45, 60, 80],
labels=['Young', 'Adult', 'Middle', 'Senior'])
# Equal-frequency binning (quantiles)
age_quantiles = pd.qcut(ages, q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
# K-bins discretizer
discretizer = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile')
ages_discretized = discretizer.fit_transform(ages.reshape(-1, 1))
print("Value counts by bin:")
print(pd.Series(age_bins).value_counts().sort_index())
Practical Pipeline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import PowerTransformer, StandardScaler
from sklearn.ensemble import RandomForestRegressor
# Define transformations for different feature types
preprocessor = ColumnTransformer([
('log', PowerTransformer(method='yeo-johnson'), ['income', 'price']),
('standard', StandardScaler(), ['age', 'experience']),
], remainder='passthrough')
# Complete pipeline
pipeline = Pipeline([
('transform', preprocessor),
('model', RandomForestRegressor())
])
# Train pipeline
pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)
print(f"R² Score: {score:.4f}")
🧠 Knowledge Check
Question 1: When should you use log transformation?
For normally distributed data
For right-skewed data with positive values
For left-skewed data
For data with negative values
Question 2: What's the advantage of Yeo-Johnson over Box-Cox?
It's faster
It's more accurate
It handles zero and negative values
It requires less data
Question 3: What does binning/discretization do?
Converts continuous features into categorical bins
Removes outliers from data
Scales features to [0, 1]
Creates polynomial features
📝 Summary
Key Takeaways
- Log Transform: Use for right-skewed positive data. Use log1p to handle zeros.
- Box-Cox: Automatically finds optimal transformation for positive data.
- Yeo-Johnson: Like Box-Cox but handles zeros and negatives.
- Binning: Convert continuous to categorical for non-linear relationships.
- Always check: Skewness before/after transformation to verify improvement.