🔄 Feature Transformation Techniques

Transform skewed features into better representations for machine learning

📚 Tutorial 5 of 8 ⏱️ 65 minutes 📊 Intermediate

Real-world data is messy. Features often have skewed distributions, extreme outliers, or non-normal shapes that violate assumptions of many ML algorithms. Feature transformation reshapes these distributions to make them more suitable for modeling, improving both performance and interpretability.

🎯 What You'll Learn

  • Log, square root, and power transformations for skewed data
  • Box-Cox and Yeo-Johnson transforms for automatic optimization
  • Binning and discretization for creating categorical features
  • When and why to apply each transformation technique

Understanding Skewed Distributions

import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt

# Generate skewed data
np.random.seed(42)
normal_data = np.random.normal(100, 15, 1000)
right_skewed = np.random.exponential(2, 1000)
left_skewed = 10 - np.random.exponential(2, 1000)

# Calculate skewness
print(f"Normal skewness: {stats.skew(normal_data):.2f}")
print(f"Right-skewed: {stats.skew(right_skewed):.2f}")
print(f"Left-skewed: {stats.skew(left_skewed):.2f}")

# Plot distributions
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
axes[0].hist(normal_data, bins=30)
axes[0].set_title('Normal (Skew ≈ 0)')
axes[1].hist(right_skewed, bins=30)
axes[1].set_title('Right-Skewed (Skew > 0)')
axes[2].hist(left_skewed, bins=30)
axes[2].set_title('Left-Skewed (Skew < 0)')
plt.tight_layout()
plt.show()

Log Transformation

Best for: Right-skewed data with positive values

import numpy as np
from sklearn.preprocessing import FunctionTransformer

# Right-skewed data (e.g., income, house prices)
data = np.random.exponential(50000, 1000).reshape(-1, 1)

# Apply log transformation
log_transformer = FunctionTransformer(np.log1p, validate=True)
data_log = log_transformer.fit_transform(data)

print(f"Original skewness: {stats.skew(data):.2f}")
print(f"Log-transformed skewness: {stats.skew(data_log):.2f}")

# Manual log transformation
data_log_manual = np.log1p(data)  # log1p(x) = log(1 + x), handles zeros

⚠️ Important: Use log1p for Data with Zeros

Use np.log1p(x) instead of np.log(x) to handle zeros safely. log1p computes log(1 + x), avoiding log(0) = -∞.

Box-Cox Transformation

Automatically finds optimal power transformation (requires strictly positive values)

from sklearn.preprocessing import PowerTransformer

# Box-Cox (positive values only)
data_positive = np.random.exponential(50, 1000).reshape(-1, 1)

boxcox_transformer = PowerTransformer(method='box-cox')
data_boxcox = boxcox_transformer.fit_transform(data_positive)

print(f"Original skewness: {stats.skew(data_positive):.2f}")
print(f"Box-Cox skewness: {stats.skew(data_boxcox):.2f}")
print(f"Optimal lambda: {boxcox_transformer.lambdas_[0]:.4f}")

Yeo-Johnson Transformation

Like Box-Cox but handles zero and negative values

from sklearn.preprocessing import PowerTransformer

# Data with zeros and negatives
data_mixed = np.concatenate([
    np.random.exponential(50, 500),
    -np.random.exponential(20, 300),
    np.zeros(200)
]).reshape(-1, 1)

yeo_transformer = PowerTransformer(method='yeo-johnson')
data_yeo = yeo_transformer.fit_transform(data_mixed)

print(f"Original skewness: {stats.skew(data_mixed):.2f}")
print(f"Yeo-Johnson skewness: {stats.skew(data_yeo):.2f}")

Binning and Discretization

import pandas as pd
from sklearn.preprocessing import KBinsDiscretizer

# Age data
ages = np.random.randint(18, 80, 1000)

# Equal-width binning
age_bins = pd.cut(ages, bins=[18, 30, 45, 60, 80], 
                  labels=['Young', 'Adult', 'Middle', 'Senior'])

# Equal-frequency binning (quantiles)
age_quantiles = pd.qcut(ages, q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])

# K-bins discretizer
discretizer = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile')
ages_discretized = discretizer.fit_transform(ages.reshape(-1, 1))

print("Value counts by bin:")
print(pd.Series(age_bins).value_counts().sort_index())

Practical Pipeline

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import PowerTransformer, StandardScaler
from sklearn.ensemble import RandomForestRegressor

# Define transformations for different feature types
preprocessor = ColumnTransformer([
    ('log', PowerTransformer(method='yeo-johnson'), ['income', 'price']),
    ('standard', StandardScaler(), ['age', 'experience']),
], remainder='passthrough')

# Complete pipeline
pipeline = Pipeline([
    ('transform', preprocessor),
    ('model', RandomForestRegressor())
])

# Train pipeline
pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)
print(f"R² Score: {score:.4f}")

🧠 Knowledge Check

Question 1: When should you use log transformation?

For normally distributed data
For right-skewed data with positive values
For left-skewed data
For data with negative values

Question 2: What's the advantage of Yeo-Johnson over Box-Cox?

It's faster
It's more accurate
It handles zero and negative values
It requires less data

Question 3: What does binning/discretization do?

Converts continuous features into categorical bins
Removes outliers from data
Scales features to [0, 1]
Creates polynomial features

📝 Summary

Key Takeaways

  • Log Transform: Use for right-skewed positive data. Use log1p to handle zeros.
  • Box-Cox: Automatically finds optimal transformation for positive data.
  • Yeo-Johnson: Like Box-Cox but handles zeros and negatives.
  • Binning: Convert continuous to categorical for non-linear relationships.
  • Always check: Skewness before/after transformation to verify improvement.

🎯 Ready for the Next Challenge?

← Previous: Feature Extraction 📚 Course Hub Next: Feature Selection →