Feature Transformation Techniques - Feature Engineering

Real-world data is messy. Features often have skewed distributions, extreme outliers, or non-normal shapes that violate assumptions of many ML algorithms. Feature transformation reshapes these distributions to make them more suitable for modeling, improving both performance and interpretability.

🎯 What You'll Learn

Log, square root, and power transformations for skewed data
Box-Cox and Yeo-Johnson transforms for automatic optimization
Binning and discretization for creating categorical features
When and why to apply each transformation technique

Understanding Skewed Distributions

import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt

# Generate skewed data
np.random.seed(42)
normal_data = np.random.normal(100, 15, 1000)
right_skewed = np.random.exponential(2, 1000)
left_skewed = 10 - np.random.exponential(2, 1000)

# Calculate skewness
print(f"Normal skewness: {stats.skew(normal_data):.2f}")
print(f"Right-skewed: {stats.skew(right_skewed):.2f}")
print(f"Left-skewed: {stats.skew(left_skewed):.2f}")

# Plot distributions
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
axes[0].hist(normal_data, bins=30)
axes[0].set_title('Normal (Skew ≈ 0)')
axes[1].hist(right_skewed, bins=30)
axes[1].set_title('Right-Skewed (Skew > 0)')
axes[2].hist(left_skewed, bins=30)
axes[2].set_title('Left-Skewed (Skew < 0)')
plt.tight_layout()
plt.show()

Log Transformation

Best for: Right-skewed data with positive values

import numpy as np
from sklearn.preprocessing import FunctionTransformer

# Right-skewed data (e.g., income, house prices)
data = np.random.exponential(50000, 1000).reshape(-1, 1)

# Apply log transformation
log_transformer = FunctionTransformer(np.log1p, validate=True)
data_log = log_transformer.fit_transform(data)

print(f"Original skewness: {stats.skew(data):.2f}")
print(f"Log-transformed skewness: {stats.skew(data_log):.2f}")

# Manual log transformation
data_log_manual = np.log1p(data)  # log1p(x) = log(1 + x), handles zeros

⚠️ Important: Use log1p for Data with Zeros

Use np.log1p(x) instead of np.log(x) to handle zeros safely. log1p computes log(1 + x), avoiding log(0) = -∞.

Box-Cox Transformation

Automatically finds optimal power transformation (requires strictly positive values)

from sklearn.preprocessing import PowerTransformer

# Box-Cox (positive values only)
data_positive = np.random.exponential(50, 1000).reshape(-1, 1)

boxcox_transformer = PowerTransformer(method='box-cox')
data_boxcox = boxcox_transformer.fit_transform(data_positive)

print(f"Original skewness: {stats.skew(data_positive):.2f}")
print(f"Box-Cox skewness: {stats.skew(data_boxcox):.2f}")
print(f"Optimal lambda: {boxcox_transformer.lambdas_[0]:.4f}")

Yeo-Johnson Transformation

Like Box-Cox but handles zero and negative values

from sklearn.preprocessing import PowerTransformer

# Data with zeros and negatives
data_mixed = np.concatenate([
    np.random.exponential(50, 500),
    -np.random.exponential(20, 300),
    np.zeros(200)
]).reshape(-1, 1)

yeo_transformer = PowerTransformer(method='yeo-johnson')
data_yeo = yeo_transformer.fit_transform(data_mixed)

print(f"Original skewness: {stats.skew(data_mixed):.2f}")
print(f"Yeo-Johnson skewness: {stats.skew(data_yeo):.2f}")

Binning and Discretization

import pandas as pd
from sklearn.preprocessing import KBinsDiscretizer

# Age data
ages = np.random.randint(18, 80, 1000)

# Equal-width binning
age_bins = pd.cut(ages, bins=[18, 30, 45, 60, 80], 
                  labels=['Young', 'Adult', 'Middle', 'Senior'])

# Equal-frequency binning (quantiles)
age_quantiles = pd.qcut(ages, q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])

# K-bins discretizer
discretizer = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile')
ages_discretized = discretizer.fit_transform(ages.reshape(-1, 1))

print("Value counts by bin:")
print(pd.Series(age_bins).value_counts().sort_index())

Practical Pipeline

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import PowerTransformer, StandardScaler
from sklearn.ensemble import RandomForestRegressor

# Define transformations for different feature types
preprocessor = ColumnTransformer([
    ('log', PowerTransformer(method='yeo-johnson'), ['income', 'price']),
    ('standard', StandardScaler(), ['age', 'experience']),
], remainder='passthrough')

# Complete pipeline
pipeline = Pipeline([
    ('transform', preprocessor),
    ('model', RandomForestRegressor())
])

# Train pipeline
pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)
print(f"R² Score: {score:.4f}")

🧠 Knowledge Check

Question 1: When should you use log transformation?

For normally distributed data

For right-skewed data with positive values

For left-skewed data

For data with negative values

Question 2: What's the advantage of Yeo-Johnson over Box-Cox?

It's faster

It's more accurate

It handles zero and negative values

It requires less data

Question 3: What does binning/discretization do?

Converts continuous features into categorical bins

Removes outliers from data

Scales features to [0, 1]

Creates polynomial features

📝 Summary

Key Takeaways

Log Transform: Use for right-skewed positive data. Use log1p to handle zeros.
Box-Cox: Automatically finds optimal transformation for positive data.
Yeo-Johnson: Like Box-Cox but handles zeros and negatives.
Binning: Convert continuous to categorical for non-linear relationships.
Always check: Skewness before/after transformation to verify improvement.

🎯 Ready for the Next Challenge?

← Previous: Feature Extraction 📚 Course Hub Next: Feature Selection →