Manual feature engineering is time-consuming and requires domain expertise. Automated feature engineering uses algorithms to systematically discover, create, and select features, often finding patterns humans would miss while dramatically reducing development time.
🎯 What You'll Learn
- Advanced Featuretools with custom primitives and deep feature synthesis
- Genetic algorithms for automated feature selection
- AutoML feature engineering with TPOT
- Production pipelines and best practices for scaling
Featuretools: Deep Feature Synthesis
# Install: pip install featuretools
import featuretools as ft
import pandas as pd
# Sample relational data
customers = pd.DataFrame({
'customer_id': [1, 2, 3],
'join_date': pd.to_datetime(['2020-01-01', '2020-03-15', '2020-06-10']),
'age': [25, 45, 35]
})
transactions = pd.DataFrame({
'transaction_id': [1, 2, 3, 4, 5],
'customer_id': [1, 1, 2, 2, 3],
'amount': [50, 100, 75, 200, 30],
'timestamp': pd.to_datetime([
'2021-01-15', '2021-02-20',
'2021-01-10', '2021-03-05',
'2021-02-18'
])
})
# Create EntitySet
es = ft.EntitySet(id='customer_data')
es = es.add_dataframe(
dataframe_name='customers',
dataframe=customers,
index='customer_id',
time_index='join_date'
)
es = es.add_dataframe(
dataframe_name='transactions',
dataframe=transactions,
index='transaction_id',
time_index='timestamp'
)
# Define relationship
es = es.add_relationship('customers', 'customer_id',
'transactions', 'customer_id')
# Deep Feature Synthesis
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name='customers',
max_depth=2,
verbose=True
)
print(f"Generated {len(feature_defs)} features")
print(feature_matrix.head())
Custom Aggregation Primitives
from featuretools.primitives import AggregationPrimitive
from featuretools.variable_types import Numeric
# Define custom primitive
class Variance(AggregationPrimitive):
"""Calculate variance of a numeric column"""
name = 'variance'
input_types = [Numeric]
return_type = Numeric
def get_function(self):
def variance(x):
return x.var()
return variance
# Use custom primitive
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name='customers',
agg_primitives=['sum', 'mean', 'max', 'min', Variance],
max_depth=2
)
print(f"Features with custom primitive: {len(feature_defs)}")
Feature Selection with Featuretools
from featuretools.selection import remove_low_information_features
# Remove features with little variance
feature_matrix_filtered = remove_low_information_features(
feature_matrix,
target=None
)
print(f"Original features: {feature_matrix.shape[1]}")
print(f"After filtering: {feature_matrix_filtered.shape[1]}")
# Manual feature importance ranking
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
# Prepare data (assuming we have a target variable)
X = feature_matrix_filtered.fillna(0)
y = [0, 1, 0] # Example labels
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)
# Get feature importances
feature_importance = pd.DataFrame({
'feature': X.columns,
'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)
print("\nTop 10 important features:")
print(feature_importance.head(10))
Genetic Algorithms for Feature Selection
# Install: pip install genetic-selection-py
from genetic_selection import GeneticSelectionCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
# Load data
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Genetic algorithm for feature selection
estimator = RandomForestClassifier(n_estimators=50, random_state=42)
selector = GeneticSelectionCV(
estimator=estimator,
cv=5,
n_generations=20,
population_size=50,
crossover_proba=0.5,
mutation_proba=0.2,
n_gen_no_change=10,
verbose=1
)
selector.fit(X_train, y_train)
print(f"\nSelected {selector.n_features_} features from {X.shape[1]}")
print(f"Selected feature indices: {list(selector.support_)}")
# Evaluate
X_train_selected = selector.transform(X_train)
X_test_selected = selector.transform(X_test)
estimator.fit(X_train_selected, y_train)
score = estimator.score(X_test_selected, y_test)
print(f"Test accuracy: {score:.4f}")
TPOT: AutoML Feature Engineering
# Install: pip install tpot
from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits
# Load data
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
digits.data, digits.target, test_size=0.3, random_state=42
)
# TPOT automatically finds best pipeline
# (includes feature engineering + model selection)
tpot = TPOTClassifier(
generations=5,
population_size=20,
cv=5,
random_state=42,
verbosity=2,
n_jobs=-1
)
tpot.fit(X_train, y_train)
score = tpot.score(X_test, y_test)
print(f"\nTPOT Test Accuracy: {score:.4f}")
# Export best pipeline
tpot.export('best_pipeline.py')
print("Best pipeline exported to 'best_pipeline.py'")
📊 TPOT Pipeline Example Output
TPOT might discover a pipeline like:
- PCA(n_components=10)
- PolynomialFeatures(degree=2)
- SelectKBest(k=50)
- GradientBoostingClassifier()
Production-Ready Feature Engineering Pipeline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
import joblib
# Define preprocessing
numeric_features = ['age', 'income', 'credit_score']
categorical_features = ['education', 'employment_status']
preprocessor = ColumnTransformer([
('num', Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=0.95))
]), numeric_features),
('cat', OneHotEncoder(drop='first', handle_unknown='ignore'),
categorical_features)
])
# Complete pipeline
pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
# Train (example)
# pipeline.fit(X_train, y_train)
# Save pipeline
joblib.dump(pipeline, 'feature_engineering_pipeline.pkl')
# Load and use
# loaded_pipeline = joblib.load('feature_engineering_pipeline.pkl')
# predictions = loaded_pipeline.predict(X_new)
Monitoring Feature Quality in Production
import numpy as np
from scipy import stats
class FeatureMonitor:
"""Monitor feature distributions for drift detection"""
def __init__(self, reference_data):
"""Store reference statistics"""
self.reference_stats = {}
for col in reference_data.columns:
self.reference_stats[col] = {
'mean': reference_data[col].mean(),
'std': reference_data[col].std(),
'min': reference_data[col].min(),
'max': reference_data[col].max()
}
def check_drift(self, new_data, threshold=0.05):
"""Check for distribution drift using Kolmogorov-Smirnov test"""
drift_detected = {}
for col in new_data.columns:
if col not in self.reference_stats:
continue
# KS test for distribution change
ks_stat, p_value = stats.ks_2samp(
reference_data[col],
new_data[col]
)
drift_detected[col] = {
'drift': p_value < threshold,
'p_value': p_value,
'ks_statistic': ks_stat
}
return drift_detected
def get_summary(self, new_data):
"""Compare statistics"""
summary = []
for col in new_data.columns:
if col not in self.reference_stats:
continue
ref = self.reference_stats[col]
summary.append({
'feature': col,
'ref_mean': ref['mean'],
'new_mean': new_data[col].mean(),
'ref_std': ref['std'],
'new_std': new_data[col].std()
})
return pd.DataFrame(summary)
# Usage
monitor = FeatureMonitor(X_train)
drift_report = monitor.check_drift(X_new)
summary = monitor.get_summary(X_new)
Best Practices for Automated Feature Engineering
✅ Do's
- Start simple: Try manual features first to understand the data
- Use domain knowledge: Guide automated tools with relevant constraints
- Validate thoroughly: Test on held-out data, check for leakage
- Monitor in production: Track feature distributions for drift
- Version pipelines: Save preprocessing + model together
⚠️ Don'ts
- Don't blindly trust: Verify generated features make logical sense
- Avoid feature explosion: Limit depth and aggregations to prevent overfitting
- Don't ignore compute costs: Deep synthesis can generate thousands of features
- Don't skip interpretability: Some features may be hard to explain to stakeholders
🧠 Knowledge Check
Question 1: What is Deep Feature Synthesis in Featuretools?
A deep learning technique
Automated creation of features through stacking primitives
A type of dimensionality reduction
A feature selection method
Question 2: How do genetic algorithms perform feature selection?
Evolve feature subsets through crossover and mutation
Use gradient descent to optimize features
Apply PCA transformations
Remove correlated features
Question 3: What does TPOT optimize?
Only feature selection
Only model hyperparameters
Entire ML pipeline including preprocessing and models
Only data cleaning steps
Question 4: Why monitor features in production?
To reduce storage costs
To increase training speed
To improve accuracy
To detect distribution drift and data quality issues
📝 Summary
Key Takeaways
- Featuretools: Automates feature generation through deep feature synthesis and custom primitives.
- Genetic Algorithms: Evolve optimal feature subsets through natural selection principles.
- TPOT: AutoML tool that optimizes entire pipelines including feature engineering.
- Production pipelines: Version, save, and monitor feature transformations.
- Balance automation with understanding: Automated tools accelerate development but require validation and domain knowledge.
🎓 Congratulations!
You've completed the Feature Engineering course! You now have the skills to:
- ✅ Preprocess and clean data effectively
- ✅ Encode categorical variables appropriately
- ✅ Scale and normalize features for different algorithms
- ✅ Extract and create powerful domain-specific features
- ✅ Transform skewed distributions
- ✅ Select the most informative features
- ✅ Apply dimensionality reduction techniques
- ✅ Leverage automated feature engineering tools
Ready to apply these skills? Check out the course projects and earn your certificate!