🤖 Automated Feature Engineering

Let algorithms discover powerful features automatically

📚 Tutorial 8 of 8 ⏱️ 85 minutes 🚀 Advanced

Manual feature engineering is time-consuming and requires domain expertise. Automated feature engineering uses algorithms to systematically discover, create, and select features, often finding patterns humans would miss while dramatically reducing development time.

🎯 What You'll Learn

  • Advanced Featuretools with custom primitives and deep feature synthesis
  • Genetic algorithms for automated feature selection
  • AutoML feature engineering with TPOT
  • Production pipelines and best practices for scaling

Featuretools: Deep Feature Synthesis

# Install: pip install featuretools
import featuretools as ft
import pandas as pd

# Sample relational data
customers = pd.DataFrame({
    'customer_id': [1, 2, 3],
    'join_date': pd.to_datetime(['2020-01-01', '2020-03-15', '2020-06-10']),
    'age': [25, 45, 35]
})

transactions = pd.DataFrame({
    'transaction_id': [1, 2, 3, 4, 5],
    'customer_id': [1, 1, 2, 2, 3],
    'amount': [50, 100, 75, 200, 30],
    'timestamp': pd.to_datetime([
        '2021-01-15', '2021-02-20', 
        '2021-01-10', '2021-03-05', 
        '2021-02-18'
    ])
})

# Create EntitySet
es = ft.EntitySet(id='customer_data')
es = es.add_dataframe(
    dataframe_name='customers',
    dataframe=customers,
    index='customer_id',
    time_index='join_date'
)
es = es.add_dataframe(
    dataframe_name='transactions',
    dataframe=transactions,
    index='transaction_id',
    time_index='timestamp'
)

# Define relationship
es = es.add_relationship('customers', 'customer_id', 
                         'transactions', 'customer_id')

# Deep Feature Synthesis
feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name='customers',
    max_depth=2,
    verbose=True
)

print(f"Generated {len(feature_defs)} features")
print(feature_matrix.head())

Custom Aggregation Primitives

from featuretools.primitives import AggregationPrimitive
from featuretools.variable_types import Numeric

# Define custom primitive
class Variance(AggregationPrimitive):
    """Calculate variance of a numeric column"""
    name = 'variance'
    input_types = [Numeric]
    return_type = Numeric
    
    def get_function(self):
        def variance(x):
            return x.var()
        return variance

# Use custom primitive
feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name='customers',
    agg_primitives=['sum', 'mean', 'max', 'min', Variance],
    max_depth=2
)

print(f"Features with custom primitive: {len(feature_defs)}")

Feature Selection with Featuretools

from featuretools.selection import remove_low_information_features

# Remove features with little variance
feature_matrix_filtered = remove_low_information_features(
    feature_matrix, 
    target=None
)

print(f"Original features: {feature_matrix.shape[1]}")
print(f"After filtering: {feature_matrix_filtered.shape[1]}")

# Manual feature importance ranking
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder

# Prepare data (assuming we have a target variable)
X = feature_matrix_filtered.fillna(0)
y = [0, 1, 0]  # Example labels

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

# Get feature importances
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)

print("\nTop 10 important features:")
print(feature_importance.head(10))

Genetic Algorithms for Feature Selection

# Install: pip install genetic-selection-py
from genetic_selection import GeneticSelectionCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# Load data
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Genetic algorithm for feature selection
estimator = RandomForestClassifier(n_estimators=50, random_state=42)

selector = GeneticSelectionCV(
    estimator=estimator,
    cv=5,
    n_generations=20,
    population_size=50,
    crossover_proba=0.5,
    mutation_proba=0.2,
    n_gen_no_change=10,
    verbose=1
)

selector.fit(X_train, y_train)

print(f"\nSelected {selector.n_features_} features from {X.shape[1]}")
print(f"Selected feature indices: {list(selector.support_)}")

# Evaluate
X_train_selected = selector.transform(X_train)
X_test_selected = selector.transform(X_test)

estimator.fit(X_train_selected, y_train)
score = estimator.score(X_test_selected, y_test)
print(f"Test accuracy: {score:.4f}")

TPOT: AutoML Feature Engineering

# Install: pip install tpot
from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits

# Load data
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
    digits.data, digits.target, test_size=0.3, random_state=42
)

# TPOT automatically finds best pipeline
# (includes feature engineering + model selection)
tpot = TPOTClassifier(
    generations=5,
    population_size=20,
    cv=5,
    random_state=42,
    verbosity=2,
    n_jobs=-1
)

tpot.fit(X_train, y_train)
score = tpot.score(X_test, y_test)

print(f"\nTPOT Test Accuracy: {score:.4f}")

# Export best pipeline
tpot.export('best_pipeline.py')
print("Best pipeline exported to 'best_pipeline.py'")

📊 TPOT Pipeline Example Output

TPOT might discover a pipeline like:

  • PCA(n_components=10)
  • PolynomialFeatures(degree=2)
  • SelectKBest(k=50)
  • GradientBoostingClassifier()

Production-Ready Feature Engineering Pipeline

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
import joblib

# Define preprocessing
numeric_features = ['age', 'income', 'credit_score']
categorical_features = ['education', 'employment_status']

preprocessor = ColumnTransformer([
    ('num', Pipeline([
        ('scaler', StandardScaler()),
        ('pca', PCA(n_components=0.95))
    ]), numeric_features),
    
    ('cat', OneHotEncoder(drop='first', handle_unknown='ignore'), 
     categorical_features)
])

# Complete pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Train (example)
# pipeline.fit(X_train, y_train)

# Save pipeline
joblib.dump(pipeline, 'feature_engineering_pipeline.pkl')

# Load and use
# loaded_pipeline = joblib.load('feature_engineering_pipeline.pkl')
# predictions = loaded_pipeline.predict(X_new)

Monitoring Feature Quality in Production

import numpy as np
from scipy import stats

class FeatureMonitor:
    """Monitor feature distributions for drift detection"""
    
    def __init__(self, reference_data):
        """Store reference statistics"""
        self.reference_stats = {}
        for col in reference_data.columns:
            self.reference_stats[col] = {
                'mean': reference_data[col].mean(),
                'std': reference_data[col].std(),
                'min': reference_data[col].min(),
                'max': reference_data[col].max()
            }
    
    def check_drift(self, new_data, threshold=0.05):
        """Check for distribution drift using Kolmogorov-Smirnov test"""
        drift_detected = {}
        
        for col in new_data.columns:
            if col not in self.reference_stats:
                continue
                
            # KS test for distribution change
            ks_stat, p_value = stats.ks_2samp(
                reference_data[col], 
                new_data[col]
            )
            
            drift_detected[col] = {
                'drift': p_value < threshold,
                'p_value': p_value,
                'ks_statistic': ks_stat
            }
        
        return drift_detected
    
    def get_summary(self, new_data):
        """Compare statistics"""
        summary = []
        for col in new_data.columns:
            if col not in self.reference_stats:
                continue
            
            ref = self.reference_stats[col]
            summary.append({
                'feature': col,
                'ref_mean': ref['mean'],
                'new_mean': new_data[col].mean(),
                'ref_std': ref['std'],
                'new_std': new_data[col].std()
            })
        
        return pd.DataFrame(summary)

# Usage
monitor = FeatureMonitor(X_train)
drift_report = monitor.check_drift(X_new)
summary = monitor.get_summary(X_new)

Best Practices for Automated Feature Engineering

✅ Do's

  • Start simple: Try manual features first to understand the data
  • Use domain knowledge: Guide automated tools with relevant constraints
  • Validate thoroughly: Test on held-out data, check for leakage
  • Monitor in production: Track feature distributions for drift
  • Version pipelines: Save preprocessing + model together

⚠️ Don'ts

  • Don't blindly trust: Verify generated features make logical sense
  • Avoid feature explosion: Limit depth and aggregations to prevent overfitting
  • Don't ignore compute costs: Deep synthesis can generate thousands of features
  • Don't skip interpretability: Some features may be hard to explain to stakeholders

🧠 Knowledge Check

Question 1: What is Deep Feature Synthesis in Featuretools?

A deep learning technique
Automated creation of features through stacking primitives
A type of dimensionality reduction
A feature selection method

Question 2: How do genetic algorithms perform feature selection?

Evolve feature subsets through crossover and mutation
Use gradient descent to optimize features
Apply PCA transformations
Remove correlated features

Question 3: What does TPOT optimize?

Only feature selection
Only model hyperparameters
Entire ML pipeline including preprocessing and models
Only data cleaning steps

Question 4: Why monitor features in production?

To reduce storage costs
To increase training speed
To improve accuracy
To detect distribution drift and data quality issues

📝 Summary

Key Takeaways

  • Featuretools: Automates feature generation through deep feature synthesis and custom primitives.
  • Genetic Algorithms: Evolve optimal feature subsets through natural selection principles.
  • TPOT: AutoML tool that optimizes entire pipelines including feature engineering.
  • Production pipelines: Version, save, and monitor feature transformations.
  • Balance automation with understanding: Automated tools accelerate development but require validation and domain knowledge.

🎓 Congratulations!

You've completed the Feature Engineering course! You now have the skills to:

  • ✅ Preprocess and clean data effectively
  • ✅ Encode categorical variables appropriately
  • ✅ Scale and normalize features for different algorithms
  • ✅ Extract and create powerful domain-specific features
  • ✅ Transform skewed distributions
  • ✅ Select the most informative features
  • ✅ Apply dimensionality reduction techniques
  • ✅ Leverage automated feature engineering tools

Ready to apply these skills? Check out the course projects and earn your certificate!

🎉 Course Complete!

← Previous: Dimensionality Reduction 📚 Back to Course Hub 🚀 Explore More Courses