Automated Feature Engineering - Feature Engineering

Manual feature engineering is time-consuming and requires domain expertise. Automated feature engineering uses algorithms to systematically discover, create, and select features, often finding patterns humans would miss while dramatically reducing development time.

🎯 What You'll Learn

Advanced Featuretools with custom primitives and deep feature synthesis
Genetic algorithms for automated feature selection
AutoML feature engineering with TPOT
Production pipelines and best practices for scaling

Featuretools: Deep Feature Synthesis

# Install: pip install featuretools
import featuretools as ft
import pandas as pd

# Sample relational data
customers = pd.DataFrame({
    'customer_id': [1, 2, 3],
    'join_date': pd.to_datetime(['2020-01-01', '2020-03-15', '2020-06-10']),
    'age': [25, 45, 35]
})

transactions = pd.DataFrame({
    'transaction_id': [1, 2, 3, 4, 5],
    'customer_id': [1, 1, 2, 2, 3],
    'amount': [50, 100, 75, 200, 30],
    'timestamp': pd.to_datetime([
        '2021-01-15', '2021-02-20', 
        '2021-01-10', '2021-03-05', 
        '2021-02-18'
    ])
})

# Create EntitySet
es = ft.EntitySet(id='customer_data')
es = es.add_dataframe(
    dataframe_name='customers',
    dataframe=customers,
    index='customer_id',
    time_index='join_date'
)
es = es.add_dataframe(
    dataframe_name='transactions',
    dataframe=transactions,
    index='transaction_id',
    time_index='timestamp'
)

# Define relationship
es = es.add_relationship('customers', 'customer_id', 
                         'transactions', 'customer_id')

# Deep Feature Synthesis
feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name='customers',
    max_depth=2,
    verbose=True
)

print(f"Generated {len(feature_defs)} features")
print(feature_matrix.head())

Custom Aggregation Primitives

from featuretools.primitives import AggregationPrimitive
from featuretools.variable_types import Numeric

# Define custom primitive
class Variance(AggregationPrimitive):
    """Calculate variance of a numeric column"""
    name = 'variance'
    input_types = [Numeric]
    return_type = Numeric
    
    def get_function(self):
        def variance(x):
            return x.var()
        return variance

# Use custom primitive
feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name='customers',
    agg_primitives=['sum', 'mean', 'max', 'min', Variance],
    max_depth=2
)

print(f"Features with custom primitive: {len(feature_defs)}")

Feature Selection with Featuretools

from featuretools.selection import remove_low_information_features

# Remove features with little variance
feature_matrix_filtered = remove_low_information_features(
    feature_matrix, 
    target=None
)

print(f"Original features: {feature_matrix.shape[1]}")
print(f"After filtering: {feature_matrix_filtered.shape[1]}")

# Manual feature importance ranking
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder

# Prepare data (assuming we have a target variable)
X = feature_matrix_filtered.fillna(0)
y = [0, 1, 0]  # Example labels

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

# Get feature importances
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)

print("\nTop 10 important features:")
print(feature_importance.head(10))

Genetic Algorithms for Feature Selection

# Install: pip install genetic-selection-py
from genetic_selection import GeneticSelectionCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# Load data
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Genetic algorithm for feature selection
estimator = RandomForestClassifier(n_estimators=50, random_state=42)

selector = GeneticSelectionCV(
    estimator=estimator,
    cv=5,
    n_generations=20,
    population_size=50,
    crossover_proba=0.5,
    mutation_proba=0.2,
    n_gen_no_change=10,
    verbose=1
)

selector.fit(X_train, y_train)

print(f"\nSelected {selector.n_features_} features from {X.shape[1]}")
print(f"Selected feature indices: {list(selector.support_)}")

# Evaluate
X_train_selected = selector.transform(X_train)
X_test_selected = selector.transform(X_test)

estimator.fit(X_train_selected, y_train)
score = estimator.score(X_test_selected, y_test)
print(f"Test accuracy: {score:.4f}")

TPOT: AutoML Feature Engineering

# Install: pip install tpot
from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits

# Load data
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
    digits.data, digits.target, test_size=0.3, random_state=42
)

# TPOT automatically finds best pipeline
# (includes feature engineering + model selection)
tpot = TPOTClassifier(
    generations=5,
    population_size=20,
    cv=5,
    random_state=42,
    verbosity=2,
    n_jobs=-1
)

tpot.fit(X_train, y_train)
score = tpot.score(X_test, y_test)

print(f"\nTPOT Test Accuracy: {score:.4f}")

# Export best pipeline
tpot.export('best_pipeline.py')
print("Best pipeline exported to 'best_pipeline.py'")

📊 TPOT Pipeline Example Output

TPOT might discover a pipeline like:

PCA(n_components=10)
PolynomialFeatures(degree=2)
SelectKBest(k=50)
GradientBoostingClassifier()

Production-Ready Feature Engineering Pipeline

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
import joblib

# Define preprocessing
numeric_features = ['age', 'income', 'credit_score']
categorical_features = ['education', 'employment_status']

preprocessor = ColumnTransformer([
    ('num', Pipeline([
        ('scaler', StandardScaler()),
        ('pca', PCA(n_components=0.95))
    ]), numeric_features),
    
    ('cat', OneHotEncoder(drop='first', handle_unknown='ignore'), 
     categorical_features)
])

# Complete pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Train (example)
# pipeline.fit(X_train, y_train)

# Save pipeline
joblib.dump(pipeline, 'feature_engineering_pipeline.pkl')

# Load and use
# loaded_pipeline = joblib.load('feature_engineering_pipeline.pkl')
# predictions = loaded_pipeline.predict(X_new)

Monitoring Feature Quality in Production

import numpy as np
from scipy import stats

class FeatureMonitor:
    """Monitor feature distributions for drift detection"""
    
    def __init__(self, reference_data):
        """Store reference statistics"""
        self.reference_stats = {}
        for col in reference_data.columns:
            self.reference_stats[col] = {
                'mean': reference_data[col].mean(),
                'std': reference_data[col].std(),
                'min': reference_data[col].min(),
                'max': reference_data[col].max()
            }
    
    def check_drift(self, new_data, threshold=0.05):
        """Check for distribution drift using Kolmogorov-Smirnov test"""
        drift_detected = {}
        
        for col in new_data.columns:
            if col not in self.reference_stats:
                continue
                
            # KS test for distribution change
            ks_stat, p_value = stats.ks_2samp(
                reference_data[col], 
                new_data[col]
            )
            
            drift_detected[col] = {
                'drift': p_value < threshold,
                'p_value': p_value,
                'ks_statistic': ks_stat
            }
        
        return drift_detected
    
    def get_summary(self, new_data):
        """Compare statistics"""
        summary = []
        for col in new_data.columns:
            if col not in self.reference_stats:
                continue
            
            ref = self.reference_stats[col]
            summary.append({
                'feature': col,
                'ref_mean': ref['mean'],
                'new_mean': new_data[col].mean(),
                'ref_std': ref['std'],
                'new_std': new_data[col].std()
            })
        
        return pd.DataFrame(summary)

# Usage
monitor = FeatureMonitor(X_train)
drift_report = monitor.check_drift(X_new)
summary = monitor.get_summary(X_new)

Best Practices for Automated Feature Engineering

✅ Do's

Start simple: Try manual features first to understand the data
Use domain knowledge: Guide automated tools with relevant constraints
Validate thoroughly: Test on held-out data, check for leakage
Monitor in production: Track feature distributions for drift
Version pipelines: Save preprocessing + model together

⚠️ Don'ts

Don't blindly trust: Verify generated features make logical sense
Avoid feature explosion: Limit depth and aggregations to prevent overfitting
Don't ignore compute costs: Deep synthesis can generate thousands of features
Don't skip interpretability: Some features may be hard to explain to stakeholders

🧠 Knowledge Check

Question 1: What is Deep Feature Synthesis in Featuretools?

A deep learning technique

Automated creation of features through stacking primitives

A type of dimensionality reduction

A feature selection method

Question 2: How do genetic algorithms perform feature selection?

Evolve feature subsets through crossover and mutation

Use gradient descent to optimize features

Apply PCA transformations

Remove correlated features

Question 3: What does TPOT optimize?

Only feature selection

Only model hyperparameters

Entire ML pipeline including preprocessing and models

Only data cleaning steps

Question 4: Why monitor features in production?

To reduce storage costs

To increase training speed

To improve accuracy

To detect distribution drift and data quality issues

📝 Summary

Key Takeaways

Featuretools: Automates feature generation through deep feature synthesis and custom primitives.
Genetic Algorithms: Evolve optimal feature subsets through natural selection principles.
TPOT: AutoML tool that optimizes entire pipelines including feature engineering.
Production pipelines: Version, save, and monitor feature transformations.
Balance automation with understanding: Automated tools accelerate development but require validation and domain knowledge.

🎓 Congratulations!

You've completed the Feature Engineering course! You now have the skills to:

✅ Preprocess and clean data effectively
✅ Encode categorical variables appropriately
✅ Scale and normalize features for different algorithms
✅ Extract and create powerful domain-specific features
✅ Transform skewed distributions
✅ Select the most informative features
✅ Apply dimensionality reduction techniques
✅ Leverage automated feature engineering tools

Ready to apply these skills? Check out the course projects and earn your certificate!

🎉 Course Complete!

← Previous: Dimensionality Reduction 📚 Back to Course Hub 🚀 Explore More Courses