Feature Extraction & Creation - Feature Engineering Tutorial

The raw features in your dataset are rarely optimal for machine learning. Often, the relationships your model needs to learn are hidden in combinations, transformations, or extractions from existing features. Feature extraction and creation is the art and science of generating new, more informative features that make patterns easier for models to discover.

🎯 What You'll Learn

Creating polynomial features and interactions for non-linear patterns
Extracting rich features from datetime data
Engineering features from text (TF-IDF, n-grams, embeddings)
Domain-specific feature engineering strategies
Automated feature generation techniques

💡 The Impact of Good Feature Engineering

Andrew Ng famously said: "Applied machine learning is basically feature engineering." In Kaggle competitions, clever feature engineering often makes the difference between winning and mediocrity—more so than algorithm choice.

Polynomial Features

Purpose: Capture non-linear relationships by creating powers and interactions of features

The Problem: Linear Relationships Aren't Enough

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Create non-linear data
np.random.seed(42)
X = np.linspace(0, 10, 100).reshape(-1, 1)
y = 0.5 * X**2 + X + 2 + np.random.randn(100, 1) * 3

# Try linear regression (will fail to capture curve)
model_linear = LinearRegression()
model_linear.fit(X, y)
y_pred_linear = model_linear.predict(X)

# Plot
plt.figure(figsize=(10, 5))
plt.scatter(X, y, alpha=0.5, label='Actual data')
plt.plot(X, y_pred_linear, 'r-', label='Linear fit', linewidth=2)
plt.xlabel('X')
plt.ylabel('y')
plt.title('Linear Model Fails on Non-Linear Data')
plt.legend()
plt.show()

# Linear model can't capture the quadratic relationship!
print(f"Linear R² score: {model_linear.score(X, y):.4f}")

Solution: Polynomial Features

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

# Create polynomial features (degree 2)
# Transforms [x] into [1, x, x²]
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)

print("Original shape:", X.shape)      # (100, 1)
print("Polynomial shape:", X_poly.shape)  # (100, 2)
print("\nFirst 5 rows:")
print("Original:", X[:5].ravel())
print("Polynomial:", X_poly[:5])

# Output:
# Original: [0.0, 0.101, 0.202, 0.303, 0.404]
# Polynomial: 
# [[0.0,    0.0]      # [x, x²]
#  [0.101, 0.010]
#  [0.202, 0.041]
#  [0.303, 0.092]
#  [0.404, 0.163]]

# Train with polynomial features
model_poly = LinearRegression()
model_poly.fit(X_poly, y)
y_pred_poly = model_poly.predict(X_poly)

print(f"\nPolynomial R² score: {model_poly.score(X_poly, y):.4f}")
# Much better fit! (R² close to 1.0)

Feature Interactions

With multiple features, polynomial features also create interaction terms:

from sklearn.preprocessing import PolynomialFeatures
import pandas as pd

# Sample data: house features
data = pd.DataFrame({
    'length': [10, 12, 15],
    'width': [8, 10, 12]
})

print("Original features:")
print(data)

# Create polynomial features with interactions
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(data)

# Get feature names
feature_names = poly.get_feature_names_out(data.columns)
poly_df = pd.DataFrame(poly_features, columns=feature_names)

print("\nPolynomial features with interactions:")
print(poly_df)

# Output:
#    length  width  length²  length×width  width²
# 0    10      8      100        80          64
# 1    12     10      144       120         100
# 2    15     12      225       180         144

# Note: length×width gives us area!
# This interaction feature might be very predictive for house prices

Practical Example: House Price Prediction

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
import numpy as np

# Load data
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Use only 2 features for demonstration
X = X[:, :2]  # MedInc and HouseAge
feature_names = ['MedInc', 'HouseAge']

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Pipeline without polynomial features
pipe_linear = Pipeline([
    ('scaler', StandardScaler()),
    ('model', Ridge(alpha=1.0))
])

# Pipeline with polynomial features (degree 2)
pipe_poly = Pipeline([
    ('poly', PolynomialFeatures(degree=2, include_bias=False)),
    ('scaler', StandardScaler()),
    ('model', Ridge(alpha=1.0))
])

# Train both
pipe_linear.fit(X_train, y_train)
pipe_poly.fit(X_train, y_train)

# Compare
print("Linear features only:")
print(f"  Train R²: {pipe_linear.score(X_train, y_train):.4f}")
print(f"  Test R²:  {pipe_linear.score(X_test, y_test):.4f}")

print("\nWith polynomial features:")
print(f"  Train R²: {pipe_poly.score(X_train, y_train):.4f}")
print(f"  Test R²:  {pipe_poly.score(X_test, y_test):.4f}")

# Polynomial features improve model performance!

⚠️ Curse of Dimensionality

Polynomial features explode in number with higher degrees and more features:

10 features, degree 2 → 65 features
10 features, degree 3 → 285 features
20 features, degree 2 → 230 features

Solution: Use feature selection, regularization (Ridge/Lasso), or specify interaction_only=True to create only interaction terms without powers.

DateTime Feature Engineering

Purpose: Extract temporal patterns, seasonality, and cyclical features from dates/times

Raw datetime values are not directly usable by ML models. We need to extract meaningful components that capture temporal patterns.

Basic DateTime Extraction

import pandas as pd
import numpy as np

# Sample datetime data
dates = pd.date_range('2023-01-01', periods=100, freq='D')
df = pd.DataFrame({'date': dates})

# Extract basic components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['dayofweek'] = df['date'].dt.dayofweek  # Monday=0, Sunday=6
df['quarter'] = df['date'].dt.quarter
df['week'] = df['date'].dt.isocalendar().week
df['dayofyear'] = df['date'].dt.dayofyear

# Boolean flags
df['is_weekend'] = df['dayofweek'].isin([5, 6]).astype(int)
df['is_month_start'] = df['date'].dt.is_month_start.astype(int)
df['is_month_end'] = df['date'].dt.is_month_end.astype(int)
df['is_quarter_start'] = df['date'].dt.is_quarter_start.astype(int)

print(df.head(10))

# Example output showing rich temporal features

Cyclical Features (Sin/Cos Encoding)

Month, day of week, and hour are cyclical: December (12) is close to January (1), not far away. Regular encoding doesn't capture this.

import numpy as np
import pandas as pd

# Sample data
df = pd.DataFrame({
    'month': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
    'hour': [0, 6, 12, 18, 23, 5, 11, 17, 22, 4, 10, 16]
})

# ❌ Regular encoding: December (12) seems far from January (1)
print("Regular encoding issues:")
print("  Distance from Dec to Jan:", abs(12 - 1))  # 11
print("  Distance from Jun to Jul:", abs(6 - 7))   # 1

# ✅ Cyclical encoding with sin/cos
def encode_cyclical(data, col, max_val):
    """Encode cyclical feature using sin/cos transformation."""
    data[col + '_sin'] = np.sin(2 * np.pi * data[col] / max_val)
    data[col + '_cos'] = np.cos(2 * np.pi * data[col] / max_val)
    return data

# Encode month (1-12)
df = encode_cyclical(df, 'month', 12)

# Encode hour (0-23)
df = encode_cyclical(df, 'hour', 24)

print("\nCyclical encoding:")
print(df[['month', 'month_sin', 'month_cos']].head())

# Now December and January are close in the feature space!
# month=12: sin ≈ 0, cos ≈ 1
# month=1:  sin ≈ 0.5, cos ≈ 0.87
# Distance in 2D space is small

Time Since Events

import pandas as pd
import numpy as np

# E-commerce example: customer purchase history
df = pd.DataFrame({
    'customer_id': [1, 1, 1, 2, 2, 3],
    'purchase_date': pd.to_datetime([
        '2024-01-15', '2024-02-20', '2024-03-25',
        '2024-01-10', '2024-03-15', '2024-02-05'
    ]),
    'amount': [100, 150, 200, 80, 120, 90]
})

# Sort by customer and date
df = df.sort_values(['customer_id', 'purchase_date'])

# Time since last purchase (in days)
df['days_since_last_purchase'] = df.groupby('customer_id')['purchase_date'].diff().dt.days

# Time since first purchase (customer age)
df['days_since_first_purchase'] = (
    df['purchase_date'] - 
    df.groupby('customer_id')['purchase_date'].transform('min')
).dt.days

# Recency: days since most recent event (useful for churn prediction)
reference_date = pd.to_datetime('2024-04-01')
df['recency'] = (reference_date - df['purchase_date']).dt.days

print(df)

# These features are highly predictive for:
# - Purchase frequency models
# - Churn prediction
# - Customer lifetime value

Aggregated Time Windows

import pandas as pd

# Transaction data
df = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=100, freq='D'),
    'sales': np.random.randint(100, 1000, 100),
    'num_transactions': np.random.randint(5, 50, 100)
})

df = df.set_index('date')

# Rolling/Moving averages
df['sales_ma_7'] = df['sales'].rolling(window=7).mean()
df['sales_ma_30'] = df['sales'].rolling(window=30).mean()

# Expanding windows (cumulative)
df['cumulative_sales'] = df['sales'].expanding().sum()
df['avg_sales_to_date'] = df['sales'].expanding().mean()

# Lagged features (previous values)
df['sales_lag_1'] = df['sales'].shift(1)  # Yesterday
df['sales_lag_7'] = df['sales'].shift(7)  # Last week

# Change/growth features
df['sales_change_1d'] = df['sales'] - df['sales_lag_1']
df['sales_pct_change_1d'] = df['sales'].pct_change()
df['sales_change_7d'] = df['sales'] - df['sales_lag_7']

print(df[['sales', 'sales_ma_7', 'sales_lag_1', 'sales_change_1d']].head(10))

# These are essential for time series forecasting!

Complete DateTime Feature Engineer

import pandas as pd
import numpy as np

class DateTimeFeatureEngineer:
    """Comprehensive datetime feature extraction."""
    
    def fit_transform(self, df, date_col):
        """Extract all datetime features."""
        df = df.copy()
        
        # Ensure datetime type
        df[date_col] = pd.to_datetime(df[date_col])
        
        # Basic components
        df['year'] = df[date_col].dt.year
        df['month'] = df[date_col].dt.month
        df['day'] = df[date_col].dt.day
        df['dayofweek'] = df[date_col].dt.dayofweek
        df['dayofyear'] = df[date_col].dt.dayofyear
        df['quarter'] = df[date_col].dt.quarter
        df['week'] = df[date_col].dt.isocalendar().week
        
        # Time components (if datetime has time)
        if df[date_col].dt.hour.sum() > 0:
            df['hour'] = df[date_col].dt.hour
            df['minute'] = df[date_col].dt.minute
        
        # Boolean flags
        df['is_weekend'] = df['dayofweek'].isin([5, 6]).astype(int)
        df['is_month_start'] = df[date_col].dt.is_month_start.astype(int)
        df['is_month_end'] = df[date_col].dt.is_month_end.astype(int)
        df['is_quarter_start'] = df[date_col].dt.is_quarter_start.astype(int)
        df['is_quarter_end'] = df[date_col].dt.is_quarter_end.astype(int)
        df['is_year_start'] = df[date_col].dt.is_year_start.astype(int)
        df['is_year_end'] = df[date_col].dt.is_year_end.astype(int)
        
        # Cyclical encoding
        df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
        df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)
        df['day_sin'] = np.sin(2 * np.pi * df['day'] / 31)
        df['day_cos'] = np.cos(2 * np.pi * df['day'] / 31)
        df['dayofweek_sin'] = np.sin(2 * np.pi * df['dayofweek'] / 7)
        df['dayofweek_cos'] = np.cos(2 * np.pi * df['dayofweek'] / 7)
        
        return df

# Example usage
dates = pd.date_range('2024-01-01', periods=365, freq='D')
df = pd.DataFrame({'date': dates, 'value': np.random.randn(365)})

engineer = DateTimeFeatureEngineer()
df_engineered = engineer.fit_transform(df, 'date')

print(f"Original columns: {2}")
print(f"Engineered columns: {len(df_engineered.columns)}")
print("\nNew features:", df_engineered.columns.tolist())

Text Feature Engineering

Purpose: Convert text into numerical features that capture semantic meaning

Bag of Words & TF-IDF

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd

# Sample text data
documents = [
    "machine learning is awesome",
    "deep learning requires GPUs",
    "machine learning and deep learning are related",
    "natural language processing uses transformers"
]

# Method 1: Bag of Words (Count Vectorization)
count_vec = CountVectorizer()
bow_features = count_vec.fit_transform(documents)

print("Bag of Words:")
print("Feature names:", count_vec.get_feature_names_out())
print("\nBOW Matrix:")
print(pd.DataFrame(bow_features.toarray(), 
                   columns=count_vec.get_feature_names_out()))

# Method 2: TF-IDF (Term Frequency - Inverse Document Frequency)
tfidf_vec = TfidfVectorizer()
tfidf_features = tfidf_vec.fit_transform(documents)

print("\n\nTF-IDF:")
print("Feature names:", tfidf_vec.get_feature_names_out())
print("\nTF-IDF Matrix:")
tfidf_df = pd.DataFrame(tfidf_features.toarray(), 
                        columns=tfidf_vec.get_feature_names_out())
print(tfidf_df.round(3))

# TF-IDF down-weights common words (like "and", "is")
# and up-weights rare, discriminative words

N-grams: Capturing Context

from sklearn.feature_extraction.text import TfidfVectorizer

documents = [
    "New York is a great city",
    "The weather in New York is cold",
    "I love New York pizza"
]

# Unigrams only (single words)
tfidf_unigram = TfidfVectorizer(ngram_range=(1, 1))
unigram_features = tfidf_unigram.fit_transform(documents)

print("Unigrams (single words):")
print(tfidf_unigram.get_feature_names_out())

# Bigrams (2-word phrases)
tfidf_bigram = TfidfVectorizer(ngram_range=(2, 2))
bigram_features = tfidf_bigram.fit_transform(documents)

print("\nBigrams (2-word phrases):")
print(tfidf_bigram.get_feature_names_out())
# Output includes: 'new york', 'york is', 'great city', etc.

# Combined: Unigrams + Bigrams
tfidf_combined = TfidfVectorizer(ngram_range=(1, 2))
combined_features = tfidf_combined.fit_transform(documents)

print("\nCombined (1-grams and 2-grams):")
print(f"Total features: {len(tfidf_combined.get_feature_names_out())}")
print("Sample features:", tfidf_combined.get_feature_names_out()[:10])

# "New York" is now captured as a single meaningful phrase!

Advanced Text Features

import pandas as pd
import re
from textblob import TextBlob  # pip install textblob

class TextFeatureExtractor:
    """Extract statistical and linguistic features from text."""
    
    def extract_features(self, text):
        """Extract multiple text features."""
        features = {}
        
        # Basic statistics
        features['char_count'] = len(text)
        features['word_count'] = len(text.split())
        features['sentence_count'] = len(text.split('.'))
        features['avg_word_length'] = (
            sum(len(word) for word in text.split()) / len(text.split())
            if text.split() else 0
        )
        
        # Special characters
        features['exclamation_count'] = text.count('!')
        features['question_count'] = text.count('?')
        features['uppercase_count'] = sum(1 for c in text if c.isupper())
        features['digit_count'] = sum(1 for c in text if c.isdigit())
        
        # Uppercase ratio
        features['uppercase_ratio'] = (
            features['uppercase_count'] / len(text) if len(text) > 0 else 0
        )
        
        # Sentiment analysis (requires textblob)
        try:
            blob = TextBlob(text)
            features['sentiment_polarity'] = blob.sentiment.polarity  # -1 to 1
            features['sentiment_subjectivity'] = blob.sentiment.subjectivity  # 0 to 1
        except:
            features['sentiment_polarity'] = 0
            features['sentiment_subjectivity'] = 0
        
        # Part of speech counts (simplified)
        # In real scenarios, use spaCy or NLTK
        features['has_url'] = int(bool(re.search(r'http[s]?://', text)))
        features['has_email'] = int(bool(re.search(r'\S+@\S+', text)))
        features['has_phone'] = int(bool(re.search(r'\d{3}[-.]?\d{3}[-.]?\d{4}', text)))
        
        return features

# Example usage
texts = [
    "Check out this AMAZING deal at http://example.com! Contact us at info@example.com",
    "This is a normal sentence without special features.",
    "WHY ARE YOU YELLING??? Please stop!!!"
]

extractor = TextFeatureExtractor()
features_list = [extractor.extract_features(text) for text in texts]

df_features = pd.DataFrame(features_list)
print(df_features)

# These features are valuable for:
# - Spam detection
# - Sentiment analysis
# - Content classification
# - Quality assessment

Word Embeddings (Advanced)

# Using pre-trained embeddings (requires download)
# Popular options: Word2Vec, GloVe, FastText

# Simple example with sentence transformers
# pip install sentence-transformers

from sentence_transformers import SentenceTransformer
import numpy as np

# Load pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Sample texts
texts = [
    "machine learning is fascinating",
    "deep learning requires GPUs",
    "natural language processing"
]

# Generate embeddings (dense vectors)
embeddings = model.encode(texts)

print("Embedding shape:", embeddings.shape)  # (3, 384)
# Each text is now a 384-dimensional vector!

# Calculate similarity
from sklearn.metrics.pairwise import cosine_similarity

similarity_matrix = cosine_similarity(embeddings)
print("\nCosine similarity between texts:")
print(similarity_matrix)

# Texts with similar meanings have higher similarity scores
# "machine learning" and "deep learning" will be more similar
# than "machine learning" and "language processing"

Domain-Specific Feature Engineering

The most powerful features often come from domain knowledge—understanding what matters in your specific problem.

Example 1: E-Commerce / Retail

import pandas as pd
import numpy as np

# Customer transaction data
df = pd.DataFrame({
    'customer_id': [1, 1, 1, 2, 2, 3, 3, 3, 3],
    'transaction_date': pd.to_datetime([
        '2024-01-05', '2024-01-12', '2024-02-15',
        '2024-01-08', '2024-03-10',
        '2024-01-03', '2024-01-10', '2024-02-05', '2024-03-15'
    ]),
    'amount': [50, 75, 100, 200, 150, 30, 40, 35, 45],
    'category': ['Electronics', 'Clothing', 'Electronics', 
                 'Electronics', 'Electronics',
                 'Food', 'Food', 'Food', 'Clothing']
})

# Customer-level features (aggregated per customer)
customer_features = df.groupby('customer_id').agg({
    'amount': ['sum', 'mean', 'std', 'min', 'max', 'count'],
    'transaction_date': ['min', 'max'],
    'category': lambda x: x.nunique()
}).reset_index()

customer_features.columns = ['_'.join(col).strip('_') for col in customer_features.columns]
customer_features.columns = ['customer_id', 'total_spend', 'avg_spend', 
                              'std_spend', 'min_spend', 'max_spend', 
                              'num_transactions', 'first_purchase', 
                              'last_purchase', 'num_categories']

# Additional engineered features
customer_features['customer_lifetime_days'] = (
    customer_features['last_purchase'] - customer_features['first_purchase']
).dt.days

customer_features['purchase_frequency'] = (
    customer_features['num_transactions'] / 
    (customer_features['customer_lifetime_days'] + 1)  # +1 to avoid division by zero
)

customer_features['avg_days_between_purchases'] = (
    customer_features['customer_lifetime_days'] / 
    (customer_features['num_transactions'] - 1).clip(lower=1)
)

# Recency (days since last purchase)
reference_date = pd.to_datetime('2024-04-01')
customer_features['recency'] = (
    reference_date - customer_features['last_purchase']
).dt.days

# RFM Score (Recency, Frequency, Monetary)
customer_features['frequency_score'] = pd.qcut(
    customer_features['num_transactions'], q=3, labels=[1, 2, 3]
)
customer_features['monetary_score'] = pd.qcut(
    customer_features['total_spend'], q=3, labels=[1, 2, 3]
)
customer_features['recency_score'] = pd.qcut(
    customer_features['recency'], q=3, labels=[3, 2, 1]  # Lower recency = better
)

print(customer_features)

# These features are highly predictive for:
# - Churn prediction
# - Customer lifetime value
# - Next purchase prediction
# - Customer segmentation

Example 2: Finance / Credit Scoring

import pandas as pd
import numpy as np

# Loan application data
df = pd.DataFrame({
    'income': [50000, 75000, 40000, 100000],
    'loan_amount': [20000, 30000, 25000, 40000],
    'existing_debt': [5000, 10000, 15000, 20000],
    'employment_years': [3, 7, 2, 10],
    'num_credit_lines': [2, 4, 1, 6],
    'delinquencies': [0, 1, 2, 0]
})

# Domain-specific ratios and features
df['debt_to_income_ratio'] = df['existing_debt'] / df['income']
df['loan_to_income_ratio'] = df['loan_amount'] / df['income']
df['total_debt_ratio'] = (df['existing_debt'] + df['loan_amount']) / df['income']

# Credit utilization proxy
df['debt_per_credit_line'] = df['existing_debt'] / df['num_credit_lines']

# Stability score
df['employment_stability'] = np.where(df['employment_years'] >= 5, 1, 0)

# Risk score (simple heuristic)
df['risk_score'] = (
    df['debt_to_income_ratio'] * 0.3 +
    df['delinquencies'] * 0.4 +
    (1 - df['employment_stability']) * 0.3
)

# Available income after debt
df['disposable_income'] = df['income'] - df['existing_debt']
df['payment_capacity'] = df['disposable_income'] / df['loan_amount']

print(df)

# These domain-specific features capture financial health
# much better than raw features alone

Example 3: Healthcare / Medical

import pandas as pd

# Patient data
df = pd.DataFrame({
    'age': [45, 67, 52, 38, 71],
    'weight_kg': [80, 90, 75, 68, 85],
    'height_cm': [175, 170, 168, 180, 165],
    'systolic_bp': [120, 140, 135, 118, 150],
    'diastolic_bp': [80, 90, 85, 78, 95],
    'glucose': [95, 110, 105, 92, 125],
    'cholesterol': [180, 220, 200, 175, 240]
})

# Body Mass Index (BMI)
df['bmi'] = df['weight_kg'] / (df['height_cm'] / 100) ** 2

# BMI category
df['bmi_category'] = pd.cut(
    df['bmi'],
    bins=[0, 18.5, 25, 30, 100],
    labels=['Underweight', 'Normal', 'Overweight', 'Obese']
)

# Blood pressure category (using standard guidelines)
df['hypertension_stage'] = pd.cut(
    df['systolic_bp'],
    bins=[0, 120, 130, 140, 180, 300],
    labels=['Normal', 'Elevated', 'Stage1', 'Stage2', 'Crisis']
)

# Pulse pressure (indicator of cardiovascular health)
df['pulse_pressure'] = df['systolic_bp'] - df['diastolic_bp']

# Mean arterial pressure
df['map'] = df['diastolic_bp'] + (df['pulse_pressure'] / 3)

# Metabolic syndrome indicators
df['glucose_high'] = (df['glucose'] >= 100).astype(int)
df['cholesterol_high'] = (df['cholesterol'] >= 200).astype(int)
df['bp_high'] = (df['systolic_bp'] >= 130).astype(int)
df['obesity'] = (df['bmi'] >= 30).astype(int)

# Risk score (sum of risk factors)
df['metabolic_risk_score'] = (
    df['glucose_high'] + df['cholesterol_high'] + 
    df['bp_high'] + df['obesity']
)

# Age-adjusted risk
df['age_risk_multiplier'] = np.where(df['age'] >= 60, 1.5, 1.0)
df['adjusted_risk'] = df['metabolic_risk_score'] * df['age_risk_multiplier']

print(df[['age', 'bmi', 'bmi_category', 'hypertension_stage', 
          'metabolic_risk_score', 'adjusted_risk']])

# Medical domain knowledge creates highly predictive features
# for disease risk prediction

Automated Feature Engineering

While domain expertise is invaluable, automated tools can help discover features you might miss.

Featuretools: Automated Feature Generation

# Install: pip install featuretools

import featuretools as ft
import pandas as pd

# Sample data: customers and transactions
customers = pd.DataFrame({
    'customer_id': [1, 2, 3],
    'join_date': pd.to_datetime(['2023-01-15', '2023-02-20', '2023-01-10']),
    'age': [25, 35, 45]
})

transactions = pd.DataFrame({
    'transaction_id': [1, 2, 3, 4, 5, 6],
    'customer_id': [1, 1, 2, 2, 3, 3],
    'amount': [50, 75, 200, 150, 30, 40],
    'timestamp': pd.to_datetime([
        '2023-03-01', '2023-03-15',
        '2023-03-05', '2023-03-20',
        '2023-03-10', '2023-03-25'
    ])
})

# Create EntitySet (data structure for featuretools)
es = ft.EntitySet(id='customer_data')

# Add entities (tables)
es = es.add_dataframe(
    dataframe_name='customers',
    dataframe=customers,
    index='customer_id',
    time_index='join_date'
)

es = es.add_dataframe(
    dataframe_name='transactions',
    dataframe=transactions,
    index='transaction_id',
    time_index='timestamp'
)

# Define relationship
es = es.add_relationship('customers', 'customer_id', 
                         'transactions', 'customer_id')

# Deep Feature Synthesis (DFS) - automatically creates features
feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name='customers',
    max_depth=2,
    verbose=True
)

print("\nAutomatically generated features:")
print(feature_matrix)

print("\nFeature definitions:")
for feature in feature_defs:
    print(f"  - {feature.get_name()}")

# Featuretools creates features like:
# - SUM(transactions.amount)
# - MEAN(transactions.amount)
# - COUNT(transactions)
# - MAX(transactions.amount)
# - And many more complex aggregations!

Feature Engineering Best Practices

🎯 Guidelines for Creating Good Features

Start with domain knowledge: Understand what matters in your domain
Look for relationships: Ratios, differences, interactions often work well
Consider non-linearity: Polynomial features, log transforms for skewed data
Extract from complex types: Dates, text, categorical have hidden information
Create aggregations: Group statistics (mean, sum, std, etc.)
Mind leakage: Don't use future information or target-derived features
Validate usefulness: Use feature importance or selection methods
Iterate: Feature engineering is experimental; try, test, refine

🧠 Knowledge Check

Question 1: What do polynomial features help capture?

Missing values in the dataset

Non-linear relationships and feature interactions

Outliers in the data

Categorical encodings

Question 2: Why use sin/cos encoding for cyclical features like month or hour?

It makes the data smaller

It removes outliers

It captures that December is close to January, not far away

It's required by all ML algorithms

Question 3: What does TF-IDF do compared to simple word counts?

Counts words faster

Down-weights common words and up-weights rare, discriminative words

Removes stop words automatically

Translates text to different languages

Question 4: What are n-grams useful for in text processing?

Reducing the size of text data

Removing punctuation

Spell checking

Capturing multi-word phrases and context (like "New York" as one unit)

Question 5: In e-commerce, what does the "recency" feature measure?

Days since the customer's last purchase

Total number of purchases

Average purchase amount

Customer's age

Question 6: What problem does polynomial features create with many input features?

Makes models train faster

Reduces model accuracy

Curse of dimensionality - exponential growth in feature count

Removes important information

Question 7: What's the most important ingredient for domain-specific feature engineering?

Having the most data

Understanding the problem domain and what factors matter

Using the latest machine learning algorithm

Having the fastest computer

Question 8: Rolling/moving averages are particularly useful for what type of data?

Categorical data

Text data

Image data

Time series data to capture trends

💻 Practice Exercises

Exercise 1: Polynomial Features Impact

Compare model performance with and without polynomial features:

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline

# Create non-linear dataset
X, y = make_regression(n_samples=200, n_features=2, noise=10, random_state=42)
y = y + 0.5 * X[:, 0]**2 + 0.3 * X[:, 1]**2  # Add non-linearity

# TODO:
# 1. Split into train/test (80/20)
# 2. Train Ridge model WITHOUT polynomial features
# 3. Train Ridge model WITH polynomial features (degree 2)
# 4. Compare R² scores
# 5. Try degrees 2, 3, 4 - what happens?
# 6. Use interaction_only=True - how does it affect results?

Exercise 2: DateTime Feature Engineering

Extract comprehensive features from datetime data:

import pandas as pd

# Sales data with dates
dates = pd.date_range('2023-01-01', '2023-12-31', freq='D')
df = pd.DataFrame({
    'date': dates,
    'sales': np.random.randint(100, 1000, len(dates))
})

# TODO:
# 1. Extract: year, month, day, dayofweek, quarter
# 2. Create binary flags: is_weekend, is_month_start, is_month_end
# 3. Add cyclical encoding for month and dayofweek (sin/cos)
# 4. Create rolling average features (7-day, 30-day)
# 5. Add lagged features (sales from 1, 7 days ago)
# 6. Train a model to predict sales using these features

Exercise 3: Text Feature Extraction

Build a spam classifier using text features:

from sklearn.datasets import fetch_20newsgroups

# Load some text data
categories = ['alt.atheism', 'soc.religion.christian']
data = fetch_20newsgroups(subset='train', categories=categories)
texts = data.data[:100]
labels = data.target[:100]

# TODO:
# 1. Extract TF-IDF features (unigrams and bigrams)
# 2. Extract statistical features:
#    - Character count, word count, avg word length
#    - Count of uppercase chars, exclamation marks
# 3. Combine TF-IDF and statistical features
# 4. Train a classifier (Logistic Regression or SVM)
# 5. Compare with using only TF-IDF vs only statistical features

Exercise 4: Customer RFM Features

Create e-commerce customer features for churn prediction:

# Generate customer transaction data
# TODO:
# 1. Create sample transaction data (customer_id, date, amount)
# 2. Calculate per-customer aggregations:
#    - Total spend, average spend, number of transactions
#    - First and last purchase dates
# 3. Engineer RFM features:
#    - Recency: days since last purchase
#    - Frequency: number of transactions
#    - Monetary: total or average spend
# 4. Add derived features:
#    - Purchase frequency (transactions per day active)
#    - Customer lifetime value prediction
#    - Days between purchases (average)
# 5. Segment customers using these features (K-Means)

Exercise 5: Domain-Specific Healthcare Features

Create medical features for diabetes risk prediction:

# Patient data with: age, weight, height, glucose, bp
# TODO:
# 1. Calculate BMI from weight and height
# 2. Create BMI categories (underweight/normal/overweight/obese)
# 3. Create blood pressure categories (normal/elevated/hypertension)
# 4. Engineer metabolic syndrome indicators:
#    - High glucose (>100 mg/dL)
#    - High BP (systolic >130)
#    - Obesity (BMI >30)
# 5. Create risk scores combining multiple factors
# 6. Add age-adjusted risk multipliers
# 7. Train a classifier to predict diabetes risk

📝 Summary

Feature extraction and creation is where machine learning becomes an art informed by both data science and domain expertise:

Key Takeaways

Polynomial Features: Capture non-linear relationships and feature interactions. Use PolynomialFeatures with degree 2-3. Watch for dimensionality explosion with many features.
DateTime Features: Extract components (year, month, day, hour), create boolean flags (weekends, month-end), use sin/cos for cyclical features, calculate time since events, and create rolling/lagged features for time series.
Text Features: Use TF-IDF for term importance, n-grams for context, statistical features (length, special chars), and embeddings for semantic meaning. Combine multiple approaches for best results.
Domain-Specific Features: Most powerful features come from domain knowledge. Create meaningful ratios, aggregations, and combinations that capture what matters in your problem space.
Automated Tools: Libraries like Featuretools can discover features automatically through deep feature synthesis, but domain expertise still crucial for feature interpretation and selection.
Best Practice: Feature engineering is iterative. Create features, test their importance, refine based on results. Always validate that features improve model performance on held-out data.

🎯 Feature Engineering Mindset

Ask: "What patterns would help a model make better predictions?"
Think about relationships between features (ratios, differences, products)
Consider transformations (logs, roots, polynomials for non-linearity)
Extract hidden information (from dates, text, complex types)
Use domain knowledge to create meaningful derived features
Test and validate - not all features will be useful!

Next tutorial, we'll explore Feature Transformation Techniques—handling skewed distributions and creating better feature representations. See you there! 🚀

🎯 Ready for the Next Challenge?

Continue your feature engineering journey

← Previous: Feature Scaling 📚 Course Hub Next: Feature Transformation →