The raw features in your dataset are rarely optimal for machine learning. Often, the relationships your model needs to learn are hidden in combinations, transformations, or extractions from existing features. Feature extraction and creation is the art and science of generating new, more informative features that make patterns easier for models to discover.
🎯 What You'll Learn
- Creating polynomial features and interactions for non-linear patterns
- Extracting rich features from datetime data
- Engineering features from text (TF-IDF, n-grams, embeddings)
- Domain-specific feature engineering strategies
- Automated feature generation techniques
💡 The Impact of Good Feature Engineering
Andrew Ng famously said: "Applied machine learning is basically feature engineering." In Kaggle competitions, clever feature engineering often makes the difference between winning and mediocrity—more so than algorithm choice.
Polynomial Features
Purpose: Capture non-linear relationships by creating powers and interactions of features
The Problem: Linear Relationships Aren't Enough
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# Create non-linear data
np.random.seed(42)
X = np.linspace(0, 10, 100).reshape(-1, 1)
y = 0.5 * X**2 + X + 2 + np.random.randn(100, 1) * 3
# Try linear regression (will fail to capture curve)
model_linear = LinearRegression()
model_linear.fit(X, y)
y_pred_linear = model_linear.predict(X)
# Plot
plt.figure(figsize=(10, 5))
plt.scatter(X, y, alpha=0.5, label='Actual data')
plt.plot(X, y_pred_linear, 'r-', label='Linear fit', linewidth=2)
plt.xlabel('X')
plt.ylabel('y')
plt.title('Linear Model Fails on Non-Linear Data')
plt.legend()
plt.show()
# Linear model can't capture the quadratic relationship!
print(f"Linear R² score: {model_linear.score(X, y):.4f}")
Solution: Polynomial Features
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
# Create polynomial features (degree 2)
# Transforms [x] into [1, x, x²]
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)
print("Original shape:", X.shape) # (100, 1)
print("Polynomial shape:", X_poly.shape) # (100, 2)
print("\nFirst 5 rows:")
print("Original:", X[:5].ravel())
print("Polynomial:", X_poly[:5])
# Output:
# Original: [0.0, 0.101, 0.202, 0.303, 0.404]
# Polynomial:
# [[0.0, 0.0] # [x, x²]
# [0.101, 0.010]
# [0.202, 0.041]
# [0.303, 0.092]
# [0.404, 0.163]]
# Train with polynomial features
model_poly = LinearRegression()
model_poly.fit(X_poly, y)
y_pred_poly = model_poly.predict(X_poly)
print(f"\nPolynomial R² score: {model_poly.score(X_poly, y):.4f}")
# Much better fit! (R² close to 1.0)
Feature Interactions
With multiple features, polynomial features also create interaction terms:
from sklearn.preprocessing import PolynomialFeatures
import pandas as pd
# Sample data: house features
data = pd.DataFrame({
'length': [10, 12, 15],
'width': [8, 10, 12]
})
print("Original features:")
print(data)
# Create polynomial features with interactions
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(data)
# Get feature names
feature_names = poly.get_feature_names_out(data.columns)
poly_df = pd.DataFrame(poly_features, columns=feature_names)
print("\nPolynomial features with interactions:")
print(poly_df)
# Output:
# length width length² length×width width²
# 0 10 8 100 80 64
# 1 12 10 144 120 100
# 2 15 12 225 180 144
# Note: length×width gives us area!
# This interaction feature might be very predictive for house prices
Practical Example: House Price Prediction
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
import numpy as np
# Load data
housing = fetch_california_housing()
X, y = housing.data, housing.target
# Use only 2 features for demonstration
X = X[:, :2] # MedInc and HouseAge
feature_names = ['MedInc', 'HouseAge']
# Split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Pipeline without polynomial features
pipe_linear = Pipeline([
('scaler', StandardScaler()),
('model', Ridge(alpha=1.0))
])
# Pipeline with polynomial features (degree 2)
pipe_poly = Pipeline([
('poly', PolynomialFeatures(degree=2, include_bias=False)),
('scaler', StandardScaler()),
('model', Ridge(alpha=1.0))
])
# Train both
pipe_linear.fit(X_train, y_train)
pipe_poly.fit(X_train, y_train)
# Compare
print("Linear features only:")
print(f" Train R²: {pipe_linear.score(X_train, y_train):.4f}")
print(f" Test R²: {pipe_linear.score(X_test, y_test):.4f}")
print("\nWith polynomial features:")
print(f" Train R²: {pipe_poly.score(X_train, y_train):.4f}")
print(f" Test R²: {pipe_poly.score(X_test, y_test):.4f}")
# Polynomial features improve model performance!
⚠️ Curse of Dimensionality
Polynomial features explode in number with higher degrees and more features:
- 10 features, degree 2 → 65 features
- 10 features, degree 3 → 285 features
- 20 features, degree 2 → 230 features
Solution: Use feature selection, regularization (Ridge/Lasso), or specify interaction_only=True to create only interaction terms without powers.
DateTime Feature Engineering
Purpose: Extract temporal patterns, seasonality, and cyclical features from dates/times
Raw datetime values are not directly usable by ML models. We need to extract meaningful components that capture temporal patterns.
Basic DateTime Extraction
import pandas as pd
import numpy as np
# Sample datetime data
dates = pd.date_range('2023-01-01', periods=100, freq='D')
df = pd.DataFrame({'date': dates})
# Extract basic components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['dayofweek'] = df['date'].dt.dayofweek # Monday=0, Sunday=6
df['quarter'] = df['date'].dt.quarter
df['week'] = df['date'].dt.isocalendar().week
df['dayofyear'] = df['date'].dt.dayofyear
# Boolean flags
df['is_weekend'] = df['dayofweek'].isin([5, 6]).astype(int)
df['is_month_start'] = df['date'].dt.is_month_start.astype(int)
df['is_month_end'] = df['date'].dt.is_month_end.astype(int)
df['is_quarter_start'] = df['date'].dt.is_quarter_start.astype(int)
print(df.head(10))
# Example output showing rich temporal features
Cyclical Features (Sin/Cos Encoding)
Month, day of week, and hour are cyclical: December (12) is close to January (1), not far away. Regular encoding doesn't capture this.
import numpy as np
import pandas as pd
# Sample data
df = pd.DataFrame({
'month': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
'hour': [0, 6, 12, 18, 23, 5, 11, 17, 22, 4, 10, 16]
})
# ❌ Regular encoding: December (12) seems far from January (1)
print("Regular encoding issues:")
print(" Distance from Dec to Jan:", abs(12 - 1)) # 11
print(" Distance from Jun to Jul:", abs(6 - 7)) # 1
# ✅ Cyclical encoding with sin/cos
def encode_cyclical(data, col, max_val):
"""Encode cyclical feature using sin/cos transformation."""
data[col + '_sin'] = np.sin(2 * np.pi * data[col] / max_val)
data[col + '_cos'] = np.cos(2 * np.pi * data[col] / max_val)
return data
# Encode month (1-12)
df = encode_cyclical(df, 'month', 12)
# Encode hour (0-23)
df = encode_cyclical(df, 'hour', 24)
print("\nCyclical encoding:")
print(df[['month', 'month_sin', 'month_cos']].head())
# Now December and January are close in the feature space!
# month=12: sin ≈ 0, cos ≈ 1
# month=1: sin ≈ 0.5, cos ≈ 0.87
# Distance in 2D space is small
Time Since Events
import pandas as pd
import numpy as np
# E-commerce example: customer purchase history
df = pd.DataFrame({
'customer_id': [1, 1, 1, 2, 2, 3],
'purchase_date': pd.to_datetime([
'2024-01-15', '2024-02-20', '2024-03-25',
'2024-01-10', '2024-03-15', '2024-02-05'
]),
'amount': [100, 150, 200, 80, 120, 90]
})
# Sort by customer and date
df = df.sort_values(['customer_id', 'purchase_date'])
# Time since last purchase (in days)
df['days_since_last_purchase'] = df.groupby('customer_id')['purchase_date'].diff().dt.days
# Time since first purchase (customer age)
df['days_since_first_purchase'] = (
df['purchase_date'] -
df.groupby('customer_id')['purchase_date'].transform('min')
).dt.days
# Recency: days since most recent event (useful for churn prediction)
reference_date = pd.to_datetime('2024-04-01')
df['recency'] = (reference_date - df['purchase_date']).dt.days
print(df)
# These features are highly predictive for:
# - Purchase frequency models
# - Churn prediction
# - Customer lifetime value
Aggregated Time Windows
import pandas as pd
# Transaction data
df = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=100, freq='D'),
'sales': np.random.randint(100, 1000, 100),
'num_transactions': np.random.randint(5, 50, 100)
})
df = df.set_index('date')
# Rolling/Moving averages
df['sales_ma_7'] = df['sales'].rolling(window=7).mean()
df['sales_ma_30'] = df['sales'].rolling(window=30).mean()
# Expanding windows (cumulative)
df['cumulative_sales'] = df['sales'].expanding().sum()
df['avg_sales_to_date'] = df['sales'].expanding().mean()
# Lagged features (previous values)
df['sales_lag_1'] = df['sales'].shift(1) # Yesterday
df['sales_lag_7'] = df['sales'].shift(7) # Last week
# Change/growth features
df['sales_change_1d'] = df['sales'] - df['sales_lag_1']
df['sales_pct_change_1d'] = df['sales'].pct_change()
df['sales_change_7d'] = df['sales'] - df['sales_lag_7']
print(df[['sales', 'sales_ma_7', 'sales_lag_1', 'sales_change_1d']].head(10))
# These are essential for time series forecasting!
Complete DateTime Feature Engineer
import pandas as pd
import numpy as np
class DateTimeFeatureEngineer:
"""Comprehensive datetime feature extraction."""
def fit_transform(self, df, date_col):
"""Extract all datetime features."""
df = df.copy()
# Ensure datetime type
df[date_col] = pd.to_datetime(df[date_col])
# Basic components
df['year'] = df[date_col].dt.year
df['month'] = df[date_col].dt.month
df['day'] = df[date_col].dt.day
df['dayofweek'] = df[date_col].dt.dayofweek
df['dayofyear'] = df[date_col].dt.dayofyear
df['quarter'] = df[date_col].dt.quarter
df['week'] = df[date_col].dt.isocalendar().week
# Time components (if datetime has time)
if df[date_col].dt.hour.sum() > 0:
df['hour'] = df[date_col].dt.hour
df['minute'] = df[date_col].dt.minute
# Boolean flags
df['is_weekend'] = df['dayofweek'].isin([5, 6]).astype(int)
df['is_month_start'] = df[date_col].dt.is_month_start.astype(int)
df['is_month_end'] = df[date_col].dt.is_month_end.astype(int)
df['is_quarter_start'] = df[date_col].dt.is_quarter_start.astype(int)
df['is_quarter_end'] = df[date_col].dt.is_quarter_end.astype(int)
df['is_year_start'] = df[date_col].dt.is_year_start.astype(int)
df['is_year_end'] = df[date_col].dt.is_year_end.astype(int)
# Cyclical encoding
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)
df['day_sin'] = np.sin(2 * np.pi * df['day'] / 31)
df['day_cos'] = np.cos(2 * np.pi * df['day'] / 31)
df['dayofweek_sin'] = np.sin(2 * np.pi * df['dayofweek'] / 7)
df['dayofweek_cos'] = np.cos(2 * np.pi * df['dayofweek'] / 7)
return df
# Example usage
dates = pd.date_range('2024-01-01', periods=365, freq='D')
df = pd.DataFrame({'date': dates, 'value': np.random.randn(365)})
engineer = DateTimeFeatureEngineer()
df_engineered = engineer.fit_transform(df, 'date')
print(f"Original columns: {2}")
print(f"Engineered columns: {len(df_engineered.columns)}")
print("\nNew features:", df_engineered.columns.tolist())
Text Feature Engineering
Purpose: Convert text into numerical features that capture semantic meaning
Bag of Words & TF-IDF
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd
# Sample text data
documents = [
"machine learning is awesome",
"deep learning requires GPUs",
"machine learning and deep learning are related",
"natural language processing uses transformers"
]
# Method 1: Bag of Words (Count Vectorization)
count_vec = CountVectorizer()
bow_features = count_vec.fit_transform(documents)
print("Bag of Words:")
print("Feature names:", count_vec.get_feature_names_out())
print("\nBOW Matrix:")
print(pd.DataFrame(bow_features.toarray(),
columns=count_vec.get_feature_names_out()))
# Method 2: TF-IDF (Term Frequency - Inverse Document Frequency)
tfidf_vec = TfidfVectorizer()
tfidf_features = tfidf_vec.fit_transform(documents)
print("\n\nTF-IDF:")
print("Feature names:", tfidf_vec.get_feature_names_out())
print("\nTF-IDF Matrix:")
tfidf_df = pd.DataFrame(tfidf_features.toarray(),
columns=tfidf_vec.get_feature_names_out())
print(tfidf_df.round(3))
# TF-IDF down-weights common words (like "and", "is")
# and up-weights rare, discriminative words
N-grams: Capturing Context
from sklearn.feature_extraction.text import TfidfVectorizer
documents = [
"New York is a great city",
"The weather in New York is cold",
"I love New York pizza"
]
# Unigrams only (single words)
tfidf_unigram = TfidfVectorizer(ngram_range=(1, 1))
unigram_features = tfidf_unigram.fit_transform(documents)
print("Unigrams (single words):")
print(tfidf_unigram.get_feature_names_out())
# Bigrams (2-word phrases)
tfidf_bigram = TfidfVectorizer(ngram_range=(2, 2))
bigram_features = tfidf_bigram.fit_transform(documents)
print("\nBigrams (2-word phrases):")
print(tfidf_bigram.get_feature_names_out())
# Output includes: 'new york', 'york is', 'great city', etc.
# Combined: Unigrams + Bigrams
tfidf_combined = TfidfVectorizer(ngram_range=(1, 2))
combined_features = tfidf_combined.fit_transform(documents)
print("\nCombined (1-grams and 2-grams):")
print(f"Total features: {len(tfidf_combined.get_feature_names_out())}")
print("Sample features:", tfidf_combined.get_feature_names_out()[:10])
# "New York" is now captured as a single meaningful phrase!
Advanced Text Features
import pandas as pd
import re
from textblob import TextBlob # pip install textblob
class TextFeatureExtractor:
"""Extract statistical and linguistic features from text."""
def extract_features(self, text):
"""Extract multiple text features."""
features = {}
# Basic statistics
features['char_count'] = len(text)
features['word_count'] = len(text.split())
features['sentence_count'] = len(text.split('.'))
features['avg_word_length'] = (
sum(len(word) for word in text.split()) / len(text.split())
if text.split() else 0
)
# Special characters
features['exclamation_count'] = text.count('!')
features['question_count'] = text.count('?')
features['uppercase_count'] = sum(1 for c in text if c.isupper())
features['digit_count'] = sum(1 for c in text if c.isdigit())
# Uppercase ratio
features['uppercase_ratio'] = (
features['uppercase_count'] / len(text) if len(text) > 0 else 0
)
# Sentiment analysis (requires textblob)
try:
blob = TextBlob(text)
features['sentiment_polarity'] = blob.sentiment.polarity # -1 to 1
features['sentiment_subjectivity'] = blob.sentiment.subjectivity # 0 to 1
except:
features['sentiment_polarity'] = 0
features['sentiment_subjectivity'] = 0
# Part of speech counts (simplified)
# In real scenarios, use spaCy or NLTK
features['has_url'] = int(bool(re.search(r'http[s]?://', text)))
features['has_email'] = int(bool(re.search(r'\S+@\S+', text)))
features['has_phone'] = int(bool(re.search(r'\d{3}[-.]?\d{3}[-.]?\d{4}', text)))
return features
# Example usage
texts = [
"Check out this AMAZING deal at http://example.com! Contact us at info@example.com",
"This is a normal sentence without special features.",
"WHY ARE YOU YELLING??? Please stop!!!"
]
extractor = TextFeatureExtractor()
features_list = [extractor.extract_features(text) for text in texts]
df_features = pd.DataFrame(features_list)
print(df_features)
# These features are valuable for:
# - Spam detection
# - Sentiment analysis
# - Content classification
# - Quality assessment
Word Embeddings (Advanced)
# Using pre-trained embeddings (requires download)
# Popular options: Word2Vec, GloVe, FastText
# Simple example with sentence transformers
# pip install sentence-transformers
from sentence_transformers import SentenceTransformer
import numpy as np
# Load pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Sample texts
texts = [
"machine learning is fascinating",
"deep learning requires GPUs",
"natural language processing"
]
# Generate embeddings (dense vectors)
embeddings = model.encode(texts)
print("Embedding shape:", embeddings.shape) # (3, 384)
# Each text is now a 384-dimensional vector!
# Calculate similarity
from sklearn.metrics.pairwise import cosine_similarity
similarity_matrix = cosine_similarity(embeddings)
print("\nCosine similarity between texts:")
print(similarity_matrix)
# Texts with similar meanings have higher similarity scores
# "machine learning" and "deep learning" will be more similar
# than "machine learning" and "language processing"
Domain-Specific Feature Engineering
The most powerful features often come from domain knowledge—understanding what matters in your specific problem.
Example 1: E-Commerce / Retail
import pandas as pd
import numpy as np
# Customer transaction data
df = pd.DataFrame({
'customer_id': [1, 1, 1, 2, 2, 3, 3, 3, 3],
'transaction_date': pd.to_datetime([
'2024-01-05', '2024-01-12', '2024-02-15',
'2024-01-08', '2024-03-10',
'2024-01-03', '2024-01-10', '2024-02-05', '2024-03-15'
]),
'amount': [50, 75, 100, 200, 150, 30, 40, 35, 45],
'category': ['Electronics', 'Clothing', 'Electronics',
'Electronics', 'Electronics',
'Food', 'Food', 'Food', 'Clothing']
})
# Customer-level features (aggregated per customer)
customer_features = df.groupby('customer_id').agg({
'amount': ['sum', 'mean', 'std', 'min', 'max', 'count'],
'transaction_date': ['min', 'max'],
'category': lambda x: x.nunique()
}).reset_index()
customer_features.columns = ['_'.join(col).strip('_') for col in customer_features.columns]
customer_features.columns = ['customer_id', 'total_spend', 'avg_spend',
'std_spend', 'min_spend', 'max_spend',
'num_transactions', 'first_purchase',
'last_purchase', 'num_categories']
# Additional engineered features
customer_features['customer_lifetime_days'] = (
customer_features['last_purchase'] - customer_features['first_purchase']
).dt.days
customer_features['purchase_frequency'] = (
customer_features['num_transactions'] /
(customer_features['customer_lifetime_days'] + 1) # +1 to avoid division by zero
)
customer_features['avg_days_between_purchases'] = (
customer_features['customer_lifetime_days'] /
(customer_features['num_transactions'] - 1).clip(lower=1)
)
# Recency (days since last purchase)
reference_date = pd.to_datetime('2024-04-01')
customer_features['recency'] = (
reference_date - customer_features['last_purchase']
).dt.days
# RFM Score (Recency, Frequency, Monetary)
customer_features['frequency_score'] = pd.qcut(
customer_features['num_transactions'], q=3, labels=[1, 2, 3]
)
customer_features['monetary_score'] = pd.qcut(
customer_features['total_spend'], q=3, labels=[1, 2, 3]
)
customer_features['recency_score'] = pd.qcut(
customer_features['recency'], q=3, labels=[3, 2, 1] # Lower recency = better
)
print(customer_features)
# These features are highly predictive for:
# - Churn prediction
# - Customer lifetime value
# - Next purchase prediction
# - Customer segmentation
Example 2: Finance / Credit Scoring
import pandas as pd
import numpy as np
# Loan application data
df = pd.DataFrame({
'income': [50000, 75000, 40000, 100000],
'loan_amount': [20000, 30000, 25000, 40000],
'existing_debt': [5000, 10000, 15000, 20000],
'employment_years': [3, 7, 2, 10],
'num_credit_lines': [2, 4, 1, 6],
'delinquencies': [0, 1, 2, 0]
})
# Domain-specific ratios and features
df['debt_to_income_ratio'] = df['existing_debt'] / df['income']
df['loan_to_income_ratio'] = df['loan_amount'] / df['income']
df['total_debt_ratio'] = (df['existing_debt'] + df['loan_amount']) / df['income']
# Credit utilization proxy
df['debt_per_credit_line'] = df['existing_debt'] / df['num_credit_lines']
# Stability score
df['employment_stability'] = np.where(df['employment_years'] >= 5, 1, 0)
# Risk score (simple heuristic)
df['risk_score'] = (
df['debt_to_income_ratio'] * 0.3 +
df['delinquencies'] * 0.4 +
(1 - df['employment_stability']) * 0.3
)
# Available income after debt
df['disposable_income'] = df['income'] - df['existing_debt']
df['payment_capacity'] = df['disposable_income'] / df['loan_amount']
print(df)
# These domain-specific features capture financial health
# much better than raw features alone
Example 3: Healthcare / Medical
import pandas as pd
# Patient data
df = pd.DataFrame({
'age': [45, 67, 52, 38, 71],
'weight_kg': [80, 90, 75, 68, 85],
'height_cm': [175, 170, 168, 180, 165],
'systolic_bp': [120, 140, 135, 118, 150],
'diastolic_bp': [80, 90, 85, 78, 95],
'glucose': [95, 110, 105, 92, 125],
'cholesterol': [180, 220, 200, 175, 240]
})
# Body Mass Index (BMI)
df['bmi'] = df['weight_kg'] / (df['height_cm'] / 100) ** 2
# BMI category
df['bmi_category'] = pd.cut(
df['bmi'],
bins=[0, 18.5, 25, 30, 100],
labels=['Underweight', 'Normal', 'Overweight', 'Obese']
)
# Blood pressure category (using standard guidelines)
df['hypertension_stage'] = pd.cut(
df['systolic_bp'],
bins=[0, 120, 130, 140, 180, 300],
labels=['Normal', 'Elevated', 'Stage1', 'Stage2', 'Crisis']
)
# Pulse pressure (indicator of cardiovascular health)
df['pulse_pressure'] = df['systolic_bp'] - df['diastolic_bp']
# Mean arterial pressure
df['map'] = df['diastolic_bp'] + (df['pulse_pressure'] / 3)
# Metabolic syndrome indicators
df['glucose_high'] = (df['glucose'] >= 100).astype(int)
df['cholesterol_high'] = (df['cholesterol'] >= 200).astype(int)
df['bp_high'] = (df['systolic_bp'] >= 130).astype(int)
df['obesity'] = (df['bmi'] >= 30).astype(int)
# Risk score (sum of risk factors)
df['metabolic_risk_score'] = (
df['glucose_high'] + df['cholesterol_high'] +
df['bp_high'] + df['obesity']
)
# Age-adjusted risk
df['age_risk_multiplier'] = np.where(df['age'] >= 60, 1.5, 1.0)
df['adjusted_risk'] = df['metabolic_risk_score'] * df['age_risk_multiplier']
print(df[['age', 'bmi', 'bmi_category', 'hypertension_stage',
'metabolic_risk_score', 'adjusted_risk']])
# Medical domain knowledge creates highly predictive features
# for disease risk prediction
Automated Feature Engineering
While domain expertise is invaluable, automated tools can help discover features you might miss.
Featuretools: Automated Feature Generation
# Install: pip install featuretools
import featuretools as ft
import pandas as pd
# Sample data: customers and transactions
customers = pd.DataFrame({
'customer_id': [1, 2, 3],
'join_date': pd.to_datetime(['2023-01-15', '2023-02-20', '2023-01-10']),
'age': [25, 35, 45]
})
transactions = pd.DataFrame({
'transaction_id': [1, 2, 3, 4, 5, 6],
'customer_id': [1, 1, 2, 2, 3, 3],
'amount': [50, 75, 200, 150, 30, 40],
'timestamp': pd.to_datetime([
'2023-03-01', '2023-03-15',
'2023-03-05', '2023-03-20',
'2023-03-10', '2023-03-25'
])
})
# Create EntitySet (data structure for featuretools)
es = ft.EntitySet(id='customer_data')
# Add entities (tables)
es = es.add_dataframe(
dataframe_name='customers',
dataframe=customers,
index='customer_id',
time_index='join_date'
)
es = es.add_dataframe(
dataframe_name='transactions',
dataframe=transactions,
index='transaction_id',
time_index='timestamp'
)
# Define relationship
es = es.add_relationship('customers', 'customer_id',
'transactions', 'customer_id')
# Deep Feature Synthesis (DFS) - automatically creates features
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name='customers',
max_depth=2,
verbose=True
)
print("\nAutomatically generated features:")
print(feature_matrix)
print("\nFeature definitions:")
for feature in feature_defs:
print(f" - {feature.get_name()}")
# Featuretools creates features like:
# - SUM(transactions.amount)
# - MEAN(transactions.amount)
# - COUNT(transactions)
# - MAX(transactions.amount)
# - And many more complex aggregations!
Feature Engineering Best Practices
🎯 Guidelines for Creating Good Features
- Start with domain knowledge: Understand what matters in your domain
- Look for relationships: Ratios, differences, interactions often work well
- Consider non-linearity: Polynomial features, log transforms for skewed data
- Extract from complex types: Dates, text, categorical have hidden information
- Create aggregations: Group statistics (mean, sum, std, etc.)
- Mind leakage: Don't use future information or target-derived features
- Validate usefulness: Use feature importance or selection methods
- Iterate: Feature engineering is experimental; try, test, refine
🧠 Knowledge Check
Question 1: What do polynomial features help capture?
Question 2: Why use sin/cos encoding for cyclical features like month or hour?
Question 3: What does TF-IDF do compared to simple word counts?
Question 4: What are n-grams useful for in text processing?
Question 5: In e-commerce, what does the "recency" feature measure?
Question 6: What problem does polynomial features create with many input features?
Question 7: What's the most important ingredient for domain-specific feature engineering?
Question 8: Rolling/moving averages are particularly useful for what type of data?
💻 Practice Exercises
Exercise 1: Polynomial Features Impact
Compare model performance with and without polynomial features:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
# Create non-linear dataset
X, y = make_regression(n_samples=200, n_features=2, noise=10, random_state=42)
y = y + 0.5 * X[:, 0]**2 + 0.3 * X[:, 1]**2 # Add non-linearity
# TODO:
# 1. Split into train/test (80/20)
# 2. Train Ridge model WITHOUT polynomial features
# 3. Train Ridge model WITH polynomial features (degree 2)
# 4. Compare R² scores
# 5. Try degrees 2, 3, 4 - what happens?
# 6. Use interaction_only=True - how does it affect results?
Exercise 2: DateTime Feature Engineering
Extract comprehensive features from datetime data:
import pandas as pd
# Sales data with dates
dates = pd.date_range('2023-01-01', '2023-12-31', freq='D')
df = pd.DataFrame({
'date': dates,
'sales': np.random.randint(100, 1000, len(dates))
})
# TODO:
# 1. Extract: year, month, day, dayofweek, quarter
# 2. Create binary flags: is_weekend, is_month_start, is_month_end
# 3. Add cyclical encoding for month and dayofweek (sin/cos)
# 4. Create rolling average features (7-day, 30-day)
# 5. Add lagged features (sales from 1, 7 days ago)
# 6. Train a model to predict sales using these features
Exercise 3: Text Feature Extraction
Build a spam classifier using text features:
from sklearn.datasets import fetch_20newsgroups
# Load some text data
categories = ['alt.atheism', 'soc.religion.christian']
data = fetch_20newsgroups(subset='train', categories=categories)
texts = data.data[:100]
labels = data.target[:100]
# TODO:
# 1. Extract TF-IDF features (unigrams and bigrams)
# 2. Extract statistical features:
# - Character count, word count, avg word length
# - Count of uppercase chars, exclamation marks
# 3. Combine TF-IDF and statistical features
# 4. Train a classifier (Logistic Regression or SVM)
# 5. Compare with using only TF-IDF vs only statistical features
Exercise 4: Customer RFM Features
Create e-commerce customer features for churn prediction:
# Generate customer transaction data
# TODO:
# 1. Create sample transaction data (customer_id, date, amount)
# 2. Calculate per-customer aggregations:
# - Total spend, average spend, number of transactions
# - First and last purchase dates
# 3. Engineer RFM features:
# - Recency: days since last purchase
# - Frequency: number of transactions
# - Monetary: total or average spend
# 4. Add derived features:
# - Purchase frequency (transactions per day active)
# - Customer lifetime value prediction
# - Days between purchases (average)
# 5. Segment customers using these features (K-Means)
Exercise 5: Domain-Specific Healthcare Features
Create medical features for diabetes risk prediction:
# Patient data with: age, weight, height, glucose, bp
# TODO:
# 1. Calculate BMI from weight and height
# 2. Create BMI categories (underweight/normal/overweight/obese)
# 3. Create blood pressure categories (normal/elevated/hypertension)
# 4. Engineer metabolic syndrome indicators:
# - High glucose (>100 mg/dL)
# - High BP (systolic >130)
# - Obesity (BMI >30)
# 5. Create risk scores combining multiple factors
# 6. Add age-adjusted risk multipliers
# 7. Train a classifier to predict diabetes risk
📝 Summary
Feature extraction and creation is where machine learning becomes an art informed by both data science and domain expertise:
Key Takeaways
- Polynomial Features: Capture non-linear relationships and feature interactions. Use PolynomialFeatures with degree 2-3. Watch for dimensionality explosion with many features.
- DateTime Features: Extract components (year, month, day, hour), create boolean flags (weekends, month-end), use sin/cos for cyclical features, calculate time since events, and create rolling/lagged features for time series.
- Text Features: Use TF-IDF for term importance, n-grams for context, statistical features (length, special chars), and embeddings for semantic meaning. Combine multiple approaches for best results.
- Domain-Specific Features: Most powerful features come from domain knowledge. Create meaningful ratios, aggregations, and combinations that capture what matters in your problem space.
- Automated Tools: Libraries like Featuretools can discover features automatically through deep feature synthesis, but domain expertise still crucial for feature interpretation and selection.
- Best Practice: Feature engineering is iterative. Create features, test their importance, refine based on results. Always validate that features improve model performance on held-out data.
🎯 Feature Engineering Mindset
- Ask: "What patterns would help a model make better predictions?"
- Think about relationships between features (ratios, differences, products)
- Consider transformations (logs, roots, polynomials for non-linearity)
- Extract hidden information (from dates, text, complex types)
- Use domain knowledge to create meaningful derived features
- Test and validate - not all features will be useful!
Next tutorial, we'll explore Feature Transformation Techniques—handling skewed distributions and creating better feature representations. See you there! 🚀
🎯 Ready for the Next Challenge?
Continue your feature engineering journey