Feature Selection Methods - Feature Engineering

Having too many features can hurt model performance through the curse of dimensionality, overfitting, and increased computational cost. Feature selection identifies and keeps only the most relevant features, improving model performance, interpretability, and training speed.

🎯 What You'll Learn

Filter methods: correlation, chi-square, mutual information
Wrapper methods: RFE (Recursive Feature Elimination)
Embedded methods: LASSO, Ridge, tree-based importance
When to use each method and how to combine them

Filter Methods: Correlation

import pandas as pd
import numpy as np
from sklearn.datasets import load_boston

# Load data
data = load_boston()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Correlation with target
correlations = df.corr()['target'].abs().sort_values(ascending=False)
print("Feature correlations with target:")
print(correlations)

# Select features with correlation > 0.5
selected_features = correlations[correlations > 0.5].index.tolist()
selected_features.remove('target')
print(f"\nSelected features: {selected_features}")

# Remove highly correlated features (multicollinearity)
corr_matrix = df[selected_features].corr().abs()
upper_triangle = corr_matrix.where(
    np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
)

to_drop = [column for column in upper_triangle.columns 
           if any(upper_triangle[column] > 0.85)]
print(f"Drop due to multicollinearity: {to_drop}")

Filter Methods: Chi-Square Test

For classification with categorical/positive features

from sklearn.feature_selection import SelectKBest, chi2
from sklearn.datasets import load_iris

# Load classification data
iris = load_iris()
X, y = iris.data, iris.target

# Chi-square test (requires non-negative features)
selector = SelectKBest(chi2, k=2)  # Select top 2 features
X_selected = selector.fit_transform(X, y)

# Get selected feature indices
selected_indices = selector.get_support(indices=True)
selected_names = [iris.feature_names[i] for i in selected_indices]

print(f"Selected features: {selected_names}")
print(f"Chi-square scores: {selector.scores_}")
print(f"P-values: {selector.pvalues_}")

Filter Methods: Mutual Information

from sklearn.feature_selection import mutual_info_classif, mutual_info_regression

# For classification
mi_scores = mutual_info_classif(X, y, random_state=42)
mi_scores = pd.Series(mi_scores, index=iris.feature_names)
mi_scores = mi_scores.sort_values(ascending=False)

print("Mutual Information scores:")
print(mi_scores)

# Select top features
threshold = 0.3
selected_features = mi_scores[mi_scores > threshold].index.tolist()
print(f"\nSelected features (MI > {threshold}): {selected_features}")

Wrapper Methods: Recursive Feature Elimination (RFE)

from sklearn.feature_selection import RFE, RFECV
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# Create model
model = LogisticRegression(max_iter=1000)

# RFE with fixed number of features
rfe = RFE(estimator=model, n_features_to_select=2)
rfe.fit(X, y)

print("RFE selected features:")
selected_features = [iris.feature_names[i] for i in range(len(iris.feature_names)) 
                     if rfe.support_[i]]
print(selected_features)
print(f"Feature ranking: {rfe.ranking_}")

# RFECV with cross-validation (automatic selection)
rfecv = RFECV(estimator=model, step=1, cv=5, scoring='accuracy')
rfecv.fit(X, y)

print(f"\nOptimal number of features: {rfecv.n_features_}")
print(f"Selected features: {[iris.feature_names[i] for i in range(len(iris.feature_names)) if rfecv.support_[i]]}")

Embedded Methods: LASSO (L1 Regularization)

from sklearn.linear_model import Lasso, LassoCV
from sklearn.preprocessing import StandardScaler

# Standardize features (important for LASSO)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# LASSO with cross-validation to find optimal alpha
lasso_cv = LassoCV(cv=5, random_state=42)
lasso_cv.fit(X_scaled, y)

print(f"Optimal alpha: {lasso_cv.alpha_:.4f}")

# Get feature coefficients
lasso = Lasso(alpha=lasso_cv.alpha_)
lasso.fit(X_scaled, y)

# Features with non-zero coefficients are selected
feature_importance = pd.Series(abs(lasso.coef_), index=iris.feature_names)
selected_features = feature_importance[feature_importance > 0].index.tolist()

print(f"\nLASSO selected features: {selected_features}")
print("\nFeature coefficients:")
print(feature_importance.sort_values(ascending=False))

Embedded Methods: Tree-Based Importance

from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt

# Train Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

# Get feature importances
importances = pd.Series(rf.feature_importances_, index=iris.feature_names)
importances = importances.sort_values(ascending=False)

print("Feature importances:")
print(importances)

# Select features above threshold
threshold = 0.1
selected_features = importances[importances > threshold].index.tolist()
print(f"\nSelected features (importance > {threshold}): {selected_features}")

# Visualize
importances.plot(kind='bar', figsize=(10, 6))
plt.title('Feature Importances')
plt.ylabel('Importance')
plt.axhline(y=threshold, color='r', linestyle='--', label=f'Threshold ({threshold})')
plt.legend()
plt.tight_layout()
plt.show()

Complete Feature Selection Pipeline

from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

# Method 1: Filter + Model
pipe_filter = Pipeline([
    ('filter', SelectKBest(f_classif, k=3)),
    ('classifier', RandomForestClassifier())
])

# Method 2: Embedded selection (no explicit selector needed)
pipe_embedded = Pipeline([
    ('classifier', RandomForestClassifier(n_estimators=100))
])

# Compare performance
scores_filter = cross_val_score(pipe_filter, X, y, cv=5, scoring='accuracy')
scores_embedded = cross_val_score(pipe_embedded, X, y, cv=5, scoring='accuracy')
scores_all = cross_val_score(RandomForestClassifier(), X, y, cv=5, scoring='accuracy')

print(f"With filter selection: {scores_filter.mean():.4f} (+/- {scores_filter.std():.4f})")
print(f"Embedded (RF): {scores_embedded.mean():.4f} (+/- {scores_embedded.std():.4f})")
print(f"All features: {scores_all.mean():.4f} (+/- {scores_all.std():.4f})")

🧠 Knowledge Check

Question 1: What's the main advantage of filter methods?

Fast and model-agnostic - work independently of the ML algorithm

Always provide the best features for any model

Automatically tune hyperparameters

Require no data preprocessing

Question 2: What does RFE (Recursive Feature Elimination) do?

Removes features with high correlation

Iteratively removes least important features based on model performance

Transforms features to reduce dimensionality

Creates new features from existing ones

Question 3: How does LASSO perform feature selection?

By removing correlated features

By ranking features by importance

By shrinking some feature coefficients to exactly zero

By applying PCA transformation

Question 4: Which method is computationally most expensive?

Correlation filter

Chi-square test

Tree-based importance

RFE with cross-validation

📝 Summary

Key Takeaways

Filter Methods: Fast, model-independent. Use correlation, chi-square, mutual information.
Wrapper Methods: RFE evaluates subsets with model. More accurate but slower.
Embedded Methods: LASSO and tree importance select during training. Good balance.
Combine methods: Use filter for initial reduction, then wrapper/embedded for refinement.
Always validate: Ensure selected features improve performance on held-out data.

🎯 Ready for the Next Challenge?

← Previous: Feature Transformation 📚 Course Hub Next: Dimensionality Reduction →