Having too many features can hurt model performance through the curse of dimensionality, overfitting, and increased computational cost. Feature selection identifies and keeps only the most relevant features, improving model performance, interpretability, and training speed.
🎯 What You'll Learn
- Filter methods: correlation, chi-square, mutual information
- Wrapper methods: RFE (Recursive Feature Elimination)
- Embedded methods: LASSO, Ridge, tree-based importance
- When to use each method and how to combine them
Filter Methods: Correlation
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
# Load data
data = load_boston()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
# Correlation with target
correlations = df.corr()['target'].abs().sort_values(ascending=False)
print("Feature correlations with target:")
print(correlations)
# Select features with correlation > 0.5
selected_features = correlations[correlations > 0.5].index.tolist()
selected_features.remove('target')
print(f"\nSelected features: {selected_features}")
# Remove highly correlated features (multicollinearity)
corr_matrix = df[selected_features].corr().abs()
upper_triangle = corr_matrix.where(
np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
)
to_drop = [column for column in upper_triangle.columns
if any(upper_triangle[column] > 0.85)]
print(f"Drop due to multicollinearity: {to_drop}")
Filter Methods: Chi-Square Test
For classification with categorical/positive features
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.datasets import load_iris
# Load classification data
iris = load_iris()
X, y = iris.data, iris.target
# Chi-square test (requires non-negative features)
selector = SelectKBest(chi2, k=2) # Select top 2 features
X_selected = selector.fit_transform(X, y)
# Get selected feature indices
selected_indices = selector.get_support(indices=True)
selected_names = [iris.feature_names[i] for i in selected_indices]
print(f"Selected features: {selected_names}")
print(f"Chi-square scores: {selector.scores_}")
print(f"P-values: {selector.pvalues_}")
Filter Methods: Mutual Information
from sklearn.feature_selection import mutual_info_classif, mutual_info_regression
# For classification
mi_scores = mutual_info_classif(X, y, random_state=42)
mi_scores = pd.Series(mi_scores, index=iris.feature_names)
mi_scores = mi_scores.sort_values(ascending=False)
print("Mutual Information scores:")
print(mi_scores)
# Select top features
threshold = 0.3
selected_features = mi_scores[mi_scores > threshold].index.tolist()
print(f"\nSelected features (MI > {threshold}): {selected_features}")
Wrapper Methods: Recursive Feature Elimination (RFE)
from sklearn.feature_selection import RFE, RFECV
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
# Create model
model = LogisticRegression(max_iter=1000)
# RFE with fixed number of features
rfe = RFE(estimator=model, n_features_to_select=2)
rfe.fit(X, y)
print("RFE selected features:")
selected_features = [iris.feature_names[i] for i in range(len(iris.feature_names))
if rfe.support_[i]]
print(selected_features)
print(f"Feature ranking: {rfe.ranking_}")
# RFECV with cross-validation (automatic selection)
rfecv = RFECV(estimator=model, step=1, cv=5, scoring='accuracy')
rfecv.fit(X, y)
print(f"\nOptimal number of features: {rfecv.n_features_}")
print(f"Selected features: {[iris.feature_names[i] for i in range(len(iris.feature_names)) if rfecv.support_[i]]}")
Embedded Methods: LASSO (L1 Regularization)
from sklearn.linear_model import Lasso, LassoCV
from sklearn.preprocessing import StandardScaler
# Standardize features (important for LASSO)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# LASSO with cross-validation to find optimal alpha
lasso_cv = LassoCV(cv=5, random_state=42)
lasso_cv.fit(X_scaled, y)
print(f"Optimal alpha: {lasso_cv.alpha_:.4f}")
# Get feature coefficients
lasso = Lasso(alpha=lasso_cv.alpha_)
lasso.fit(X_scaled, y)
# Features with non-zero coefficients are selected
feature_importance = pd.Series(abs(lasso.coef_), index=iris.feature_names)
selected_features = feature_importance[feature_importance > 0].index.tolist()
print(f"\nLASSO selected features: {selected_features}")
print("\nFeature coefficients:")
print(feature_importance.sort_values(ascending=False))
Embedded Methods: Tree-Based Importance
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
# Train Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)
# Get feature importances
importances = pd.Series(rf.feature_importances_, index=iris.feature_names)
importances = importances.sort_values(ascending=False)
print("Feature importances:")
print(importances)
# Select features above threshold
threshold = 0.1
selected_features = importances[importances > threshold].index.tolist()
print(f"\nSelected features (importance > {threshold}): {selected_features}")
# Visualize
importances.plot(kind='bar', figsize=(10, 6))
plt.title('Feature Importances')
plt.ylabel('Importance')
plt.axhline(y=threshold, color='r', linestyle='--', label=f'Threshold ({threshold})')
plt.legend()
plt.tight_layout()
plt.show()
Complete Feature Selection Pipeline
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
# Method 1: Filter + Model
pipe_filter = Pipeline([
('filter', SelectKBest(f_classif, k=3)),
('classifier', RandomForestClassifier())
])
# Method 2: Embedded selection (no explicit selector needed)
pipe_embedded = Pipeline([
('classifier', RandomForestClassifier(n_estimators=100))
])
# Compare performance
scores_filter = cross_val_score(pipe_filter, X, y, cv=5, scoring='accuracy')
scores_embedded = cross_val_score(pipe_embedded, X, y, cv=5, scoring='accuracy')
scores_all = cross_val_score(RandomForestClassifier(), X, y, cv=5, scoring='accuracy')
print(f"With filter selection: {scores_filter.mean():.4f} (+/- {scores_filter.std():.4f})")
print(f"Embedded (RF): {scores_embedded.mean():.4f} (+/- {scores_embedded.std():.4f})")
print(f"All features: {scores_all.mean():.4f} (+/- {scores_all.std():.4f})")
🧠 Knowledge Check
Question 1: What's the main advantage of filter methods?
Fast and model-agnostic - work independently of the ML algorithm
Always provide the best features for any model
Automatically tune hyperparameters
Require no data preprocessing
Question 2: What does RFE (Recursive Feature Elimination) do?
Removes features with high correlation
Iteratively removes least important features based on model performance
Transforms features to reduce dimensionality
Creates new features from existing ones
Question 3: How does LASSO perform feature selection?
By removing correlated features
By ranking features by importance
By shrinking some feature coefficients to exactly zero
By applying PCA transformation
Question 4: Which method is computationally most expensive?
Correlation filter
Chi-square test
Tree-based importance
RFE with cross-validation
📝 Summary
Key Takeaways
- Filter Methods: Fast, model-independent. Use correlation, chi-square, mutual information.
- Wrapper Methods: RFE evaluates subsets with model. More accurate but slower.
- Embedded Methods: LASSO and tree importance select during training. Good balance.
- Combine methods: Use filter for initial reduction, then wrapper/embedded for refinement.
- Always validate: Ensure selected features improve performance on held-out data.