Project Overview
Build a fraud detection system using the Credit Card Fraud Detection dataset. You'll tackle highly imbalanced data (99.8% legitimate transactions), engineer temporal and statistical features, and apply advanced feature selection techniques.
šÆ Learning Objectives
- Handle severely imbalanced datasets with SMOTE/undersampling
- Create time-based and statistical transaction features
- Apply feature selection to reduce dimensionality
- Use anomaly detection techniques (Isolation Forest)
- Evaluate models with precision, recall, F1-score, and ROC-AUC
š¦ Dataset
Download from: Kaggle Credit Card Fraud Detection
Or use the synthetic data generation code below to get started quickly.
Step 1: Load and Explore Data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
# If you have the Kaggle dataset:
# df = pd.read_csv('creditcard.csv')
# Or generate synthetic data for practice:
from sklearn.datasets import make_classification
X, y = make_classification(
n_samples=50000,
n_features=30,
n_informative=20,
n_redundant=5,
n_classes=2,
weights=[0.998, 0.002], # Highly imbalanced
random_state=42
)
df = pd.DataFrame(X, columns=[f'V{i}' for i in range(1, 31)])
df['Time'] = np.arange(len(df))
df['Amount'] = np.abs(np.random.normal(100, 50, len(df)))
df['Class'] = y
print(f"Dataset shape: {df.shape}")
print(f"\nClass distribution:")
print(df['Class'].value_counts())
print(f"\nFraud percentage: {(df['Class'].sum() / len(df)) * 100:.2f}%")
# Visualize class imbalance
plt.figure(figsize=(8, 5))
df['Class'].value_counts().plot(kind='bar')
plt.title('Class Distribution (0: Legitimate, 1: Fraud)')
plt.xlabel('Class')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.show()
Step 2: Feature Engineering - Transaction Features
# Create time-based features
df['Hour'] = (df['Time'] / 3600) % 24 # Hour of day
df['Day'] = df['Time'] // (3600 * 24) # Day number
# Amount-based features
df['Amount_Log'] = np.log1p(df['Amount']) # Log-transform for skewness
df['Amount_Sqrt'] = np.sqrt(df['Amount'])
# Statistical features (rolling windows)
# Group by customer (assuming customer info exists or use random groups)
df['Customer_ID'] = np.random.randint(1, 1000, len(df))
# Calculate rolling statistics per customer
df = df.sort_values(['Customer_ID', 'Time'])
# Transaction frequency features
df['Trans_Count_1day'] = df.groupby('Customer_ID')['Time'].transform(
lambda x: x.rolling(window=86400, min_periods=1).count()
)
# Amount aggregations
df['Amount_Mean_1day'] = df.groupby('Customer_ID')['Amount'].transform(
lambda x: x.rolling(window=10, min_periods=1).mean()
)
df['Amount_Std_1day'] = df.groupby('Customer_ID')['Amount'].transform(
lambda x: x.rolling(window=10, min_periods=1).std()
).fillna(0)
# Deviation from customer's typical behavior
df['Amount_Deviation'] = df['Amount'] - df['Amount_Mean_1day']
df['Amount_Z_Score'] = df['Amount_Deviation'] / (df['Amount_Std_1day'] + 1e-5)
print("\nNew features created:")
print(df[['Hour', 'Day', 'Amount_Log', 'Trans_Count_1day',
'Amount_Deviation', 'Amount_Z_Score']].head())
Step 3: Feature Selection
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest, f_classif, RFE
# Prepare features
feature_cols = [col for col in df.columns if col not in ['Class', 'Time', 'Customer_ID']]
X = df[feature_cols]
y = df['Class']
# Split data first (avoid data leakage)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y
)
print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
# Method 1: Random Forest Feature Importance
rf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
feature_importance = pd.DataFrame({
'feature': feature_cols,
'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)
print("\nTop 15 Important Features:")
print(feature_importance.head(15))
# Select top 20 features
top_features = feature_importance.head(20)['feature'].tolist()
X_train_selected = X_train[top_features]
X_test_selected = X_test[top_features]
print(f"\nReduced features: {X_train_selected.shape[1]}")
Step 4: Handle Class Imbalance
# Install: pip install imbalanced-learn
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as ImbPipeline
print("Original class distribution:")
print(y_train.value_counts())
# Strategy 1: SMOTE (Synthetic Minority Over-sampling)
smote = SMOTE(sampling_strategy=0.5, random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train_selected, y_train)
print("\nAfter SMOTE:")
print(pd.Series(y_train_smote).value_counts())
# Strategy 2: Combined (SMOTE + Undersampling)
sampling_pipeline = ImbPipeline([
('smote', SMOTE(sampling_strategy=0.5, random_state=42)),
('undersample', RandomUnderSampler(sampling_strategy=0.8, random_state=42))
])
X_train_balanced, y_train_balanced = sampling_pipeline.fit_resample(
X_train_selected, y_train
)
print("\nAfter SMOTE + Undersampling:")
print(pd.Series(y_train_balanced).value_counts())
Step 5: Anomaly Detection Approach
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
# Isolation Forest (unsupervised anomaly detection)
# Train only on legitimate transactions
X_train_legit = X_train_selected[y_train == 0]
# Scale features
scaler = StandardScaler()
X_train_legit_scaled = scaler.fit_transform(X_train_legit)
X_test_scaled = scaler.transform(X_test_selected)
# Train Isolation Forest
iso_forest = IsolationForest(
contamination=0.002, # Expected fraud rate
random_state=42,
n_jobs=-1
)
iso_forest.fit(X_train_legit_scaled)
# Predict anomalies
y_pred_iso = iso_forest.predict(X_test_scaled)
y_pred_iso = (y_pred_iso == -1).astype(int) # -1 = anomaly = fraud
# Evaluate
from sklearn.metrics import classification_report, confusion_matrix
print("\nIsolation Forest Results:")
print(classification_report(y_test, y_pred_iso,
target_names=['Legitimate', 'Fraud']))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred_iso))
Step 6: Supervised Classification
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, precision_recall_curve, auc
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_balanced)
X_test_scaled = scaler.transform(X_test_selected)
# Train multiple models
models = {
'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),
'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42)
}
results = {}
for name, model in models.items():
print(f"\n{'='*50}")
print(f"Training {name}...")
model.fit(X_train_scaled, y_train_balanced)
# Predictions
y_pred = model.predict(X_test_scaled)
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
# Metrics
print(classification_report(y_test, y_pred, target_names=['Legitimate', 'Fraud']))
roc_auc = roc_auc_score(y_test, y_pred_proba)
precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
pr_auc = auc(recall, precision)
results[name] = {
'model': model,
'roc_auc': roc_auc,
'pr_auc': pr_auc,
'y_pred_proba': y_pred_proba
}
print(f"ROC-AUC: {roc_auc:.4f}")
print(f"PR-AUC: {pr_auc:.4f}")
# Find best model
best_model = max(results.items(), key=lambda x: x[1]['pr_auc'])
print(f"\nš Best Model: {best_model[0]} (PR-AUC: {best_model[1]['pr_auc']:.4f})")
Step 7: Visualize Results
from sklearn.metrics import roc_curve, precision_recall_curve
import matplotlib.pyplot as plt
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
# ROC Curve
for name, result in results.items():
fpr, tpr, _ = roc_curve(y_test, result['y_pred_proba'])
axes[0].plot(fpr, tpr, label=f"{name} (AUC={result['roc_auc']:.3f})")
axes[0].plot([0, 1], [0, 1], 'k--', label='Random')
axes[0].set_xlabel('False Positive Rate')
axes[0].set_ylabel('True Positive Rate')
axes[0].set_title('ROC Curve Comparison')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Precision-Recall Curve
for name, result in results.items():
precision, recall, _ = precision_recall_curve(y_test, result['y_pred_proba'])
axes[1].plot(recall, precision, label=f"{name} (AUC={result['pr_auc']:.3f})")
axes[1].set_xlabel('Recall')
axes[1].set_ylabel('Precision')
axes[1].set_title('Precision-Recall Curve Comparison')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Step 8: Cost-Sensitive Analysis
# Define business costs
cost_fp = 1 # False Positive: Manual review cost
cost_fn = 100 # False Negative: Fraud not detected (high cost!)
# Find optimal threshold based on cost
best_threshold = 0.5
best_cost = float('inf')
thresholds = np.linspace(0.1, 0.9, 50)
costs = []
best_model_name, best_model_results = best_model
y_pred_proba = best_model_results['y_pred_proba']
for threshold in thresholds:
y_pred_thresh = (y_pred_proba >= threshold).astype(int)
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_thresh).ravel()
total_cost = (fp * cost_fp) + (fn * cost_fn)
costs.append(total_cost)
if total_cost < best_cost:
best_cost = total_cost
best_threshold = threshold
print(f"Optimal Threshold: {best_threshold:.3f}")
print(f"Minimum Total Cost: ${best_cost:.2f}")
# Plot cost vs threshold
plt.figure(figsize=(10, 6))
plt.plot(thresholds, costs, linewidth=2)
plt.axvline(best_threshold, color='r', linestyle='--',
label=f'Optimal: {best_threshold:.3f}')
plt.xlabel('Decision Threshold')
plt.ylabel('Total Cost ($)')
plt.title('Cost Analysis: Finding Optimal Threshold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
šÆ Challenge Exercises
Exercise 1: Advanced Feature Engineering
Create velocity features: transaction frequency and amount changes in different time windows (1hr, 6hr, 24hr).
Exercise 2: Ensemble Methods
Build a stacking ensemble combining Logistic Regression, Random Forest, and XGBoost with a meta-learner.
Exercise 3: Real-Time Scoring
Implement a function that scores new transactions in real-time, including all preprocessing steps.
Exercise 4: Explainability
Use SHAP values to explain which features contributed most to fraud predictions for specific transactions.
š Project Summary
What You've Built
- ā Handled severely imbalanced data (99.8% vs 0.2%)
- ā Engineered temporal and statistical transaction features
- ā Applied feature selection to reduce dimensionality
- ā Used SMOTE and undersampling for balancing
- ā Compared supervised and unsupervised approaches
- ā Optimized decision threshold based on business costs
- ā Evaluated with precision, recall, F1, ROC-AUC, PR-AUC
š” Key Takeaways
- Accuracy is misleading for imbalanced data - use precision/recall
- PR-AUC is more informative than ROC-AUC for rare events
- Business costs should drive threshold selection
- Feature engineering is crucial for fraud detection
- Anomaly detection can work without labeled fraud data
š Next Steps
š Real Dataset
Download the Kaggle Credit Card Fraud dataset and apply these techniques
š¬ Deep Learning
Try autoencoders for anomaly detection in fraud detection
š± Next Project
Try the Customer Churn Prediction project with automated feature engineering