Credit Card Fraud Detection Project - Feature Engineering

Project Overview

Build a fraud detection system using the Credit Card Fraud Detection dataset. You'll tackle highly imbalanced data (99.8% legitimate transactions), engineer temporal and statistical features, and apply advanced feature selection techniques.

🎯 Learning Objectives

Handle severely imbalanced datasets with SMOTE/undersampling
Create time-based and statistical transaction features
Apply feature selection to reduce dimensionality
Use anomaly detection techniques (Isolation Forest)
Evaluate models with precision, recall, F1-score, and ROC-AUC

📦 Dataset

Download from: Kaggle Credit Card Fraud Detection

Or use the synthetic data generation code below to get started quickly.

Step 1: Load and Explore Data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

# If you have the Kaggle dataset:
# df = pd.read_csv('creditcard.csv')

# Or generate synthetic data for practice:
from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=50000,
    n_features=30,
    n_informative=20,
    n_redundant=5,
    n_classes=2,
    weights=[0.998, 0.002],  # Highly imbalanced
    random_state=42
)

df = pd.DataFrame(X, columns=[f'V{i}' for i in range(1, 31)])
df['Time'] = np.arange(len(df))
df['Amount'] = np.abs(np.random.normal(100, 50, len(df)))
df['Class'] = y

print(f"Dataset shape: {df.shape}")
print(f"\nClass distribution:")
print(df['Class'].value_counts())
print(f"\nFraud percentage: {(df['Class'].sum() / len(df)) * 100:.2f}%")

# Visualize class imbalance
plt.figure(figsize=(8, 5))
df['Class'].value_counts().plot(kind='bar')
plt.title('Class Distribution (0: Legitimate, 1: Fraud)')
plt.xlabel('Class')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.show()

Step 2: Feature Engineering - Transaction Features

# Create time-based features
df['Hour'] = (df['Time'] / 3600) % 24  # Hour of day
df['Day'] = df['Time'] // (3600 * 24)   # Day number

# Amount-based features
df['Amount_Log'] = np.log1p(df['Amount'])  # Log-transform for skewness
df['Amount_Sqrt'] = np.sqrt(df['Amount'])

# Statistical features (rolling windows)
# Group by customer (assuming customer info exists or use random groups)
df['Customer_ID'] = np.random.randint(1, 1000, len(df))

# Calculate rolling statistics per customer
df = df.sort_values(['Customer_ID', 'Time'])

# Transaction frequency features
df['Trans_Count_1day'] = df.groupby('Customer_ID')['Time'].transform(
    lambda x: x.rolling(window=86400, min_periods=1).count()
)

# Amount aggregations
df['Amount_Mean_1day'] = df.groupby('Customer_ID')['Amount'].transform(
    lambda x: x.rolling(window=10, min_periods=1).mean()
)

df['Amount_Std_1day'] = df.groupby('Customer_ID')['Amount'].transform(
    lambda x: x.rolling(window=10, min_periods=1).std()
).fillna(0)

# Deviation from customer's typical behavior
df['Amount_Deviation'] = df['Amount'] - df['Amount_Mean_1day']
df['Amount_Z_Score'] = df['Amount_Deviation'] / (df['Amount_Std_1day'] + 1e-5)

print("\nNew features created:")
print(df[['Hour', 'Day', 'Amount_Log', 'Trans_Count_1day', 
          'Amount_Deviation', 'Amount_Z_Score']].head())

Step 3: Feature Selection

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest, f_classif, RFE

# Prepare features
feature_cols = [col for col in df.columns if col not in ['Class', 'Time', 'Customer_ID']]
X = df[feature_cols]
y = df['Class']

# Split data first (avoid data leakage)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")

# Method 1: Random Forest Feature Importance
rf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)

feature_importance = pd.DataFrame({
    'feature': feature_cols,
    'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)

print("\nTop 15 Important Features:")
print(feature_importance.head(15))

# Select top 20 features
top_features = feature_importance.head(20)['feature'].tolist()
X_train_selected = X_train[top_features]
X_test_selected = X_test[top_features]

print(f"\nReduced features: {X_train_selected.shape[1]}")

Step 4: Handle Class Imbalance

# Install: pip install imbalanced-learn
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as ImbPipeline

print("Original class distribution:")
print(y_train.value_counts())

# Strategy 1: SMOTE (Synthetic Minority Over-sampling)
smote = SMOTE(sampling_strategy=0.5, random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train_selected, y_train)

print("\nAfter SMOTE:")
print(pd.Series(y_train_smote).value_counts())

# Strategy 2: Combined (SMOTE + Undersampling)
sampling_pipeline = ImbPipeline([
    ('smote', SMOTE(sampling_strategy=0.5, random_state=42)),
    ('undersample', RandomUnderSampler(sampling_strategy=0.8, random_state=42))
])

X_train_balanced, y_train_balanced = sampling_pipeline.fit_resample(
    X_train_selected, y_train
)

print("\nAfter SMOTE + Undersampling:")
print(pd.Series(y_train_balanced).value_counts())

Step 5: Anomaly Detection Approach

from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler

# Isolation Forest (unsupervised anomaly detection)
# Train only on legitimate transactions
X_train_legit = X_train_selected[y_train == 0]

# Scale features
scaler = StandardScaler()
X_train_legit_scaled = scaler.fit_transform(X_train_legit)
X_test_scaled = scaler.transform(X_test_selected)

# Train Isolation Forest
iso_forest = IsolationForest(
    contamination=0.002,  # Expected fraud rate
    random_state=42,
    n_jobs=-1
)
iso_forest.fit(X_train_legit_scaled)

# Predict anomalies
y_pred_iso = iso_forest.predict(X_test_scaled)
y_pred_iso = (y_pred_iso == -1).astype(int)  # -1 = anomaly = fraud

# Evaluate
from sklearn.metrics import classification_report, confusion_matrix

print("\nIsolation Forest Results:")
print(classification_report(y_test, y_pred_iso, 
                          target_names=['Legitimate', 'Fraud']))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred_iso))

Step 6: Supervised Classification

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, precision_recall_curve, auc

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_balanced)
X_test_scaled = scaler.transform(X_test_selected)

# Train multiple models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42)
}

results = {}
for name, model in models.items():
    print(f"\n{'='*50}")
    print(f"Training {name}...")
    model.fit(X_train_scaled, y_train_balanced)
    
    # Predictions
    y_pred = model.predict(X_test_scaled)
    y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
    
    # Metrics
    print(classification_report(y_test, y_pred, target_names=['Legitimate', 'Fraud']))
    
    roc_auc = roc_auc_score(y_test, y_pred_proba)
    precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
    pr_auc = auc(recall, precision)
    
    results[name] = {
        'model': model,
        'roc_auc': roc_auc,
        'pr_auc': pr_auc,
        'y_pred_proba': y_pred_proba
    }
    
    print(f"ROC-AUC: {roc_auc:.4f}")
    print(f"PR-AUC: {pr_auc:.4f}")

# Find best model
best_model = max(results.items(), key=lambda x: x[1]['pr_auc'])
print(f"\n🏆 Best Model: {best_model[0]} (PR-AUC: {best_model[1]['pr_auc']:.4f})")

Step 7: Visualize Results

from sklearn.metrics import roc_curve, precision_recall_curve
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# ROC Curve
for name, result in results.items():
    fpr, tpr, _ = roc_curve(y_test, result['y_pred_proba'])
    axes[0].plot(fpr, tpr, label=f"{name} (AUC={result['roc_auc']:.3f})")

axes[0].plot([0, 1], [0, 1], 'k--', label='Random')
axes[0].set_xlabel('False Positive Rate')
axes[0].set_ylabel('True Positive Rate')
axes[0].set_title('ROC Curve Comparison')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Precision-Recall Curve
for name, result in results.items():
    precision, recall, _ = precision_recall_curve(y_test, result['y_pred_proba'])
    axes[1].plot(recall, precision, label=f"{name} (AUC={result['pr_auc']:.3f})")

axes[1].set_xlabel('Recall')
axes[1].set_ylabel('Precision')
axes[1].set_title('Precision-Recall Curve Comparison')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Step 8: Cost-Sensitive Analysis

# Define business costs
cost_fp = 1      # False Positive: Manual review cost
cost_fn = 100    # False Negative: Fraud not detected (high cost!)

# Find optimal threshold based on cost
best_threshold = 0.5
best_cost = float('inf')

thresholds = np.linspace(0.1, 0.9, 50)
costs = []

best_model_name, best_model_results = best_model
y_pred_proba = best_model_results['y_pred_proba']

for threshold in thresholds:
    y_pred_thresh = (y_pred_proba >= threshold).astype(int)
    
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred_thresh).ravel()
    total_cost = (fp * cost_fp) + (fn * cost_fn)
    costs.append(total_cost)
    
    if total_cost < best_cost:
        best_cost = total_cost
        best_threshold = threshold

print(f"Optimal Threshold: {best_threshold:.3f}")
print(f"Minimum Total Cost: ${best_cost:.2f}")

# Plot cost vs threshold
plt.figure(figsize=(10, 6))
plt.plot(thresholds, costs, linewidth=2)
plt.axvline(best_threshold, color='r', linestyle='--', 
            label=f'Optimal: {best_threshold:.3f}')
plt.xlabel('Decision Threshold')
plt.ylabel('Total Cost ($)')
plt.title('Cost Analysis: Finding Optimal Threshold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

🎯 Challenge Exercises

Exercise 1: Advanced Feature Engineering

Create velocity features: transaction frequency and amount changes in different time windows (1hr, 6hr, 24hr).

Exercise 2: Ensemble Methods

Build a stacking ensemble combining Logistic Regression, Random Forest, and XGBoost with a meta-learner.

Exercise 3: Real-Time Scoring

Implement a function that scores new transactions in real-time, including all preprocessing steps.

Exercise 4: Explainability

Use SHAP values to explain which features contributed most to fraud predictions for specific transactions.

📝 Project Summary

What You've Built

✅ Handled severely imbalanced data (99.8% vs 0.2%)
✅ Engineered temporal and statistical transaction features
✅ Applied feature selection to reduce dimensionality
✅ Used SMOTE and undersampling for balancing
✅ Compared supervised and unsupervised approaches
✅ Optimized decision threshold based on business costs
✅ Evaluated with precision, recall, F1, ROC-AUC, PR-AUC

💡 Key Takeaways

Accuracy is misleading for imbalanced data - use precision/recall
PR-AUC is more informative than ROC-AUC for rare events
Business costs should drive threshold selection
Feature engineering is crucial for fraud detection
Anomaly detection can work without labeled fraud data

🚀 Next Steps

📊 Real Dataset

Download the Kaggle Credit Card Fraud dataset and apply these techniques

🔬 Deep Learning

Try autoencoders for anomaly detection in fraud detection

📱 Next Project

Try the Customer Churn Prediction project with automated feature engineering

🎉 Project Complete!

← Previous Project 📚 Back to Course Next Project: Churn Prediction →