Project Overview
Predict house prices using the famous Ames Housing dataset. You'll apply data preprocessing, categorical encoding, feature scaling, and build a complete ML pipeline.
π― Learning Objectives
- Handle missing values using multiple imputation strategies
- Encode categorical variables (neighborhoods, quality ratings)
- Create new features (house age, total square footage)
- Apply feature scaling for regression models
- Build and evaluate a complete pipeline
Step 1: Load and Explore Data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
# Load dataset (download from Kaggle: House Prices - Advanced Regression Techniques)
# Or use sklearn's California housing for quick start
from sklearn.datasets import fetch_california_housing
# Using California housing as example
housing = fetch_california_housing(as_frame=True)
df = housing.frame
# Add some categorical features for practice
np.random.seed(42)
df['Neighborhood'] = np.random.choice(['Downtown', 'Suburbs', 'Rural'], len(df))
df['HouseStyle'] = np.random.choice(['1Story', '2Story', 'SplitLevel'], len(df))
print(f"Dataset shape: {df.shape}")
print("\nFirst few rows:")
print(df.head())
print("\nBasic statistics:")
print(df.describe())
print("\nMissing values:")
print(df.isnull().sum())
# Target variable
target = 'MedHouseVal'
Step 2: Handle Missing Values
from sklearn.impute import SimpleImputer
# Introduce some missing values for demonstration
df_missing = df.copy()
for col in ['AveRooms', 'AveBedrms', 'Population']:
mask = np.random.rand(len(df_missing)) < 0.05 # 5% missing
df_missing.loc[mask, col] = np.nan
print("Missing values:")
print(df_missing.isnull().sum())
# Imputation strategy
numeric_cols = df_missing.select_dtypes(include=[np.number]).columns.tolist()
numeric_cols.remove(target)
# Median imputation for numeric features
median_imputer = SimpleImputer(strategy='median')
df_missing[numeric_cols] = median_imputer.fit_transform(df_missing[numeric_cols])
print("\nAfter imputation:")
print(df_missing.isnull().sum())
Step 3: Feature Engineering
# Create new features
df_engineered = df_missing.copy()
# 1. Total rooms per household
df_engineered['TotalRoomsPerHouse'] = df_engineered['AveRooms'] * df_engineered['HouseAge']
# 2. Bedroom ratio
df_engineered['BedroomRatio'] = df_engineered['AveBedrms'] / df_engineered['AveRooms']
# 3. Population density
df_engineered['PopulationDensity'] = df_engineered['Population'] / df_engineered['AveRooms']
# 4. Income per room (affordability indicator)
df_engineered['IncomePerRoom'] = df_engineered['MedInc'] / df_engineered['AveRooms']
# 5. Age categories (binning)
df_engineered['AgeCategory'] = pd.cut(
df_engineered['HouseAge'],
bins=[0, 10, 25, 50, 100],
labels=['New', 'Modern', 'Old', 'VeryOld']
)
print("New features created:")
print(df_engineered[['TotalRoomsPerHouse', 'BedroomRatio', 'PopulationDensity',
'IncomePerRoom', 'AgeCategory']].head())
Step 4: Encode Categorical Variables
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
# One-Hot Encoding for nominal variables
ohe = OneHotEncoder(sparse_output=False, drop='first', handle_unknown='ignore')
neighborhood_encoded = ohe.fit_transform(df_engineered[['Neighborhood', 'HouseStyle']])
# Create DataFrame with encoded columns
neighborhood_cols = ohe.get_feature_names_out(['Neighborhood', 'HouseStyle'])
df_encoded = pd.DataFrame(neighborhood_encoded, columns=neighborhood_cols, index=df_engineered.index)
# Label Encoding for ordinal variable (AgeCategory)
le = LabelEncoder()
df_engineered['AgeCategory_Encoded'] = le.fit_transform(df_engineered['AgeCategory'])
# Combine all features
df_final = pd.concat([
df_engineered.drop(['Neighborhood', 'HouseStyle', 'AgeCategory'], axis=1),
df_encoded
], axis=1)
print(f"Final dataset shape: {df_final.shape}")
print("\nFeature columns:")
print(df_final.columns.tolist())
Step 5: Feature Scaling & Split Data
from sklearn.preprocessing import StandardScaler
# Separate features and target
X = df_final.drop(target, axis=1)
y = df_final[target]
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Convert back to DataFrame for readability
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X.columns, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X.columns, index=X_test.index)
print(f"Training set: {X_train_scaled.shape}")
print(f"Test set: {X_test_scaled.shape}")
Step 6: Build Complete Pipeline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge
# Start fresh with original data
X = df[df.columns.difference([target])]
y = df[target]
# Identify column types
numeric_features = ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms',
'Population', 'AveOccup', 'Latitude', 'Longitude']
categorical_features = ['Neighborhood', 'HouseStyle']
# Preprocessing pipelines
numeric_transformer = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
categorical_transformer = Pipeline([
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(drop='first', handle_unknown='ignore'))
])
# Combine preprocessing
preprocessor = ColumnTransformer([
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
# Complete pipeline
pipeline = Pipeline([
('preprocessor', preprocessor),
('regressor', Ridge(alpha=1.0))
])
# Split and train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
pipeline.fit(X_train, y_train)
print("Pipeline trained successfully!")
Step 7: Evaluate Model
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# Make predictions
y_pred = pipeline.predict(X_test)
# Calculate metrics
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print("Model Performance:")
print(f"MAE: ${mae:.2f}")
print(f"RMSE: ${rmse:.2f}")
print(f"RΒ² Score: {r2:.4f}")
# Visualize predictions
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.title('Actual vs Predicted House Prices')
plt.tight_layout()
plt.show()
Step 8: Feature Importance Analysis
# Train Random Forest to get feature importances
rf_pipeline = Pipeline([
('preprocessor', preprocessor),
('regressor', RandomForestRegressor(n_estimators=100, random_state=42))
])
rf_pipeline.fit(X_train, y_train)
# Get feature names after transformation
feature_names = (numeric_features +
list(rf_pipeline.named_steps['preprocessor']
.named_transformers_['cat']
.named_steps['onehot']
.get_feature_names_out(categorical_features)))
# Feature importances
importances = rf_pipeline.named_steps['regressor'].feature_importances_
feature_importance_df = pd.DataFrame({
'feature': feature_names,
'importance': importances
}).sort_values('importance', ascending=False)
print("\nTop 10 Most Important Features:")
print(feature_importance_df.head(10))
# Visualize
plt.figure(figsize=(10, 6))
feature_importance_df.head(10).plot(x='feature', y='importance', kind='barh')
plt.xlabel('Importance')
plt.title('Top 10 Feature Importances')
plt.tight_layout()
plt.show()
π― Challenge Exercises
Exercise 1: Advanced Feature Engineering
Create polynomial features (e.g., MedIncΒ², MedInc Γ AveRooms) and test if they improve model performance.
Exercise 2: Feature Selection
Apply RFE (Recursive Feature Elimination) to select the top 10 features. Compare performance with all features.
Exercise 3: Handle Outliers
Detect outliers in house prices using IQR method and decide whether to remove or cap them.
Exercise 4: Try Different Models
Compare Ridge, Lasso, Random Forest, and Gradient Boosting. Which performs best?
π Project Summary
What You've Built
- β Complete data preprocessing pipeline with imputation
- β Feature engineering: created 5+ new features
- β Categorical encoding for neighborhoods and styles
- β Feature scaling with StandardScaler
- β End-to-end ML pipeline with sklearn
- β Model evaluation with multiple metrics
- β Feature importance analysis
π‘ Key Takeaways
- Always explore data before feature engineering
- Handle missing values before creating new features
- Use pipelines to prevent data leakage
- Feature engineering can significantly improve model performance
- Document your feature creation process for reproducibility
π Next Steps
π More Practice
Try the Ames Housing dataset from Kaggle for more challenging feature engineering
π¬ Advanced Topics
Learn about automated feature engineering with Featuretools
π³ Next Project
Try the Credit Card Fraud Detection project for imbalanced data