🏠 House Price Prediction Project

Build a complete regression model with advanced feature engineering

🎯 Beginner Project ⏱️ 2-3 hours πŸ“Š Regression

Project Overview

Predict house prices using the famous Ames Housing dataset. You'll apply data preprocessing, categorical encoding, feature scaling, and build a complete ML pipeline.

🎯 Learning Objectives

  • Handle missing values using multiple imputation strategies
  • Encode categorical variables (neighborhoods, quality ratings)
  • Create new features (house age, total square footage)
  • Apply feature scaling for regression models
  • Build and evaluate a complete pipeline

Step 1: Load and Explore Data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

# Load dataset (download from Kaggle: House Prices - Advanced Regression Techniques)
# Or use sklearn's California housing for quick start
from sklearn.datasets import fetch_california_housing

# Using California housing as example
housing = fetch_california_housing(as_frame=True)
df = housing.frame

# Add some categorical features for practice
np.random.seed(42)
df['Neighborhood'] = np.random.choice(['Downtown', 'Suburbs', 'Rural'], len(df))
df['HouseStyle'] = np.random.choice(['1Story', '2Story', 'SplitLevel'], len(df))

print(f"Dataset shape: {df.shape}")
print("\nFirst few rows:")
print(df.head())

print("\nBasic statistics:")
print(df.describe())

print("\nMissing values:")
print(df.isnull().sum())

# Target variable
target = 'MedHouseVal'

Step 2: Handle Missing Values

from sklearn.impute import SimpleImputer

# Introduce some missing values for demonstration
df_missing = df.copy()
for col in ['AveRooms', 'AveBedrms', 'Population']:
    mask = np.random.rand(len(df_missing)) < 0.05  # 5% missing
    df_missing.loc[mask, col] = np.nan

print("Missing values:")
print(df_missing.isnull().sum())

# Imputation strategy
numeric_cols = df_missing.select_dtypes(include=[np.number]).columns.tolist()
numeric_cols.remove(target)

# Median imputation for numeric features
median_imputer = SimpleImputer(strategy='median')
df_missing[numeric_cols] = median_imputer.fit_transform(df_missing[numeric_cols])

print("\nAfter imputation:")
print(df_missing.isnull().sum())

Step 3: Feature Engineering

# Create new features
df_engineered = df_missing.copy()

# 1. Total rooms per household
df_engineered['TotalRoomsPerHouse'] = df_engineered['AveRooms'] * df_engineered['HouseAge']

# 2. Bedroom ratio
df_engineered['BedroomRatio'] = df_engineered['AveBedrms'] / df_engineered['AveRooms']

# 3. Population density
df_engineered['PopulationDensity'] = df_engineered['Population'] / df_engineered['AveRooms']

# 4. Income per room (affordability indicator)
df_engineered['IncomePerRoom'] = df_engineered['MedInc'] / df_engineered['AveRooms']

# 5. Age categories (binning)
df_engineered['AgeCategory'] = pd.cut(
    df_engineered['HouseAge'], 
    bins=[0, 10, 25, 50, 100],
    labels=['New', 'Modern', 'Old', 'VeryOld']
)

print("New features created:")
print(df_engineered[['TotalRoomsPerHouse', 'BedroomRatio', 'PopulationDensity', 
                      'IncomePerRoom', 'AgeCategory']].head())

Step 4: Encode Categorical Variables

from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# One-Hot Encoding for nominal variables
ohe = OneHotEncoder(sparse_output=False, drop='first', handle_unknown='ignore')
neighborhood_encoded = ohe.fit_transform(df_engineered[['Neighborhood', 'HouseStyle']])

# Create DataFrame with encoded columns
neighborhood_cols = ohe.get_feature_names_out(['Neighborhood', 'HouseStyle'])
df_encoded = pd.DataFrame(neighborhood_encoded, columns=neighborhood_cols, index=df_engineered.index)

# Label Encoding for ordinal variable (AgeCategory)
le = LabelEncoder()
df_engineered['AgeCategory_Encoded'] = le.fit_transform(df_engineered['AgeCategory'])

# Combine all features
df_final = pd.concat([
    df_engineered.drop(['Neighborhood', 'HouseStyle', 'AgeCategory'], axis=1),
    df_encoded
], axis=1)

print(f"Final dataset shape: {df_final.shape}")
print("\nFeature columns:")
print(df_final.columns.tolist())

Step 5: Feature Scaling & Split Data

from sklearn.preprocessing import StandardScaler

# Separate features and target
X = df_final.drop(target, axis=1)
y = df_final[target]

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert back to DataFrame for readability
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X.columns, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X.columns, index=X_test.index)

print(f"Training set: {X_train_scaled.shape}")
print(f"Test set: {X_test_scaled.shape}")

Step 6: Build Complete Pipeline

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge

# Start fresh with original data
X = df[df.columns.difference([target])]
y = df[target]

# Identify column types
numeric_features = ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 
                    'Population', 'AveOccup', 'Latitude', 'Longitude']
categorical_features = ['Neighborhood', 'HouseStyle']

# Preprocessing pipelines
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(drop='first', handle_unknown='ignore'))
])

# Combine preprocessing
preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

# Complete pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', Ridge(alpha=1.0))
])

# Split and train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
pipeline.fit(X_train, y_train)

print("Pipeline trained successfully!")

Step 7: Evaluate Model

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Make predictions
y_pred = pipeline.predict(X_test)

# Calculate metrics
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print("Model Performance:")
print(f"MAE: ${mae:.2f}")
print(f"RMSE: ${rmse:.2f}")
print(f"RΒ² Score: {r2:.4f}")

# Visualize predictions
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.title('Actual vs Predicted House Prices')
plt.tight_layout()
plt.show()

Step 8: Feature Importance Analysis

# Train Random Forest to get feature importances
rf_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(n_estimators=100, random_state=42))
])

rf_pipeline.fit(X_train, y_train)

# Get feature names after transformation
feature_names = (numeric_features + 
                list(rf_pipeline.named_steps['preprocessor']
                     .named_transformers_['cat']
                     .named_steps['onehot']
                     .get_feature_names_out(categorical_features)))

# Feature importances
importances = rf_pipeline.named_steps['regressor'].feature_importances_
feature_importance_df = pd.DataFrame({
    'feature': feature_names,
    'importance': importances
}).sort_values('importance', ascending=False)

print("\nTop 10 Most Important Features:")
print(feature_importance_df.head(10))

# Visualize
plt.figure(figsize=(10, 6))
feature_importance_df.head(10).plot(x='feature', y='importance', kind='barh')
plt.xlabel('Importance')
plt.title('Top 10 Feature Importances')
plt.tight_layout()
plt.show()

🎯 Challenge Exercises

Exercise 1: Advanced Feature Engineering

Create polynomial features (e.g., MedIncΒ², MedInc Γ— AveRooms) and test if they improve model performance.

Exercise 2: Feature Selection

Apply RFE (Recursive Feature Elimination) to select the top 10 features. Compare performance with all features.

Exercise 3: Handle Outliers

Detect outliers in house prices using IQR method and decide whether to remove or cap them.

Exercise 4: Try Different Models

Compare Ridge, Lasso, Random Forest, and Gradient Boosting. Which performs best?

πŸ“ Project Summary

What You've Built

  • βœ… Complete data preprocessing pipeline with imputation
  • βœ… Feature engineering: created 5+ new features
  • βœ… Categorical encoding for neighborhoods and styles
  • βœ… Feature scaling with StandardScaler
  • βœ… End-to-end ML pipeline with sklearn
  • βœ… Model evaluation with multiple metrics
  • βœ… Feature importance analysis

πŸ’‘ Key Takeaways

  • Always explore data before feature engineering
  • Handle missing values before creating new features
  • Use pipelines to prevent data leakage
  • Feature engineering can significantly improve model performance
  • Document your feature creation process for reproducibility

πŸš€ Next Steps

πŸ“Š More Practice

Try the Ames Housing dataset from Kaggle for more challenging feature engineering

πŸ”¬ Advanced Topics

Learn about automated feature engineering with Featuretools

πŸ’³ Next Project

Try the Credit Card Fraud Detection project for imbalanced data

πŸŽ‰ Project Complete!

πŸ“š Back to Course Next Project: Fraud Detection β†’