Project: House Price Prediction

🎯 Project Overview

In this comprehensive project, you'll build a complete machine learning system to predict house prices based on features like location, size, number of rooms, and more. This is one of the most popular beginner-to-intermediate ML projects and closely mirrors real-world data science work.

What You'll Build

Data Pipeline: Load, clean, and explore real housing data
Feature Engineering: Create new features and handle missing values
Multiple Models: Train Linear Regression, Random Forest, and Gradient Boosting
Model Evaluation: Compare models using RMSE, MAE, and R² metrics
Visualization Dashboard: Create plots showing predictions vs actual prices
Deployment-Ready Code: Export model for production use

Skills You'll Practice

📊 Data Science

Exploratory Data Analysis (EDA)
Correlation analysis
Outlier detection

🔧 Feature Engineering

Handling missing data
Creating interaction features
Feature scaling

🤖 Machine Learning

Training multiple models
Hyperparameter tuning
Cross-validation

📈 Evaluation

Model comparison
Residual analysis
Performance visualization

🎓 Portfolio Worthy: This project demonstrates end-to-end ML skills that employers look for. Add it to your GitHub and resume!

📋 Prerequisites

Before starting this project, make sure you're comfortable with:

✅ Python basics (variables, functions, loops)
✅ Pandas for data manipulation
✅ Linear Regression concepts
✅ Basic statistics (mean, median, correlation)

Required Courses:

Module 1: Introduction to Machine Learning
Module 2: Linear Regression
Module 5: Random Forests (optional but recommended)

🛠️ Setup & Installation

1 Install Required Libraries

# Create virtual environment (recommended)
python -m venv house_price_env
source house_price_env/bin/activate  # On Windows: house_price_env\Scripts\activate

# Install required packages
pip install numpy pandas scikit-learn matplotlib seaborn jupyter

2 Download the Dataset

We'll use the California Housing dataset (built into scikit-learn) or you can use Kaggle's House Prices dataset.

# Option 1: Built-in California Housing (easier)
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing(as_frame=True)
df = housing.frame

# Option 2: Kaggle House Prices (more realistic)
# Download from: https://www.kaggle.com/c/house-prices-advanced-regression-techniques
# import pandas as pd
# df = pd.read_csv('train.csv')

3 Create Project Structure

house-price-prediction/
├── data/
│   └── housing_data.csv
├── notebooks/
│   ├── 01_eda.ipynb
│   ├── 02_feature_engineering.ipynb
│   └── 03_modeling.ipynb
├── models/
│   └── saved_model.pkl
├── src/
│   ├── data_preprocessing.py
│   ├── model_training.py
│   └── predict.py
└── README.md

📊 Part 1: Exploratory Data Analysis

Load and Inspect Data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing

# Set visualization style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

# Load data
housing = fetch_california_housing(as_frame=True)
df = housing.frame

# First look at the data
print("Dataset Shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())

print("\nDataset Info:")
print(df.info())

print("\nBasic Statistics:")
print(df.describe())

Expected Output

You should see 20,640 houses with 8 features: MedInc (median income), HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude, and MedHouseVal (target)

Check for Missing Values

# Missing values analysis
missing = df.isnull().sum()
print("Missing Values:")
print(missing[missing > 0])

# Visualize missing data
plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cbar=False, yticklabels=False, cmap='viridis')
plt.title('Missing Data Heatmap')
plt.show()

# Good news: California Housing has NO missing values!
# For Kaggle dataset, you'd need imputation strategies

Understand Target Variable Distribution

# Target variable: MedHouseVal (median house value in $100,000s)
target = df['MedHouseVal']

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Histogram
axes[0].hist(target, bins=50, edgecolor='black', alpha=0.7)
axes[0].set_xlabel('Median House Value ($100k)')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of House Prices')

# Box plot (detect outliers)
axes[1].boxplot(target)
axes[1].set_ylabel('Median House Value ($100k)')
axes[1].set_title('Box Plot - Outlier Detection')

# Q-Q plot (check normality)
from scipy import stats
stats.probplot(target, dist="norm", plot=axes[2])
axes[2].set_title('Q-Q Plot')

plt.tight_layout()
plt.show()

# Key statistics
print(f"Mean House Price: ${target.mean() * 100000:,.0f}")
print(f"Median House Price: ${target.median() * 100000:,.0f}")
print(f"Price Range: ${target.min() * 100000:,.0f} - ${target.max() * 100000:,.0f}")

Feature Correlations

# Correlation matrix
correlation_matrix = df.corr()

# Heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
            center=0, square=True, linewidths=1)
plt.title('Feature Correlation Heatmap')
plt.tight_layout()
plt.show()

# Top correlations with target
target_corr = correlation_matrix['MedHouseVal'].sort_values(ascending=False)
print("\nCorrelations with House Price:")
print(target_corr)

# Key insight: MedInc (median income) has strongest correlation (0.69)

Pairwise Relationships

# Select most important features for pairplot
important_features = ['MedInc', 'HouseAge', 'AveRooms', 'MedHouseVal']
sns.pairplot(df[important_features], diag_kind='kde', plot_kws={'alpha': 0.6})
plt.suptitle('Pairwise Relationships', y=1.02)
plt.show()

# Scatter plot: Price vs Income (strongest predictor)
plt.figure(figsize=(10, 6))
plt.scatter(df['MedInc'], df['MedHouseVal'], alpha=0.3)
plt.xlabel('Median Income (tens of thousands)')
plt.ylabel('Median House Value ($100k)')
plt.title('House Price vs Median Income')
plt.grid(alpha=0.3)
plt.show()

✅ Checkpoint 1: EDA Complete

By now you should have:

Loaded and inspected the dataset
Checked for missing values (none!)
Visualized target variable distribution
Identified MedInc as the strongest predictor
Created correlation heatmap

🔧 Part 2: Feature Engineering

Create New Features

# Create meaningful derived features
df_engineered = df.copy()

# 1. Rooms per household
df_engineered['RoomsPerHousehold'] = df_engineered['AveRooms'] * df_engineered['AveOccup']

# 2. Bedrooms ratio
df_engineered['BedroomsRatio'] = df_engineered['AveBedrms'] / df_engineered['AveRooms']

# 3. People per household
df_engineered['PeoplePerHousehold'] = df_engineered['Population'] / df_engineered['AveOccup']

# 4. Income category (binning)
df_engineered['IncomeCategory'] = pd.cut(df_engineered['MedInc'], 
                                          bins=[0, 1.5, 3.0, 4.5, 6.0, np.inf],
                                          labels=['VeryLow', 'Low', 'Medium', 'High', 'VeryHigh'])

# 5. Location-based feature (simple geographical zones)
df_engineered['LocationZone'] = 'Other'
# Coastal (latitude > 37 and longitude < -122)
coastal_mask = (df_engineered['Latitude'] > 37) & (df_engineered['Longitude'] < -122)
df_engineered.loc[coastal_mask, 'LocationZone'] = 'Coastal'

print("New Features Created:")
print(df_engineered[['RoomsPerHousehold', 'BedroomsRatio', 'PeoplePerHousehold']].head())

# Check correlation of new features with target
new_features_corr = df_engineered[['RoomsPerHousehold', 'BedroomsRatio', 
                                     'PeoplePerHousehold', 'MedHouseVal']].corr()['MedHouseVal']
print("\nNew Feature Correlations:")
print(new_features_corr.sort_values(ascending=False))

Handle Outliers

# Detect outliers using IQR method
def remove_outliers_iqr(df, column, multiplier=1.5):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - multiplier * IQR
    upper_bound = Q3 + multiplier * IQR
    return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]

# Remove outliers from target variable
print(f"Before outlier removal: {len(df_engineered)} samples")
df_clean = remove_outliers_iqr(df_engineered, 'MedHouseVal', multiplier=1.5)
print(f"After outlier removal: {len(df_clean)} samples")
print(f"Removed: {len(df_engineered) - len(df_clean)} outliers")

# Visualize before/after
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].scatter(df_engineered['MedInc'], df_engineered['MedHouseVal'], alpha=0.3)
axes[0].set_title('Before Outlier Removal')
axes[0].set_xlabel('Median Income')
axes[0].set_ylabel('House Price')

axes[1].scatter(df_clean['MedInc'], df_clean['MedHouseVal'], alpha=0.3)
axes[1].set_title('After Outlier Removal')
axes[1].set_xlabel('Median Income')
axes[1].set_ylabel('House Price')

plt.tight_layout()
plt.show()

Feature Scaling

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Prepare features and target
X = df_clean.drop(['MedHouseVal', 'IncomeCategory', 'LocationZone'], axis=1)
y = df_clean['MedHouseVal']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples")

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert back to DataFrame for readability
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns, index=X_test.index)

print("\nFeature scaling complete!")
print("\nScaled features (first 3 rows):")
print(X_train_scaled.head(3))

✅ Checkpoint 2: Feature Engineering Complete

You've successfully:

Created 5 new meaningful features
Removed outliers using IQR method
Split data into train/test sets (80/20)
Scaled features using StandardScaler

🤖 Part 3: Model Training & Evaluation

Baseline Model: Linear Regression

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Train Linear Regression
lr_model = LinearRegression()
lr_model.fit(X_train_scaled, y_train)

# Predictions
y_train_pred_lr = lr_model.predict(X_train_scaled)
y_test_pred_lr = lr_model.predict(X_test_scaled)

# Evaluation metrics
def evaluate_model(y_true, y_pred, model_name):
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    
    print(f"\n{model_name} Performance:")
    print(f"  RMSE: ${rmse * 100000:,.0f}")
    print(f"  MAE:  ${mae * 100000:,.0f}")
    print(f"  R²:   {r2:.4f}")
    return rmse, mae, r2

# Evaluate
lr_train_metrics = evaluate_model(y_train, y_train_pred_lr, "Linear Regression (Train)")
lr_test_metrics = evaluate_model(y_test, y_test_pred_lr, "Linear Regression (Test)")

Advanced Model: Random Forest

from sklearn.ensemble import RandomForestRegressor

# Train Random Forest
rf_model = RandomForestRegressor(
    n_estimators=100,
    max_depth=20,
    min_samples_split=5,
    min_samples_leaf=2,
    random_state=42,
    n_jobs=-1
)
rf_model.fit(X_train_scaled, y_train)

# Predictions
y_train_pred_rf = rf_model.predict(X_train_scaled)
y_test_pred_rf = rf_model.predict(X_test_scaled)

# Evaluate
rf_train_metrics = evaluate_model(y_train, y_train_pred_rf, "Random Forest (Train)")
rf_test_metrics = evaluate_model(y_test, y_test_pred_rf, "Random Forest (Test)")

# Feature importance
feature_importance = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': rf_model.feature_importances_
}).sort_values('Importance', ascending=False)

print("\nTop 5 Most Important Features:")
print(feature_importance.head())

Champion Model: Gradient Boosting

from sklearn.ensemble import GradientBoostingRegressor

# Train Gradient Boosting
gb_model = GradientBoostingRegressor(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=5,
    min_samples_split=5,
    min_samples_leaf=2,
    subsample=0.8,
    random_state=42
)
gb_model.fit(X_train_scaled, y_train)

# Predictions
y_train_pred_gb = gb_model.predict(X_train_scaled)
y_test_pred_gb = gb_model.predict(X_test_scaled)

# Evaluate
gb_train_metrics = evaluate_model(y_train, y_train_pred_gb, "Gradient Boosting (Train)")
gb_test_metrics = evaluate_model(y_test, y_test_pred_gb, "Gradient Boosting (Test)")

Model Comparison

# Create comparison DataFrame
results = pd.DataFrame({
    'Model': ['Linear Regression', 'Random Forest', 'Gradient Boosting'],
    'Train RMSE': [lr_train_metrics[0], rf_train_metrics[0], gb_train_metrics[0]],
    'Test RMSE': [lr_test_metrics[0], rf_test_metrics[0], gb_test_metrics[0]],
    'Train R²': [lr_train_metrics[2], rf_train_metrics[2], gb_train_metrics[2]],
    'Test R²': [lr_test_metrics[2], rf_test_metrics[2], gb_test_metrics[2]]
})

print("\n" + "="*60)
print("MODEL COMPARISON")
print("="*60)
print(results.to_string(index=False))

# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# RMSE comparison
axes[0].bar(results['Model'], results['Test RMSE'], color=['#3b82f6', '#10b981', '#f59e0b'])
axes[0].set_ylabel('RMSE (in $100k)')
axes[0].set_title('Model RMSE Comparison (Lower is Better)')
axes[0].tick_params(axis='x', rotation=45)

# R² comparison
axes[1].bar(results['Model'], results['Test R²'], color=['#3b82f6', '#10b981', '#f59e0b'])
axes[1].set_ylabel('R² Score')
axes[1].set_ylim([0, 1])
axes[1].set_title('Model R² Comparison (Higher is Better)')
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

# Identify best model
best_model_idx = results['Test R²'].idxmax()
best_model_name = results.loc[best_model_idx, 'Model']
print(f"\n🏆 Best Model: {best_model_name}")

✅ Checkpoint 3: Models Trained

You've trained and compared:

Linear Regression (baseline)
Random Forest (ensemble)
Gradient Boosting (champion)
Typical result: Gradient Boosting wins with R² ~0.80

📈 Part 4: Model Analysis & Visualization

Prediction vs Actual Plot

# Use best model (Gradient Boosting)
y_pred = y_test_pred_gb

# Predictions vs Actual
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('Actual Price ($100k)')
plt.ylabel('Predicted Price ($100k)')
plt.title('Predicted vs Actual House Prices')
plt.grid(alpha=0.3)
plt.show()

# Perfect predictions would fall on the red diagonal line

Residual Analysis

# Calculate residuals
residuals = y_test - y_pred

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Residual plot
axes[0].scatter(y_pred, residuals, alpha=0.5)
axes[0].axhline(y=0, color='r', linestyle='--', lw=2)
axes[0].set_xlabel('Predicted Price ($100k)')
axes[0].set_ylabel('Residuals ($100k)')
axes[0].set_title('Residual Plot')
axes[0].grid(alpha=0.3)

# Residual distribution
axes[1].hist(residuals, bins=50, edgecolor='black', alpha=0.7)
axes[1].set_xlabel('Residuals ($100k)')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Residual Distribution')
axes[1].axvline(x=0, color='r', linestyle='--', lw=2)

plt.tight_layout()
plt.show()

# Residual statistics
print(f"Mean Residual: ${residuals.mean() * 100000:,.0f} (should be ~$0)")
print(f"Std Dev of Residuals: ${residuals.std() * 100000:,.0f}")

Feature Importance Visualization

# Gradient Boosting feature importance
importance_df = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': gb_model.feature_importances_
}).sort_values('Importance', ascending=False)

plt.figure(figsize=(10, 6))
plt.barh(importance_df['Feature'][:10], importance_df['Importance'][:10])
plt.xlabel('Importance')
plt.title('Top 10 Most Important Features (Gradient Boosting)')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

print("\nFeature Importance Rankings:")
print(importance_df)

Error Analysis: Best and Worst Predictions

# Calculate absolute errors
errors = np.abs(y_test - y_pred)
error_df = pd.DataFrame({
    'Actual': y_test.values,
    'Predicted': y_pred,
    'Error': errors.values
}, index=y_test.index)

# Best predictions (lowest error)
print("\n🎯 Best Predictions (Top 5):")
print(error_df.nsmallest(5, 'Error'))

# Worst predictions (highest error)
print("\n❌ Worst Predictions (Top 5):")
print(error_df.nlargest(5, 'Error'))

# Analyze worst predictions
worst_indices = error_df.nlargest(5, 'Error').index
worst_features = X_test.loc[worst_indices]
print("\nFeatures of worst predictions:")
print(worst_features)

✅ Checkpoint 4: Analysis Complete

You've created:

Prediction vs Actual scatter plot
Residual analysis plots
Feature importance visualization
Error analysis identifying best/worst predictions

🚀 Part 5: Hyperparameter Tuning (Optional)

Grid Search for Optimal Parameters

from sklearn.model_selection import GridSearchCV

# Define parameter grid for Gradient Boosting
param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.05, 0.1, 0.15],
    'max_depth': [3, 5, 7],
    'min_samples_split': [2, 5, 10]
}

# Grid search with cross-validation
print("Starting Grid Search (this may take a few minutes)...")
grid_search = GridSearchCV(
    GradientBoostingRegressor(random_state=42),
    param_grid,
    cv=5,
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train_scaled, y_train)

print(f"\nBest Parameters: {grid_search.best_params_}")
print(f"Best CV Score (RMSE): ${np.sqrt(-grid_search.best_score_) * 100000:,.0f}")

# Train final model with best parameters
final_model = grid_search.best_estimator_
y_test_pred_final = final_model.predict(X_test_scaled)

# Final evaluation
final_metrics = evaluate_model(y_test, y_test_pred_final, "Tuned Gradient Boosting")

⚠️ Note: Grid search can take 10-30 minutes depending on your machine. For quick iteration, start with fewer parameter combinations or use RandomizedSearchCV.

💾 Part 6: Save Model for Production

import pickle
from datetime import datetime

# Save model
model_filename = f'house_price_model_{datetime.now().strftime("%Y%m%d")}.pkl'
with open(model_filename, 'wb') as f:
    pickle.dump({
        'model': gb_model,  # or final_model if you did tuning
        'scaler': scaler,
        'feature_names': X_train.columns.tolist(),
        'training_date': datetime.now(),
        'test_r2': gb_test_metrics[2],
        'test_rmse': gb_test_metrics[0]
    }, f)

print(f"Model saved as: {model_filename}")

# Load model (example)
with open(model_filename, 'rb') as f:
    loaded_data = pickle.load(f)
    loaded_model = loaded_data['model']
    loaded_scaler = loaded_data['scaler']

print("\nModel loaded successfully!")
print(f"Training Date: {loaded_data['training_date']}")
print(f"Test R²: {loaded_data['test_r2']:.4f}")

Create Prediction Function

def predict_house_price(model, scaler, feature_values):
    """
    Predict house price for new data
    
    Parameters:
    -----------
    model : trained model object
    scaler : fitted StandardScaler
    feature_values : dict with feature names and values
    
    Returns:
    --------
    predicted_price : float (in dollars)
    """
    # Create DataFrame from input
    input_df = pd.DataFrame([feature_values])
    
    # Scale features
    input_scaled = scaler.transform(input_df)
    
    # Predict
    prediction = model.predict(input_scaled)[0]
    
    return prediction * 100000  # Convert to dollars

# Example usage
sample_house = {
    'MedInc': 3.5,
    'HouseAge': 25,
    'AveRooms': 5.5,
    'AveBedrms': 1.2,
    'Population': 1200,
    'AveOccup': 3.2,
    'Latitude': 37.5,
    'Longitude': -122.3,
    'RoomsPerHousehold': 17.6,
    'BedroomsRatio': 0.218,
    'PeoplePerHousehold': 375
}

predicted_price = predict_house_price(loaded_model, loaded_scaler, sample_house)
print(f"\nPredicted Price: ${predicted_price:,.0f}")

🎯 Project Summary & Next Steps

🎉 Congratulations!

You've built a complete machine learning system from data loading to model deployment!

📊 What You've Accomplished

✅ Loaded and explored 20,000+ housing records
✅ Engineered 5 new features to improve model performance
✅ Trained 3 different models (Linear Regression, Random Forest, Gradient Boosting)
✅ Achieved ~80% R² on test data (strong performance!)
✅ Created visualizations for predictions and feature importance
✅ Saved model ready for production deployment

🚀 Challenge Yourself

📈 Improve Performance

Try XGBoost or LightGBM
Add more engineered features
Use ensemble stacking

🌐 Deploy Model

Build Flask API
Create Streamlit web app
Deploy to AWS/Heroku

📊 Advanced Analysis

Add confidence intervals
Perform residual analysis
Create interactive dashboard

🎯 Real Kaggle Data

Use Kaggle House Prices
Handle missing values
Submit to competition

📚 Recommended Reading

Book: "Hands-On Machine Learning" by Aurélien Géron (Chapter 2)
Kaggle: House Prices Competition - Advanced Regression Techniques
Papers: "Feature Engineering for Machine Learning" - Alice Zheng

💡 Portfolio Tips

Make this project shine on GitHub:

Add comprehensive README with results
Include visualizations (save all plots)
Document your feature engineering decisions
Add requirements.txt for reproducibility
Create Jupyter notebook with clean narrative