π― Project Overview
In this comprehensive project, you'll build a complete machine learning system to predict house prices based on features like location, size, number of rooms, and more. This is one of the most popular beginner-to-intermediate ML projects and closely mirrors real-world data science work.
What You'll Build
- Data Pipeline: Load, clean, and explore real housing data
- Feature Engineering: Create new features and handle missing values
- Multiple Models: Train Linear Regression, Random Forest, and Gradient Boosting
- Model Evaluation: Compare models using RMSE, MAE, and RΒ² metrics
- Visualization Dashboard: Create plots showing predictions vs actual prices
- Deployment-Ready Code: Export model for production use
Skills You'll Practice
- Exploratory Data Analysis (EDA)
- Correlation analysis
- Outlier detection
- Handling missing data
- Creating interaction features
- Feature scaling
- Training multiple models
- Hyperparameter tuning
- Cross-validation
- Model comparison
- Residual analysis
- Performance visualization
π Portfolio Worthy: This project demonstrates end-to-end ML skills that employers look for. Add it to your GitHub and resume!
π Prerequisites
Before starting this project, make sure you're comfortable with:
- β Python basics (variables, functions, loops)
- β Pandas for data manipulation
- β Linear Regression concepts
- β Basic statistics (mean, median, correlation)
Required Courses:
- Module 1: Introduction to Machine Learning
- Module 2: Linear Regression
- Module 5: Random Forests (optional but recommended)
π οΈ Setup & Installation
1 Install Required Libraries
# Create virtual environment (recommended)
python -m venv house_price_env
source house_price_env/bin/activate # On Windows: house_price_env\Scripts\activate
# Install required packages
pip install numpy pandas scikit-learn matplotlib seaborn jupyter
2 Download the Dataset
We'll use the California Housing dataset (built into scikit-learn) or you can use Kaggle's House Prices dataset.
# Option 1: Built-in California Housing (easier)
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing(as_frame=True)
df = housing.frame
# Option 2: Kaggle House Prices (more realistic)
# Download from: https://www.kaggle.com/c/house-prices-advanced-regression-techniques
# import pandas as pd
# df = pd.read_csv('train.csv')
3 Create Project Structure
house-price-prediction/
βββ data/
β βββ housing_data.csv
βββ notebooks/
β βββ 01_eda.ipynb
β βββ 02_feature_engineering.ipynb
β βββ 03_modeling.ipynb
βββ models/
β βββ saved_model.pkl
βββ src/
β βββ data_preprocessing.py
β βββ model_training.py
β βββ predict.py
βββ README.md
π Part 1: Exploratory Data Analysis
Load and Inspect Data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing
# Set visualization style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)
# Load data
housing = fetch_california_housing(as_frame=True)
df = housing.frame
# First look at the data
print("Dataset Shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())
print("\nDataset Info:")
print(df.info())
print("\nBasic Statistics:")
print(df.describe())
Expected Output
You should see 20,640 houses with 8 features: MedInc (median income), HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude, and MedHouseVal (target)
Check for Missing Values
# Missing values analysis
missing = df.isnull().sum()
print("Missing Values:")
print(missing[missing > 0])
# Visualize missing data
plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cbar=False, yticklabels=False, cmap='viridis')
plt.title('Missing Data Heatmap')
plt.show()
# Good news: California Housing has NO missing values!
# For Kaggle dataset, you'd need imputation strategies
Understand Target Variable Distribution
# Target variable: MedHouseVal (median house value in $100,000s)
target = df['MedHouseVal']
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
# Histogram
axes[0].hist(target, bins=50, edgecolor='black', alpha=0.7)
axes[0].set_xlabel('Median House Value ($100k)')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of House Prices')
# Box plot (detect outliers)
axes[1].boxplot(target)
axes[1].set_ylabel('Median House Value ($100k)')
axes[1].set_title('Box Plot - Outlier Detection')
# Q-Q plot (check normality)
from scipy import stats
stats.probplot(target, dist="norm", plot=axes[2])
axes[2].set_title('Q-Q Plot')
plt.tight_layout()
plt.show()
# Key statistics
print(f"Mean House Price: ${target.mean() * 100000:,.0f}")
print(f"Median House Price: ${target.median() * 100000:,.0f}")
print(f"Price Range: ${target.min() * 100000:,.0f} - ${target.max() * 100000:,.0f}")
Feature Correlations
# Correlation matrix
correlation_matrix = df.corr()
# Heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm',
center=0, square=True, linewidths=1)
plt.title('Feature Correlation Heatmap')
plt.tight_layout()
plt.show()
# Top correlations with target
target_corr = correlation_matrix['MedHouseVal'].sort_values(ascending=False)
print("\nCorrelations with House Price:")
print(target_corr)
# Key insight: MedInc (median income) has strongest correlation (0.69)
Pairwise Relationships
# Select most important features for pairplot
important_features = ['MedInc', 'HouseAge', 'AveRooms', 'MedHouseVal']
sns.pairplot(df[important_features], diag_kind='kde', plot_kws={'alpha': 0.6})
plt.suptitle('Pairwise Relationships', y=1.02)
plt.show()
# Scatter plot: Price vs Income (strongest predictor)
plt.figure(figsize=(10, 6))
plt.scatter(df['MedInc'], df['MedHouseVal'], alpha=0.3)
plt.xlabel('Median Income (tens of thousands)')
plt.ylabel('Median House Value ($100k)')
plt.title('House Price vs Median Income')
plt.grid(alpha=0.3)
plt.show()
β Checkpoint 1: EDA Complete
By now you should have:
- Loaded and inspected the dataset
- Checked for missing values (none!)
- Visualized target variable distribution
- Identified MedInc as the strongest predictor
- Created correlation heatmap
π§ Part 2: Feature Engineering
Create New Features
# Create meaningful derived features
df_engineered = df.copy()
# 1. Rooms per household
df_engineered['RoomsPerHousehold'] = df_engineered['AveRooms'] * df_engineered['AveOccup']
# 2. Bedrooms ratio
df_engineered['BedroomsRatio'] = df_engineered['AveBedrms'] / df_engineered['AveRooms']
# 3. People per household
df_engineered['PeoplePerHousehold'] = df_engineered['Population'] / df_engineered['AveOccup']
# 4. Income category (binning)
df_engineered['IncomeCategory'] = pd.cut(df_engineered['MedInc'],
bins=[0, 1.5, 3.0, 4.5, 6.0, np.inf],
labels=['VeryLow', 'Low', 'Medium', 'High', 'VeryHigh'])
# 5. Location-based feature (simple geographical zones)
df_engineered['LocationZone'] = 'Other'
# Coastal (latitude > 37 and longitude < -122)
coastal_mask = (df_engineered['Latitude'] > 37) & (df_engineered['Longitude'] < -122)
df_engineered.loc[coastal_mask, 'LocationZone'] = 'Coastal'
print("New Features Created:")
print(df_engineered[['RoomsPerHousehold', 'BedroomsRatio', 'PeoplePerHousehold']].head())
# Check correlation of new features with target
new_features_corr = df_engineered[['RoomsPerHousehold', 'BedroomsRatio',
'PeoplePerHousehold', 'MedHouseVal']].corr()['MedHouseVal']
print("\nNew Feature Correlations:")
print(new_features_corr.sort_values(ascending=False))
Handle Outliers
# Detect outliers using IQR method
def remove_outliers_iqr(df, column, multiplier=1.5):
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - multiplier * IQR
upper_bound = Q3 + multiplier * IQR
return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]
# Remove outliers from target variable
print(f"Before outlier removal: {len(df_engineered)} samples")
df_clean = remove_outliers_iqr(df_engineered, 'MedHouseVal', multiplier=1.5)
print(f"After outlier removal: {len(df_clean)} samples")
print(f"Removed: {len(df_engineered) - len(df_clean)} outliers")
# Visualize before/after
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].scatter(df_engineered['MedInc'], df_engineered['MedHouseVal'], alpha=0.3)
axes[0].set_title('Before Outlier Removal')
axes[0].set_xlabel('Median Income')
axes[0].set_ylabel('House Price')
axes[1].scatter(df_clean['MedInc'], df_clean['MedHouseVal'], alpha=0.3)
axes[1].set_title('After Outlier Removal')
axes[1].set_xlabel('Median Income')
axes[1].set_ylabel('House Price')
plt.tight_layout()
plt.show()
Feature Scaling
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# Prepare features and target
X = df_clean.drop(['MedHouseVal', 'IncomeCategory', 'LocationZone'], axis=1)
y = df_clean['MedHouseVal']
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print(f"Training set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples")
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Convert back to DataFrame for readability
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns, index=X_test.index)
print("\nFeature scaling complete!")
print("\nScaled features (first 3 rows):")
print(X_train_scaled.head(3))
β Checkpoint 2: Feature Engineering Complete
You've successfully:
- Created 5 new meaningful features
- Removed outliers using IQR method
- Split data into train/test sets (80/20)
- Scaled features using StandardScaler
π€ Part 3: Model Training & Evaluation
Baseline Model: Linear Regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
# Train Linear Regression
lr_model = LinearRegression()
lr_model.fit(X_train_scaled, y_train)
# Predictions
y_train_pred_lr = lr_model.predict(X_train_scaled)
y_test_pred_lr = lr_model.predict(X_test_scaled)
# Evaluation metrics
def evaluate_model(y_true, y_pred, model_name):
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
mae = mean_absolute_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
print(f"\n{model_name} Performance:")
print(f" RMSE: ${rmse * 100000:,.0f}")
print(f" MAE: ${mae * 100000:,.0f}")
print(f" RΒ²: {r2:.4f}")
return rmse, mae, r2
# Evaluate
lr_train_metrics = evaluate_model(y_train, y_train_pred_lr, "Linear Regression (Train)")
lr_test_metrics = evaluate_model(y_test, y_test_pred_lr, "Linear Regression (Test)")
Advanced Model: Random Forest
from sklearn.ensemble import RandomForestRegressor
# Train Random Forest
rf_model = RandomForestRegressor(
n_estimators=100,
max_depth=20,
min_samples_split=5,
min_samples_leaf=2,
random_state=42,
n_jobs=-1
)
rf_model.fit(X_train_scaled, y_train)
# Predictions
y_train_pred_rf = rf_model.predict(X_train_scaled)
y_test_pred_rf = rf_model.predict(X_test_scaled)
# Evaluate
rf_train_metrics = evaluate_model(y_train, y_train_pred_rf, "Random Forest (Train)")
rf_test_metrics = evaluate_model(y_test, y_test_pred_rf, "Random Forest (Test)")
# Feature importance
feature_importance = pd.DataFrame({
'Feature': X_train.columns,
'Importance': rf_model.feature_importances_
}).sort_values('Importance', ascending=False)
print("\nTop 5 Most Important Features:")
print(feature_importance.head())
Champion Model: Gradient Boosting
from sklearn.ensemble import GradientBoostingRegressor
# Train Gradient Boosting
gb_model = GradientBoostingRegressor(
n_estimators=200,
learning_rate=0.1,
max_depth=5,
min_samples_split=5,
min_samples_leaf=2,
subsample=0.8,
random_state=42
)
gb_model.fit(X_train_scaled, y_train)
# Predictions
y_train_pred_gb = gb_model.predict(X_train_scaled)
y_test_pred_gb = gb_model.predict(X_test_scaled)
# Evaluate
gb_train_metrics = evaluate_model(y_train, y_train_pred_gb, "Gradient Boosting (Train)")
gb_test_metrics = evaluate_model(y_test, y_test_pred_gb, "Gradient Boosting (Test)")
Model Comparison
# Create comparison DataFrame
results = pd.DataFrame({
'Model': ['Linear Regression', 'Random Forest', 'Gradient Boosting'],
'Train RMSE': [lr_train_metrics[0], rf_train_metrics[0], gb_train_metrics[0]],
'Test RMSE': [lr_test_metrics[0], rf_test_metrics[0], gb_test_metrics[0]],
'Train RΒ²': [lr_train_metrics[2], rf_train_metrics[2], gb_train_metrics[2]],
'Test RΒ²': [lr_test_metrics[2], rf_test_metrics[2], gb_test_metrics[2]]
})
print("\n" + "="*60)
print("MODEL COMPARISON")
print("="*60)
print(results.to_string(index=False))
# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# RMSE comparison
axes[0].bar(results['Model'], results['Test RMSE'], color=['#3b82f6', '#10b981', '#f59e0b'])
axes[0].set_ylabel('RMSE (in $100k)')
axes[0].set_title('Model RMSE Comparison (Lower is Better)')
axes[0].tick_params(axis='x', rotation=45)
# RΒ² comparison
axes[1].bar(results['Model'], results['Test RΒ²'], color=['#3b82f6', '#10b981', '#f59e0b'])
axes[1].set_ylabel('RΒ² Score')
axes[1].set_ylim([0, 1])
axes[1].set_title('Model RΒ² Comparison (Higher is Better)')
axes[1].tick_params(axis='x', rotation=45)
plt.tight_layout()
plt.show()
# Identify best model
best_model_idx = results['Test RΒ²'].idxmax()
best_model_name = results.loc[best_model_idx, 'Model']
print(f"\nπ Best Model: {best_model_name}")
β Checkpoint 3: Models Trained
You've trained and compared:
- Linear Regression (baseline)
- Random Forest (ensemble)
- Gradient Boosting (champion)
- Typical result: Gradient Boosting wins with RΒ² ~0.80
π Part 4: Model Analysis & Visualization
Prediction vs Actual Plot
# Use best model (Gradient Boosting)
y_pred = y_test_pred_gb
# Predictions vs Actual
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('Actual Price ($100k)')
plt.ylabel('Predicted Price ($100k)')
plt.title('Predicted vs Actual House Prices')
plt.grid(alpha=0.3)
plt.show()
# Perfect predictions would fall on the red diagonal line
Residual Analysis
# Calculate residuals
residuals = y_test - y_pred
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Residual plot
axes[0].scatter(y_pred, residuals, alpha=0.5)
axes[0].axhline(y=0, color='r', linestyle='--', lw=2)
axes[0].set_xlabel('Predicted Price ($100k)')
axes[0].set_ylabel('Residuals ($100k)')
axes[0].set_title('Residual Plot')
axes[0].grid(alpha=0.3)
# Residual distribution
axes[1].hist(residuals, bins=50, edgecolor='black', alpha=0.7)
axes[1].set_xlabel('Residuals ($100k)')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Residual Distribution')
axes[1].axvline(x=0, color='r', linestyle='--', lw=2)
plt.tight_layout()
plt.show()
# Residual statistics
print(f"Mean Residual: ${residuals.mean() * 100000:,.0f} (should be ~$0)")
print(f"Std Dev of Residuals: ${residuals.std() * 100000:,.0f}")
Feature Importance Visualization
# Gradient Boosting feature importance
importance_df = pd.DataFrame({
'Feature': X_train.columns,
'Importance': gb_model.feature_importances_
}).sort_values('Importance', ascending=False)
plt.figure(figsize=(10, 6))
plt.barh(importance_df['Feature'][:10], importance_df['Importance'][:10])
plt.xlabel('Importance')
plt.title('Top 10 Most Important Features (Gradient Boosting)')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
print("\nFeature Importance Rankings:")
print(importance_df)
Error Analysis: Best and Worst Predictions
# Calculate absolute errors
errors = np.abs(y_test - y_pred)
error_df = pd.DataFrame({
'Actual': y_test.values,
'Predicted': y_pred,
'Error': errors.values
}, index=y_test.index)
# Best predictions (lowest error)
print("\nπ― Best Predictions (Top 5):")
print(error_df.nsmallest(5, 'Error'))
# Worst predictions (highest error)
print("\nβ Worst Predictions (Top 5):")
print(error_df.nlargest(5, 'Error'))
# Analyze worst predictions
worst_indices = error_df.nlargest(5, 'Error').index
worst_features = X_test.loc[worst_indices]
print("\nFeatures of worst predictions:")
print(worst_features)
β Checkpoint 4: Analysis Complete
You've created:
- Prediction vs Actual scatter plot
- Residual analysis plots
- Feature importance visualization
- Error analysis identifying best/worst predictions
π Part 5: Hyperparameter Tuning (Optional)
Grid Search for Optimal Parameters
from sklearn.model_selection import GridSearchCV
# Define parameter grid for Gradient Boosting
param_grid = {
'n_estimators': [100, 200, 300],
'learning_rate': [0.05, 0.1, 0.15],
'max_depth': [3, 5, 7],
'min_samples_split': [2, 5, 10]
}
# Grid search with cross-validation
print("Starting Grid Search (this may take a few minutes)...")
grid_search = GridSearchCV(
GradientBoostingRegressor(random_state=42),
param_grid,
cv=5,
scoring='neg_mean_squared_error',
n_jobs=-1,
verbose=1
)
grid_search.fit(X_train_scaled, y_train)
print(f"\nBest Parameters: {grid_search.best_params_}")
print(f"Best CV Score (RMSE): ${np.sqrt(-grid_search.best_score_) * 100000:,.0f}")
# Train final model with best parameters
final_model = grid_search.best_estimator_
y_test_pred_final = final_model.predict(X_test_scaled)
# Final evaluation
final_metrics = evaluate_model(y_test, y_test_pred_final, "Tuned Gradient Boosting")
β οΈ Note: Grid search can take 10-30 minutes depending on your machine. For quick iteration, start with fewer parameter combinations or use RandomizedSearchCV.
πΎ Part 6: Save Model for Production
import pickle
from datetime import datetime
# Save model
model_filename = f'house_price_model_{datetime.now().strftime("%Y%m%d")}.pkl'
with open(model_filename, 'wb') as f:
pickle.dump({
'model': gb_model, # or final_model if you did tuning
'scaler': scaler,
'feature_names': X_train.columns.tolist(),
'training_date': datetime.now(),
'test_r2': gb_test_metrics[2],
'test_rmse': gb_test_metrics[0]
}, f)
print(f"Model saved as: {model_filename}")
# Load model (example)
with open(model_filename, 'rb') as f:
loaded_data = pickle.load(f)
loaded_model = loaded_data['model']
loaded_scaler = loaded_data['scaler']
print("\nModel loaded successfully!")
print(f"Training Date: {loaded_data['training_date']}")
print(f"Test RΒ²: {loaded_data['test_r2']:.4f}")
Create Prediction Function
def predict_house_price(model, scaler, feature_values):
"""
Predict house price for new data
Parameters:
-----------
model : trained model object
scaler : fitted StandardScaler
feature_values : dict with feature names and values
Returns:
--------
predicted_price : float (in dollars)
"""
# Create DataFrame from input
input_df = pd.DataFrame([feature_values])
# Scale features
input_scaled = scaler.transform(input_df)
# Predict
prediction = model.predict(input_scaled)[0]
return prediction * 100000 # Convert to dollars
# Example usage
sample_house = {
'MedInc': 3.5,
'HouseAge': 25,
'AveRooms': 5.5,
'AveBedrms': 1.2,
'Population': 1200,
'AveOccup': 3.2,
'Latitude': 37.5,
'Longitude': -122.3,
'RoomsPerHousehold': 17.6,
'BedroomsRatio': 0.218,
'PeoplePerHousehold': 375
}
predicted_price = predict_house_price(loaded_model, loaded_scaler, sample_house)
print(f"\nPredicted Price: ${predicted_price:,.0f}")
π― Project Summary & Next Steps
π Congratulations!
You've built a complete machine learning system from data loading to model deployment!
π What You've Accomplished
- β Loaded and explored 20,000+ housing records
- β Engineered 5 new features to improve model performance
- β Trained 3 different models (Linear Regression, Random Forest, Gradient Boosting)
- β Achieved ~80% RΒ² on test data (strong performance!)
- β Created visualizations for predictions and feature importance
- β Saved model ready for production deployment
π Challenge Yourself
- Try XGBoost or LightGBM
- Add more engineered features
- Use ensemble stacking
- Build Flask API
- Create Streamlit web app
- Deploy to AWS/Heroku
- Add confidence intervals
- Perform residual analysis
- Create interactive dashboard
- Use Kaggle House Prices
- Handle missing values
- Submit to competition
π Recommended Reading
- Book: "Hands-On Machine Learning" by AurΓ©lien GΓ©ron (Chapter 2)
- Kaggle: House Prices Competition - Advanced Regression Techniques
- Papers: "Feature Engineering for Machine Learning" - Alice Zheng
π‘ Portfolio Tips
Make this project shine on GitHub:
- Add comprehensive README with results
- Include visualizations (save all plots)
- Document your feature engineering decisions
- Add requirements.txt for reproducibility
- Create Jupyter notebook with clean narrative