šÆ Project Overview
Recommendation systems power 35% of Amazon purchases, 75% of Netflix views, and drive billions in revenue across e-commerce, streaming, and social media platforms. In this project, you'll build a sophisticated recommendation engine using real movie ratings data.
Why Recommendation Systems Matter
- Revenue Impact: Amazon attributes 35% of sales to recommendations
- User Engagement: Netflix saves $1B annually through reduced churn via personalization
- Content Discovery: YouTube's recommender drives 70% of watch time
- Competitive Advantage: Personalization is a key differentiator in crowded markets
What You'll Build
- User-Based Collaborative Filtering: "Users like you also liked..."
- Item-Based Collaborative Filtering: "People who liked this also liked..."
- Matrix Factorization (SVD): Discover latent factors and patterns
- Content-Based Filtering: Recommend based on item features
- Hybrid System: Combine multiple approaches for better results
- Cold Start Solutions: Handle new users/items with no ratings
š High-Demand Skill: Recommendation systems are core to companies like Netflix, Spotify, Amazon, YouTube, and thousands of startups. This project showcases your ability to build personalization engines!
š Dataset & Setup
1 Install Dependencies
pip install pandas numpy scikit-learn scikit-surprise matplotlib seaborn
2 Load MovieLens Dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from surprise import Dataset, Reader, SVD, KNNBasic, accuracy
from surprise.model_selection import train_test_split, cross_validate
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings('ignore')
# Load MovieLens 100K dataset
# Download from: https://grouplens.org/datasets/movielens/100k/
# Or use Surprise's built-in loader
data = Dataset.load_builtin('ml-100k')
# Convert to pandas for analysis
ratings_df = pd.DataFrame(data.raw_ratings, columns=['user', 'item', 'rating', 'timestamp'])
print(f"Dataset Shape: {ratings_df.shape}")
print(f"Number of users: {ratings_df['user'].nunique()}")
print(f"Number of movies: {ratings_df['item'].nunique()}")
print(f"Number of ratings: {len(ratings_df)}")
print(f"Rating range: {ratings_df['rating'].min()} - {ratings_df['rating'].max()}")
print(f"Average rating: {ratings_df['rating'].mean():.2f}")
# Sparsity analysis
num_users = ratings_df['user'].nunique()
num_items = ratings_df['item'].nunique()
sparsity = 1 - (len(ratings_df) / (num_users * num_items))
print(f"\nMatrix Sparsity: {sparsity:.2%}")
print(f"({sparsity:.2%} of user-item combinations have no rating)")
š” Dataset Info: MovieLens 100K contains 100,000 ratings from 943 users on 1,682 movies. Ratings range from 1-5 stars. The user-item matrix is 93.7% sparse - a typical challenge in recommendation systems!
š Part 1: Exploratory Data Analysis
Rating Distribution Analysis
# Rating distribution
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
# Overall rating distribution
rating_counts = ratings_df['rating'].value_counts().sort_index()
axes[0].bar(rating_counts.index, rating_counts.values, color='#3b82f6')
axes[0].set_xlabel('Rating')
axes[0].set_ylabel('Count')
axes[0].set_title('Rating Distribution')
axes[0].grid(axis='y', alpha=0.3)
# User activity distribution
user_activity = ratings_df.groupby('user').size()
axes[1].hist(user_activity, bins=50, color='#8b5cf6', edgecolor='black')
axes[1].set_xlabel('Number of Ratings per User')
axes[1].set_ylabel('Number of Users')
axes[1].set_title('User Activity Distribution')
axes[1].axvline(user_activity.mean(), color='red', linestyle='--', label=f'Mean: {user_activity.mean():.0f}')
axes[1].legend()
# Movie popularity distribution
item_popularity = ratings_df.groupby('item').size()
axes[2].hist(item_popularity, bins=50, color='#10b981', edgecolor='black')
axes[2].set_xlabel('Number of Ratings per Movie')
axes[2].set_ylabel('Number of Movies')
axes[2].set_title('Movie Popularity Distribution')
axes[2].axvline(item_popularity.mean(), color='red', linestyle='--', label=f'Mean: {item_popularity.mean():.0f}')
axes[2].legend()
plt.tight_layout()
plt.show()
print("\nš Summary Statistics:")
print(f"Average ratings per user: {user_activity.mean():.1f}")
print(f"Average ratings per movie: {item_popularity.mean():.1f}")
print(f"Most active user rated: {user_activity.max()} movies")
print(f"Most popular movie has: {item_popularity.max()} ratings")
print(f"Cold start challenge: {(item_popularity < 5).sum()} movies have < 5 ratings")
Create User-Item Matrix
# Create user-item matrix (pivot table)
user_item_matrix = ratings_df.pivot_table(
index='user',
columns='item',
values='rating'
)
print(f"\nUser-Item Matrix Shape: {user_item_matrix.shape}")
print(f"Memory size: {user_item_matrix.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
# Visualize a sample of the matrix
plt.figure(figsize=(12, 8))
sample_matrix = user_item_matrix.iloc[:50, :50]
sns.heatmap(sample_matrix, cmap='YlGnBu', cbar_kws={'label': 'Rating'})
plt.title('User-Item Rating Matrix (First 50 users Ć 50 movies)')
plt.xlabel('Movie ID')
plt.ylabel('User ID')
plt.tight_layout()
plt.show()
ā Checkpoint 1: Data Understanding
Key insights discovered:
- 100,000 ratings from 943 users on 1,682 movies
- 93.7% sparsity - most user-movie pairs unrated
- Rating bias: Users tend to rate 3-4 stars (positive skew)
- Power law: Few movies have many ratings, most have few
š¤ Part 2: Collaborative Filtering
Method 1: User-Based Collaborative Filtering
# Train-test split
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)
# User-based KNN with cosine similarity
user_based_cf = KNNBasic(
k=40, # Number of neighbors
sim_options={
'name': 'cosine',
'user_based': True
}
)
# Train
user_based_cf.fit(trainset)
# Predictions
predictions_user = user_based_cf.test(testset)
# Evaluate
rmse_user = accuracy.rmse(predictions_user, verbose=True)
mae_user = accuracy.mae(predictions_user, verbose=True)
print(f"\nš„ USER-BASED COLLABORATIVE FILTERING")
print(f"RMSE: {rmse_user:.4f}")
print(f"MAE: {mae_user:.4f}")
Method 2: Item-Based Collaborative Filtering
# Item-based KNN with cosine similarity
item_based_cf = KNNBasic(
k=40,
sim_options={
'name': 'cosine',
'user_based': False # Item-based
}
)
# Train
item_based_cf.fit(trainset)
# Predictions
predictions_item = item_based_cf.test(testset)
# Evaluate
rmse_item = accuracy.rmse(predictions_item, verbose=True)
mae_item = accuracy.mae(predictions_item, verbose=True)
print(f"\nš¬ ITEM-BASED COLLABORATIVE FILTERING")
print(f"RMSE: {rmse_item:.4f}")
print(f"MAE: {mae_item:.4f}")
Method 3: Matrix Factorization (SVD)
# SVD (Singular Value Decomposition)
svd_model = SVD(
n_factors=100, # Number of latent factors
n_epochs=20,
lr_all=0.005,
reg_all=0.02,
random_state=42
)
# Train
svd_model.fit(trainset)
# Predictions
predictions_svd = svd_model.test(testset)
# Evaluate
rmse_svd = accuracy.rmse(predictions_svd, verbose=True)
mae_svd = accuracy.mae(predictions_svd, verbose=True)
print(f"\nš§® MATRIX FACTORIZATION (SVD)")
print(f"RMSE: {rmse_svd:.4f}")
print(f"MAE: {mae_svd:.4f}")
Model Comparison
# Compare all models
comparison_df = pd.DataFrame({
'Model': ['User-Based CF', 'Item-Based CF', 'SVD'],
'RMSE': [rmse_user, rmse_item, rmse_svd],
'MAE': [mae_user, mae_item, mae_svd]
})
print("\n" + "="*60)
print("MODEL COMPARISON")
print("="*60)
print(comparison_df.to_string(index=False))
# Visualize
fig, ax = plt.subplots(figsize=(10, 6))
x = np.arange(len(comparison_df))
width = 0.35
bars1 = ax.bar(x - width/2, comparison_df['RMSE'], width, label='RMSE', color='#3b82f6')
bars2 = ax.bar(x + width/2, comparison_df['MAE'], width, label='MAE', color='#8b5cf6')
ax.set_xlabel('Models')
ax.set_ylabel('Error')
ax.set_title('Recommendation Model Performance Comparison')
ax.set_xticks(x)
ax.set_xticklabels(comparison_df['Model'])
ax.legend()
ax.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
# Best model
best_model_idx = comparison_df['RMSE'].idxmin()
print(f"\nš Best Model: {comparison_df.loc[best_model_idx, 'Model']}")
print(f"Typically SVD achieves lowest error (~0.93 RMSE)")
ā Checkpoint 2: Collaborative Filtering Complete
You've implemented:
- User-based CF: Finds similar users
- Item-based CF: Finds similar items (often better for sparse data)
- SVD: Discovers latent factors (typically best performance)
- Expected RMSE: 0.93-0.98 on MovieLens 100K
šÆ Part 3: Generate Recommendations
Top-N Recommendations for Users
def get_top_n_recommendations(model, user_id, n=10, trainset=trainset):
"""
Get top N movie recommendations for a user
"""
# Get all items
all_items = set(trainset.all_items())
# Get items user has already rated
user_items = set([j for (j, _) in trainset.ur[trainset.to_inner_uid(user_id)]])
# Items to predict (not yet rated)
items_to_predict = all_items - user_items
# Predict ratings
predictions = []
for item_id in items_to_predict:
pred = model.predict(user_id, item_id)
predictions.append((item_id, pred.est))
# Sort by predicted rating
predictions.sort(key=lambda x: x[1], reverse=True)
# Return top N
return predictions[:n]
# Example: Recommendations for user 196
user_id = '196'
recommendations = get_top_n_recommendations(svd_model, user_id, n=10)
print(f"\nš¬ TOP 10 RECOMMENDATIONS FOR USER {user_id}")
print("="*60)
for i, (item_id, predicted_rating) in enumerate(recommendations, 1):
print(f"{i}. Movie {item_id} - Predicted Rating: {predicted_rating:.2f}")
Similar Items (Content Discovery)
def get_similar_items(model, item_id, n=10, trainset=trainset):
"""
Find movies similar to given movie
"""
# Get inner item id
inner_item_id = trainset.to_inner_iid(item_id)
# Get item's latent factors (for SVD)
if hasattr(model, 'qi'):
item_factors = model.qi[inner_item_id]
# Compute similarity with all items
similarities = []
for other_inner_id in range(trainset.n_items):
if other_inner_id != inner_item_id:
other_factors = model.qi[other_inner_id]
sim = np.dot(item_factors, other_factors)
other_raw_id = trainset.to_raw_iid(other_inner_id)
similarities.append((other_raw_id, sim))
# Sort by similarity
similarities.sort(key=lambda x: x[1], reverse=True)
return similarities[:n]
else:
return []
# Example: Movies similar to movie 50
movie_id = '50'
similar_movies = get_similar_items(svd_model, movie_id, n=10)
print(f"\nš¬ MOVIES SIMILAR TO MOVIE {movie_id}")
print("="*60)
for i, (item_id, similarity) in enumerate(similar_movies, 1):
print(f"{i}. Movie {item_id} - Similarity: {similarity:.4f}")
Precision@K and Recall@K
def precision_recall_at_k(predictions, k=10, threshold=3.5):
"""
Compute precision and recall at k
predictions: list of Prediction objects
k: number of recommendations
threshold: rating threshold for relevance
"""
# Group predictions by user
user_est_true = {}
for pred in predictions:
user_id = pred.uid
if user_id not in user_est_true:
user_est_true[user_id] = []
user_est_true[user_id].append((pred.est, pred.r_ui))
precisions = []
recalls = []
for user_id, user_ratings in user_est_true.items():
# Sort by estimated rating
user_ratings.sort(key=lambda x: x[0], reverse=True)
# Top k predictions
top_k = user_ratings[:k]
# Number of relevant items in top k
n_rel_and_rec_k = sum((true_r >= threshold) for (_, true_r) in top_k)
# Total number of relevant items
n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)
# Precision@K
precision = n_rel_and_rec_k / k if k != 0 else 0
precisions.append(precision)
# Recall@K
recall = n_rel_and_rec_k / n_rel if n_rel != 0 else 0
recalls.append(recall)
return np.mean(precisions), np.mean(recalls)
# Calculate for SVD model
precision, recall = precision_recall_at_k(predictions_svd, k=10, threshold=3.5)
print(f"\nš RECOMMENDATION QUALITY METRICS")
print("="*60)
print(f"Precision@10: {precision:.4f}")
print(f"Recall@10: {recall:.4f}")
print(f"F1-Score@10: {2 * (precision * recall) / (precision + recall):.4f}")
print("\nš” Interpretation:")
print(f" - {precision:.1%} of top-10 recommendations are relevant (rated ā„ 3.5)")
print(f" - We capture {recall:.1%} of all relevant items in top-10")
Coverage and Diversity Analysis
# Calculate catalog coverage
def calculate_coverage(model, n_users=100, n_recommendations=10):
"""
What percentage of catalog gets recommended?
"""
recommended_items = set()
for user_id in range(1, n_users + 1):
user_id_str = str(user_id)
try:
recs = get_top_n_recommendations(model, user_id_str, n=n_recommendations)
for item_id, _ in recs:
recommended_items.add(item_id)
except:
continue
coverage = len(recommended_items) / trainset.n_items
return coverage, len(recommended_items)
coverage, n_unique_items = calculate_coverage(svd_model, n_users=100, n_recommendations=10)
print(f"\nš CATALOG COVERAGE")
print("="*60)
print(f"Unique items recommended: {n_unique_items}")
print(f"Total items in catalog: {trainset.n_items}")
print(f"Coverage: {coverage:.1%}")
# Diversity: Average pairwise distance in recommendations
print("\nš” High coverage (>30%) means diverse recommendations")
print(" Low coverage means 'filter bubble' - recommending same popular items")
ā Checkpoint 3: Recommendations Generated
You've created:
- Top-N personalized recommendations for any user
- Similar items finder for content discovery
- Precision@K and Recall@K metrics
- Catalog coverage analysis
š Part 4: Hybrid Recommendation System
Combine Collaborative and Content-Based
def hybrid_recommendation(user_id, item_id, cf_model, content_weight=0.3):
"""
Hybrid: Combine collaborative filtering with content features
cf_model: trained SVD model
content_weight: weight for content-based score (0-1)
"""
# Collaborative filtering score
cf_pred = cf_model.predict(user_id, item_id)
cf_score = cf_pred.est
# Content-based score (simplified - using item popularity as proxy)
item_ratings = ratings_df[ratings_df['item'] == item_id]
if len(item_ratings) > 0:
content_score = item_ratings['rating'].mean()
else:
content_score = 3.0 # Default for cold start
# Weighted combination
hybrid_score = (1 - content_weight) * cf_score + content_weight * content_score
return hybrid_score, cf_score, content_score
# Example hybrid predictions
test_user = '196'
test_items = ['50', '100', '150', '200', '250']
print("\nš HYBRID RECOMMENDATIONS")
print("="*80)
print(f"{'Movie ID':<12} {'CF Score':<12} {'Content Score':<15} {'Hybrid Score':<15}")
print("-"*80)
for item in test_items:
hybrid, cf, content = hybrid_recommendation(test_user, item, svd_model, content_weight=0.3)
print(f"{item:<12} {cf:<12.2f} {content:<15.2f} {hybrid:<15.2f}")
print("\nš” Hybrid systems combine strengths:")
print(" - CF: Captures user preferences and taste patterns")
print(" - Content: Handles cold start and provides diversity")
Cold Start: New User Strategy
def recommend_for_new_user(n=10):
"""
Recommend to new user with no rating history
Strategy: Popular items + diverse genres
"""
# Calculate item popularity and average rating
item_stats = ratings_df.groupby('item').agg({
'rating': ['count', 'mean']
}).reset_index()
item_stats.columns = ['item', 'num_ratings', 'avg_rating']
# Filter: At least 50 ratings (quality) and avg rating > 3.5
popular_items = item_stats[
(item_stats['num_ratings'] >= 50) &
(item_stats['avg_rating'] >= 3.5)
].sort_values('avg_rating', ascending=False)
return popular_items.head(n)
new_user_recs = recommend_for_new_user(n=10)
print("\nš¶ COLD START: NEW USER RECOMMENDATIONS")
print("="*60)
print("Strategy: Popular + High-Quality Movies")
print(new_user_recs.to_string(index=False))
print("\nš” After user rates 5-10 items, switch to personalized CF!")
ā Checkpoint 4: Hybrid System Built
Advanced features implemented:
- Hybrid scoring combining CF + content features
- Cold start solution for new users
- Flexible weighting for different scenarios
- Production-ready recommendation pipeline
š¾ Part 5: Model Deployment
import pickle
from datetime import datetime
# Save trained model and metadata
recommendation_package = {
'model': svd_model,
'model_type': 'SVD',
'training_date': datetime.now(),
'performance_metrics': {
'rmse': rmse_svd,
'mae': mae_svd,
'precision_at_10': precision,
'recall_at_10': recall,
'coverage': coverage
},
'trainset': trainset,
'hyperparameters': {
'n_factors': 100,
'n_epochs': 20,
'lr_all': 0.005,
'reg_all': 0.02
}
}
# Save
model_filename = f'recommendation_model_{datetime.now().strftime("%Y%m%d")}.pkl'
with open(model_filename, 'wb') as f:
pickle.dump(recommendation_package, f)
print(f"ā
Model saved as: {model_filename}")
# Recommendation API function
def recommend_api(user_id, n=10, model_path=model_filename):
"""
Production API for recommendations
Parameters:
-----------
user_id : str
User identifier
n : int
Number of recommendations
model_path : str
Path to saved model
Returns:
--------
list of dicts with item_id, predicted_rating, confidence
"""
# Load model
with open(model_path, 'rb') as f:
package = pickle.load(f)
model = package['model']
trainset = package['trainset']
# Generate recommendations
recommendations = get_top_n_recommendations(model, user_id, n=n, trainset=trainset)
# Format output
result = []
for item_id, predicted_rating in recommendations:
result.append({
'item_id': item_id,
'predicted_rating': round(predicted_rating, 2),
'confidence': 'high' if predicted_rating >= 4.0 else 'medium'
})
return result
# Example API call
user_recs = recommend_api('196', n=5)
print("\n" + "="*60)
print("RECOMMENDATION API RESPONSE")
print("="*60)
for i, rec in enumerate(user_recs, 1):
print(f"{i}. Item {rec['item_id']}: Rating {rec['predicted_rating']} ({rec['confidence']} confidence)")
Flask API (Bonus)
# Save this as app.py and run: python app.py
from flask import Flask, jsonify, request
import pickle
app = Flask(__name__)
# Load model at startup
with open(model_filename, 'rb') as f:
model_package = pickle.load(f)
@app.route('/recommend/', methods=['GET'])
def get_recommendations(user_id):
"""
GET /recommend/196?n=10
"""
n = request.args.get('n', default=10, type=int)
try:
recommendations = recommend_api(user_id, n=n)
return jsonify({
'user_id': user_id,
'recommendations': recommendations,
'count': len(recommendations)
})
except Exception as e:
return jsonify({'error': str(e)}), 400
@app.route('/similar/', methods=['GET'])
def get_similar(item_id):
"""
GET /similar/50?n=10
"""
n = request.args.get('n', default=10, type=int)
try:
model = model_package['model']
trainset = model_package['trainset']
similar = get_similar_items(model, item_id, n=n, trainset=trainset)
return jsonify({
'item_id': item_id,
'similar_items': [
{'item_id': iid, 'similarity': float(sim)}
for iid, sim in similar
]
})
except Exception as e:
return jsonify({'error': str(e)}), 400
if __name__ == '__main__':
app.run(debug=True, port=5000)
# Test API:
# curl http://localhost:5000/recommend/196?n=5
# curl http://localhost:5000/similar/50?n=5
šÆ Project Summary
š Exceptional Achievement!
You've built a production-grade recommendation system used by top tech companies!
š Key Accomplishments
- ā Implemented 3 CF algorithms: User-based, item-based, and SVD matrix factorization
- ā Achieved ~0.93 RMSE: Industry-standard performance on MovieLens
- ā Built hybrid system: Combined collaborative and content-based filtering
- ā Solved cold start: Strategies for new users and items
- ā Evaluated with Precision@K, Recall@K: Beyond simple accuracy metrics
- ā Deployed as REST API: Production-ready Flask service
š Next Level Enhancements
- Deep Learning: Implement neural collaborative filtering with TensorFlow
- Context-Aware: Add time, location, device as contextual features
- Real-Time: Use Apache Kafka for streaming recommendations
- A/B Testing: Deploy shadow mode and measure click-through rate
- Explainability: "Recommended because you watched X, Y, Z"
- Multi-Armed Bandits: Balance exploration vs exploitation
š¼ Interview Talking Points:
- "Built hybrid recommendation system achieving 0.93 RMSE using SVD matrix factorization"
- "Handled 93.7% data sparsity through collaborative filtering on MovieLens 100K"
- "Evaluated with Precision@10 and catalog coverage for business-relevant metrics"
- "Deployed as REST API with cold start handling for new users"
- "System architecture mirrors Netflix and Amazon recommendation engines"
š Further Learning
- Papers: "Matrix Factorization Techniques for Recommender Systems" (Koren et al.)
- Books: "Recommender Systems Handbook" by Ricci et al.
- Courses: Coursera "Recommender Systems" by University of Minnesota
- Datasets: MovieLens 25M, Amazon Product Data, Yelp Dataset