🎓 Complete all tutorials to earn your Free MLOps Engineer Certificate
Shareable on LinkedIn • Verified by AITutorials.site • No signup fee
🔥 The Experiment Chaos Problem
Imagine you've trained 50 different models over the past month. You know one had 94% accuracy, but you can't remember which hyperparameters you used. Your notebook is a mess of commented-out code, and you've overwritten files multiple times. Sound familiar?
This is the experiment chaos problem, and it's one of the biggest productivity killers in ML development. Without proper tracking, you'll:
- Waste time re-running experiments you've already done
- Lose insights about what worked and what didn't
- Struggle to reproduce your best results
- Can't collaborate effectively with teammates
- Fail audits when stakeholders ask "how did you get this result?"
⚠️ Real Cost: Data scientists spend an average of 20-30% of their time trying to reproduce past experiments or debug "why did this work last week?" issues. That's 10+ hours per week of wasted effort!
The solution? Systematic experiment tracking from day one. Let's dive in.
📊 What is Experiment Tracking?
Experiment tracking is the practice of automatically logging everything related to your ML experiments:
What to Track
- Hyperparameters: Learning rate, batch size, number of layers, regularization, etc.
- Metrics: Accuracy, loss, precision, recall, F1 score, AUC-ROC
- Artifacts: Model files, plots, confusion matrices, feature importance charts
- Code version: Git commit hash to know exactly what code was used
- Environment: Python version, library versions, hardware specs
- Data version: Which dataset version was used
- Metadata: Training time, who ran it, notes/tags
💡 Key Insight: Good experiment tracking is like version control (Git) for your ML experiments. Instead of tracking code changes, you're tracking model changes and their results.
Benefits of Experiment Tracking
- Reproducibility: Recreate any experiment months later
- Comparison: Easily compare 100+ experiments side-by-side
- Collaboration: Share results with teammates instantly
- Optimization: Identify patterns in what works
- Compliance: Meet regulatory requirements for model auditing
- Debugging: Quickly find when/why performance degraded
🚀 Getting Started with MLflow
MLflow is an open-source platform for the complete ML lifecycle. It has four main components, but we'll focus on MLflow Tracking for experiment tracking.
Installing MLflow
# Install MLflow
pip install mlflow
# Optionally install with scikit-learn
pip install mlflow[extras]
# Start the MLflow UI
mlflow ui
# Visit: http://localhost:5000
MLflow Key Concepts
- Run: A single execution of model training code
- Experiment: A collection of related runs (e.g., "Customer Churn Model v2")
- Parameters: Input values (hyperparameters)
- Metrics: Output values (accuracy, loss, etc.)
- Artifacts: Output files (models, plots, data)
- Tags: Metadata for organizing runs
💡 Pro Tip: MLflow stores everything locally by default in a mlruns folder. For teams, you can set up a central tracking server so everyone shares the same experiment database.
📝 Tracking Experiments with MLflow
Basic Tracking Example
Here's how to add MLflow tracking to your existing training code:
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
import pandas as pd
# Load your data
data = pd.read_csv('customer_data.csv')
X = data.drop('churned', axis=1)
y = data['churned']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Set experiment name
mlflow.set_experiment("customer-churn-prediction")
# Start a run
with mlflow.start_run():
# Define hyperparameters
n_estimators = 100
max_depth = 10
min_samples_split = 5
# Log parameters
mlflow.log_param("n_estimators", n_estimators)
mlflow.log_param("max_depth", max_depth)
mlflow.log_param("min_samples_split", min_samples_split)
# Train model
model = RandomForestClassifier(
n_estimators=n_estimators,
max_depth=max_depth,
min_samples_split=min_samples_split,
random_state=42
)
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Log metrics
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
mlflow.log_metric("accuracy", accuracy)
mlflow.log_metric("f1_score", f1)
# Log model
mlflow.sklearn.log_model(model, "random_forest_model")
# Log additional artifacts
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.savefig("confusion_matrix.png")
mlflow.log_artifact("confusion_matrix.png")
print(f"Accuracy: {accuracy:.4f}")
print(f"F1 Score: {f1:.4f}")
print(f"Run ID: {mlflow.active_run().info.run_id}")
Viewing Results in MLflow UI
After running the code above, open the MLflow UI:
mlflow ui
You'll see:
- 📊 All your experiments organized in a table
- 🔍 Filter and sort by any metric or parameter
- 📈 Compare multiple runs side-by-side
- 📁 Download models and artifacts
- 📝 Add notes and tags to runs
Advanced: Hyperparameter Tuning with MLflow
import mlflow
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
# Define parameter grid
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [5, 10, 15, 20],
'min_samples_split': [2, 5, 10]
}
mlflow.set_experiment("customer-churn-hyperparameter-tuning")
with mlflow.start_run(run_name="grid_search_parent"):
rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='f1', n_jobs=-1)
grid_search.fit(X_train, y_train)
# Log each configuration as a nested run
for i, params in enumerate(grid_search.cv_results_['params']):
with mlflow.start_run(nested=True, run_name=f"config_{i}"):
mlflow.log_params(params)
mlflow.log_metric("mean_f1_score", grid_search.cv_results_['mean_test_score'][i])
mlflow.log_metric("std_f1_score", grid_search.cv_results_['std_test_score'][i])
# Log best configuration
mlflow.log_params(grid_search.best_params_)
mlflow.log_metric("best_f1_score", grid_search.best_score_)
mlflow.sklearn.log_model(grid_search.best_estimator_, "best_model")
print(f"Best params: {grid_search.best_params_}")
print(f"Best F1 score: {grid_search.best_score_:.4f}")
🌐 Weights & Biases for Team Collaboration
While MLflow is great for local tracking, Weights & Biases (W&B) excels at team collaboration with a cloud-first approach and beautiful visualizations.
Why Choose W&B?
- ✅ Cloud-first: Automatic syncing, no server setup
- ✅ Real-time updates: Watch training in progress
- ✅ Beautiful dashboards: Interactive charts out of the box
- ✅ Team features: Comments, reports, sharing
- ✅ Integration: Works with PyTorch, TensorFlow, scikit-learn, Hugging Face
- ✅ Free tier: Unlimited personal projects
Getting Started with W&B
# Install wandb
pip install wandb
# Login (creates account if needed)
wandb login
Basic W&B Tracking
import wandb
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score
# Initialize W&B
wandb.init(
project="customer-churn",
name="rf-experiment-1",
config={
"n_estimators": 100,
"max_depth": 10,
"min_samples_split": 5,
"model": "RandomForest"
}
)
# Access config
config = wandb.config
# Train model
model = RandomForestClassifier(
n_estimators=config.n_estimators,
max_depth=config.max_depth,
min_samples_split=config.min_samples_split,
random_state=42
)
model.fit(X_train, y_train)
# Predictions and metrics
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
# Log metrics
wandb.log({
"accuracy": accuracy,
"f1_score": f1
})
# Log confusion matrix
wandb.sklearn.plot_confusion_matrix(y_test, y_pred, labels=['Not Churned', 'Churned'])
# Save model
wandb.save('model.pkl')
# Finish run
wandb.finish()
W&B for Deep Learning
import wandb
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
# Initialize
wandb.init(project="image-classification", config={
"learning_rate": 0.001,
"epochs": 50,
"batch_size": 32
})
config = wandb.config
# Define model
model = YourNeuralNetwork()
optimizer = torch.optim.Adam(model.parameters(), lr=config.learning_rate)
# Training loop
for epoch in range(config.epochs):
for batch_idx, (data, target) in enumerate(train_loader):
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
# Log every 100 batches
if batch_idx % 100 == 0:
wandb.log({
"epoch": epoch,
"batch": batch_idx,
"loss": loss.item()
})
# Validation
val_accuracy = validate(model, val_loader)
wandb.log({
"epoch": epoch,
"val_accuracy": val_accuracy
})
wandb.finish()
💡 Pro Tip: W&B automatically logs system metrics (CPU, GPU, memory) so you can identify bottlenecks and optimize training performance.
🔖 Model Versioning Best Practices
Model versioning is critical for tracking which model is in production, rolling back to previous versions, and maintaining a history of model evolution.
Semantic Versioning for Models
Use a versioning scheme like MAJOR.MINOR.PATCH:
- MAJOR: Complete model architecture change (e.g., Random Forest → Neural Network)
- MINOR: Significant retraining with new features or data
- PATCH: Bug fixes, small hyperparameter tweaks
Example: customer-churn-model-v2.3.1
MLflow Model Registry
MLflow provides a centralized model registry for managing model versions:
import mlflow
from mlflow.tracking import MlflowClient
client = MlflowClient()
# Register a model
mlflow.set_experiment("customer-churn")
with mlflow.start_run():
# Train model
model = train_your_model()
# Log model
mlflow.sklearn.log_model(
model,
"model",
registered_model_name="customer-churn-classifier"
)
# Transition model to production
client.transition_model_version_stage(
name="customer-churn-classifier",
version=1,
stage="Production"
)
# Load production model
model_uri = "models:/customer-churn-classifier/Production"
loaded_model = mlflow.sklearn.load_model(model_uri)
Model Version Stages
- None: Initial state after registration
- Staging: Model being tested before production
- Production: Currently deployed model
- Archived: Old version, kept for reference
🔄 Ensuring Reproducibility
Reproducibility means anyone (including future you) can recreate exact results. This requires tracking everything that affects model training.
Reproducibility Checklist
-
Set Random Seeds
import random import numpy as np import torch def set_seed(seed=42): random.seed(seed) np.random.seed(seed) torch.manual_seed(seed) torch.cuda.manual_seed_all(seed) # For deterministic behavior torch.backends.cudnn.deterministic = True torch.backends.cudnn.benchmark = False set_seed(42) -
Log Python Environment
# Create requirements.txt pip freeze > requirements.txt # Or use conda conda env export > environment.yml -
Track Data Versions with DVC
# Initialize DVC dvc init # Track data file dvc add data/customer_data.csv # Commit .dvc file to Git git add data/customer_data.csv.dvc .gitignore git commit -m "Track data with DVC" -
Log Git Commit Hash
import subprocess import mlflow def get_git_commit(): try: commit = subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode('ascii').strip() return commit except: return "unknown" with mlflow.start_run(): mlflow.set_tag("git_commit", get_git_commit()) # ... rest of training code
⚠️ Common Pitfall: GPU non-determinism can still cause slight variations even with seeds set. For 100% reproducibility, use torch.use_deterministic_algorithms(True), but be aware this may slow training.
⚖️ MLflow vs W&B vs DVC
Feature Comparison
| Feature | MLflow | Weights & Biases | DVC |
|---|---|---|---|
| Experiment Tracking | ✅ Excellent | ✅ Excellent | ⚠️ Basic |
| Data Versioning | ❌ No | ⚠️ Limited | ✅ Excellent |
| Hosting | Self-hosted or cloud | Cloud-only | Git + cloud storage |
| UI Quality | ⭐⭐⭐ Good | ⭐⭐⭐⭐⭐ Excellent | ⭐⭐ Basic |
| Free Tier | Unlimited | Personal projects | Unlimited |
| Best For | Open-source projects | Team collaboration | Data/model versioning |
Our Recommendation
Use all three together! They complement each other:
- DVC: Version control for data and model files
- MLflow: Track experiments, manage models locally
- W&B: Team collaboration and beautiful visualizations
For solo developers or small teams: Start with MLflow (free, self-hosted).
For larger teams needing collaboration: Add W&B.
When data versioning becomes critical: Add DVC.
💻 Hands-On: Complete Tracking Example
Let's put everything together in a complete example that tracks experiments, versions models, and ensures reproducibility.
"""
Complete ML Development Workflow with Tracking
"""
import mlflow
import mlflow.sklearn
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
import subprocess
import json
from datetime import datetime
# ===== 1. REPRODUCIBILITY =====
def set_seed(seed=42):
"""Set seeds for reproducibility"""
np.random.seed(seed)
import random
random.seed(seed)
set_seed(42)
# ===== 2. HELPER FUNCTIONS =====
def get_git_commit():
"""Get current git commit hash"""
try:
commit = subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode('ascii').strip()
return commit
except:
return "unknown"
def plot_confusion_matrix(y_true, y_pred, labels, filename="confusion_matrix.png"):
"""Create and save confusion matrix plot"""
cm = confusion_matrix(y_true, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=labels, yticklabels=labels)
plt.title('Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.savefig(filename, dpi=150, bbox_inches='tight')
plt.close()
return filename
# ===== 3. LOAD AND PREPARE DATA =====
print("Loading data...")
# Replace with your actual data loading
data = pd.read_csv('data/customer_churn.csv')
X = data.drop('churn', axis=1)
y = data['churn']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Training samples: {len(X_train)}, Test samples: {len(X_test)}")
# ===== 4. EXPERIMENT TRACKING =====
# Set experiment
mlflow.set_experiment("customer-churn-production")
# Define hyperparameters to try
hyperparameter_configs = [
{"n_estimators": 100, "max_depth": 10, "min_samples_split": 5},
{"n_estimators": 200, "max_depth": 15, "min_samples_split": 2},
{"n_estimators": 150, "max_depth": 12, "min_samples_split": 10},
]
best_model = None
best_f1 = 0
for config_idx, config in enumerate(hyperparameter_configs):
print(f"\n{'='*60}")
print(f"Training configuration {config_idx + 1}/{len(hyperparameter_configs)}")
print(f"Config: {config}")
print(f"{'='*60}")
with mlflow.start_run(run_name=f"rf_config_{config_idx}"):
# ===== LOG METADATA =====
mlflow.set_tag("git_commit", get_git_commit())
mlflow.set_tag("model_type", "RandomForest")
mlflow.set_tag("dataset_version", "v1.0")
mlflow.set_tag("training_date", datetime.now().isoformat())
mlflow.set_tag("engineer", "your_name")
# ===== LOG PARAMETERS =====
mlflow.log_params(config)
mlflow.log_param("test_size", 0.2)
mlflow.log_param("random_state", 42)
# ===== TRAIN MODEL =====
model = RandomForestClassifier(**config, random_state=42, n_jobs=-1)
import time
start_time = time.time()
model.fit(X_train, y_train)
training_time = time.time() - start_time
mlflow.log_metric("training_time_seconds", training_time)
# ===== EVALUATE MODEL =====
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)
# Training metrics
train_acc = accuracy_score(y_train, y_pred_train)
mlflow.log_metric("train_accuracy", train_acc)
# Test metrics
test_acc = accuracy_score(y_test, y_pred_test)
test_precision = precision_score(y_test, y_pred_test, average='weighted')
test_recall = recall_score(y_test, y_pred_test, average='weighted')
test_f1 = f1_score(y_test, y_pred_test, average='weighted')
mlflow.log_metric("test_accuracy", test_acc)
mlflow.log_metric("test_precision", test_precision)
mlflow.log_metric("test_recall", test_recall)
mlflow.log_metric("test_f1_score", test_f1)
# Check for overfitting
overfit_gap = train_acc - test_acc
mlflow.log_metric("overfit_gap", overfit_gap)
print(f"Train Accuracy: {train_acc:.4f}")
print(f"Test Accuracy: {test_acc:.4f}")
print(f"Test F1 Score: {test_f1:.4f}")
print(f"Overfit Gap: {overfit_gap:.4f}")
# ===== LOG ARTIFACTS =====
# Confusion matrix
cm_file = plot_confusion_matrix(y_test, y_pred_test, ['No Churn', 'Churn'])
mlflow.log_artifact(cm_file)
# Classification report
report = classification_report(y_test, y_pred_test, output_dict=True)
with open('classification_report.json', 'w') as f:
json.dump(report, f, indent=2)
mlflow.log_artifact('classification_report.json')
# Feature importance
feature_importance = pd.DataFrame({
'feature': X.columns,
'importance': model.feature_importances_
}).sort_values('importance', ascending=False)
plt.figure(figsize=(10, 6))
sns.barplot(data=feature_importance.head(10), x='importance', y='feature')
plt.title('Top 10 Feature Importances')
plt.tight_layout()
plt.savefig('feature_importance.png', dpi=150)
plt.close()
mlflow.log_artifact('feature_importance.png')
# ===== LOG MODEL =====
mlflow.sklearn.log_model(
model,
"model",
registered_model_name="customer-churn-classifier"
)
# Track best model
if test_f1 > best_f1:
best_f1 = test_f1
best_model = model
print(f"✨ New best model! F1 Score: {best_f1:.4f}")
print(f"\n{'='*60}")
print(f"🎉 Training Complete!")
print(f"Best F1 Score: {best_f1:.4f}")
print(f"View results: mlflow ui")
print(f"{'='*60}")
💡 What's Happening: This script trains 3 different model configurations, tracks everything in MLflow, saves artifacts, and automatically selects the best model. Run mlflow ui to see all experiments in a beautiful dashboard.
Next Steps
- Run the script above with your own data
- Open MLflow UI and explore the results
- Try adding W&B tracking alongside MLflow
- Set up a central MLflow tracking server for your team
- Create automated retraining pipelines (we'll cover this in later tutorials)
Test Your Knowledge
Q1: What percentage of time do data scientists typically waste trying to reproduce past experiments?
Q2: Which of the following should you track in ML experiments?
Q3: What is the main advantage of Weights & Biases over MLflow?
Q4: In semantic versioning for models (MAJOR.MINOR.PATCH), when should you increment the MAJOR version?
Q5: Which tool is best for data versioning in ML projects?