HomeMLOps EngineerML Development Workflow

ML Development Workflow

Master experiment tracking with MLflow and Weights & Biases. Learn model versioning, reproducibility, and collaboration for production ML

📅 Tutorial 2 📊 Beginner

🎓 Complete all tutorials to earn your Free MLOps Engineer Certificate
Shareable on LinkedIn • Verified by AITutorials.site • No signup fee

🔥 The Experiment Chaos Problem

Imagine you've trained 50 different models over the past month. You know one had 94% accuracy, but you can't remember which hyperparameters you used. Your notebook is a mess of commented-out code, and you've overwritten files multiple times. Sound familiar?

This is the experiment chaos problem, and it's one of the biggest productivity killers in ML development. Without proper tracking, you'll:

  • Waste time re-running experiments you've already done
  • Lose insights about what worked and what didn't
  • Struggle to reproduce your best results
  • Can't collaborate effectively with teammates
  • Fail audits when stakeholders ask "how did you get this result?"

⚠️ Real Cost: Data scientists spend an average of 20-30% of their time trying to reproduce past experiments or debug "why did this work last week?" issues. That's 10+ hours per week of wasted effort!

The solution? Systematic experiment tracking from day one. Let's dive in.

📊 What is Experiment Tracking?

Experiment tracking is the practice of automatically logging everything related to your ML experiments:

What to Track

  • Hyperparameters: Learning rate, batch size, number of layers, regularization, etc.
  • Metrics: Accuracy, loss, precision, recall, F1 score, AUC-ROC
  • Artifacts: Model files, plots, confusion matrices, feature importance charts
  • Code version: Git commit hash to know exactly what code was used
  • Environment: Python version, library versions, hardware specs
  • Data version: Which dataset version was used
  • Metadata: Training time, who ran it, notes/tags

💡 Key Insight: Good experiment tracking is like version control (Git) for your ML experiments. Instead of tracking code changes, you're tracking model changes and their results.

Benefits of Experiment Tracking

  1. Reproducibility: Recreate any experiment months later
  2. Comparison: Easily compare 100+ experiments side-by-side
  3. Collaboration: Share results with teammates instantly
  4. Optimization: Identify patterns in what works
  5. Compliance: Meet regulatory requirements for model auditing
  6. Debugging: Quickly find when/why performance degraded

🚀 Getting Started with MLflow

MLflow is an open-source platform for the complete ML lifecycle. It has four main components, but we'll focus on MLflow Tracking for experiment tracking.

Installing MLflow

# Install MLflow
pip install mlflow

# Optionally install with scikit-learn
pip install mlflow[extras]

# Start the MLflow UI
mlflow ui

# Visit: http://localhost:5000

MLflow Key Concepts

  • Run: A single execution of model training code
  • Experiment: A collection of related runs (e.g., "Customer Churn Model v2")
  • Parameters: Input values (hyperparameters)
  • Metrics: Output values (accuracy, loss, etc.)
  • Artifacts: Output files (models, plots, data)
  • Tags: Metadata for organizing runs

💡 Pro Tip: MLflow stores everything locally by default in a mlruns folder. For teams, you can set up a central tracking server so everyone shares the same experiment database.

📝 Tracking Experiments with MLflow

Basic Tracking Example

Here's how to add MLflow tracking to your existing training code:

import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
import pandas as pd

# Load your data
data = pd.read_csv('customer_data.csv')
X = data.drop('churned', axis=1)
y = data['churned']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Set experiment name
mlflow.set_experiment("customer-churn-prediction")

# Start a run
with mlflow.start_run():
    # Define hyperparameters
    n_estimators = 100
    max_depth = 10
    min_samples_split = 5
    
    # Log parameters
    mlflow.log_param("n_estimators", n_estimators)
    mlflow.log_param("max_depth", max_depth)
    mlflow.log_param("min_samples_split", min_samples_split)
    
    # Train model
    model = RandomForestClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        min_samples_split=min_samples_split,
        random_state=42
    )
    model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test)
    
    # Log metrics
    accuracy = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    mlflow.log_metric("accuracy", accuracy)
    mlflow.log_metric("f1_score", f1)
    
    # Log model
    mlflow.sklearn.log_model(model, "random_forest_model")
    
    # Log additional artifacts
    import matplotlib.pyplot as plt
    from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
    
    cm = confusion_matrix(y_test, y_pred)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm)
    disp.plot()
    plt.savefig("confusion_matrix.png")
    mlflow.log_artifact("confusion_matrix.png")
    
    print(f"Accuracy: {accuracy:.4f}")
    print(f"F1 Score: {f1:.4f}")
    print(f"Run ID: {mlflow.active_run().info.run_id}")

Viewing Results in MLflow UI

After running the code above, open the MLflow UI:

mlflow ui

You'll see:

  • 📊 All your experiments organized in a table
  • 🔍 Filter and sort by any metric or parameter
  • 📈 Compare multiple runs side-by-side
  • 📁 Download models and artifacts
  • 📝 Add notes and tags to runs

Advanced: Hyperparameter Tuning with MLflow

import mlflow
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15, 20],
    'min_samples_split': [2, 5, 10]
}

mlflow.set_experiment("customer-churn-hyperparameter-tuning")

with mlflow.start_run(run_name="grid_search_parent"):
    rf = RandomForestClassifier(random_state=42)
    grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='f1', n_jobs=-1)
    
    grid_search.fit(X_train, y_train)
    
    # Log each configuration as a nested run
    for i, params in enumerate(grid_search.cv_results_['params']):
        with mlflow.start_run(nested=True, run_name=f"config_{i}"):
            mlflow.log_params(params)
            mlflow.log_metric("mean_f1_score", grid_search.cv_results_['mean_test_score'][i])
            mlflow.log_metric("std_f1_score", grid_search.cv_results_['std_test_score'][i])
    
    # Log best configuration
    mlflow.log_params(grid_search.best_params_)
    mlflow.log_metric("best_f1_score", grid_search.best_score_)
    mlflow.sklearn.log_model(grid_search.best_estimator_, "best_model")
    
print(f"Best params: {grid_search.best_params_}")
print(f"Best F1 score: {grid_search.best_score_:.4f}")

🌐 Weights & Biases for Team Collaboration

While MLflow is great for local tracking, Weights & Biases (W&B) excels at team collaboration with a cloud-first approach and beautiful visualizations.

Why Choose W&B?

  • Cloud-first: Automatic syncing, no server setup
  • Real-time updates: Watch training in progress
  • Beautiful dashboards: Interactive charts out of the box
  • Team features: Comments, reports, sharing
  • Integration: Works with PyTorch, TensorFlow, scikit-learn, Hugging Face
  • Free tier: Unlimited personal projects

Getting Started with W&B

# Install wandb
pip install wandb

# Login (creates account if needed)
wandb login

Basic W&B Tracking

import wandb
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score

# Initialize W&B
wandb.init(
    project="customer-churn",
    name="rf-experiment-1",
    config={
        "n_estimators": 100,
        "max_depth": 10,
        "min_samples_split": 5,
        "model": "RandomForest"
    }
)

# Access config
config = wandb.config

# Train model
model = RandomForestClassifier(
    n_estimators=config.n_estimators,
    max_depth=config.max_depth,
    min_samples_split=config.min_samples_split,
    random_state=42
)
model.fit(X_train, y_train)

# Predictions and metrics
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Log metrics
wandb.log({
    "accuracy": accuracy,
    "f1_score": f1
})

# Log confusion matrix
wandb.sklearn.plot_confusion_matrix(y_test, y_pred, labels=['Not Churned', 'Churned'])

# Save model
wandb.save('model.pkl')

# Finish run
wandb.finish()

W&B for Deep Learning

import wandb
import torch
import torch.nn as nn
from torch.utils.data import DataLoader

# Initialize
wandb.init(project="image-classification", config={
    "learning_rate": 0.001,
    "epochs": 50,
    "batch_size": 32
})

config = wandb.config

# Define model
model = YourNeuralNetwork()
optimizer = torch.optim.Adam(model.parameters(), lr=config.learning_rate)

# Training loop
for epoch in range(config.epochs):
    for batch_idx, (data, target) in enumerate(train_loader):
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
        
        # Log every 100 batches
        if batch_idx % 100 == 0:
            wandb.log({
                "epoch": epoch,
                "batch": batch_idx,
                "loss": loss.item()
            })
    
    # Validation
    val_accuracy = validate(model, val_loader)
    wandb.log({
        "epoch": epoch,
        "val_accuracy": val_accuracy
    })

wandb.finish()

💡 Pro Tip: W&B automatically logs system metrics (CPU, GPU, memory) so you can identify bottlenecks and optimize training performance.

🔖 Model Versioning Best Practices

Model versioning is critical for tracking which model is in production, rolling back to previous versions, and maintaining a history of model evolution.

Semantic Versioning for Models

Use a versioning scheme like MAJOR.MINOR.PATCH:

  • MAJOR: Complete model architecture change (e.g., Random Forest → Neural Network)
  • MINOR: Significant retraining with new features or data
  • PATCH: Bug fixes, small hyperparameter tweaks

Example: customer-churn-model-v2.3.1

MLflow Model Registry

MLflow provides a centralized model registry for managing model versions:

import mlflow
from mlflow.tracking import MlflowClient

client = MlflowClient()

# Register a model
mlflow.set_experiment("customer-churn")
with mlflow.start_run():
    # Train model
    model = train_your_model()
    
    # Log model
    mlflow.sklearn.log_model(
        model, 
        "model",
        registered_model_name="customer-churn-classifier"
    )

# Transition model to production
client.transition_model_version_stage(
    name="customer-churn-classifier",
    version=1,
    stage="Production"
)

# Load production model
model_uri = "models:/customer-churn-classifier/Production"
loaded_model = mlflow.sklearn.load_model(model_uri)

Model Version Stages

  • None: Initial state after registration
  • Staging: Model being tested before production
  • Production: Currently deployed model
  • Archived: Old version, kept for reference

🔄 Ensuring Reproducibility

Reproducibility means anyone (including future you) can recreate exact results. This requires tracking everything that affects model training.

Reproducibility Checklist

  1. Set Random Seeds
    import random
    import numpy as np
    import torch
    
    def set_seed(seed=42):
        random.seed(seed)
        np.random.seed(seed)
        torch.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        # For deterministic behavior
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False
    
    set_seed(42)
  2. Log Python Environment
    # Create requirements.txt
    pip freeze > requirements.txt
    
    # Or use conda
    conda env export > environment.yml
  3. Track Data Versions with DVC
    # Initialize DVC
    dvc init
    
    # Track data file
    dvc add data/customer_data.csv
    
    # Commit .dvc file to Git
    git add data/customer_data.csv.dvc .gitignore
    git commit -m "Track data with DVC"
  4. Log Git Commit Hash
    import subprocess
    import mlflow
    
    def get_git_commit():
        try:
            commit = subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode('ascii').strip()
            return commit
        except:
            return "unknown"
    
    with mlflow.start_run():
        mlflow.set_tag("git_commit", get_git_commit())
        # ... rest of training code

⚠️ Common Pitfall: GPU non-determinism can still cause slight variations even with seeds set. For 100% reproducibility, use torch.use_deterministic_algorithms(True), but be aware this may slow training.

⚖️ MLflow vs W&B vs DVC

Feature Comparison

Feature MLflow Weights & Biases DVC
Experiment Tracking ✅ Excellent ✅ Excellent ⚠️ Basic
Data Versioning ❌ No ⚠️ Limited ✅ Excellent
Hosting Self-hosted or cloud Cloud-only Git + cloud storage
UI Quality ⭐⭐⭐ Good ⭐⭐⭐⭐⭐ Excellent ⭐⭐ Basic
Free Tier Unlimited Personal projects Unlimited
Best For Open-source projects Team collaboration Data/model versioning

Our Recommendation

Use all three together! They complement each other:

  • DVC: Version control for data and model files
  • MLflow: Track experiments, manage models locally
  • W&B: Team collaboration and beautiful visualizations

For solo developers or small teams: Start with MLflow (free, self-hosted).
For larger teams needing collaboration: Add W&B.
When data versioning becomes critical: Add DVC.

💻 Hands-On: Complete Tracking Example

Let's put everything together in a complete example that tracks experiments, versions models, and ensures reproducibility.

"""
Complete ML Development Workflow with Tracking
"""
import mlflow
import mlflow.sklearn
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
import subprocess
import json
from datetime import datetime

# ===== 1. REPRODUCIBILITY =====
def set_seed(seed=42):
    """Set seeds for reproducibility"""
    np.random.seed(seed)
    import random
    random.seed(seed)

set_seed(42)

# ===== 2. HELPER FUNCTIONS =====
def get_git_commit():
    """Get current git commit hash"""
    try:
        commit = subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode('ascii').strip()
        return commit
    except:
        return "unknown"

def plot_confusion_matrix(y_true, y_pred, labels, filename="confusion_matrix.png"):
    """Create and save confusion matrix plot"""
    cm = confusion_matrix(y_true, y_pred)
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=labels, yticklabels=labels)
    plt.title('Confusion Matrix')
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')
    plt.savefig(filename, dpi=150, bbox_inches='tight')
    plt.close()
    return filename

# ===== 3. LOAD AND PREPARE DATA =====
print("Loading data...")
# Replace with your actual data loading
data = pd.read_csv('data/customer_churn.csv')
X = data.drop('churn', axis=1)
y = data['churn']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training samples: {len(X_train)}, Test samples: {len(X_test)}")

# ===== 4. EXPERIMENT TRACKING =====
# Set experiment
mlflow.set_experiment("customer-churn-production")

# Define hyperparameters to try
hyperparameter_configs = [
    {"n_estimators": 100, "max_depth": 10, "min_samples_split": 5},
    {"n_estimators": 200, "max_depth": 15, "min_samples_split": 2},
    {"n_estimators": 150, "max_depth": 12, "min_samples_split": 10},
]

best_model = None
best_f1 = 0

for config_idx, config in enumerate(hyperparameter_configs):
    print(f"\n{'='*60}")
    print(f"Training configuration {config_idx + 1}/{len(hyperparameter_configs)}")
    print(f"Config: {config}")
    print(f"{'='*60}")
    
    with mlflow.start_run(run_name=f"rf_config_{config_idx}"):
        # ===== LOG METADATA =====
        mlflow.set_tag("git_commit", get_git_commit())
        mlflow.set_tag("model_type", "RandomForest")
        mlflow.set_tag("dataset_version", "v1.0")
        mlflow.set_tag("training_date", datetime.now().isoformat())
        mlflow.set_tag("engineer", "your_name")
        
        # ===== LOG PARAMETERS =====
        mlflow.log_params(config)
        mlflow.log_param("test_size", 0.2)
        mlflow.log_param("random_state", 42)
        
        # ===== TRAIN MODEL =====
        model = RandomForestClassifier(**config, random_state=42, n_jobs=-1)
        
        import time
        start_time = time.time()
        model.fit(X_train, y_train)
        training_time = time.time() - start_time
        
        mlflow.log_metric("training_time_seconds", training_time)
        
        # ===== EVALUATE MODEL =====
        y_pred_train = model.predict(X_train)
        y_pred_test = model.predict(X_test)
        
        # Training metrics
        train_acc = accuracy_score(y_train, y_pred_train)
        mlflow.log_metric("train_accuracy", train_acc)
        
        # Test metrics
        test_acc = accuracy_score(y_test, y_pred_test)
        test_precision = precision_score(y_test, y_pred_test, average='weighted')
        test_recall = recall_score(y_test, y_pred_test, average='weighted')
        test_f1 = f1_score(y_test, y_pred_test, average='weighted')
        
        mlflow.log_metric("test_accuracy", test_acc)
        mlflow.log_metric("test_precision", test_precision)
        mlflow.log_metric("test_recall", test_recall)
        mlflow.log_metric("test_f1_score", test_f1)
        
        # Check for overfitting
        overfit_gap = train_acc - test_acc
        mlflow.log_metric("overfit_gap", overfit_gap)
        
        print(f"Train Accuracy: {train_acc:.4f}")
        print(f"Test Accuracy: {test_acc:.4f}")
        print(f"Test F1 Score: {test_f1:.4f}")
        print(f"Overfit Gap: {overfit_gap:.4f}")
        
        # ===== LOG ARTIFACTS =====
        # Confusion matrix
        cm_file = plot_confusion_matrix(y_test, y_pred_test, ['No Churn', 'Churn'])
        mlflow.log_artifact(cm_file)
        
        # Classification report
        report = classification_report(y_test, y_pred_test, output_dict=True)
        with open('classification_report.json', 'w') as f:
            json.dump(report, f, indent=2)
        mlflow.log_artifact('classification_report.json')
        
        # Feature importance
        feature_importance = pd.DataFrame({
            'feature': X.columns,
            'importance': model.feature_importances_
        }).sort_values('importance', ascending=False)
        
        plt.figure(figsize=(10, 6))
        sns.barplot(data=feature_importance.head(10), x='importance', y='feature')
        plt.title('Top 10 Feature Importances')
        plt.tight_layout()
        plt.savefig('feature_importance.png', dpi=150)
        plt.close()
        mlflow.log_artifact('feature_importance.png')
        
        # ===== LOG MODEL =====
        mlflow.sklearn.log_model(
            model,
            "model",
            registered_model_name="customer-churn-classifier"
        )
        
        # Track best model
        if test_f1 > best_f1:
            best_f1 = test_f1
            best_model = model
            print(f"✨ New best model! F1 Score: {best_f1:.4f}")

print(f"\n{'='*60}")
print(f"🎉 Training Complete!")
print(f"Best F1 Score: {best_f1:.4f}")
print(f"View results: mlflow ui")
print(f"{'='*60}")

💡 What's Happening: This script trains 3 different model configurations, tracks everything in MLflow, saves artifacts, and automatically selects the best model. Run mlflow ui to see all experiments in a beautiful dashboard.

Next Steps

  1. Run the script above with your own data
  2. Open MLflow UI and explore the results
  3. Try adding W&B tracking alongside MLflow
  4. Set up a central MLflow tracking server for your team
  5. Create automated retraining pipelines (we'll cover this in later tutorials)

Test Your Knowledge

Q1: What percentage of time do data scientists typically waste trying to reproduce past experiments?

5-10%
10-15%
20-30%
40-50%

Q2: Which of the following should you track in ML experiments?

Only hyperparameters
Hyperparameters, metrics, artifacts, code version, environment, data version
Only metrics and model files
Only the final model

Q3: What is the main advantage of Weights & Biases over MLflow?

It's completely free
It has better model versioning
It supports more ML frameworks
Cloud-first with excellent team collaboration and visualizations

Q4: In semantic versioning for models (MAJOR.MINOR.PATCH), when should you increment the MAJOR version?

Complete model architecture change (e.g., Random Forest → Neural Network)
Small hyperparameter tweaks
Bug fixes
Retraining with the same architecture

Q5: Which tool is best for data versioning in ML projects?

MLflow
Weights & Biases
DVC (Data Version Control)
Git