ML Development Workflow - MLOps Engineer

🎓 Complete all tutorials to earn your Free MLOps Engineer Certificate
Shareable on LinkedIn • Verified by AITutorials.site • No signup fee

🔥 The Experiment Chaos Problem

Imagine you've trained 50 different models over the past month. You know one had 94% accuracy, but you can't remember which hyperparameters you used. Your notebook is a mess of commented-out code, and you've overwritten files multiple times. Sound familiar?

This is the experiment chaos problem, and it's one of the biggest productivity killers in ML development. Without proper tracking, you'll:

Waste time re-running experiments you've already done
Lose insights about what worked and what didn't
Struggle to reproduce your best results
Can't collaborate effectively with teammates
Fail audits when stakeholders ask "how did you get this result?"

⚠️ Real Cost: Data scientists spend an average of 20-30% of their time trying to reproduce past experiments or debug "why did this work last week?" issues. That's 10+ hours per week of wasted effort!

The solution? Systematic experiment tracking from day one. Let's dive in.

📊 What is Experiment Tracking?

Experiment tracking is the practice of automatically logging everything related to your ML experiments:

What to Track

Hyperparameters: Learning rate, batch size, number of layers, regularization, etc.
Metrics: Accuracy, loss, precision, recall, F1 score, AUC-ROC
Artifacts: Model files, plots, confusion matrices, feature importance charts
Code version: Git commit hash to know exactly what code was used
Environment: Python version, library versions, hardware specs
Data version: Which dataset version was used
Metadata: Training time, who ran it, notes/tags

💡 Key Insight: Good experiment tracking is like version control (Git) for your ML experiments. Instead of tracking code changes, you're tracking model changes and their results.

Benefits of Experiment Tracking

Reproducibility: Recreate any experiment months later
Comparison: Easily compare 100+ experiments side-by-side
Collaboration: Share results with teammates instantly
Optimization: Identify patterns in what works
Compliance: Meet regulatory requirements for model auditing
Debugging: Quickly find when/why performance degraded

🚀 Getting Started with MLflow

MLflow is an open-source platform for the complete ML lifecycle. It has four main components, but we'll focus on MLflow Tracking for experiment tracking.

Installing MLflow

# Install MLflow
pip install mlflow

# Optionally install with scikit-learn
pip install mlflow[extras]

# Start the MLflow UI
mlflow ui

# Visit: http://localhost:5000

MLflow Key Concepts

Run: A single execution of model training code
Experiment: A collection of related runs (e.g., "Customer Churn Model v2")
Parameters: Input values (hyperparameters)
Metrics: Output values (accuracy, loss, etc.)
Artifacts: Output files (models, plots, data)
Tags: Metadata for organizing runs

💡 Pro Tip: MLflow stores everything locally by default in a mlruns folder. For teams, you can set up a central tracking server so everyone shares the same experiment database.

📝 Tracking Experiments with MLflow

Basic Tracking Example

Here's how to add MLflow tracking to your existing training code:

import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
import pandas as pd

# Load your data
data = pd.read_csv('customer_data.csv')
X = data.drop('churned', axis=1)
y = data['churned']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Set experiment name
mlflow.set_experiment("customer-churn-prediction")

# Start a run
with mlflow.start_run():
    # Define hyperparameters
    n_estimators = 100
    max_depth = 10
    min_samples_split = 5
    
    # Log parameters
    mlflow.log_param("n_estimators", n_estimators)
    mlflow.log_param("max_depth", max_depth)
    mlflow.log_param("min_samples_split", min_samples_split)
    
    # Train model
    model = RandomForestClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        min_samples_split=min_samples_split,
        random_state=42
    )
    model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test)
    
    # Log metrics
    accuracy = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    mlflow.log_metric("accuracy", accuracy)
    mlflow.log_metric("f1_score", f1)
    
    # Log model
    mlflow.sklearn.log_model(model, "random_forest_model")
    
    # Log additional artifacts
    import matplotlib.pyplot as plt
    from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
    
    cm = confusion_matrix(y_test, y_pred)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm)
    disp.plot()
    plt.savefig("confusion_matrix.png")
    mlflow.log_artifact("confusion_matrix.png")
    
    print(f"Accuracy: {accuracy:.4f}")
    print(f"F1 Score: {f1:.4f}")
    print(f"Run ID: {mlflow.active_run().info.run_id}")

Viewing Results in MLflow UI

After running the code above, open the MLflow UI:

mlflow ui

You'll see:

📊 All your experiments organized in a table
🔍 Filter and sort by any metric or parameter
📈 Compare multiple runs side-by-side
📁 Download models and artifacts
📝 Add notes and tags to runs

Advanced: Hyperparameter Tuning with MLflow

import mlflow
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15, 20],
    'min_samples_split': [2, 5, 10]
}

mlflow.set_experiment("customer-churn-hyperparameter-tuning")

with mlflow.start_run(run_name="grid_search_parent"):
    rf = RandomForestClassifier(random_state=42)
    grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='f1', n_jobs=-1)
    
    grid_search.fit(X_train, y_train)
    
    # Log each configuration as a nested run
    for i, params in enumerate(grid_search.cv_results_['params']):
        with mlflow.start_run(nested=True, run_name=f"config_{i}"):
            mlflow.log_params(params)
            mlflow.log_metric("mean_f1_score", grid_search.cv_results_['mean_test_score'][i])
            mlflow.log_metric("std_f1_score", grid_search.cv_results_['std_test_score'][i])
    
    # Log best configuration
    mlflow.log_params(grid_search.best_params_)
    mlflow.log_metric("best_f1_score", grid_search.best_score_)
    mlflow.sklearn.log_model(grid_search.best_estimator_, "best_model")
    
print(f"Best params: {grid_search.best_params_}")
print(f"Best F1 score: {grid_search.best_score_:.4f}")

🌐 Weights & Biases for Team Collaboration

While MLflow is great for local tracking, Weights & Biases (W&B) excels at team collaboration with a cloud-first approach and beautiful visualizations.

Why Choose W&B?

✅ Cloud-first: Automatic syncing, no server setup
✅ Real-time updates: Watch training in progress
✅ Beautiful dashboards: Interactive charts out of the box
✅ Team features: Comments, reports, sharing
✅ Integration: Works with PyTorch, TensorFlow, scikit-learn, Hugging Face
✅ Free tier: Unlimited personal projects

Getting Started with W&B

# Install wandb
pip install wandb

# Login (creates account if needed)
wandb login

Basic W&B Tracking

import wandb
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score

# Initialize W&B
wandb.init(
    project="customer-churn",
    name="rf-experiment-1",
    config={
        "n_estimators": 100,
        "max_depth": 10,
        "min_samples_split": 5,
        "model": "RandomForest"
    }
)

# Access config
config = wandb.config

# Train model
model = RandomForestClassifier(
    n_estimators=config.n_estimators,
    max_depth=config.max_depth,
    min_samples_split=config.min_samples_split,
    random_state=42
)
model.fit(X_train, y_train)

# Predictions and metrics
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Log metrics
wandb.log({
    "accuracy": accuracy,
    "f1_score": f1
})

# Log confusion matrix
wandb.sklearn.plot_confusion_matrix(y_test, y_pred, labels=['Not Churned', 'Churned'])

# Save model
wandb.save('model.pkl')

# Finish run
wandb.finish()

W&B for Deep Learning

import wandb
import torch
import torch.nn as nn
from torch.utils.data import DataLoader

# Initialize
wandb.init(project="image-classification", config={
    "learning_rate": 0.001,
    "epochs": 50,
    "batch_size": 32
})

config = wandb.config

# Define model
model = YourNeuralNetwork()
optimizer = torch.optim.Adam(model.parameters(), lr=config.learning_rate)

# Training loop
for epoch in range(config.epochs):
    for batch_idx, (data, target) in enumerate(train_loader):
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
        
        # Log every 100 batches
        if batch_idx % 100 == 0:
            wandb.log({
                "epoch": epoch,
                "batch": batch_idx,
                "loss": loss.item()
            })
    
    # Validation
    val_accuracy = validate(model, val_loader)
    wandb.log({
        "epoch": epoch,
        "val_accuracy": val_accuracy
    })

wandb.finish()

💡 Pro Tip: W&B automatically logs system metrics (CPU, GPU, memory) so you can identify bottlenecks and optimize training performance.

🔖 Model Versioning Best Practices

Model versioning is critical for tracking which model is in production, rolling back to previous versions, and maintaining a history of model evolution.

Semantic Versioning for Models

Use a versioning scheme like MAJOR.MINOR.PATCH:

MAJOR: Complete model architecture change (e.g., Random Forest → Neural Network)
MINOR: Significant retraining with new features or data
PATCH: Bug fixes, small hyperparameter tweaks

Example: customer-churn-model-v2.3.1

MLflow Model Registry

MLflow provides a centralized model registry for managing model versions:

import mlflow
from mlflow.tracking import MlflowClient

client = MlflowClient()

# Register a model
mlflow.set_experiment("customer-churn")
with mlflow.start_run():
    # Train model
    model = train_your_model()
    
    # Log model
    mlflow.sklearn.log_model(
        model, 
        "model",
        registered_model_name="customer-churn-classifier"
    )

# Transition model to production
client.transition_model_version_stage(
    name="customer-churn-classifier",
    version=1,
    stage="Production"
)

# Load production model
model_uri = "models:/customer-churn-classifier/Production"
loaded_model = mlflow.sklearn.load_model(model_uri)

Model Version Stages

None: Initial state after registration
Staging: Model being tested before production
Production: Currently deployed model
Archived: Old version, kept for reference

🔄 Ensuring Reproducibility

Reproducibility means anyone (including future you) can recreate exact results. This requires tracking everything that affects model training.

Reproducibility Checklist

Set Random Seeds

import random
import numpy as np
import torch

def set_seed(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    # For deterministic behavior
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

set_seed(42)

Log Python Environment

# Create requirements.txt
pip freeze > requirements.txt

# Or use conda
conda env export > environment.yml

Track Data Versions with DVC

# Initialize DVC
dvc init

# Track data file
dvc add data/customer_data.csv

# Commit .dvc file to Git
git add data/customer_data.csv.dvc .gitignore
git commit -m "Track data with DVC"

Log Git Commit Hash

import subprocess
import mlflow

def get_git_commit():
    try:
        commit = subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode('ascii').strip()
        return commit
    except:
        return "unknown"

with mlflow.start_run():
    mlflow.set_tag("git_commit", get_git_commit())
    # ... rest of training code

⚠️ Common Pitfall: GPU non-determinism can still cause slight variations even with seeds set. For 100% reproducibility, use torch.use_deterministic_algorithms(True), but be aware this may slow training.

⚖️ MLflow vs W&B vs DVC

Feature Comparison

Feature	MLflow	Weights & Biases	DVC
Experiment Tracking	✅ Excellent	✅ Excellent	⚠️ Basic
Data Versioning	❌ No	⚠️ Limited	✅ Excellent
Hosting	Self-hosted or cloud	Cloud-only	Git + cloud storage
UI Quality	⭐⭐⭐ Good	⭐⭐⭐⭐⭐ Excellent	⭐⭐ Basic
Free Tier	Unlimited	Personal projects	Unlimited
Best For	Open-source projects	Team collaboration	Data/model versioning

Our Recommendation

Use all three together! They complement each other:

DVC: Version control for data and model files
MLflow: Track experiments, manage models locally
W&B: Team collaboration and beautiful visualizations

For solo developers or small teams: Start with MLflow (free, self-hosted).
For larger teams needing collaboration: Add W&B.
When data versioning becomes critical: Add DVC.

💻 Hands-On: Complete Tracking Example

Let's put everything together in a complete example that tracks experiments, versions models, and ensures reproducibility.

"""
Complete ML Development Workflow with Tracking
"""
import mlflow
import mlflow.sklearn
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
import subprocess
import json
from datetime import datetime

# ===== 1. REPRODUCIBILITY =====
def set_seed(seed=42):
    """Set seeds for reproducibility"""
    np.random.seed(seed)
    import random
    random.seed(seed)

set_seed(42)

# ===== 2. HELPER FUNCTIONS =====
def get_git_commit():
    """Get current git commit hash"""
    try:
        commit = subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode('ascii').strip()
        return commit
    except:
        return "unknown"

def plot_confusion_matrix(y_true, y_pred, labels, filename="confusion_matrix.png"):
    """Create and save confusion matrix plot"""
    cm = confusion_matrix(y_true, y_pred)
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=labels, yticklabels=labels)
    plt.title('Confusion Matrix')
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')
    plt.savefig(filename, dpi=150, bbox_inches='tight')
    plt.close()
    return filename

# ===== 3. LOAD AND PREPARE DATA =====
print("Loading data...")
# Replace with your actual data loading
data = pd.read_csv('data/customer_churn.csv')
X = data.drop('churn', axis=1)
y = data['churn']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training samples: {len(X_train)}, Test samples: {len(X_test)}")

# ===== 4. EXPERIMENT TRACKING =====
# Set experiment
mlflow.set_experiment("customer-churn-production")

# Define hyperparameters to try
hyperparameter_configs = [
    {"n_estimators": 100, "max_depth": 10, "min_samples_split": 5},
    {"n_estimators": 200, "max_depth": 15, "min_samples_split": 2},
    {"n_estimators": 150, "max_depth": 12, "min_samples_split": 10},
]

best_model = None
best_f1 = 0

for config_idx, config in enumerate(hyperparameter_configs):
    print(f"\n{'='*60}")
    print(f"Training configuration {config_idx + 1}/{len(hyperparameter_configs)}")
    print(f"Config: {config}")
    print(f"{'='*60}")
    
    with mlflow.start_run(run_name=f"rf_config_{config_idx}"):
        # ===== LOG METADATA =====
        mlflow.set_tag("git_commit", get_git_commit())
        mlflow.set_tag("model_type", "RandomForest")
        mlflow.set_tag("dataset_version", "v1.0")
        mlflow.set_tag("training_date", datetime.now().isoformat())
        mlflow.set_tag("engineer", "your_name")
        
        # ===== LOG PARAMETERS =====
        mlflow.log_params(config)
        mlflow.log_param("test_size", 0.2)
        mlflow.log_param("random_state", 42)
        
        # ===== TRAIN MODEL =====
        model = RandomForestClassifier(**config, random_state=42, n_jobs=-1)
        
        import time
        start_time = time.time()
        model.fit(X_train, y_train)
        training_time = time.time() - start_time
        
        mlflow.log_metric("training_time_seconds", training_time)
        
        # ===== EVALUATE MODEL =====
        y_pred_train = model.predict(X_train)
        y_pred_test = model.predict(X_test)
        
        # Training metrics
        train_acc = accuracy_score(y_train, y_pred_train)
        mlflow.log_metric("train_accuracy", train_acc)
        
        # Test metrics
        test_acc = accuracy_score(y_test, y_pred_test)
        test_precision = precision_score(y_test, y_pred_test, average='weighted')
        test_recall = recall_score(y_test, y_pred_test, average='weighted')
        test_f1 = f1_score(y_test, y_pred_test, average='weighted')
        
        mlflow.log_metric("test_accuracy", test_acc)
        mlflow.log_metric("test_precision", test_precision)
        mlflow.log_metric("test_recall", test_recall)
        mlflow.log_metric("test_f1_score", test_f1)
        
        # Check for overfitting
        overfit_gap = train_acc - test_acc
        mlflow.log_metric("overfit_gap", overfit_gap)
        
        print(f"Train Accuracy: {train_acc:.4f}")
        print(f"Test Accuracy: {test_acc:.4f}")
        print(f"Test F1 Score: {test_f1:.4f}")
        print(f"Overfit Gap: {overfit_gap:.4f}")
        
        # ===== LOG ARTIFACTS =====
        # Confusion matrix
        cm_file = plot_confusion_matrix(y_test, y_pred_test, ['No Churn', 'Churn'])
        mlflow.log_artifact(cm_file)
        
        # Classification report
        report = classification_report(y_test, y_pred_test, output_dict=True)
        with open('classification_report.json', 'w') as f:
            json.dump(report, f, indent=2)
        mlflow.log_artifact('classification_report.json')
        
        # Feature importance
        feature_importance = pd.DataFrame({
            'feature': X.columns,
            'importance': model.feature_importances_
        }).sort_values('importance', ascending=False)
        
        plt.figure(figsize=(10, 6))
        sns.barplot(data=feature_importance.head(10), x='importance', y='feature')
        plt.title('Top 10 Feature Importances')
        plt.tight_layout()
        plt.savefig('feature_importance.png', dpi=150)
        plt.close()
        mlflow.log_artifact('feature_importance.png')
        
        # ===== LOG MODEL =====
        mlflow.sklearn.log_model(
            model,
            "model",
            registered_model_name="customer-churn-classifier"
        )
        
        # Track best model
        if test_f1 > best_f1:
            best_f1 = test_f1
            best_model = model
            print(f"✨ New best model! F1 Score: {best_f1:.4f}")

print(f"\n{'='*60}")
print(f"🎉 Training Complete!")
print(f"Best F1 Score: {best_f1:.4f}")
print(f"View results: mlflow ui")
print(f"{'='*60}")

💡 What's Happening: This script trains 3 different model configurations, tracks everything in MLflow, saves artifacts, and automatically selects the best model. Run mlflow ui to see all experiments in a beautiful dashboard.

Next Steps

Run the script above with your own data
Open MLflow UI and explore the results
Try adding W&B tracking alongside MLflow
Set up a central MLflow tracking server for your team
Create automated retraining pipelines (we'll cover this in later tutorials)

Test Your Knowledge

Q1: What percentage of time do data scientists typically waste trying to reproduce past experiments?

5-10%

10-15%

20-30%

40-50%

Q2: Which of the following should you track in ML experiments?

Only hyperparameters

Hyperparameters, metrics, artifacts, code version, environment, data version

Only metrics and model files

Only the final model

Q3: What is the main advantage of Weights & Biases over MLflow?

It's completely free

It has better model versioning

It supports more ML frameworks

Cloud-first with excellent team collaboration and visualizations

Q4: In semantic versioning for models (MAJOR.MINOR.PATCH), when should you increment the MAJOR version?

Complete model architecture change (e.g., Random Forest → Neural Network)

Small hyperparameter tweaks

Bug fixes

Retraining with the same architecture

Q5: Which tool is best for data versioning in ML projects?

MLflow

Weights & Biases

DVC (Data Version Control)

Git