Model Serving at Scale - MLOps Tutorial

🎓 Complete all tutorials to earn your Free MLOps Engineer Certificate
Shareable on LinkedIn • Verified by AITutorials.site • No signup fee

🚀 Beyond Basic Deployment

You've deployed your model to the cloud. It works! But now you're facing production challenges: How do you serve 10,000 requests per second? How do you roll out a new model version without downtime? How do you A/B test different models? How do you ensure predictions complete in under 100ms?

Model serving frameworks solve these production-scale challenges with specialized tools for ML inference. They provide batching, caching, model versioning, traffic routing, and performance optimizations that generic web frameworks don't offer.

⚠️ Production Serving Challenges:

Latency spikes under high load
Inefficient use of expensive GPU resources
Downtime during model updates
No way to compare model versions in production
Manual rollbacks when new models fail
Difficulty serving multiple models efficiently

🔧 Serving Framework Comparison

Framework	BentoML	TorchServe	TF Serving
Framework Support	✅ All (sklearn, PyTorch, TF, XGBoost)	PyTorch only	TensorFlow only
Ease of Use	✅ Easiest	Moderate	Complex
Performance	Excellent	✅ Best for PyTorch	✅ Best for TF
Batching	✅ Adaptive	✅ Dynamic	✅ Built-in
Model Versioning	✅ Built-in	✅ Built-in	✅ Built-in
A/B Testing	✅ Native	Via config	Via external proxy
Learning Curve	✅ Low	Moderate	Steep
Best For	✅ Multi-framework, Python-first	PyTorch production	TensorFlow at scale

💡 Which to Choose?

BentoML: Best for most use cases, especially if using multiple frameworks
TorchServe: If you're PyTorch-only and need maximum performance
TensorFlow Serving: If you're TensorFlow-only and need ultra-low latency

🍱 BentoML: Universal Model Serving

Why BentoML?

BentoML is the most user-friendly ML serving framework. It supports any framework, provides automatic API generation, adaptive batching, and easy deployment to any cloud.

Installation

pip install bentoml

Serving a Scikit-learn Model

"""
BentoML service for sklearn model
File: service.py
"""
import bentoml
import numpy as np
from bentoml.io import NumpyNdarray, JSON

# Save model to BentoML model store
# model = train_model()
# bentoml.sklearn.save_model("iris_classifier", model)

# Create service
iris_classifier_runner = bentoml.sklearn.get("iris_classifier:latest").to_runner()
svc = bentoml.Service("iris_classifier", runners=[iris_classifier_runner])

@svc.api(input=NumpyNdarray(), output=JSON())
async def classify(input_series: np.ndarray) -> dict:
    """Classify iris species"""
    result = await iris_classifier_runner.predict.async_run(input_series)
    return {
        "prediction": int(result[0]),
        "species": ["setosa", "versicolor", "virginica"][result[0]]
    }

Running BentoML Service

# Serve locally
bentoml serve service:svc --reload

# API available at http://localhost:3000

# Test with curl
curl -X POST http://localhost:3000/classify \
  -H "Content-Type: application/json" \
  -d '[[5.1, 3.5, 1.4, 0.2]]'

# Build containerized service
bentoml build

# Containerize
bentoml containerize iris_classifier:latest

# Run container
docker run -p 3000:3000 iris_classifier:latest

PyTorch Model with BentoML

"""
BentoML service for PyTorch model
"""
import bentoml
import torch
from bentoml.io import JSON, Image
from PIL import Image as PILImage

# Save PyTorch model
# bentoml.pytorch.save_model("resnet50", model)

# Load model runner
resnet_runner = bentoml.pytorch.get("resnet50:latest").to_runner()
svc = bentoml.Service("image_classifier", runners=[resnet_runner])

@svc.api(input=Image(), output=JSON())
async def predict_image(img: PILImage.Image) -> dict:
    """Classify image"""
    # Preprocess
    img_tensor = preprocess(img)
    
    # Predict (automatic batching!)
    output = await resnet_runner.async_run(img_tensor)
    
    # Get top prediction
    _, predicted = torch.max(output, 1)
    
    return {
        "class_id": int(predicted[0]),
        "confidence": float(torch.softmax(output, dim=1)[0][predicted[0]])
    }

Adaptive Batching

"""
BentoML automatically batches requests for efficiency
"""
import bentoml

# Configure batching
iris_runner = bentoml.sklearn.get("iris_classifier:latest").to_runner(
    max_batch_size=32,      # Max batch size
    max_latency_ms=100,     # Max wait time
)

# BentoML will:
# 1. Collect requests for up to 100ms
# 2. Batch up to 32 requests together
# 3. Run inference once for the batch
# 4. Return individual results

# Result: 10x throughput improvement!

Model Store & Versioning

# List models
bentoml models list

# Get specific version
bentoml models get iris_classifier:v1.0.0

# Delete old versions
bentoml models delete iris_classifier:old_version

# Export model
bentoml models export iris_classifier:latest ./model.bentomodel

# Import model
bentoml models import ./model.bentomodel

✅ BentoML Benefits:

Automatic API generation and OpenAPI docs
Adaptive batching for throughput optimization
Easy deployment to any platform
Model versioning and management
Multi-model serving in one service

🔥 TorchServe: PyTorch Production Serving

What is TorchServe?

TorchServe is PyTorch's official serving solution, optimized for PyTorch models with features like multi-model serving, A/B testing, and metrics.

Installation

pip install torchserve torch-model-archiver torch-workflow-archiver

Creating a Model Archive

# custom_handler.py
import torch
import torch.nn.functional as F
from torchvision import transforms
from ts.torch_handler.base_handler import BaseHandler

class ImageClassifier(BaseHandler):
    """
    Custom handler for image classification
    """
    
    def __init__(self):
        super().__init__()
        self.transform = transforms.Compose([
            transforms.Resize(256),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            transforms.Normalize(
                mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225]
            )
        ])
    
    def preprocess(self, data):
        """Transform raw input into model input"""
        images = []
        for row in data:
            image = row.get("data") or row.get("body")
            image = self.transform(image)
            images.append(image)
        return torch.stack(images)
    
    def inference(self, data):
        """Run inference"""
        with torch.no_grad():
            results = self.model(data)
        return results
    
    def postprocess(self, data):
        """Transform model output to response"""
        probabilities = F.softmax(data, dim=1)
        top_prob, top_class = torch.topk(probabilities, 1)
        
        return [
            {
                "class": int(top_class[i]),
                "probability": float(top_prob[i])
            }
            for i in range(len(top_class))
        ]

Creating Model Archive

# Archive model for TorchServe
torch-model-archiver \
  --model-name resnet50 \
  --version 1.0 \
  --model-file model.py \
  --serialized-file resnet50.pth \
  --handler custom_handler.py \
  --extra-files index_to_name.json \
  --export-path model_store/

Starting TorchServe

# Start server
torchserve --start \
  --model-store model_store \
  --models resnet50=resnet50.mar

# Check status
curl http://localhost:8080/ping

# List models
curl http://localhost:8081/models

# Make prediction
curl -X POST http://localhost:8080/predictions/resnet50 \
  -T kitten.jpg

# Stop server
torchserve --stop

Configuration for Production

# config.properties
inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082

# Performance tuning
number_of_netty_threads=32
job_queue_size=1000
max_request_size=104857600  # 100MB
max_response_size=104857600

# Batching
default_workers_per_model=4
batch_size=8
max_batch_delay=100  # milliseconds

# Logging
log_dir=/var/log/torchserve
metrics_format=prometheus

Model Versioning & A/B Testing

# Register multiple versions
curl -X POST "http://localhost:8081/models?url=resnet50_v1.mar&model_name=resnet50&initial_workers=2"
curl -X POST "http://localhost:8081/models?url=resnet50_v2.mar&model_name=resnet50&initial_workers=2"

# Set default version
curl -X PUT "http://localhost:8081/models/resnet50/1.0/set-default"

# Scale workers
curl -X PUT "http://localhost:8081/models/resnet50?min_worker=3&max_worker=10"

# Unregister model
curl -X DELETE "http://localhost:8081/models/resnet50/1.0"

Monitoring with Prometheus

# View metrics
curl http://localhost:8082/metrics

# Metrics available:
# - ts_inference_latency_microseconds
# - ts_queue_latency_microseconds
# - ts_inference_requests_total
# - Requests_2XX, 4XX, 5XX
# - CPUUtilization, MemoryUtilization
# - GPUMemoryUtilization, GPUUtilization

🧠 TensorFlow Serving

Overview

TensorFlow Serving is a high-performance serving system designed specifically for TensorFlow models. It's used by Google for serving models at massive scale.

Preparing Model for Serving

"""
Save TensorFlow model for serving
"""
import tensorflow as tf

# Train model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
model.fit(x_train, y_train, epochs=5)

# Save in SavedModel format
model_path = 'models/my_model/1'  # Version 1
model.save(model_path)

# Directory structure:
# models/
# └── my_model/
#     └── 1/
#         ├── saved_model.pb
#         └── variables/

Running TensorFlow Serving

# Using Docker
docker pull tensorflow/serving

# Serve model
docker run -p 8501:8501 \
  --mount type=bind,source=$(pwd)/models,target=/models \
  -e MODEL_NAME=my_model \
  -t tensorflow/serving

# REST API available at http://localhost:8501/v1/models/my_model

# Make prediction (REST)
curl -X POST http://localhost:8501/v1/models/my_model:predict \
  -H "Content-Type: application/json" \
  -d '{
    "instances": [[1.0, 2.0, 3.0, 4.0]]
  }'

Using gRPC for Better Performance

"""
gRPC client for TensorFlow Serving
"""
import grpc
import tensorflow as tf
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc
import numpy as np

# Create gRPC channel
channel = grpc.insecure_channel('localhost:8500')
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)

# Create request
request = predict_pb2.PredictRequest()
request.model_spec.name = 'my_model'
request.model_spec.signature_name = 'serving_default'

# Input data
input_data = np.array([[1.0, 2.0, 3.0, 4.0]], dtype=np.float32)
request.inputs['input'].CopyFrom(
    tf.make_tensor_proto(input_data, shape=input_data.shape)
)

# Make prediction
result = stub.Predict(request, 10.0)  # 10 second timeout

# gRPC is 2-3x faster than REST for large tensors!

Model Versioning

# Directory structure with versions
models/
└── my_model/
    ├── 1/  # Version 1
    │   └── saved_model.pb
    ├── 2/  # Version 2
    │   └── saved_model.pb
    └── 3/  # Version 3 (latest)
        └── saved_model.pb

# TensorFlow Serving automatically:
# - Detects new versions
# - Loads new version
# - Serves latest version
# - Keeps old versions for rollback

# Specify version in request
curl -X POST http://localhost:8501/v1/models/my_model/versions/2:predict \
  -d '{"instances": [[1.0, 2.0, 3.0, 4.0]]}'

Batching Configuration

# batching_config.txt
max_batch_size { value: 32 }
batch_timeout_micros { value: 100000 }  # 100ms
max_enqueued_batches { value: 100 }
num_batch_threads { value: 4 }

# Start with batching
docker run -p 8501:8501 \
  --mount type=bind,source=$(pwd)/models,target=/models \
  --mount type=bind,source=$(pwd)/batching_config.txt,target=/config/batching_config.txt \
  -e MODEL_NAME=my_model \
  -t tensorflow/serving \
  --enable_batching=true \
  --batching_parameters_file=/config/batching_config.txt

⚖️ Batch vs Real-Time Inference

When to Use Each

Aspect	Real-Time	Batch
Latency	Milliseconds	Minutes to hours
Use Cases	User-facing apps, fraud detection	Recommendations, analytics, ETL
Cost	Higher (always-on servers)	✅ Lower (run periodically)
Complexity	Higher	✅ Lower
Throughput	Lower per request	✅ Much higher
Infrastructure	Load balancers, auto-scaling	Simple job scheduler

Batch Inference Example

"""
Batch inference for recommendations
"""
import joblib
import pandas as pd
from concurrent.futures import ThreadPoolExecutor

model = joblib.load('recommendation_model.joblib')

def batch_predict(input_file, output_file, batch_size=10000):
    """Process large dataset in batches"""
    
    # Read data
    df = pd.read_csv(input_file)
    predictions = []
    
    # Process in chunks
    for i in range(0, len(df), batch_size):
        batch = df.iloc[i:i+batch_size]
        
        # Parallel prediction
        batch_predictions = model.predict(batch)
        predictions.extend(batch_predictions)
        
        print(f"Processed {i+batch_size}/{len(df)}")
    
    # Save results
    df['prediction'] = predictions
    df.to_csv(output_file, index=False)
    print(f"✅ Saved predictions to {output_file}")

# Run batch job
batch_predict(
    'users.csv',
    'user_recommendations.csv',
    batch_size=10000
)

# Schedule with cron
# 0 2 * * * python batch_inference.py  # Run at 2 AM daily

Hybrid: Real-Time with Batch Preprocessing

"""
Precompute embeddings in batch, serve in real-time
"""
import faiss
import numpy as np

# BATCH: Precompute user/item embeddings
def batch_compute_embeddings():
    """Run daily to update embeddings"""
    users = load_all_users()
    embeddings = model.encode(users)
    
    # Build FAISS index for fast similarity search
    index = faiss.IndexFlatL2(embedding_dim)
    index.add(embeddings)
    
    faiss.write_index(index, 'embeddings.index')

# REAL-TIME: Fast lookup
def realtime_recommend(user_id):
    """Real-time recommendations using precomputed index"""
    index = faiss.read_index('embeddings.index')
    user_embedding = get_user_embedding(user_id)
    
    # Fast nearest neighbor search
    distances, indices = index.search(user_embedding, k=10)
    
    return indices[0]  # Top 10 recommendations in < 5ms!

🔄 Deployment Strategies

1. Blue-Green Deployment

Run two identical production environments. Route traffic to one (blue) while preparing the other (green). Switch instantly if needed.

# kubernetes blue-green deployment
# Blue deployment (current production)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-model-blue
  labels:
    version: blue
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ml-model
      version: blue
  template:
    metadata:
      labels:
        app: ml-model
        version: blue
    spec:
      containers:
      - name: model
        image: ml-model:v1.0
        ports:
        - containerPort: 8080

---
# Service (routes to active version)
apiVersion: v1
kind: Service
metadata:
  name: ml-model-service
spec:
  selector:
    app: ml-model
    version: blue  # Change to 'green' to switch
  ports:
  - port: 80
    targetPort: 8080

2. Canary Deployment

Gradually roll out new version to small percentage of traffic. Monitor metrics. Increase percentage if successful.

# Istio VirtualService for canary
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: ml-model
spec:
  hosts:
  - ml-model
  http:
  - match:
    - headers:
        x-version:
          exact: v2
    route:
    - destination:
        host: ml-model
        subset: v2
  - route:
    - destination:
        host: ml-model
        subset: v1
      weight: 90  # 90% to v1
    - destination:
        host: ml-model
        subset: v2
      weight: 10  # 10% to v2 (canary)

3. A/B Testing

Split traffic between model versions based on user attributes or random selection. Compare business metrics.

"""
A/B testing with BentoML
"""
import bentoml
import random

model_a_runner = bentoml.sklearn.get("model_a:latest").to_runner()
model_b_runner = bentoml.sklearn.get("model_b:latest").to_runner()

svc = bentoml.Service("ab_test", runners=[model_a_runner, model_b_runner])

@svc.api(input=JSON(), output=JSON())
async def predict(request: dict) -> dict:
    user_id = request['user_id']
    features = request['features']
    
    # Consistent hashing for user assignment
    if hash(user_id) % 2 == 0:
        model = model_a_runner
        version = 'A'
    else:
        model = model_b_runner
        version = 'B'
    
    prediction = await model.predict.async_run(features)
    
    # Log for analysis
    log_ab_test(user_id, version, prediction)
    
    return {
        'prediction': prediction,
        'model_version': version
    }

4. Shadow Mode

New model receives copy of production traffic but predictions aren't used. Compare with production model offline.

"""
Shadow mode deployment
"""
import asyncio

@svc.api(input=JSON(), output=JSON())
async def predict_with_shadow(request: dict) -> dict:
    # Production prediction
    prod_prediction = await prod_model.predict.async_run(request['features'])
    
    # Shadow prediction (don't wait for it)
    asyncio.create_task(
        shadow_predict(request, prod_prediction)
    )
    
    # Return production result immediately
    return prod_prediction

async def shadow_predict(request, prod_result):
    """Run shadow model and compare"""
    shadow_result = await shadow_model.predict.async_run(request['features'])
    
    # Log comparison
    log_shadow_comparison(
        prod_result=prod_result,
        shadow_result=shadow_result,
        features=request['features']
    )

✅ Deployment Strategy Selection:

Blue-Green: Zero-downtime deployments, instant rollback
Canary: Gradual rollout, low risk
A/B Testing: Compare business impact of models
Shadow: Validate new model without user impact

⚖️ Load Balancing for ML Models

NGINX Configuration

# nginx.conf for ML model load balancing
upstream ml_models {
    least_conn;  # Route to server with fewest connections
    
    server model1:8080 weight=3;
    server model2:8080 weight=2;
    server model3:8080 weight=1;
    
    # Health checks
    server model4:8080 backup;
}

server {
    listen 80;
    
    location /predict {
        proxy_pass http://ml_models;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        
        # Timeouts for long predictions
        proxy_connect_timeout 60s;
        proxy_send_timeout 60s;
        proxy_read_timeout 60s;
        
        # Request buffering
        client_max_body_size 100M;
    }
    
    # Health check endpoint
    location /health {
        access_log off;
        return 200 "healthy\n";
    }
}

Horizontal Pod Autoscaling (HPA)

# Kubernetes HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ml-model-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ml-model
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5min before scale down
    scaleUp:
      stabilizationWindowSeconds: 0    # Scale up immediately

🎯 Summary

You've mastered production model serving at scale:

🍱

BentoML

Universal framework with adaptive batching and easy deployment

🔥

TorchServe

Optimized PyTorch serving with model management

🧠

TF Serving

High-performance TensorFlow inference at scale

⚖️

Batch vs Real-Time

Choose the right approach for your use case

🔄

Deployment Strategies

Blue-green, canary, A/B testing, shadow mode

📊

Load Balancing

Auto-scaling and traffic distribution

Key Takeaways

Use specialized serving frameworks for production ML inference
BentoML for multi-framework, TorchServe for PyTorch, TF Serving for TensorFlow
Leverage adaptive batching for throughput optimization
Choose batch inference for periodic, high-volume workloads
Use canary deployments for safe rollouts
Implement A/B testing to compare model versions
Configure auto-scaling based on traffic patterns

🚀 Next Steps:

Your models are serving at scale! Next, you'll learn orchestration and ML pipelines - automating the entire ML lifecycle from data ingestion to model deployment with tools like Airflow and Kubeflow.

Test Your Knowledge

Q1: What's the main advantage of using BentoML over a generic FastAPI service?

It's faster

It provides adaptive batching, model versioning, and framework-agnostic serving built specifically for ML

It's free

It only works with scikit-learn

Q2: When should you use batch inference instead of real-time inference?

Always, it's better

For user-facing applications

For periodic, high-volume workloads like daily recommendations where latency isn't critical

Never use batch inference

Q3: What is a canary deployment?

Gradually rolling out a new model version to a small percentage of traffic and monitoring before full rollout

Deploying to a yellow server

A type of bird-themed deployment

Instant full traffic switch

Q4: What's the benefit of adaptive batching in model serving?

It makes models more accurate

It reduces model size

It makes training faster

It significantly increases throughput by processing multiple requests together while keeping latency low

Q5: What's shadow mode deployment?

Deploying at night

Running a new model on production traffic without using its predictions, to validate performance before real deployment

Hiding the model from users

Using dark themes