🎓 Complete all tutorials to earn your Free MLOps Engineer Certificate
Shareable on LinkedIn • Verified by AITutorials.site • No signup fee
🚀 Beyond Basic Deployment
You've deployed your model to the cloud. It works! But now you're facing production challenges: How do you serve 10,000 requests per second? How do you roll out a new model version without downtime? How do you A/B test different models? How do you ensure predictions complete in under 100ms?
Model serving frameworks solve these production-scale challenges with specialized tools for ML inference. They provide batching, caching, model versioning, traffic routing, and performance optimizations that generic web frameworks don't offer.
⚠️ Production Serving Challenges:
- Latency spikes under high load
- Inefficient use of expensive GPU resources
- Downtime during model updates
- No way to compare model versions in production
- Manual rollbacks when new models fail
- Difficulty serving multiple models efficiently
🔧 Serving Framework Comparison
| Framework | BentoML | TorchServe | TF Serving |
|---|---|---|---|
| Framework Support | ✅ All (sklearn, PyTorch, TF, XGBoost) | PyTorch only | TensorFlow only |
| Ease of Use | ✅ Easiest | Moderate | Complex |
| Performance | Excellent | ✅ Best for PyTorch | ✅ Best for TF |
| Batching | ✅ Adaptive | ✅ Dynamic | ✅ Built-in |
| Model Versioning | ✅ Built-in | ✅ Built-in | ✅ Built-in |
| A/B Testing | ✅ Native | Via config | Via external proxy |
| Learning Curve | ✅ Low | Moderate | Steep |
| Best For | ✅ Multi-framework, Python-first | PyTorch production | TensorFlow at scale |
💡 Which to Choose?
- BentoML: Best for most use cases, especially if using multiple frameworks
- TorchServe: If you're PyTorch-only and need maximum performance
- TensorFlow Serving: If you're TensorFlow-only and need ultra-low latency
🍱 BentoML: Universal Model Serving
Why BentoML?
BentoML is the most user-friendly ML serving framework. It supports any framework, provides automatic API generation, adaptive batching, and easy deployment to any cloud.
Installation
pip install bentoml
Serving a Scikit-learn Model
"""
BentoML service for sklearn model
File: service.py
"""
import bentoml
import numpy as np
from bentoml.io import NumpyNdarray, JSON
# Save model to BentoML model store
# model = train_model()
# bentoml.sklearn.save_model("iris_classifier", model)
# Create service
iris_classifier_runner = bentoml.sklearn.get("iris_classifier:latest").to_runner()
svc = bentoml.Service("iris_classifier", runners=[iris_classifier_runner])
@svc.api(input=NumpyNdarray(), output=JSON())
async def classify(input_series: np.ndarray) -> dict:
"""Classify iris species"""
result = await iris_classifier_runner.predict.async_run(input_series)
return {
"prediction": int(result[0]),
"species": ["setosa", "versicolor", "virginica"][result[0]]
}
Running BentoML Service
# Serve locally
bentoml serve service:svc --reload
# API available at http://localhost:3000
# Test with curl
curl -X POST http://localhost:3000/classify \
-H "Content-Type: application/json" \
-d '[[5.1, 3.5, 1.4, 0.2]]'
# Build containerized service
bentoml build
# Containerize
bentoml containerize iris_classifier:latest
# Run container
docker run -p 3000:3000 iris_classifier:latest
PyTorch Model with BentoML
"""
BentoML service for PyTorch model
"""
import bentoml
import torch
from bentoml.io import JSON, Image
from PIL import Image as PILImage
# Save PyTorch model
# bentoml.pytorch.save_model("resnet50", model)
# Load model runner
resnet_runner = bentoml.pytorch.get("resnet50:latest").to_runner()
svc = bentoml.Service("image_classifier", runners=[resnet_runner])
@svc.api(input=Image(), output=JSON())
async def predict_image(img: PILImage.Image) -> dict:
"""Classify image"""
# Preprocess
img_tensor = preprocess(img)
# Predict (automatic batching!)
output = await resnet_runner.async_run(img_tensor)
# Get top prediction
_, predicted = torch.max(output, 1)
return {
"class_id": int(predicted[0]),
"confidence": float(torch.softmax(output, dim=1)[0][predicted[0]])
}
Adaptive Batching
"""
BentoML automatically batches requests for efficiency
"""
import bentoml
# Configure batching
iris_runner = bentoml.sklearn.get("iris_classifier:latest").to_runner(
max_batch_size=32, # Max batch size
max_latency_ms=100, # Max wait time
)
# BentoML will:
# 1. Collect requests for up to 100ms
# 2. Batch up to 32 requests together
# 3. Run inference once for the batch
# 4. Return individual results
# Result: 10x throughput improvement!
Model Store & Versioning
# List models
bentoml models list
# Get specific version
bentoml models get iris_classifier:v1.0.0
# Delete old versions
bentoml models delete iris_classifier:old_version
# Export model
bentoml models export iris_classifier:latest ./model.bentomodel
# Import model
bentoml models import ./model.bentomodel
✅ BentoML Benefits:
- Automatic API generation and OpenAPI docs
- Adaptive batching for throughput optimization
- Easy deployment to any platform
- Model versioning and management
- Multi-model serving in one service
🔥 TorchServe: PyTorch Production Serving
What is TorchServe?
TorchServe is PyTorch's official serving solution, optimized for PyTorch models with features like multi-model serving, A/B testing, and metrics.
Installation
pip install torchserve torch-model-archiver torch-workflow-archiver
Creating a Model Archive
# custom_handler.py
import torch
import torch.nn.functional as F
from torchvision import transforms
from ts.torch_handler.base_handler import BaseHandler
class ImageClassifier(BaseHandler):
"""
Custom handler for image classification
"""
def __init__(self):
super().__init__()
self.transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
)
])
def preprocess(self, data):
"""Transform raw input into model input"""
images = []
for row in data:
image = row.get("data") or row.get("body")
image = self.transform(image)
images.append(image)
return torch.stack(images)
def inference(self, data):
"""Run inference"""
with torch.no_grad():
results = self.model(data)
return results
def postprocess(self, data):
"""Transform model output to response"""
probabilities = F.softmax(data, dim=1)
top_prob, top_class = torch.topk(probabilities, 1)
return [
{
"class": int(top_class[i]),
"probability": float(top_prob[i])
}
for i in range(len(top_class))
]
Creating Model Archive
# Archive model for TorchServe
torch-model-archiver \
--model-name resnet50 \
--version 1.0 \
--model-file model.py \
--serialized-file resnet50.pth \
--handler custom_handler.py \
--extra-files index_to_name.json \
--export-path model_store/
Starting TorchServe
# Start server
torchserve --start \
--model-store model_store \
--models resnet50=resnet50.mar
# Check status
curl http://localhost:8080/ping
# List models
curl http://localhost:8081/models
# Make prediction
curl -X POST http://localhost:8080/predictions/resnet50 \
-T kitten.jpg
# Stop server
torchserve --stop
Configuration for Production
# config.properties
inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
# Performance tuning
number_of_netty_threads=32
job_queue_size=1000
max_request_size=104857600 # 100MB
max_response_size=104857600
# Batching
default_workers_per_model=4
batch_size=8
max_batch_delay=100 # milliseconds
# Logging
log_dir=/var/log/torchserve
metrics_format=prometheus
Model Versioning & A/B Testing
# Register multiple versions
curl -X POST "http://localhost:8081/models?url=resnet50_v1.mar&model_name=resnet50&initial_workers=2"
curl -X POST "http://localhost:8081/models?url=resnet50_v2.mar&model_name=resnet50&initial_workers=2"
# Set default version
curl -X PUT "http://localhost:8081/models/resnet50/1.0/set-default"
# Scale workers
curl -X PUT "http://localhost:8081/models/resnet50?min_worker=3&max_worker=10"
# Unregister model
curl -X DELETE "http://localhost:8081/models/resnet50/1.0"
Monitoring with Prometheus
# View metrics
curl http://localhost:8082/metrics
# Metrics available:
# - ts_inference_latency_microseconds
# - ts_queue_latency_microseconds
# - ts_inference_requests_total
# - Requests_2XX, 4XX, 5XX
# - CPUUtilization, MemoryUtilization
# - GPUMemoryUtilization, GPUUtilization
🧠 TensorFlow Serving
Overview
TensorFlow Serving is a high-performance serving system designed specifically for TensorFlow models. It's used by Google for serving models at massive scale.
Preparing Model for Serving
"""
Save TensorFlow model for serving
"""
import tensorflow as tf
# Train model
model = tf.keras.Sequential([
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
model.fit(x_train, y_train, epochs=5)
# Save in SavedModel format
model_path = 'models/my_model/1' # Version 1
model.save(model_path)
# Directory structure:
# models/
# └── my_model/
# └── 1/
# ├── saved_model.pb
# └── variables/
Running TensorFlow Serving
# Using Docker
docker pull tensorflow/serving
# Serve model
docker run -p 8501:8501 \
--mount type=bind,source=$(pwd)/models,target=/models \
-e MODEL_NAME=my_model \
-t tensorflow/serving
# REST API available at http://localhost:8501/v1/models/my_model
# Make prediction (REST)
curl -X POST http://localhost:8501/v1/models/my_model:predict \
-H "Content-Type: application/json" \
-d '{
"instances": [[1.0, 2.0, 3.0, 4.0]]
}'
Using gRPC for Better Performance
"""
gRPC client for TensorFlow Serving
"""
import grpc
import tensorflow as tf
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc
import numpy as np
# Create gRPC channel
channel = grpc.insecure_channel('localhost:8500')
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
# Create request
request = predict_pb2.PredictRequest()
request.model_spec.name = 'my_model'
request.model_spec.signature_name = 'serving_default'
# Input data
input_data = np.array([[1.0, 2.0, 3.0, 4.0]], dtype=np.float32)
request.inputs['input'].CopyFrom(
tf.make_tensor_proto(input_data, shape=input_data.shape)
)
# Make prediction
result = stub.Predict(request, 10.0) # 10 second timeout
# gRPC is 2-3x faster than REST for large tensors!
Model Versioning
# Directory structure with versions
models/
└── my_model/
├── 1/ # Version 1
│ └── saved_model.pb
├── 2/ # Version 2
│ └── saved_model.pb
└── 3/ # Version 3 (latest)
└── saved_model.pb
# TensorFlow Serving automatically:
# - Detects new versions
# - Loads new version
# - Serves latest version
# - Keeps old versions for rollback
# Specify version in request
curl -X POST http://localhost:8501/v1/models/my_model/versions/2:predict \
-d '{"instances": [[1.0, 2.0, 3.0, 4.0]]}'
Batching Configuration
# batching_config.txt
max_batch_size { value: 32 }
batch_timeout_micros { value: 100000 } # 100ms
max_enqueued_batches { value: 100 }
num_batch_threads { value: 4 }
# Start with batching
docker run -p 8501:8501 \
--mount type=bind,source=$(pwd)/models,target=/models \
--mount type=bind,source=$(pwd)/batching_config.txt,target=/config/batching_config.txt \
-e MODEL_NAME=my_model \
-t tensorflow/serving \
--enable_batching=true \
--batching_parameters_file=/config/batching_config.txt
⚖️ Batch vs Real-Time Inference
When to Use Each
| Aspect | Real-Time | Batch |
|---|---|---|
| Latency | Milliseconds | Minutes to hours |
| Use Cases | User-facing apps, fraud detection | Recommendations, analytics, ETL |
| Cost | Higher (always-on servers) | ✅ Lower (run periodically) |
| Complexity | Higher | ✅ Lower |
| Throughput | Lower per request | ✅ Much higher |
| Infrastructure | Load balancers, auto-scaling | Simple job scheduler |
Batch Inference Example
"""
Batch inference for recommendations
"""
import joblib
import pandas as pd
from concurrent.futures import ThreadPoolExecutor
model = joblib.load('recommendation_model.joblib')
def batch_predict(input_file, output_file, batch_size=10000):
"""Process large dataset in batches"""
# Read data
df = pd.read_csv(input_file)
predictions = []
# Process in chunks
for i in range(0, len(df), batch_size):
batch = df.iloc[i:i+batch_size]
# Parallel prediction
batch_predictions = model.predict(batch)
predictions.extend(batch_predictions)
print(f"Processed {i+batch_size}/{len(df)}")
# Save results
df['prediction'] = predictions
df.to_csv(output_file, index=False)
print(f"✅ Saved predictions to {output_file}")
# Run batch job
batch_predict(
'users.csv',
'user_recommendations.csv',
batch_size=10000
)
# Schedule with cron
# 0 2 * * * python batch_inference.py # Run at 2 AM daily
Hybrid: Real-Time with Batch Preprocessing
"""
Precompute embeddings in batch, serve in real-time
"""
import faiss
import numpy as np
# BATCH: Precompute user/item embeddings
def batch_compute_embeddings():
"""Run daily to update embeddings"""
users = load_all_users()
embeddings = model.encode(users)
# Build FAISS index for fast similarity search
index = faiss.IndexFlatL2(embedding_dim)
index.add(embeddings)
faiss.write_index(index, 'embeddings.index')
# REAL-TIME: Fast lookup
def realtime_recommend(user_id):
"""Real-time recommendations using precomputed index"""
index = faiss.read_index('embeddings.index')
user_embedding = get_user_embedding(user_id)
# Fast nearest neighbor search
distances, indices = index.search(user_embedding, k=10)
return indices[0] # Top 10 recommendations in < 5ms!
🔄 Deployment Strategies
1. Blue-Green Deployment
Run two identical production environments. Route traffic to one (blue) while preparing the other (green). Switch instantly if needed.
# kubernetes blue-green deployment
# Blue deployment (current production)
apiVersion: apps/v1
kind: Deployment
metadata:
name: ml-model-blue
labels:
version: blue
spec:
replicas: 3
selector:
matchLabels:
app: ml-model
version: blue
template:
metadata:
labels:
app: ml-model
version: blue
spec:
containers:
- name: model
image: ml-model:v1.0
ports:
- containerPort: 8080
---
# Service (routes to active version)
apiVersion: v1
kind: Service
metadata:
name: ml-model-service
spec:
selector:
app: ml-model
version: blue # Change to 'green' to switch
ports:
- port: 80
targetPort: 8080
2. Canary Deployment
Gradually roll out new version to small percentage of traffic. Monitor metrics. Increase percentage if successful.
# Istio VirtualService for canary
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: ml-model
spec:
hosts:
- ml-model
http:
- match:
- headers:
x-version:
exact: v2
route:
- destination:
host: ml-model
subset: v2
- route:
- destination:
host: ml-model
subset: v1
weight: 90 # 90% to v1
- destination:
host: ml-model
subset: v2
weight: 10 # 10% to v2 (canary)
3. A/B Testing
Split traffic between model versions based on user attributes or random selection. Compare business metrics.
"""
A/B testing with BentoML
"""
import bentoml
import random
model_a_runner = bentoml.sklearn.get("model_a:latest").to_runner()
model_b_runner = bentoml.sklearn.get("model_b:latest").to_runner()
svc = bentoml.Service("ab_test", runners=[model_a_runner, model_b_runner])
@svc.api(input=JSON(), output=JSON())
async def predict(request: dict) -> dict:
user_id = request['user_id']
features = request['features']
# Consistent hashing for user assignment
if hash(user_id) % 2 == 0:
model = model_a_runner
version = 'A'
else:
model = model_b_runner
version = 'B'
prediction = await model.predict.async_run(features)
# Log for analysis
log_ab_test(user_id, version, prediction)
return {
'prediction': prediction,
'model_version': version
}
4. Shadow Mode
New model receives copy of production traffic but predictions aren't used. Compare with production model offline.
"""
Shadow mode deployment
"""
import asyncio
@svc.api(input=JSON(), output=JSON())
async def predict_with_shadow(request: dict) -> dict:
# Production prediction
prod_prediction = await prod_model.predict.async_run(request['features'])
# Shadow prediction (don't wait for it)
asyncio.create_task(
shadow_predict(request, prod_prediction)
)
# Return production result immediately
return prod_prediction
async def shadow_predict(request, prod_result):
"""Run shadow model and compare"""
shadow_result = await shadow_model.predict.async_run(request['features'])
# Log comparison
log_shadow_comparison(
prod_result=prod_result,
shadow_result=shadow_result,
features=request['features']
)
✅ Deployment Strategy Selection:
- Blue-Green: Zero-downtime deployments, instant rollback
- Canary: Gradual rollout, low risk
- A/B Testing: Compare business impact of models
- Shadow: Validate new model without user impact
⚖️ Load Balancing for ML Models
NGINX Configuration
# nginx.conf for ML model load balancing
upstream ml_models {
least_conn; # Route to server with fewest connections
server model1:8080 weight=3;
server model2:8080 weight=2;
server model3:8080 weight=1;
# Health checks
server model4:8080 backup;
}
server {
listen 80;
location /predict {
proxy_pass http://ml_models;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
# Timeouts for long predictions
proxy_connect_timeout 60s;
proxy_send_timeout 60s;
proxy_read_timeout 60s;
# Request buffering
client_max_body_size 100M;
}
# Health check endpoint
location /health {
access_log off;
return 200 "healthy\n";
}
}
Horizontal Pod Autoscaling (HPA)
# Kubernetes HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ml-model-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ml-model
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "1000"
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5min before scale down
scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately
🎯 Summary
You've mastered production model serving at scale:
BentoML
Universal framework with adaptive batching and easy deployment
TorchServe
Optimized PyTorch serving with model management
TF Serving
High-performance TensorFlow inference at scale
Batch vs Real-Time
Choose the right approach for your use case
Deployment Strategies
Blue-green, canary, A/B testing, shadow mode
Load Balancing
Auto-scaling and traffic distribution
Key Takeaways
- Use specialized serving frameworks for production ML inference
- BentoML for multi-framework, TorchServe for PyTorch, TF Serving for TensorFlow
- Leverage adaptive batching for throughput optimization
- Choose batch inference for periodic, high-volume workloads
- Use canary deployments for safe rollouts
- Implement A/B testing to compare model versions
- Configure auto-scaling based on traffic patterns
🚀 Next Steps:
Your models are serving at scale! Next, you'll learn orchestration and ML pipelines - automating the entire ML lifecycle from data ingestion to model deployment with tools like Airflow and Kubeflow.
Test Your Knowledge
Q1: What's the main advantage of using BentoML over a generic FastAPI service?
Q2: When should you use batch inference instead of real-time inference?
Q3: What is a canary deployment?
Q4: What's the benefit of adaptive batching in model serving?
Q5: What's shadow mode deployment?