🎓 Complete all tutorials to earn your Free MLOps Engineer Certificate
Shareable on LinkedIn • Verified by AITutorials.site • No signup fee
☁️ Why Cloud Deployment?
You've containerized your ML model with Docker. Now you need to deploy it to production where millions of users can access it. Running on your laptop isn't an option. A single server won't handle the traffic. You need scalability, reliability, and global reach.
Cloud platforms solve this by providing managed infrastructure, automatic scaling, load balancing, monitoring, and deployment tools specifically designed for ML workloads.
Auto-Scaling
Handle traffic spikes automatically - scale from 10 to 10,000 requests/second
Global Reach
Deploy to multiple regions for low latency worldwide
Reliability
99.99% uptime SLAs with automatic failover
Cost Efficiency
Pay only for what you use - no idle server costs
⚠️ Cloud Complexity: While powerful, cloud platforms have steep learning curves. This tutorial focuses on practical deployment patterns you'll actually use in production.
🏢 Cloud Platform Comparison
| Feature | AWS SageMaker | Google Cloud AI | Azure ML |
|---|---|---|---|
| Market Share | ✅ #1 (32%) | #3 (10%) | #2 (23%) |
| ML Services | Most comprehensive | ✅ TensorFlow native | Best for .NET |
| Ease of Use | Moderate | ✅ Easiest | Moderate |
| Pricing | Competitive | ✅ Cheapest compute | Competitive |
| GPU Options | ✅ Most variety | TPU available | Good variety |
| Enterprise | ✅ Most mature | Strong | ✅ Best MS integration |
| Deployment Speed | Fast | ✅ Fastest | Fast |
💡 Which to Choose?
- AWS: If you need maximum flexibility and already use AWS
- Google Cloud: If you use TensorFlow or want simplicity
- Azure: If you're in the Microsoft ecosystem
- Multi-cloud: Abstract with Kubernetes for portability
🟠 AWS SageMaker Deployment
What is SageMaker?
Amazon SageMaker is a fully managed ML platform that handles training, tuning, and deployment. It supports any framework (scikit-learn, PyTorch, TensorFlow, XGBoost).
Deployment Options
- Real-time endpoints: Persistent, low-latency predictions
- Batch transform: Process large datasets offline
- Serverless inference: Auto-scaling, pay-per-invocation
- Asynchronous inference: Long-running, queued requests
Deploying a Scikit-learn Model
"""
Deploy sklearn model to SageMaker
"""
import sagemaker
from sagemaker.sklearn import SKLearnModel
from sagemaker.predictor import Predictor
import boto3
import joblib
# 1. Save model with entry point script
# model.py (inference script)
"""
import joblib
import numpy as np
def model_fn(model_dir):
'''Load model from directory'''
model = joblib.load(f'{model_dir}/model.joblib')
return model
def predict_fn(input_data, model):
'''Make predictions'''
return model.predict(input_data)
"""
# 2. Upload model to S3
session = sagemaker.Session()
bucket = session.default_bucket()
# Upload model artifact
model_data = session.upload_data(
path='model.tar.gz', # tar.gz with model.joblib and model.py
bucket=bucket,
key_prefix='sklearn-model'
)
# 3. Create SageMaker model
sklearn_model = SKLearnModel(
model_data=model_data,
role='arn:aws:iam::YOUR_ACCOUNT:role/SageMakerRole',
entry_point='model.py',
framework_version='1.0-1',
py_version='py3'
)
# 4. Deploy to endpoint
predictor = sklearn_model.deploy(
initial_instance_count=1,
instance_type='ml.m5.large',
endpoint_name='iris-classifier'
)
# 5. Make predictions
import numpy as np
data = np.array([[5.1, 3.5, 1.4, 0.2]])
prediction = predictor.predict(data)
print(f"Prediction: {prediction}")
# 6. Clean up (delete endpoint to stop billing!)
predictor.delete_endpoint()
Deploying Docker Container to SageMaker
"""
Deploy custom Docker container to SageMaker
"""
import sagemaker
from sagemaker.model import Model
from sagemaker.predictor import Predictor
# 1. Build and push Docker image to ECR
"""
# Build image
docker build -t ml-api:latest .
# Tag for ECR
docker tag ml-api:latest \
123456789.dkr.ecr.us-east-1.amazonaws.com/ml-api:latest
# Login to ECR
aws ecr get-login-password --region us-east-1 | \
docker login --username AWS --password-stdin \
123456789.dkr.ecr.us-east-1.amazonaws.com
# Push image
docker push 123456789.dkr.ecr.us-east-1.amazonaws.com/ml-api:latest
"""
# 2. Create SageMaker model from container
model = Model(
image_uri='123456789.dkr.ecr.us-east-1.amazonaws.com/ml-api:latest',
model_data='s3://bucket/model.tar.gz', # Optional
role='arn:aws:iam::YOUR_ACCOUNT:role/SageMakerRole'
)
# 3. Deploy
predictor = model.deploy(
initial_instance_count=1,
instance_type='ml.g4dn.xlarge', # GPU instance
endpoint_name='ml-api-gpu'
)
# 4. Invoke endpoint
response = predictor.predict({
"sepal_length": 5.1,
"sepal_width": 3.5,
"petal_length": 1.4,
"petal_width": 0.2
})
print(response)
Serverless Inference (Cost-Optimized)
from sagemaker.serverless import ServerlessInferenceConfig
# Deploy with serverless config
predictor = sklearn_model.deploy(
serverless_inference_config=ServerlessInferenceConfig(
memory_size_in_mb=2048, # 1GB - 6GB
max_concurrency=10 # Max concurrent invocations
)
)
# Automatically scales to zero when not in use!
# Pay only per invocation
✅ SageMaker Benefits:
- Automatic scaling and load balancing
- Built-in A/B testing and traffic routing
- Model monitoring and drift detection
- One-click rollback to previous versions
- Integration with AWS ecosystem
🔵 Google Cloud AI Platform
Vertex AI Overview
Vertex AI is Google's unified ML platform. It's particularly strong for TensorFlow models and offers TPU acceleration.
Deploying to Vertex AI
"""
Deploy model to Google Cloud Vertex AI
"""
from google.cloud import aiplatform
# Initialize
aiplatform.init(
project='your-project-id',
location='us-central1'
)
# 1. Upload model to Google Cloud Storage
# gsutil cp model.joblib gs://your-bucket/models/
# 2. Create model
model = aiplatform.Model.upload(
display_name='iris-classifier',
artifact_uri='gs://your-bucket/models/',
serving_container_image_uri='us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.0-24:latest'
)
# 3. Deploy to endpoint
endpoint = model.deploy(
machine_type='n1-standard-2',
min_replica_count=1,
max_replica_count=5,
traffic_split={'0': 100} # 100% traffic to this version
)
# 4. Make prediction
prediction = endpoint.predict(
instances=[[5.1, 3.5, 1.4, 0.2]]
)
print(prediction.predictions)
# 5. Update with new model version (A/B testing)
new_model = aiplatform.Model.upload(
display_name='iris-classifier-v2',
artifact_uri='gs://your-bucket/models-v2/'
)
# Deploy alongside existing model
endpoint.deploy(
model=new_model,
traffic_split={'0': 80, '1': 20} # 80% old, 20% new
)
Deploying Custom Container
# 1. Build and push to Google Container Registry
gcloud builds submit --tag gcr.io/your-project/ml-api:v1
# 2. Create custom prediction container
# Must respond to health checks and prediction requests
# prediction.py (FastAPI app for Vertex AI)
from fastapi import FastAPI, Request
import joblib
import numpy as np
app = FastAPI()
model = joblib.load('model.joblib')
# Health check (required by Vertex AI)
@app.get('/health')
def health():
return {'status': 'healthy'}
# Prediction endpoint (required format)
@app.post('/predict')
async def predict(request: Request):
data = await request.json()
instances = np.array(data['instances'])
predictions = model.predict(instances)
return {'predictions': predictions.tolist()}
# Run with: uvicorn prediction:app --host 0.0.0.0 --port 8080
# Deploy custom container
from google.cloud import aiplatform
model = aiplatform.Model.upload(
display_name='custom-ml-api',
serving_container_image_uri='gcr.io/your-project/ml-api:v1',
serving_container_predict_route='/predict',
serving_container_health_route='/health',
serving_container_ports=[8080]
)
endpoint = model.deploy(machine_type='n1-standard-4')
Batch Predictions
# Process large datasets offline
batch_prediction_job = model.batch_predict(
job_display_name='iris-batch-prediction',
gcs_source='gs://your-bucket/input-data.csv',
gcs_destination_prefix='gs://your-bucket/predictions/',
machine_type='n1-standard-4',
starting_replica_count=5,
max_replica_count=10
)
# Wait for completion
batch_prediction_job.wait()
print(f"Output: {batch_prediction_job.output_info}")
💡 Google Cloud Advantages: Simplest deployment process, excellent documentation, TPU support for deep learning, tight integration with TensorFlow, and generous free tier.
🔷 Azure Machine Learning
Azure ML Service Overview
Azure ML provides enterprise-grade ML lifecycle management with strong integration into Microsoft ecosystem.
Deploying to Azure ML
"""
Deploy model to Azure ML
"""
from azureml.core import Workspace, Model, Environment
from azureml.core.model import InferenceConfig
from azureml.core.webservice import AciWebservice
# 1. Connect to workspace
ws = Workspace.from_config()
# 2. Register model
model = Model.register(
workspace=ws,
model_name='iris-classifier',
model_path='model.joblib',
description='Iris classification model'
)
# 3. Create inference environment
env = Environment.from_conda_specification(
name='sklearn-env',
file_path='conda_env.yml'
)
# 4. Create scoring script
"""
# score.py
import json
import joblib
import numpy as np
from azureml.core.model import Model
def init():
global model
model_path = Model.get_model_path('iris-classifier')
model = joblib.load(model_path)
def run(raw_data):
data = json.loads(raw_data)
input_data = np.array(data['data'])
predictions = model.predict(input_data)
return predictions.tolist()
"""
# 5. Configure inference
inference_config = InferenceConfig(
entry_script='score.py',
environment=env
)
# 6. Deploy to Azure Container Instances (dev/test)
aci_config = AciWebservice.deploy_configuration(
cpu_cores=1,
memory_gb=1,
auth_enabled=True
)
service = Model.deploy(
workspace=ws,
name='iris-classifier-service',
models=[model],
inference_config=inference_config,
deployment_config=aci_config
)
service.wait_for_deployment(show_output=True)
# 7. Get scoring URI
print(f"Scoring URI: {service.scoring_uri}")
# 8. Test endpoint
import requests
headers = {'Content-Type': 'application/json'}
data = {'data': [[5.1, 3.5, 1.4, 0.2]]}
response = requests.post(
service.scoring_uri,
json=data,
headers=headers
)
print(response.json())
Deploy to Azure Kubernetes Service (Production)
from azureml.core.webservice import AksWebservice
from azureml.core.compute import AksCompute, ComputeTarget
# 1. Create or attach to AKS cluster
aks_name = 'ml-aks-cluster'
if aks_name not in ws.compute_targets:
# Create new AKS cluster
prov_config = AksCompute.provisioning_configuration(
agent_count=3,
vm_size='Standard_D3_v2',
location='eastus'
)
aks_target = ComputeTarget.create(
workspace=ws,
name=aks_name,
provisioning_configuration=prov_config
)
aks_target.wait_for_completion(show_output=True)
else:
aks_target = ws.compute_targets[aks_name]
# 2. Configure deployment
aks_config = AksWebservice.deploy_configuration(
autoscale_enabled=True,
autoscale_min_replicas=2,
autoscale_max_replicas=10,
cpu_cores=2,
memory_gb=4,
auth_enabled=True,
enable_app_insights=True # Monitoring
)
# 3. Deploy
service = Model.deploy(
workspace=ws,
name='iris-classifier-aks',
models=[model],
inference_config=inference_config,
deployment_config=aks_config,
deployment_target=aks_target
)
service.wait_for_deployment(show_output=True)
Managed Online Endpoints (Recommended)
from azure.ai.ml import MLClient
from azure.ai.ml.entities import ManagedOnlineEndpoint, ManagedOnlineDeployment
# New Azure ML SDK v2 approach
ml_client = MLClient.from_config()
# Create endpoint
endpoint = ManagedOnlineEndpoint(
name='iris-endpoint',
description='Iris classification endpoint'
)
ml_client.online_endpoints.begin_create_or_update(endpoint)
# Create deployment
deployment = ManagedOnlineDeployment(
name='blue',
endpoint_name='iris-endpoint',
model=model,
instance_type='Standard_DS2_v2',
instance_count=2
)
ml_client.online_deployments.begin_create_or_update(deployment)
# Set 100% traffic to this deployment
endpoint.traffic = {'blue': 100}
ml_client.online_endpoints.begin_create_or_update(endpoint)
⚡ Serverless ML Deployment
When to Use Serverless
- Sporadic traffic: Requests come in bursts
- Cost-sensitive: Don't want idle server costs
- Simple models: Fast inference (< 30 seconds)
- Auto-scaling: Unpredictable load patterns
AWS Lambda with Container
# Dockerfile for Lambda
FROM public.ecr.aws/lambda/python:3.10
# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy model and code
COPY model.joblib .
COPY lambda_function.py .
# Lambda handler
CMD ["lambda_function.handler"]
# lambda_function.py
import json
import joblib
import numpy as np
# Load model once (cold start)
model = joblib.load('model.joblib')
def handler(event, context):
"""AWS Lambda handler"""
try:
# Parse input
body = json.loads(event['body'])
features = np.array([body['features']])
# Predict
prediction = model.predict(features)[0]
probabilities = model.predict_proba(features)[0]
return {
'statusCode': 200,
'body': json.dumps({
'prediction': int(prediction),
'confidence': float(max(probabilities))
})
}
except Exception as e:
return {
'statusCode': 500,
'body': json.dumps({'error': str(e)})
}
# Deploy to Lambda
# 1. Build and push to ECR
docker build -t ml-lambda .
docker tag ml-lambda:latest 123456789.dkr.ecr.us-east-1.amazonaws.com/ml-lambda:latest
docker push 123456789.dkr.ecr.us-east-1.amazonaws.com/ml-lambda:latest
# 2. Create Lambda function via AWS CLI
aws lambda create-function \
--function-name ml-predictor \
--package-type Image \
--code ImageUri=123456789.dkr.ecr.us-east-1.amazonaws.com/ml-lambda:latest \
--role arn:aws:iam::123456789:role/lambda-execution-role \
--timeout 30 \
--memory-size 1024
# 3. Create API Gateway trigger (via console or CLI)
Lambda Performance Optimization
# Optimize cold starts
import sys
sys.path.insert(0, '/opt/python') # Lambda layers
# Use Lambda layers for heavy dependencies
# Layer 1: numpy, scipy (shared across functions)
# Layer 2: scikit-learn
# Function code: just your model and handler
# Provisioned concurrency (keeps instances warm)
# aws lambda put-provisioned-concurrency-config \
# --function-name ml-predictor \
# --provisioned-concurrent-executions 5
⚠️ Serverless Limitations:
- 15-minute maximum execution time (Lambda)
- Cold start latency (500ms - 5s)
- Memory limits (up to 10GB Lambda)
- No GPU support
- Not cost-effective for constant high traffic
💰 Cost Optimization Strategies
1. Right-Sizing Instances
| Instance Type | Use Case | Cost (AWS/month) |
|---|---|---|
| t3.medium | Dev/test, low traffic | $30 |
| m5.large | Production, CPU models | $70 |
| c5.xlarge | Compute-intensive | $125 |
| g4dn.xlarge | GPU inference | $390 |
| p3.2xlarge | Heavy GPU workloads | $2,200 |
2. Auto-Scaling Configuration
# AWS SageMaker auto-scaling
import boto3
client = boto3.client('application-autoscaling')
# Register scalable target
client.register_scalable_target(
ServiceNamespace='sagemaker',
ResourceId=f'endpoint/iris-classifier/variant/AllTraffic',
ScalableDimension='sagemaker:variant:DesiredInstanceCount',
MinCapacity=1,
MaxCapacity=10
)
# Target tracking scaling policy
client.put_scaling_policy(
PolicyName='scale-on-invocations',
ServiceNamespace='sagemaker',
ResourceId=f'endpoint/iris-classifier/variant/AllTraffic',
ScalableDimension='sagemaker:variant:DesiredInstanceCount',
PolicyType='TargetTrackingScaling',
TargetTrackingScalingPolicyConfiguration={
'TargetValue': 1000.0, # Target 1000 invocations per instance
'PredefinedMetricSpecification': {
'PredefinedMetricType': 'SageMakerVariantInvocationsPerInstance'
},
'ScaleInCooldown': 300,
'ScaleOutCooldown': 60
}
)
3. Spot Instances for Batch Workloads
# Use spot instances (up to 90% cheaper!)
from sagemaker.tensorflow import TensorFlow
estimator = TensorFlow(
entry_point='train.py',
instance_type='ml.p3.2xlarge',
instance_count=4,
use_spot_instances=True, # Enable spot
max_wait=7200, # Max wait time
max_run=3600 # Max training time
)
estimator.fit() # Will use spot instances when available
4. Model Optimization Techniques
# Quantization (reduce model size & speed up)
import tensorflow as tf
# Convert to TensorFlow Lite with quantization
converter = tf.lite.TFLiteConverter.from_saved_model('model/')
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()
# Result: 4x smaller model, 2-3x faster inference
# PyTorch quantization
import torch
model = torch.load('model.pth')
model.eval()
# Dynamic quantization
quantized_model = torch.quantization.quantize_dynamic(
model,
{torch.nn.Linear},
dtype=torch.qint8
)
# 4x smaller, similar accuracy
5. Caching Strategies
# Redis caching (prevent redundant predictions)
import redis
import hashlib
import json
redis_client = redis.from_url('redis://cache:6379')
def predict_with_cache(features):
# Create cache key
key = hashlib.md5(str(features).encode()).hexdigest()
# Check cache
cached = redis_client.get(f'pred:{key}')
if cached:
return json.loads(cached)
# Make prediction
result = model.predict(features)
# Cache for 1 hour
redis_client.setex(f'pred:{key}', 3600, json.dumps(result))
return result
# Can reduce costs by 50%+ for repeated queries!
Cost Monitoring
# AWS Cost Explorer API
import boto3
from datetime import datetime, timedelta
ce = boto3.client('ce')
# Get costs for SageMaker
response = ce.get_cost_and_usage(
TimePeriod={
'Start': (datetime.now() - timedelta(days=30)).strftime('%Y-%m-%d'),
'End': datetime.now().strftime('%Y-%m-%d')
},
Granularity='DAILY',
Metrics=['UnblendedCost'],
Filter={
'Dimensions': {
'Key': 'SERVICE',
'Values': ['Amazon SageMaker']
}
}
)
for result in response['ResultsByTime']:
print(f"{result['TimePeriod']['Start']}: ${result['Total']['UnblendedCost']['Amount']}")
🎯 Summary
You've mastered cloud deployment for ML models:
AWS SageMaker
Most comprehensive platform with serverless and real-time options
Google Cloud
Simplest deployment with TPU support
Azure ML
Enterprise features with Microsoft integration
Serverless
Cost-effective for sporadic workloads
Cost Optimization
Right-sizing, auto-scaling, caching, quantization
Monitoring
Track costs and performance continuously
Key Takeaways
- Choose cloud provider based on your tech stack and requirements
- Use managed services (SageMaker, Vertex AI) for easier deployment
- Implement auto-scaling to handle traffic spikes
- Consider serverless for sporadic workloads
- Optimize costs with spot instances, caching, and quantization
- Monitor costs continuously - cloud bills can surprise you
- Start small, scale based on actual traffic patterns
🚀 Next Steps:
Your models are in the cloud! Next, you'll learn model serving frameworks like BentoML and TorchServe for production-grade inference at scale, with advanced features like A/B testing and canary deployments.
Test Your Knowledge
Q1: What's the main advantage of using managed ML platforms like SageMaker over deploying containers yourself?
Q2: When should you use serverless inference (like AWS Lambda)?
Q3: Which cloud platform is generally best for TensorFlow models?
Q4: What's a good cost optimization strategy for ML inference?
Q5: What's the purpose of A/B testing in model deployment?