HomeMLOps EngineerCloud Deployment

Cloud Deployment

Deploy ML models to AWS SageMaker, Google Cloud AI Platform, Azure ML. Master serverless ML and cost optimization strategies

📅 Tutorial 6 📊 Intermediate

🎓 Complete all tutorials to earn your Free MLOps Engineer Certificate
Shareable on LinkedIn • Verified by AITutorials.site • No signup fee

☁️ Why Cloud Deployment?

You've containerized your ML model with Docker. Now you need to deploy it to production where millions of users can access it. Running on your laptop isn't an option. A single server won't handle the traffic. You need scalability, reliability, and global reach.

Cloud platforms solve this by providing managed infrastructure, automatic scaling, load balancing, monitoring, and deployment tools specifically designed for ML workloads.

📈

Auto-Scaling

Handle traffic spikes automatically - scale from 10 to 10,000 requests/second

🌍

Global Reach

Deploy to multiple regions for low latency worldwide

🛡️

Reliability

99.99% uptime SLAs with automatic failover

💰

Cost Efficiency

Pay only for what you use - no idle server costs

⚠️ Cloud Complexity: While powerful, cloud platforms have steep learning curves. This tutorial focuses on practical deployment patterns you'll actually use in production.

🏢 Cloud Platform Comparison

Feature AWS SageMaker Google Cloud AI Azure ML
Market Share ✅ #1 (32%) #3 (10%) #2 (23%)
ML Services Most comprehensive ✅ TensorFlow native Best for .NET
Ease of Use Moderate ✅ Easiest Moderate
Pricing Competitive ✅ Cheapest compute Competitive
GPU Options ✅ Most variety TPU available Good variety
Enterprise ✅ Most mature Strong ✅ Best MS integration
Deployment Speed Fast ✅ Fastest Fast

💡 Which to Choose?

  • AWS: If you need maximum flexibility and already use AWS
  • Google Cloud: If you use TensorFlow or want simplicity
  • Azure: If you're in the Microsoft ecosystem
  • Multi-cloud: Abstract with Kubernetes for portability

🟠 AWS SageMaker Deployment

What is SageMaker?

Amazon SageMaker is a fully managed ML platform that handles training, tuning, and deployment. It supports any framework (scikit-learn, PyTorch, TensorFlow, XGBoost).

Deployment Options

  1. Real-time endpoints: Persistent, low-latency predictions
  2. Batch transform: Process large datasets offline
  3. Serverless inference: Auto-scaling, pay-per-invocation
  4. Asynchronous inference: Long-running, queued requests

Deploying a Scikit-learn Model

"""
Deploy sklearn model to SageMaker
"""
import sagemaker
from sagemaker.sklearn import SKLearnModel
from sagemaker.predictor import Predictor
import boto3
import joblib

# 1. Save model with entry point script
# model.py (inference script)
"""
import joblib
import numpy as np

def model_fn(model_dir):
    '''Load model from directory'''
    model = joblib.load(f'{model_dir}/model.joblib')
    return model

def predict_fn(input_data, model):
    '''Make predictions'''
    return model.predict(input_data)
"""

# 2. Upload model to S3
session = sagemaker.Session()
bucket = session.default_bucket()

# Upload model artifact
model_data = session.upload_data(
    path='model.tar.gz',  # tar.gz with model.joblib and model.py
    bucket=bucket,
    key_prefix='sklearn-model'
)

# 3. Create SageMaker model
sklearn_model = SKLearnModel(
    model_data=model_data,
    role='arn:aws:iam::YOUR_ACCOUNT:role/SageMakerRole',
    entry_point='model.py',
    framework_version='1.0-1',
    py_version='py3'
)

# 4. Deploy to endpoint
predictor = sklearn_model.deploy(
    initial_instance_count=1,
    instance_type='ml.m5.large',
    endpoint_name='iris-classifier'
)

# 5. Make predictions
import numpy as np
data = np.array([[5.1, 3.5, 1.4, 0.2]])
prediction = predictor.predict(data)
print(f"Prediction: {prediction}")

# 6. Clean up (delete endpoint to stop billing!)
predictor.delete_endpoint()

Deploying Docker Container to SageMaker

"""
Deploy custom Docker container to SageMaker
"""
import sagemaker
from sagemaker.model import Model
from sagemaker.predictor import Predictor

# 1. Build and push Docker image to ECR
"""
# Build image
docker build -t ml-api:latest .

# Tag for ECR
docker tag ml-api:latest \
  123456789.dkr.ecr.us-east-1.amazonaws.com/ml-api:latest

# Login to ECR
aws ecr get-login-password --region us-east-1 | \
  docker login --username AWS --password-stdin \
  123456789.dkr.ecr.us-east-1.amazonaws.com

# Push image
docker push 123456789.dkr.ecr.us-east-1.amazonaws.com/ml-api:latest
"""

# 2. Create SageMaker model from container
model = Model(
    image_uri='123456789.dkr.ecr.us-east-1.amazonaws.com/ml-api:latest',
    model_data='s3://bucket/model.tar.gz',  # Optional
    role='arn:aws:iam::YOUR_ACCOUNT:role/SageMakerRole'
)

# 3. Deploy
predictor = model.deploy(
    initial_instance_count=1,
    instance_type='ml.g4dn.xlarge',  # GPU instance
    endpoint_name='ml-api-gpu'
)

# 4. Invoke endpoint
response = predictor.predict({
    "sepal_length": 5.1,
    "sepal_width": 3.5,
    "petal_length": 1.4,
    "petal_width": 0.2
})

print(response)

Serverless Inference (Cost-Optimized)

from sagemaker.serverless import ServerlessInferenceConfig

# Deploy with serverless config
predictor = sklearn_model.deploy(
    serverless_inference_config=ServerlessInferenceConfig(
        memory_size_in_mb=2048,  # 1GB - 6GB
        max_concurrency=10       # Max concurrent invocations
    )
)

# Automatically scales to zero when not in use!
# Pay only per invocation

✅ SageMaker Benefits:

  • Automatic scaling and load balancing
  • Built-in A/B testing and traffic routing
  • Model monitoring and drift detection
  • One-click rollback to previous versions
  • Integration with AWS ecosystem

🔵 Google Cloud AI Platform

Vertex AI Overview

Vertex AI is Google's unified ML platform. It's particularly strong for TensorFlow models and offers TPU acceleration.

Deploying to Vertex AI

"""
Deploy model to Google Cloud Vertex AI
"""
from google.cloud import aiplatform

# Initialize
aiplatform.init(
    project='your-project-id',
    location='us-central1'
)

# 1. Upload model to Google Cloud Storage
# gsutil cp model.joblib gs://your-bucket/models/

# 2. Create model
model = aiplatform.Model.upload(
    display_name='iris-classifier',
    artifact_uri='gs://your-bucket/models/',
    serving_container_image_uri='us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.0-24:latest'
)

# 3. Deploy to endpoint
endpoint = model.deploy(
    machine_type='n1-standard-2',
    min_replica_count=1,
    max_replica_count=5,
    traffic_split={'0': 100}  # 100% traffic to this version
)

# 4. Make prediction
prediction = endpoint.predict(
    instances=[[5.1, 3.5, 1.4, 0.2]]
)
print(prediction.predictions)

# 5. Update with new model version (A/B testing)
new_model = aiplatform.Model.upload(
    display_name='iris-classifier-v2',
    artifact_uri='gs://your-bucket/models-v2/'
)

# Deploy alongside existing model
endpoint.deploy(
    model=new_model,
    traffic_split={'0': 80, '1': 20}  # 80% old, 20% new
)

Deploying Custom Container

# 1. Build and push to Google Container Registry
gcloud builds submit --tag gcr.io/your-project/ml-api:v1

# 2. Create custom prediction container
# Must respond to health checks and prediction requests
# prediction.py (FastAPI app for Vertex AI)
from fastapi import FastAPI, Request
import joblib
import numpy as np

app = FastAPI()
model = joblib.load('model.joblib')

# Health check (required by Vertex AI)
@app.get('/health')
def health():
    return {'status': 'healthy'}

# Prediction endpoint (required format)
@app.post('/predict')
async def predict(request: Request):
    data = await request.json()
    instances = np.array(data['instances'])
    predictions = model.predict(instances)
    return {'predictions': predictions.tolist()}

# Run with: uvicorn prediction:app --host 0.0.0.0 --port 8080
# Deploy custom container
from google.cloud import aiplatform

model = aiplatform.Model.upload(
    display_name='custom-ml-api',
    serving_container_image_uri='gcr.io/your-project/ml-api:v1',
    serving_container_predict_route='/predict',
    serving_container_health_route='/health',
    serving_container_ports=[8080]
)

endpoint = model.deploy(machine_type='n1-standard-4')

Batch Predictions

# Process large datasets offline
batch_prediction_job = model.batch_predict(
    job_display_name='iris-batch-prediction',
    gcs_source='gs://your-bucket/input-data.csv',
    gcs_destination_prefix='gs://your-bucket/predictions/',
    machine_type='n1-standard-4',
    starting_replica_count=5,
    max_replica_count=10
)

# Wait for completion
batch_prediction_job.wait()
print(f"Output: {batch_prediction_job.output_info}")

💡 Google Cloud Advantages: Simplest deployment process, excellent documentation, TPU support for deep learning, tight integration with TensorFlow, and generous free tier.

🔷 Azure Machine Learning

Azure ML Service Overview

Azure ML provides enterprise-grade ML lifecycle management with strong integration into Microsoft ecosystem.

Deploying to Azure ML

"""
Deploy model to Azure ML
"""
from azureml.core import Workspace, Model, Environment
from azureml.core.model import InferenceConfig
from azureml.core.webservice import AciWebservice

# 1. Connect to workspace
ws = Workspace.from_config()

# 2. Register model
model = Model.register(
    workspace=ws,
    model_name='iris-classifier',
    model_path='model.joblib',
    description='Iris classification model'
)

# 3. Create inference environment
env = Environment.from_conda_specification(
    name='sklearn-env',
    file_path='conda_env.yml'
)

# 4. Create scoring script
"""
# score.py
import json
import joblib
import numpy as np
from azureml.core.model import Model

def init():
    global model
    model_path = Model.get_model_path('iris-classifier')
    model = joblib.load(model_path)

def run(raw_data):
    data = json.loads(raw_data)
    input_data = np.array(data['data'])
    predictions = model.predict(input_data)
    return predictions.tolist()
"""

# 5. Configure inference
inference_config = InferenceConfig(
    entry_script='score.py',
    environment=env
)

# 6. Deploy to Azure Container Instances (dev/test)
aci_config = AciWebservice.deploy_configuration(
    cpu_cores=1,
    memory_gb=1,
    auth_enabled=True
)

service = Model.deploy(
    workspace=ws,
    name='iris-classifier-service',
    models=[model],
    inference_config=inference_config,
    deployment_config=aci_config
)

service.wait_for_deployment(show_output=True)

# 7. Get scoring URI
print(f"Scoring URI: {service.scoring_uri}")

# 8. Test endpoint
import requests
headers = {'Content-Type': 'application/json'}
data = {'data': [[5.1, 3.5, 1.4, 0.2]]}

response = requests.post(
    service.scoring_uri,
    json=data,
    headers=headers
)
print(response.json())

Deploy to Azure Kubernetes Service (Production)

from azureml.core.webservice import AksWebservice
from azureml.core.compute import AksCompute, ComputeTarget

# 1. Create or attach to AKS cluster
aks_name = 'ml-aks-cluster'

if aks_name not in ws.compute_targets:
    # Create new AKS cluster
    prov_config = AksCompute.provisioning_configuration(
        agent_count=3,
        vm_size='Standard_D3_v2',
        location='eastus'
    )
    
    aks_target = ComputeTarget.create(
        workspace=ws,
        name=aks_name,
        provisioning_configuration=prov_config
    )
    aks_target.wait_for_completion(show_output=True)
else:
    aks_target = ws.compute_targets[aks_name]

# 2. Configure deployment
aks_config = AksWebservice.deploy_configuration(
    autoscale_enabled=True,
    autoscale_min_replicas=2,
    autoscale_max_replicas=10,
    cpu_cores=2,
    memory_gb=4,
    auth_enabled=True,
    enable_app_insights=True  # Monitoring
)

# 3. Deploy
service = Model.deploy(
    workspace=ws,
    name='iris-classifier-aks',
    models=[model],
    inference_config=inference_config,
    deployment_config=aks_config,
    deployment_target=aks_target
)

service.wait_for_deployment(show_output=True)

Managed Online Endpoints (Recommended)

from azure.ai.ml import MLClient
from azure.ai.ml.entities import ManagedOnlineEndpoint, ManagedOnlineDeployment

# New Azure ML SDK v2 approach
ml_client = MLClient.from_config()

# Create endpoint
endpoint = ManagedOnlineEndpoint(
    name='iris-endpoint',
    description='Iris classification endpoint'
)
ml_client.online_endpoints.begin_create_or_update(endpoint)

# Create deployment
deployment = ManagedOnlineDeployment(
    name='blue',
    endpoint_name='iris-endpoint',
    model=model,
    instance_type='Standard_DS2_v2',
    instance_count=2
)
ml_client.online_deployments.begin_create_or_update(deployment)

# Set 100% traffic to this deployment
endpoint.traffic = {'blue': 100}
ml_client.online_endpoints.begin_create_or_update(endpoint)

⚡ Serverless ML Deployment

When to Use Serverless

  • Sporadic traffic: Requests come in bursts
  • Cost-sensitive: Don't want idle server costs
  • Simple models: Fast inference (< 30 seconds)
  • Auto-scaling: Unpredictable load patterns

AWS Lambda with Container

# Dockerfile for Lambda
FROM public.ecr.aws/lambda/python:3.10

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy model and code
COPY model.joblib .
COPY lambda_function.py .

# Lambda handler
CMD ["lambda_function.handler"]
# lambda_function.py
import json
import joblib
import numpy as np

# Load model once (cold start)
model = joblib.load('model.joblib')

def handler(event, context):
    """AWS Lambda handler"""
    try:
        # Parse input
        body = json.loads(event['body'])
        features = np.array([body['features']])
        
        # Predict
        prediction = model.predict(features)[0]
        probabilities = model.predict_proba(features)[0]
        
        return {
            'statusCode': 200,
            'body': json.dumps({
                'prediction': int(prediction),
                'confidence': float(max(probabilities))
            })
        }
    except Exception as e:
        return {
            'statusCode': 500,
            'body': json.dumps({'error': str(e)})
        }
# Deploy to Lambda
# 1. Build and push to ECR
docker build -t ml-lambda .
docker tag ml-lambda:latest 123456789.dkr.ecr.us-east-1.amazonaws.com/ml-lambda:latest
docker push 123456789.dkr.ecr.us-east-1.amazonaws.com/ml-lambda:latest

# 2. Create Lambda function via AWS CLI
aws lambda create-function \
  --function-name ml-predictor \
  --package-type Image \
  --code ImageUri=123456789.dkr.ecr.us-east-1.amazonaws.com/ml-lambda:latest \
  --role arn:aws:iam::123456789:role/lambda-execution-role \
  --timeout 30 \
  --memory-size 1024

# 3. Create API Gateway trigger (via console or CLI)

Lambda Performance Optimization

# Optimize cold starts
import sys
sys.path.insert(0, '/opt/python')  # Lambda layers

# Use Lambda layers for heavy dependencies
# Layer 1: numpy, scipy (shared across functions)
# Layer 2: scikit-learn
# Function code: just your model and handler

# Provisioned concurrency (keeps instances warm)
# aws lambda put-provisioned-concurrency-config \
#   --function-name ml-predictor \
#   --provisioned-concurrent-executions 5

⚠️ Serverless Limitations:

  • 15-minute maximum execution time (Lambda)
  • Cold start latency (500ms - 5s)
  • Memory limits (up to 10GB Lambda)
  • No GPU support
  • Not cost-effective for constant high traffic

💰 Cost Optimization Strategies

1. Right-Sizing Instances

Instance Type Use Case Cost (AWS/month)
t3.medium Dev/test, low traffic $30
m5.large Production, CPU models $70
c5.xlarge Compute-intensive $125
g4dn.xlarge GPU inference $390
p3.2xlarge Heavy GPU workloads $2,200

2. Auto-Scaling Configuration

# AWS SageMaker auto-scaling
import boto3

client = boto3.client('application-autoscaling')

# Register scalable target
client.register_scalable_target(
    ServiceNamespace='sagemaker',
    ResourceId=f'endpoint/iris-classifier/variant/AllTraffic',
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    MinCapacity=1,
    MaxCapacity=10
)

# Target tracking scaling policy
client.put_scaling_policy(
    PolicyName='scale-on-invocations',
    ServiceNamespace='sagemaker',
    ResourceId=f'endpoint/iris-classifier/variant/AllTraffic',
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    PolicyType='TargetTrackingScaling',
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': 1000.0,  # Target 1000 invocations per instance
        'PredefinedMetricSpecification': {
            'PredefinedMetricType': 'SageMakerVariantInvocationsPerInstance'
        },
        'ScaleInCooldown': 300,
        'ScaleOutCooldown': 60
    }
)

3. Spot Instances for Batch Workloads

# Use spot instances (up to 90% cheaper!)
from sagemaker.tensorflow import TensorFlow

estimator = TensorFlow(
    entry_point='train.py',
    instance_type='ml.p3.2xlarge',
    instance_count=4,
    use_spot_instances=True,  # Enable spot
    max_wait=7200,            # Max wait time
    max_run=3600              # Max training time
)

estimator.fit()  # Will use spot instances when available

4. Model Optimization Techniques

# Quantization (reduce model size & speed up)
import tensorflow as tf

# Convert to TensorFlow Lite with quantization
converter = tf.lite.TFLiteConverter.from_saved_model('model/')
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()

# Result: 4x smaller model, 2-3x faster inference

# PyTorch quantization
import torch

model = torch.load('model.pth')
model.eval()

# Dynamic quantization
quantized_model = torch.quantization.quantize_dynamic(
    model,
    {torch.nn.Linear},
    dtype=torch.qint8
)

# 4x smaller, similar accuracy

5. Caching Strategies

# Redis caching (prevent redundant predictions)
import redis
import hashlib
import json

redis_client = redis.from_url('redis://cache:6379')

def predict_with_cache(features):
    # Create cache key
    key = hashlib.md5(str(features).encode()).hexdigest()
    
    # Check cache
    cached = redis_client.get(f'pred:{key}')
    if cached:
        return json.loads(cached)
    
    # Make prediction
    result = model.predict(features)
    
    # Cache for 1 hour
    redis_client.setex(f'pred:{key}', 3600, json.dumps(result))
    
    return result

# Can reduce costs by 50%+ for repeated queries!

Cost Monitoring

# AWS Cost Explorer API
import boto3
from datetime import datetime, timedelta

ce = boto3.client('ce')

# Get costs for SageMaker
response = ce.get_cost_and_usage(
    TimePeriod={
        'Start': (datetime.now() - timedelta(days=30)).strftime('%Y-%m-%d'),
        'End': datetime.now().strftime('%Y-%m-%d')
    },
    Granularity='DAILY',
    Metrics=['UnblendedCost'],
    Filter={
        'Dimensions': {
            'Key': 'SERVICE',
            'Values': ['Amazon SageMaker']
        }
    }
)

for result in response['ResultsByTime']:
    print(f"{result['TimePeriod']['Start']}: ${result['Total']['UnblendedCost']['Amount']}")

🎯 Summary

You've mastered cloud deployment for ML models:

🟠

AWS SageMaker

Most comprehensive platform with serverless and real-time options

🔵

Google Cloud

Simplest deployment with TPU support

🔷

Azure ML

Enterprise features with Microsoft integration

Serverless

Cost-effective for sporadic workloads

💰

Cost Optimization

Right-sizing, auto-scaling, caching, quantization

📊

Monitoring

Track costs and performance continuously

Key Takeaways

  1. Choose cloud provider based on your tech stack and requirements
  2. Use managed services (SageMaker, Vertex AI) for easier deployment
  3. Implement auto-scaling to handle traffic spikes
  4. Consider serverless for sporadic workloads
  5. Optimize costs with spot instances, caching, and quantization
  6. Monitor costs continuously - cloud bills can surprise you
  7. Start small, scale based on actual traffic patterns

🚀 Next Steps:

Your models are in the cloud! Next, you'll learn model serving frameworks like BentoML and TorchServe for production-grade inference at scale, with advanced features like A/B testing and canary deployments.

Test Your Knowledge

Q1: What's the main advantage of using managed ML platforms like SageMaker over deploying containers yourself?

They're always cheaper
They're faster
They handle auto-scaling, monitoring, and deployment infrastructure automatically
They're more secure

Q2: When should you use serverless inference (like AWS Lambda)?

For all ML deployments always
For sporadic traffic, cost-sensitive applications with fast inference times
For GPU-intensive workloads
For long-running batch jobs

Q3: Which cloud platform is generally best for TensorFlow models?

Google Cloud (Vertex AI) - native TensorFlow support and TPUs
AWS always
Azure for everything
They're all exactly the same

Q4: What's a good cost optimization strategy for ML inference?

Always use the largest instances
Never use auto-scaling
Deploy to all regions
Implement caching, use auto-scaling, consider spot instances for batch jobs, and quantize models

Q5: What's the purpose of A/B testing in model deployment?

To make deployment faster
To reduce costs
To gradually roll out new model versions and compare performance with the old version
To test different cloud providers