Cloud Deployment - MLOps Tutorial

🎓 Complete all tutorials to earn your Free MLOps Engineer Certificate
Shareable on LinkedIn • Verified by AITutorials.site • No signup fee

☁️ Why Cloud Deployment?

You've containerized your ML model with Docker. Now you need to deploy it to production where millions of users can access it. Running on your laptop isn't an option. A single server won't handle the traffic. You need scalability, reliability, and global reach.

Cloud platforms solve this by providing managed infrastructure, automatic scaling, load balancing, monitoring, and deployment tools specifically designed for ML workloads.

📈

Auto-Scaling

Handle traffic spikes automatically - scale from 10 to 10,000 requests/second

🌍

Global Reach

Deploy to multiple regions for low latency worldwide

🛡️

Reliability

99.99% uptime SLAs with automatic failover

💰

Cost Efficiency

Pay only for what you use - no idle server costs

⚠️ Cloud Complexity: While powerful, cloud platforms have steep learning curves. This tutorial focuses on practical deployment patterns you'll actually use in production.

🏢 Cloud Platform Comparison

Feature	AWS SageMaker	Google Cloud AI	Azure ML
Market Share	✅ #1 (32%)	#3 (10%)	#2 (23%)
ML Services	Most comprehensive	✅ TensorFlow native	Best for .NET
Ease of Use	Moderate	✅ Easiest	Moderate
Pricing	Competitive	✅ Cheapest compute	Competitive
GPU Options	✅ Most variety	TPU available	Good variety
Enterprise	✅ Most mature	Strong	✅ Best MS integration
Deployment Speed	Fast	✅ Fastest	Fast

💡 Which to Choose?

AWS: If you need maximum flexibility and already use AWS
Google Cloud: If you use TensorFlow or want simplicity
Azure: If you're in the Microsoft ecosystem
Multi-cloud: Abstract with Kubernetes for portability

🟠 AWS SageMaker Deployment

What is SageMaker?

Amazon SageMaker is a fully managed ML platform that handles training, tuning, and deployment. It supports any framework (scikit-learn, PyTorch, TensorFlow, XGBoost).

Deployment Options

Real-time endpoints: Persistent, low-latency predictions
Batch transform: Process large datasets offline
Serverless inference: Auto-scaling, pay-per-invocation
Asynchronous inference: Long-running, queued requests

Deploying a Scikit-learn Model

"""
Deploy sklearn model to SageMaker
"""
import sagemaker
from sagemaker.sklearn import SKLearnModel
from sagemaker.predictor import Predictor
import boto3
import joblib

# 1. Save model with entry point script
# model.py (inference script)
"""
import joblib
import numpy as np

def model_fn(model_dir):
    '''Load model from directory'''
    model = joblib.load(f'{model_dir}/model.joblib')
    return model

def predict_fn(input_data, model):
    '''Make predictions'''
    return model.predict(input_data)
"""

# 2. Upload model to S3
session = sagemaker.Session()
bucket = session.default_bucket()

# Upload model artifact
model_data = session.upload_data(
    path='model.tar.gz',  # tar.gz with model.joblib and model.py
    bucket=bucket,
    key_prefix='sklearn-model'
)

# 3. Create SageMaker model
sklearn_model = SKLearnModel(
    model_data=model_data,
    role='arn:aws:iam::YOUR_ACCOUNT:role/SageMakerRole',
    entry_point='model.py',
    framework_version='1.0-1',
    py_version='py3'
)

# 4. Deploy to endpoint
predictor = sklearn_model.deploy(
    initial_instance_count=1,
    instance_type='ml.m5.large',
    endpoint_name='iris-classifier'
)

# 5. Make predictions
import numpy as np
data = np.array([[5.1, 3.5, 1.4, 0.2]])
prediction = predictor.predict(data)
print(f"Prediction: {prediction}")

# 6. Clean up (delete endpoint to stop billing!)
predictor.delete_endpoint()

Deploying Docker Container to SageMaker

"""
Deploy custom Docker container to SageMaker
"""
import sagemaker
from sagemaker.model import Model
from sagemaker.predictor import Predictor

# 1. Build and push Docker image to ECR
"""
# Build image
docker build -t ml-api:latest .

# Tag for ECR
docker tag ml-api:latest \
  123456789.dkr.ecr.us-east-1.amazonaws.com/ml-api:latest

# Login to ECR
aws ecr get-login-password --region us-east-1 | \
  docker login --username AWS --password-stdin \
  123456789.dkr.ecr.us-east-1.amazonaws.com

# Push image
docker push 123456789.dkr.ecr.us-east-1.amazonaws.com/ml-api:latest
"""

# 2. Create SageMaker model from container
model = Model(
    image_uri='123456789.dkr.ecr.us-east-1.amazonaws.com/ml-api:latest',
    model_data='s3://bucket/model.tar.gz',  # Optional
    role='arn:aws:iam::YOUR_ACCOUNT:role/SageMakerRole'
)

# 3. Deploy
predictor = model.deploy(
    initial_instance_count=1,
    instance_type='ml.g4dn.xlarge',  # GPU instance
    endpoint_name='ml-api-gpu'
)

# 4. Invoke endpoint
response = predictor.predict({
    "sepal_length": 5.1,
    "sepal_width": 3.5,
    "petal_length": 1.4,
    "petal_width": 0.2
})

print(response)

Serverless Inference (Cost-Optimized)

from sagemaker.serverless import ServerlessInferenceConfig

# Deploy with serverless config
predictor = sklearn_model.deploy(
    serverless_inference_config=ServerlessInferenceConfig(
        memory_size_in_mb=2048,  # 1GB - 6GB
        max_concurrency=10       # Max concurrent invocations
    )
)

# Automatically scales to zero when not in use!
# Pay only per invocation

✅ SageMaker Benefits:

Automatic scaling and load balancing
Built-in A/B testing and traffic routing
Model monitoring and drift detection
One-click rollback to previous versions
Integration with AWS ecosystem

🔵 Google Cloud AI Platform

Vertex AI Overview

Vertex AI is Google's unified ML platform. It's particularly strong for TensorFlow models and offers TPU acceleration.

Deploying to Vertex AI

"""
Deploy model to Google Cloud Vertex AI
"""
from google.cloud import aiplatform

# Initialize
aiplatform.init(
    project='your-project-id',
    location='us-central1'
)

# 1. Upload model to Google Cloud Storage
# gsutil cp model.joblib gs://your-bucket/models/

# 2. Create model
model = aiplatform.Model.upload(
    display_name='iris-classifier',
    artifact_uri='gs://your-bucket/models/',
    serving_container_image_uri='us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.0-24:latest'
)

# 3. Deploy to endpoint
endpoint = model.deploy(
    machine_type='n1-standard-2',
    min_replica_count=1,
    max_replica_count=5,
    traffic_split={'0': 100}  # 100% traffic to this version
)

# 4. Make prediction
prediction = endpoint.predict(
    instances=[[5.1, 3.5, 1.4, 0.2]]
)
print(prediction.predictions)

# 5. Update with new model version (A/B testing)
new_model = aiplatform.Model.upload(
    display_name='iris-classifier-v2',
    artifact_uri='gs://your-bucket/models-v2/'
)

# Deploy alongside existing model
endpoint.deploy(
    model=new_model,
    traffic_split={'0': 80, '1': 20}  # 80% old, 20% new
)

Deploying Custom Container

# 1. Build and push to Google Container Registry
gcloud builds submit --tag gcr.io/your-project/ml-api:v1

# 2. Create custom prediction container
# Must respond to health checks and prediction requests

# prediction.py (FastAPI app for Vertex AI)
from fastapi import FastAPI, Request
import joblib
import numpy as np

app = FastAPI()
model = joblib.load('model.joblib')

# Health check (required by Vertex AI)
@app.get('/health')
def health():
    return {'status': 'healthy'}

# Prediction endpoint (required format)
@app.post('/predict')
async def predict(request: Request):
    data = await request.json()
    instances = np.array(data['instances'])
    predictions = model.predict(instances)
    return {'predictions': predictions.tolist()}

# Run with: uvicorn prediction:app --host 0.0.0.0 --port 8080

# Deploy custom container
from google.cloud import aiplatform

model = aiplatform.Model.upload(
    display_name='custom-ml-api',
    serving_container_image_uri='gcr.io/your-project/ml-api:v1',
    serving_container_predict_route='/predict',
    serving_container_health_route='/health',
    serving_container_ports=[8080]
)

endpoint = model.deploy(machine_type='n1-standard-4')

Batch Predictions

# Process large datasets offline
batch_prediction_job = model.batch_predict(
    job_display_name='iris-batch-prediction',
    gcs_source='gs://your-bucket/input-data.csv',
    gcs_destination_prefix='gs://your-bucket/predictions/',
    machine_type='n1-standard-4',
    starting_replica_count=5,
    max_replica_count=10
)

# Wait for completion
batch_prediction_job.wait()
print(f"Output: {batch_prediction_job.output_info}")

💡 Google Cloud Advantages: Simplest deployment process, excellent documentation, TPU support for deep learning, tight integration with TensorFlow, and generous free tier.

🔷 Azure Machine Learning

Azure ML Service Overview

Azure ML provides enterprise-grade ML lifecycle management with strong integration into Microsoft ecosystem.

Deploying to Azure ML

"""
Deploy model to Azure ML
"""
from azureml.core import Workspace, Model, Environment
from azureml.core.model import InferenceConfig
from azureml.core.webservice import AciWebservice

# 1. Connect to workspace
ws = Workspace.from_config()

# 2. Register model
model = Model.register(
    workspace=ws,
    model_name='iris-classifier',
    model_path='model.joblib',
    description='Iris classification model'
)

# 3. Create inference environment
env = Environment.from_conda_specification(
    name='sklearn-env',
    file_path='conda_env.yml'
)

# 4. Create scoring script
"""
# score.py
import json
import joblib
import numpy as np
from azureml.core.model import Model

def init():
    global model
    model_path = Model.get_model_path('iris-classifier')
    model = joblib.load(model_path)

def run(raw_data):
    data = json.loads(raw_data)
    input_data = np.array(data['data'])
    predictions = model.predict(input_data)
    return predictions.tolist()
"""

# 5. Configure inference
inference_config = InferenceConfig(
    entry_script='score.py',
    environment=env
)

# 6. Deploy to Azure Container Instances (dev/test)
aci_config = AciWebservice.deploy_configuration(
    cpu_cores=1,
    memory_gb=1,
    auth_enabled=True
)

service = Model.deploy(
    workspace=ws,
    name='iris-classifier-service',
    models=[model],
    inference_config=inference_config,
    deployment_config=aci_config
)

service.wait_for_deployment(show_output=True)

# 7. Get scoring URI
print(f"Scoring URI: {service.scoring_uri}")

# 8. Test endpoint
import requests
headers = {'Content-Type': 'application/json'}
data = {'data': [[5.1, 3.5, 1.4, 0.2]]}

response = requests.post(
    service.scoring_uri,
    json=data,
    headers=headers
)
print(response.json())

Deploy to Azure Kubernetes Service (Production)

from azureml.core.webservice import AksWebservice
from azureml.core.compute import AksCompute, ComputeTarget

# 1. Create or attach to AKS cluster
aks_name = 'ml-aks-cluster'

if aks_name not in ws.compute_targets:
    # Create new AKS cluster
    prov_config = AksCompute.provisioning_configuration(
        agent_count=3,
        vm_size='Standard_D3_v2',
        location='eastus'
    )
    
    aks_target = ComputeTarget.create(
        workspace=ws,
        name=aks_name,
        provisioning_configuration=prov_config
    )
    aks_target.wait_for_completion(show_output=True)
else:
    aks_target = ws.compute_targets[aks_name]

# 2. Configure deployment
aks_config = AksWebservice.deploy_configuration(
    autoscale_enabled=True,
    autoscale_min_replicas=2,
    autoscale_max_replicas=10,
    cpu_cores=2,
    memory_gb=4,
    auth_enabled=True,
    enable_app_insights=True  # Monitoring
)

# 3. Deploy
service = Model.deploy(
    workspace=ws,
    name='iris-classifier-aks',
    models=[model],
    inference_config=inference_config,
    deployment_config=aks_config,
    deployment_target=aks_target
)

service.wait_for_deployment(show_output=True)

Managed Online Endpoints (Recommended)

from azure.ai.ml import MLClient
from azure.ai.ml.entities import ManagedOnlineEndpoint, ManagedOnlineDeployment

# New Azure ML SDK v2 approach
ml_client = MLClient.from_config()

# Create endpoint
endpoint = ManagedOnlineEndpoint(
    name='iris-endpoint',
    description='Iris classification endpoint'
)
ml_client.online_endpoints.begin_create_or_update(endpoint)

# Create deployment
deployment = ManagedOnlineDeployment(
    name='blue',
    endpoint_name='iris-endpoint',
    model=model,
    instance_type='Standard_DS2_v2',
    instance_count=2
)
ml_client.online_deployments.begin_create_or_update(deployment)

# Set 100% traffic to this deployment
endpoint.traffic = {'blue': 100}
ml_client.online_endpoints.begin_create_or_update(endpoint)

⚡ Serverless ML Deployment

When to Use Serverless

Sporadic traffic: Requests come in bursts
Cost-sensitive: Don't want idle server costs
Simple models: Fast inference (< 30 seconds)
Auto-scaling: Unpredictable load patterns

AWS Lambda with Container

# Dockerfile for Lambda
FROM public.ecr.aws/lambda/python:3.10

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy model and code
COPY model.joblib .
COPY lambda_function.py .

# Lambda handler
CMD ["lambda_function.handler"]

# lambda_function.py
import json
import joblib
import numpy as np

# Load model once (cold start)
model = joblib.load('model.joblib')

def handler(event, context):
    """AWS Lambda handler"""
    try:
        # Parse input
        body = json.loads(event['body'])
        features = np.array([body['features']])
        
        # Predict
        prediction = model.predict(features)[0]
        probabilities = model.predict_proba(features)[0]
        
        return {
            'statusCode': 200,
            'body': json.dumps({
                'prediction': int(prediction),
                'confidence': float(max(probabilities))
            })
        }
    except Exception as e:
        return {
            'statusCode': 500,
            'body': json.dumps({'error': str(e)})
        }

# Deploy to Lambda
# 1. Build and push to ECR
docker build -t ml-lambda .
docker tag ml-lambda:latest 123456789.dkr.ecr.us-east-1.amazonaws.com/ml-lambda:latest
docker push 123456789.dkr.ecr.us-east-1.amazonaws.com/ml-lambda:latest

# 2. Create Lambda function via AWS CLI
aws lambda create-function \
  --function-name ml-predictor \
  --package-type Image \
  --code ImageUri=123456789.dkr.ecr.us-east-1.amazonaws.com/ml-lambda:latest \
  --role arn:aws:iam::123456789:role/lambda-execution-role \
  --timeout 30 \
  --memory-size 1024

# 3. Create API Gateway trigger (via console or CLI)

Lambda Performance Optimization

# Optimize cold starts
import sys
sys.path.insert(0, '/opt/python')  # Lambda layers

# Use Lambda layers for heavy dependencies
# Layer 1: numpy, scipy (shared across functions)
# Layer 2: scikit-learn
# Function code: just your model and handler

# Provisioned concurrency (keeps instances warm)
# aws lambda put-provisioned-concurrency-config \
#   --function-name ml-predictor \
#   --provisioned-concurrent-executions 5

⚠️ Serverless Limitations:

15-minute maximum execution time (Lambda)
Cold start latency (500ms - 5s)
Memory limits (up to 10GB Lambda)
No GPU support
Not cost-effective for constant high traffic

💰 Cost Optimization Strategies

1. Right-Sizing Instances

Instance Type	Use Case	Cost (AWS/month)
t3.medium	Dev/test, low traffic	$30
m5.large	Production, CPU models	$70
c5.xlarge	Compute-intensive	$125
g4dn.xlarge	GPU inference	$390
p3.2xlarge	Heavy GPU workloads	$2,200

2. Auto-Scaling Configuration

# AWS SageMaker auto-scaling
import boto3

client = boto3.client('application-autoscaling')

# Register scalable target
client.register_scalable_target(
    ServiceNamespace='sagemaker',
    ResourceId=f'endpoint/iris-classifier/variant/AllTraffic',
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    MinCapacity=1,
    MaxCapacity=10
)

# Target tracking scaling policy
client.put_scaling_policy(
    PolicyName='scale-on-invocations',
    ServiceNamespace='sagemaker',
    ResourceId=f'endpoint/iris-classifier/variant/AllTraffic',
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    PolicyType='TargetTrackingScaling',
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': 1000.0,  # Target 1000 invocations per instance
        'PredefinedMetricSpecification': {
            'PredefinedMetricType': 'SageMakerVariantInvocationsPerInstance'
        },
        'ScaleInCooldown': 300,
        'ScaleOutCooldown': 60
    }
)

3. Spot Instances for Batch Workloads

# Use spot instances (up to 90% cheaper!)
from sagemaker.tensorflow import TensorFlow

estimator = TensorFlow(
    entry_point='train.py',
    instance_type='ml.p3.2xlarge',
    instance_count=4,
    use_spot_instances=True,  # Enable spot
    max_wait=7200,            # Max wait time
    max_run=3600              # Max training time
)

estimator.fit()  # Will use spot instances when available

4. Model Optimization Techniques

# Quantization (reduce model size & speed up)
import tensorflow as tf

# Convert to TensorFlow Lite with quantization
converter = tf.lite.TFLiteConverter.from_saved_model('model/')
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()

# Result: 4x smaller model, 2-3x faster inference

# PyTorch quantization
import torch

model = torch.load('model.pth')
model.eval()

# Dynamic quantization
quantized_model = torch.quantization.quantize_dynamic(
    model,
    {torch.nn.Linear},
    dtype=torch.qint8
)

# 4x smaller, similar accuracy

5. Caching Strategies

# Redis caching (prevent redundant predictions)
import redis
import hashlib
import json

redis_client = redis.from_url('redis://cache:6379')

def predict_with_cache(features):
    # Create cache key
    key = hashlib.md5(str(features).encode()).hexdigest()
    
    # Check cache
    cached = redis_client.get(f'pred:{key}')
    if cached:
        return json.loads(cached)
    
    # Make prediction
    result = model.predict(features)
    
    # Cache for 1 hour
    redis_client.setex(f'pred:{key}', 3600, json.dumps(result))
    
    return result

# Can reduce costs by 50%+ for repeated queries!

Cost Monitoring

# AWS Cost Explorer API
import boto3
from datetime import datetime, timedelta

ce = boto3.client('ce')

# Get costs for SageMaker
response = ce.get_cost_and_usage(
    TimePeriod={
        'Start': (datetime.now() - timedelta(days=30)).strftime('%Y-%m-%d'),
        'End': datetime.now().strftime('%Y-%m-%d')
    },
    Granularity='DAILY',
    Metrics=['UnblendedCost'],
    Filter={
        'Dimensions': {
            'Key': 'SERVICE',
            'Values': ['Amazon SageMaker']
        }
    }
)

for result in response['ResultsByTime']:
    print(f"{result['TimePeriod']['Start']}: ${result['Total']['UnblendedCost']['Amount']}")

🎯 Summary

You've mastered cloud deployment for ML models:

🟠

AWS SageMaker

Most comprehensive platform with serverless and real-time options

🔵

Google Cloud

Simplest deployment with TPU support

🔷

Azure ML

Enterprise features with Microsoft integration

⚡

Serverless

Cost-effective for sporadic workloads

💰

Cost Optimization

Right-sizing, auto-scaling, caching, quantization

📊

Monitoring

Track costs and performance continuously

Key Takeaways

Choose cloud provider based on your tech stack and requirements
Use managed services (SageMaker, Vertex AI) for easier deployment
Implement auto-scaling to handle traffic spikes
Consider serverless for sporadic workloads
Optimize costs with spot instances, caching, and quantization
Monitor costs continuously - cloud bills can surprise you
Start small, scale based on actual traffic patterns

🚀 Next Steps:

Your models are in the cloud! Next, you'll learn model serving frameworks like BentoML and TorchServe for production-grade inference at scale, with advanced features like A/B testing and canary deployments.

Test Your Knowledge

Q1: What's the main advantage of using managed ML platforms like SageMaker over deploying containers yourself?

They're always cheaper

They're faster

They handle auto-scaling, monitoring, and deployment infrastructure automatically

They're more secure

Q2: When should you use serverless inference (like AWS Lambda)?

For all ML deployments always

For sporadic traffic, cost-sensitive applications with fast inference times

For GPU-intensive workloads

For long-running batch jobs

Q3: Which cloud platform is generally best for TensorFlow models?

Google Cloud (Vertex AI) - native TensorFlow support and TPUs

AWS always

Azure for everything

They're all exactly the same

Q4: What's a good cost optimization strategy for ML inference?

Always use the largest instances

Never use auto-scaling

Deploy to all regions

Implement caching, use auto-scaling, consider spot instances for batch jobs, and quantize models

Q5: What's the purpose of A/B testing in model deployment?

To make deployment faster

To reduce costs

To gradually roll out new model versions and compare performance with the old version

To test different cloud providers