HomeMLOps EngineerKubernetes for ML

Kubernetes for ML

Master Kubernetes for ML workloads. Deploy models, implement autoscaling, optimize GPU scheduling, and use Helm charts for production ML infrastructure

📅 Tutorial 9 📊 Advanced

🎓 Complete all tutorials to earn your Free MLOps Engineer Certificate
Shareable on LinkedIn • Verified by AITutorials.site • No signup fee

☸️ Why Kubernetes for ML?

You've containerized your ML application with Docker. It runs great on your laptop! But production requires:

  • Running 10 replicas for high availability
  • Automatically scaling based on traffic
  • Zero-downtime deployments
  • Efficient GPU sharing across models
  • Self-healing when containers crash
  • Load balancing incoming requests

Kubernetes (K8s) is the industry-standard container orchestration platform. It manages containerized applications at scale with features ML workloads need: resource scheduling, autoscaling, GPU support, and high availability.

Kubernetes Architecture

Control Plane: API Server, Scheduler, Controller Manager, etcd (manages cluster state)
Worker Nodes: Run containerized applications (Pods)
Pods: Smallest deployable units (one or more containers)
Services: Expose Pods to network traffic with load balancing
Deployments: Manage desired state and rolling updates

💡 Kubernetes Benefits for ML:

  • Declarative configuration (define desired state, K8s maintains it)
  • Automatic scaling based on CPU, memory, or custom metrics
  • GPU scheduling and sharing across workloads
  • Rolling updates with zero downtime
  • Self-healing (automatic restarts of failed containers)
  • Resource quotas and limits for cost control

🚀 Kubernetes Setup

Local Development: Minikube

# Install Minikube (macOS)
brew install minikube

# Start cluster
minikube start --cpus=4 --memory=8192 --driver=docker

# Enable GPU support (if available)
minikube start --driver=kvm2 --gpus all

# Verify cluster
kubectl cluster-info
kubectl get nodes

# Install kubectl (if not installed)
brew install kubectl

Production: Cloud Kubernetes Services

# AWS EKS
eksctl create cluster \
  --name ml-cluster \
  --region us-west-2 \
  --nodegroup-name ml-nodes \
  --node-type p3.2xlarge \  # GPU instance
  --nodes 3

# Google GKE
gcloud container clusters create ml-cluster \
  --machine-type n1-standard-4 \
  --accelerator type=nvidia-tesla-t4,count=1 \
  --num-nodes 3 \
  --zone us-central1-a

# Azure AKS
az aks create \
  --resource-group ml-rg \
  --name ml-cluster \
  --node-vm-size Standard_NC6 \
  --node-count 3

Essential kubectl Commands

# View resources
kubectl get pods
kubectl get services
kubectl get deployments
kubectl get nodes

# Detailed info
kubectl describe pod 
kubectl logs 
kubectl logs -f   # Follow logs

# Execute commands in pod
kubectl exec -it  -- bash

# Port forwarding for local testing
kubectl port-forward pod/ 8080:8080

# Apply configuration
kubectl apply -f deployment.yaml

# Delete resources
kubectl delete -f deployment.yaml
kubectl delete pod 

🚢 Deploying ML Models on Kubernetes

Step 1: Create Docker Image

# Dockerfile for ML model
FROM python:3.10-slim

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy model and code
COPY model.pkl .
COPY app.py .

# Expose port
EXPOSE 8080

# Run server
CMD ["python", "app.py"]
# Build and push image
docker build -t myregistry/ml-model:v1 .
docker push myregistry/ml-model:v1

Step 2: Create Deployment

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-model
  labels:
    app: ml-model
spec:
  replicas: 3  # Run 3 pods for high availability
  selector:
    matchLabels:
      app: ml-model
  template:
    metadata:
      labels:
        app: ml-model
        version: v1
    spec:
      containers:
      - name: model
        image: myregistry/ml-model:v1
        ports:
        - containerPort: 8080
        
        # Resource requests and limits
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"      # 0.5 CPU
          limits:
            memory: "1Gi"
            cpu: "1000m"     # 1 CPU
        
        # Health checks
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
        
        # Environment variables
        env:
        - name: MODEL_VERSION
          value: "v1"
        - name: LOG_LEVEL
          value: "INFO"

Step 3: Create Service

# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: ml-model-service
spec:
  selector:
    app: ml-model
  type: LoadBalancer  # External access
  ports:
  - protocol: TCP
    port: 80          # External port
    targetPort: 8080  # Container port
  
  sessionAffinity: ClientIP  # Sticky sessions

Deploy to Kubernetes

# Apply configurations
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml

# Check deployment status
kubectl get deployments
kubectl get pods
kubectl get services

# Get external IP
kubectl get service ml-model-service

# Test endpoint
curl http:///predict -X POST -d '{"features": [1,2,3,4]}'

# View logs
kubectl logs -l app=ml-model --tail=100

# Scale deployment
kubectl scale deployment ml-model --replicas=5

✅ Deployment Best Practices:

  • Always set resource requests and limits
  • Implement health checks (liveness and readiness probes)
  • Use multiple replicas for high availability
  • Tag images with versions, not 'latest'
  • Store sensitive data in Secrets, not in image

📈 Horizontal Pod Autoscaling (HPA)

What is HPA?

HPA automatically scales the number of pods based on observed metrics (CPU, memory, custom metrics like request rate or queue length).

CPU-Based Autoscaling

# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ml-model-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ml-model
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70  # Scale when avg CPU > 70%
  
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5min before scaling down
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60  # Max 50% reduction per minute
    
    scaleUp:
      stabilizationWindowSeconds: 0  # Scale up immediately
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15  # Max 100% increase per 15s
# Apply HPA
kubectl apply -f hpa.yaml

# Check HPA status
kubectl get hpa
kubectl describe hpa ml-model-hpa

# Generate load to test autoscaling
kubectl run -i --tty load-generator --rm --image=busybox --restart=Never -- /bin/sh
while true; do wget -q -O- http://ml-model-service/predict; done

# Watch pods scale up
kubectl get pods -w

Custom Metrics Autoscaling

# hpa-custom.yaml - Scale based on request rate
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ml-model-hpa-custom
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ml-model
  minReplicas: 2
  maxReplicas: 20
  metrics:
  # CPU metric
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  
  # Memory metric
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  
  # Custom metric: requests per second
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"  # Scale when > 1000 req/s per pod
  
  # Custom metric: inference queue length
  - type: Pods
    pods:
      metric:
        name: inference_queue_length
      target:
        type: AverageValue
        averageValue: "10"  # Scale when queue > 10

Vertical Pod Autoscaling (VPA)

# vpa.yaml - Automatically adjust resource requests
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: ml-model-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ml-model
  updatePolicy:
    updateMode: "Auto"  # Auto-update pod resources
  resourcePolicy:
    containerPolicies:
    - containerName: model
      minAllowed:
        cpu: 100m
        memory: 256Mi
      maxAllowed:
        cpu: 2000m
        memory: 4Gi

🎮 GPU Scheduling

Enable GPU Support

# Install NVIDIA device plugin
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml

# Verify GPU nodes
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"

Deploy GPU Workload

# gpu-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: pytorch-gpu-model
spec:
  replicas: 2
  selector:
    matchLabels:
      app: pytorch-gpu
  template:
    metadata:
      labels:
        app: pytorch-gpu
    spec:
      containers:
      - name: model
        image: pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime
        command: ["python", "serve.py"]
        
        resources:
          limits:
            nvidia.com/gpu: 1  # Request 1 GPU
          requests:
            nvidia.com/gpu: 1
            memory: "8Gi"
            cpu: "4"
        
        env:
        - name: CUDA_VISIBLE_DEVICES
          value: "0"
        
        volumeMounts:
        - name: model-storage
          mountPath: /models
      
      # Node selector for GPU nodes
      nodeSelector:
        accelerator: nvidia-tesla-t4
      
      volumes:
      - name: model-storage
        persistentVolumeClaim:
          claimName: model-pvc

GPU Sharing with Time-Slicing

# gpu-time-slicing-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: gpu-sharing-config
  namespace: gpu-operator
data:
  time-slicing-config: |
    version: v1
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 4  # Share 1 GPU among 4 pods

Multi-GPU Training Job

# training-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: distributed-training
spec:
  template:
    spec:
      restartPolicy: OnFailure
      containers:
      - name: trainer
        image: myregistry/pytorch-trainer:latest
        command: ["python", "-m", "torch.distributed.launch"]
        args:
        - "--nproc_per_node=4"  # 4 GPUs
        - "train.py"
        
        resources:
          limits:
            nvidia.com/gpu: 4  # Request 4 GPUs
          requests:
            memory: "32Gi"
            cpu: "16"
        
        env:
        - name: MASTER_ADDR
          value: "localhost"
        - name: MASTER_PORT
          value: "29500"

⚠️ GPU Best Practices:

  • Use GPU time-slicing for inference workloads to maximize utilization
  • Reserve full GPUs for training jobs
  • Set appropriate node selectors to target GPU nodes
  • Monitor GPU utilization with Prometheus + NVIDIA DCGM
  • Use node pools for different GPU types (training vs inference)

🔐 Configuration Management

ConfigMaps for Configuration

# configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: ml-model-config
data:
  # Simple key-value pairs
  model_version: "v2.1.0"
  batch_size: "32"
  max_sequence_length: "512"
  
  # Configuration file
  config.json: |
    {
      "model_path": "/models/model.pkl",
      "preprocessing": {
        "normalize": true,
        "scale_features": true
      },
      "inference": {
        "batch_size": 32,
        "timeout": 30
      }
    }
# Use ConfigMap in Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-model
spec:
  template:
    spec:
      containers:
      - name: model
        image: myregistry/ml-model:v1
        
        # Environment variables from ConfigMap
        envFrom:
        - configMapRef:
            name: ml-model-config
        
        # Mount config file
        volumeMounts:
        - name: config-volume
          mountPath: /app/config
      
      volumes:
      - name: config-volume
        configMap:
          name: ml-model-config
          items:
          - key: config.json
            path: config.json

Secrets for Sensitive Data

# Create secret from literal
kubectl create secret generic ml-api-keys \
  --from-literal=api-key=your-api-key \
  --from-literal=db-password=your-password

# Create secret from file
kubectl create secret generic ml-model-weights \
  --from-file=model.pkl

# Create TLS secret
kubectl create secret tls ml-tls-secret \
  --cert=tls.crt \
  --key=tls.key
# Use Secrets in Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-model
spec:
  template:
    spec:
      containers:
      - name: model
        image: myregistry/ml-model:v1
        
        # Environment variables from Secret
        env:
        - name: API_KEY
          valueFrom:
            secretKeyRef:
              name: ml-api-keys
              key: api-key
        
        - name: DB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: ml-api-keys
              key: db-password
        
        # Mount secret as volume
        volumeMounts:
        - name: model-secret
          mountPath: /secrets
          readOnly: true
      
      volumes:
      - name: model-secret
        secret:
          secretName: ml-model-weights

⎈ Helm Charts for ML Applications

What is Helm?

Helm is a package manager for Kubernetes. It uses "charts" to define, install, and manage complex K8s applications with a single command.

Install Helm

# Install Helm
brew install helm

# Add popular charts repository
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update

Create Helm Chart for ML Model

# Create chart structure
helm create ml-model

# Chart structure:
# ml-model/
#   Chart.yaml          # Chart metadata
#   values.yaml         # Default configuration values
#   templates/          # K8s manifests with templating
#     deployment.yaml
#     service.yaml
#     hpa.yaml
#     ingress.yaml

Chart.yaml

# Chart.yaml
apiVersion: v2
name: ml-model
description: Production ML model deployment
type: application
version: 1.0.0
appVersion: "2.1.0"

values.yaml

# values.yaml - Default values
replicaCount: 3

image:
  repository: myregistry/ml-model
  tag: "v1.0.0"
  pullPolicy: IfNotPresent

service:
  type: LoadBalancer
  port: 80
  targetPort: 8080

resources:
  requests:
    memory: "512Mi"
    cpu: "500m"
  limits:
    memory: "1Gi"
    cpu: "1000m"

autoscaling:
  enabled: true
  minReplicas: 2
  maxReplicas: 10
  targetCPUUtilizationPercentage: 70

gpu:
  enabled: false
  count: 0

model:
  version: "v2.1.0"
  batchSize: 32

ingress:
  enabled: true
  className: nginx
  hosts:
    - host: ml-model.example.com
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: ml-tls-secret
      hosts:
        - ml-model.example.com

templates/deployment.yaml

# templates/deployment.yaml with templating
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ include "ml-model.fullname" . }}
  labels:
    {{- include "ml-model.labels" . | nindent 4 }}
spec:
  replicas: {{ .Values.replicaCount }}
  selector:
    matchLabels:
      {{- include "ml-model.selectorLabels" . | nindent 6 }}
  template:
    metadata:
      labels:
        {{- include "ml-model.selectorLabels" . | nindent 8 }}
    spec:
      containers:
      - name: {{ .Chart.Name }}
        image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
        imagePullPolicy: {{ .Values.image.pullPolicy }}
        ports:
        - name: http
          containerPort: {{ .Values.service.targetPort }}
        
        resources:
          {{- toYaml .Values.resources | nindent 10 }}
        
        {{- if .Values.gpu.enabled }}
        resources:
          limits:
            nvidia.com/gpu: {{ .Values.gpu.count }}
        {{- end }}
        
        env:
        - name: MODEL_VERSION
          value: {{ .Values.model.version | quote }}
        - name: BATCH_SIZE
          value: {{ .Values.model.batchSize | quote }}

Deploy with Helm

# Install chart
helm install my-ml-model ./ml-model

# Install with custom values
helm install my-ml-model ./ml-model \
  --set replicaCount=5 \
  --set image.tag=v2.0.0 \
  --set gpu.enabled=true \
  --set gpu.count=1

# Install with values file
helm install my-ml-model ./ml-model -f production-values.yaml

# Upgrade deployment
helm upgrade my-ml-model ./ml-model --set image.tag=v2.1.0

# Rollback
helm rollback my-ml-model 1

# Uninstall
helm uninstall my-ml-model

# List releases
helm list

# View release history
helm history my-ml-model

Production Values Override

# production-values.yaml
replicaCount: 10

image:
  tag: "v2.1.0"

resources:
  requests:
    memory: "2Gi"
    cpu: "2000m"
  limits:
    memory: "4Gi"
    cpu: "4000m"

autoscaling:
  enabled: true
  minReplicas: 5
  maxReplicas: 50
  targetCPUUtilizationPercentage: 60

gpu:
  enabled: true
  count: 1

ingress:
  enabled: true
  hosts:
    - host: ml-api.production.com
# Deploy to production
helm install prod-ml-model ./ml-model \
  -f production-values.yaml \
  --namespace production \
  --create-namespace

📊 Monitoring ML Workloads

Prometheus & Grafana Stack

# Install Prometheus using Helm
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace

# Access Grafana dashboard
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80

# Default credentials: admin / prom-operator

Custom Metrics for ML Models

"""
Expose custom metrics from ML service
"""
from prometheus_client import Counter, Histogram, Gauge, start_http_server

# Define metrics
predictions_total = Counter(
    'ml_predictions_total',
    'Total number of predictions',
    ['model_version', 'status']
)

prediction_latency = Histogram(
    'ml_prediction_latency_seconds',
    'Prediction latency in seconds',
    ['model_version']
)

model_accuracy = Gauge(
    'ml_model_accuracy',
    'Current model accuracy',
    ['model_version']
)

active_connections = Gauge(
    'ml_active_connections',
    'Number of active connections'
)

# Instrument code
@app.post("/predict")
async def predict(request: PredictRequest):
    start_time = time.time()
    
    try:
        # Make prediction
        result = model.predict(request.features)
        
        # Record metrics
        predictions_total.labels(
            model_version='v2.1.0',
            status='success'
        ).inc()
        
        duration = time.time() - start_time
        prediction_latency.labels(model_version='v2.1.0').observe(duration)
        
        return result
    
    except Exception as e:
        predictions_total.labels(
            model_version='v2.1.0',
            status='error'
        ).inc()
        raise

# Start metrics server
start_http_server(9090)  # Metrics at :9090/metrics

ServiceMonitor for Prometheus

# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: ml-model-metrics
  labels:
    app: ml-model
spec:
  selector:
    matchLabels:
      app: ml-model
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics

🎯 Summary

You've mastered Kubernetes for ML workloads:

☸️

K8s Fundamentals

Pods, Services, Deployments, and cluster architecture

🚢

Model Deployment

Production-ready ML model deployments with health checks

📈

Autoscaling

HPA and VPA for automatic resource scaling

🎮

GPU Scheduling

Efficient GPU sharing and multi-GPU training

Helm Charts

Package and deploy complex ML applications

📊

Monitoring

Prometheus and custom metrics for ML workloads

Key Takeaways

  1. Kubernetes provides container orchestration at scale for ML workloads
  2. Use Deployments for stateless ML serving, StatefulSets for training
  3. Implement HPA for automatic scaling based on traffic and metrics
  4. Leverage GPU time-slicing for inference, full GPUs for training
  5. Use Helm charts to manage complex ML application deployments
  6. Monitor with Prometheus and expose custom ML metrics
  7. Set resource requests/limits for cost control and stability

🚀 Next Steps:

Your ML infrastructure is production-ready on Kubernetes! Next tutorials will cover model monitoring, data pipelines, feature stores, and CI/CD - completing your MLOps toolkit for end-to-end production ML systems.

Test Your Knowledge

Q1: What is a Pod in Kubernetes?

A type of container
A deployment strategy
The smallest deployable unit that can contain one or more containers
A load balancer

Q2: What does Horizontal Pod Autoscaler (HPA) do?

Increases container memory
Automatically scales the number of pods based on metrics like CPU usage or custom metrics
Distributes pods across nodes
Monitors pod health

Q3: How do you request a GPU in a Kubernetes pod?

Set resources.limits.nvidia.com/gpu in the container spec
Use a special GPU image
Add --gpu flag to kubectl
GPUs are automatically assigned

Q4: What is the benefit of using Helm charts?

Faster pod startup
Better GPU performance
Automatic scaling
Package and manage complex K8s applications with reusable templates and configuration

Q5: What should you use to store sensitive information like API keys in Kubernetes?

ConfigMaps
Environment variables in Dockerfile
Secrets
Regular files in the image