🎓 Complete all tutorials to earn your Free MLOps Engineer Certificate
Shareable on LinkedIn • Verified by AITutorials.site • No signup fee
☸️ Why Kubernetes for ML?
You've containerized your ML application with Docker. It runs great on your laptop! But production requires:
- Running 10 replicas for high availability
- Automatically scaling based on traffic
- Zero-downtime deployments
- Efficient GPU sharing across models
- Self-healing when containers crash
- Load balancing incoming requests
Kubernetes (K8s) is the industry-standard container orchestration platform. It manages containerized applications at scale with features ML workloads need: resource scheduling, autoscaling, GPU support, and high availability.
Kubernetes Architecture
💡 Kubernetes Benefits for ML:
- Declarative configuration (define desired state, K8s maintains it)
- Automatic scaling based on CPU, memory, or custom metrics
- GPU scheduling and sharing across workloads
- Rolling updates with zero downtime
- Self-healing (automatic restarts of failed containers)
- Resource quotas and limits for cost control
🚀 Kubernetes Setup
Local Development: Minikube
# Install Minikube (macOS)
brew install minikube
# Start cluster
minikube start --cpus=4 --memory=8192 --driver=docker
# Enable GPU support (if available)
minikube start --driver=kvm2 --gpus all
# Verify cluster
kubectl cluster-info
kubectl get nodes
# Install kubectl (if not installed)
brew install kubectl
Production: Cloud Kubernetes Services
# AWS EKS
eksctl create cluster \
--name ml-cluster \
--region us-west-2 \
--nodegroup-name ml-nodes \
--node-type p3.2xlarge \ # GPU instance
--nodes 3
# Google GKE
gcloud container clusters create ml-cluster \
--machine-type n1-standard-4 \
--accelerator type=nvidia-tesla-t4,count=1 \
--num-nodes 3 \
--zone us-central1-a
# Azure AKS
az aks create \
--resource-group ml-rg \
--name ml-cluster \
--node-vm-size Standard_NC6 \
--node-count 3
Essential kubectl Commands
# View resources
kubectl get pods
kubectl get services
kubectl get deployments
kubectl get nodes
# Detailed info
kubectl describe pod
kubectl logs
kubectl logs -f # Follow logs
# Execute commands in pod
kubectl exec -it -- bash
# Port forwarding for local testing
kubectl port-forward pod/ 8080:8080
# Apply configuration
kubectl apply -f deployment.yaml
# Delete resources
kubectl delete -f deployment.yaml
kubectl delete pod
🚢 Deploying ML Models on Kubernetes
Step 1: Create Docker Image
# Dockerfile for ML model
FROM python:3.10-slim
WORKDIR /app
# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy model and code
COPY model.pkl .
COPY app.py .
# Expose port
EXPOSE 8080
# Run server
CMD ["python", "app.py"]
# Build and push image
docker build -t myregistry/ml-model:v1 .
docker push myregistry/ml-model:v1
Step 2: Create Deployment
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ml-model
labels:
app: ml-model
spec:
replicas: 3 # Run 3 pods for high availability
selector:
matchLabels:
app: ml-model
template:
metadata:
labels:
app: ml-model
version: v1
spec:
containers:
- name: model
image: myregistry/ml-model:v1
ports:
- containerPort: 8080
# Resource requests and limits
resources:
requests:
memory: "512Mi"
cpu: "500m" # 0.5 CPU
limits:
memory: "1Gi"
cpu: "1000m" # 1 CPU
# Health checks
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
# Environment variables
env:
- name: MODEL_VERSION
value: "v1"
- name: LOG_LEVEL
value: "INFO"
Step 3: Create Service
# service.yaml
apiVersion: v1
kind: Service
metadata:
name: ml-model-service
spec:
selector:
app: ml-model
type: LoadBalancer # External access
ports:
- protocol: TCP
port: 80 # External port
targetPort: 8080 # Container port
sessionAffinity: ClientIP # Sticky sessions
Deploy to Kubernetes
# Apply configurations
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
# Check deployment status
kubectl get deployments
kubectl get pods
kubectl get services
# Get external IP
kubectl get service ml-model-service
# Test endpoint
curl http:///predict -X POST -d '{"features": [1,2,3,4]}'
# View logs
kubectl logs -l app=ml-model --tail=100
# Scale deployment
kubectl scale deployment ml-model --replicas=5
✅ Deployment Best Practices:
- Always set resource requests and limits
- Implement health checks (liveness and readiness probes)
- Use multiple replicas for high availability
- Tag images with versions, not 'latest'
- Store sensitive data in Secrets, not in image
📈 Horizontal Pod Autoscaling (HPA)
What is HPA?
HPA automatically scales the number of pods based on observed metrics (CPU, memory, custom metrics like request rate or queue length).
CPU-Based Autoscaling
# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ml-model-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ml-model
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 # Scale when avg CPU > 70%
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5min before scaling down
policies:
- type: Percent
value: 50
periodSeconds: 60 # Max 50% reduction per minute
scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately
policies:
- type: Percent
value: 100
periodSeconds: 15 # Max 100% increase per 15s
# Apply HPA
kubectl apply -f hpa.yaml
# Check HPA status
kubectl get hpa
kubectl describe hpa ml-model-hpa
# Generate load to test autoscaling
kubectl run -i --tty load-generator --rm --image=busybox --restart=Never -- /bin/sh
while true; do wget -q -O- http://ml-model-service/predict; done
# Watch pods scale up
kubectl get pods -w
Custom Metrics Autoscaling
# hpa-custom.yaml - Scale based on request rate
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ml-model-hpa-custom
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ml-model
minReplicas: 2
maxReplicas: 20
metrics:
# CPU metric
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
# Memory metric
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
# Custom metric: requests per second
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "1000" # Scale when > 1000 req/s per pod
# Custom metric: inference queue length
- type: Pods
pods:
metric:
name: inference_queue_length
target:
type: AverageValue
averageValue: "10" # Scale when queue > 10
Vertical Pod Autoscaling (VPA)
# vpa.yaml - Automatically adjust resource requests
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: ml-model-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: ml-model
updatePolicy:
updateMode: "Auto" # Auto-update pod resources
resourcePolicy:
containerPolicies:
- containerName: model
minAllowed:
cpu: 100m
memory: 256Mi
maxAllowed:
cpu: 2000m
memory: 4Gi
🎮 GPU Scheduling
Enable GPU Support
# Install NVIDIA device plugin
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml
# Verify GPU nodes
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
Deploy GPU Workload
# gpu-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: pytorch-gpu-model
spec:
replicas: 2
selector:
matchLabels:
app: pytorch-gpu
template:
metadata:
labels:
app: pytorch-gpu
spec:
containers:
- name: model
image: pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime
command: ["python", "serve.py"]
resources:
limits:
nvidia.com/gpu: 1 # Request 1 GPU
requests:
nvidia.com/gpu: 1
memory: "8Gi"
cpu: "4"
env:
- name: CUDA_VISIBLE_DEVICES
value: "0"
volumeMounts:
- name: model-storage
mountPath: /models
# Node selector for GPU nodes
nodeSelector:
accelerator: nvidia-tesla-t4
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: model-pvc
GPU Sharing with Time-Slicing
# gpu-time-slicing-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: gpu-sharing-config
namespace: gpu-operator
data:
time-slicing-config: |
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4 # Share 1 GPU among 4 pods
Multi-GPU Training Job
# training-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: distributed-training
spec:
template:
spec:
restartPolicy: OnFailure
containers:
- name: trainer
image: myregistry/pytorch-trainer:latest
command: ["python", "-m", "torch.distributed.launch"]
args:
- "--nproc_per_node=4" # 4 GPUs
- "train.py"
resources:
limits:
nvidia.com/gpu: 4 # Request 4 GPUs
requests:
memory: "32Gi"
cpu: "16"
env:
- name: MASTER_ADDR
value: "localhost"
- name: MASTER_PORT
value: "29500"
⚠️ GPU Best Practices:
- Use GPU time-slicing for inference workloads to maximize utilization
- Reserve full GPUs for training jobs
- Set appropriate node selectors to target GPU nodes
- Monitor GPU utilization with Prometheus + NVIDIA DCGM
- Use node pools for different GPU types (training vs inference)
🔐 Configuration Management
ConfigMaps for Configuration
# configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: ml-model-config
data:
# Simple key-value pairs
model_version: "v2.1.0"
batch_size: "32"
max_sequence_length: "512"
# Configuration file
config.json: |
{
"model_path": "/models/model.pkl",
"preprocessing": {
"normalize": true,
"scale_features": true
},
"inference": {
"batch_size": 32,
"timeout": 30
}
}
# Use ConfigMap in Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: ml-model
spec:
template:
spec:
containers:
- name: model
image: myregistry/ml-model:v1
# Environment variables from ConfigMap
envFrom:
- configMapRef:
name: ml-model-config
# Mount config file
volumeMounts:
- name: config-volume
mountPath: /app/config
volumes:
- name: config-volume
configMap:
name: ml-model-config
items:
- key: config.json
path: config.json
Secrets for Sensitive Data
# Create secret from literal
kubectl create secret generic ml-api-keys \
--from-literal=api-key=your-api-key \
--from-literal=db-password=your-password
# Create secret from file
kubectl create secret generic ml-model-weights \
--from-file=model.pkl
# Create TLS secret
kubectl create secret tls ml-tls-secret \
--cert=tls.crt \
--key=tls.key
# Use Secrets in Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: ml-model
spec:
template:
spec:
containers:
- name: model
image: myregistry/ml-model:v1
# Environment variables from Secret
env:
- name: API_KEY
valueFrom:
secretKeyRef:
name: ml-api-keys
key: api-key
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: ml-api-keys
key: db-password
# Mount secret as volume
volumeMounts:
- name: model-secret
mountPath: /secrets
readOnly: true
volumes:
- name: model-secret
secret:
secretName: ml-model-weights
⎈ Helm Charts for ML Applications
What is Helm?
Helm is a package manager for Kubernetes. It uses "charts" to define, install, and manage complex K8s applications with a single command.
Install Helm
# Install Helm
brew install helm
# Add popular charts repository
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update
Create Helm Chart for ML Model
# Create chart structure
helm create ml-model
# Chart structure:
# ml-model/
# Chart.yaml # Chart metadata
# values.yaml # Default configuration values
# templates/ # K8s manifests with templating
# deployment.yaml
# service.yaml
# hpa.yaml
# ingress.yaml
Chart.yaml
# Chart.yaml
apiVersion: v2
name: ml-model
description: Production ML model deployment
type: application
version: 1.0.0
appVersion: "2.1.0"
values.yaml
# values.yaml - Default values
replicaCount: 3
image:
repository: myregistry/ml-model
tag: "v1.0.0"
pullPolicy: IfNotPresent
service:
type: LoadBalancer
port: 80
targetPort: 8080
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "1Gi"
cpu: "1000m"
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 70
gpu:
enabled: false
count: 0
model:
version: "v2.1.0"
batchSize: 32
ingress:
enabled: true
className: nginx
hosts:
- host: ml-model.example.com
paths:
- path: /
pathType: Prefix
tls:
- secretName: ml-tls-secret
hosts:
- ml-model.example.com
templates/deployment.yaml
# templates/deployment.yaml with templating
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ include "ml-model.fullname" . }}
labels:
{{- include "ml-model.labels" . | nindent 4 }}
spec:
replicas: {{ .Values.replicaCount }}
selector:
matchLabels:
{{- include "ml-model.selectorLabels" . | nindent 6 }}
template:
metadata:
labels:
{{- include "ml-model.selectorLabels" . | nindent 8 }}
spec:
containers:
- name: {{ .Chart.Name }}
image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
imagePullPolicy: {{ .Values.image.pullPolicy }}
ports:
- name: http
containerPort: {{ .Values.service.targetPort }}
resources:
{{- toYaml .Values.resources | nindent 10 }}
{{- if .Values.gpu.enabled }}
resources:
limits:
nvidia.com/gpu: {{ .Values.gpu.count }}
{{- end }}
env:
- name: MODEL_VERSION
value: {{ .Values.model.version | quote }}
- name: BATCH_SIZE
value: {{ .Values.model.batchSize | quote }}
Deploy with Helm
# Install chart
helm install my-ml-model ./ml-model
# Install with custom values
helm install my-ml-model ./ml-model \
--set replicaCount=5 \
--set image.tag=v2.0.0 \
--set gpu.enabled=true \
--set gpu.count=1
# Install with values file
helm install my-ml-model ./ml-model -f production-values.yaml
# Upgrade deployment
helm upgrade my-ml-model ./ml-model --set image.tag=v2.1.0
# Rollback
helm rollback my-ml-model 1
# Uninstall
helm uninstall my-ml-model
# List releases
helm list
# View release history
helm history my-ml-model
Production Values Override
# production-values.yaml
replicaCount: 10
image:
tag: "v2.1.0"
resources:
requests:
memory: "2Gi"
cpu: "2000m"
limits:
memory: "4Gi"
cpu: "4000m"
autoscaling:
enabled: true
minReplicas: 5
maxReplicas: 50
targetCPUUtilizationPercentage: 60
gpu:
enabled: true
count: 1
ingress:
enabled: true
hosts:
- host: ml-api.production.com
# Deploy to production
helm install prod-ml-model ./ml-model \
-f production-values.yaml \
--namespace production \
--create-namespace
📊 Monitoring ML Workloads
Prometheus & Grafana Stack
# Install Prometheus using Helm
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace
# Access Grafana dashboard
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
# Default credentials: admin / prom-operator
Custom Metrics for ML Models
"""
Expose custom metrics from ML service
"""
from prometheus_client import Counter, Histogram, Gauge, start_http_server
# Define metrics
predictions_total = Counter(
'ml_predictions_total',
'Total number of predictions',
['model_version', 'status']
)
prediction_latency = Histogram(
'ml_prediction_latency_seconds',
'Prediction latency in seconds',
['model_version']
)
model_accuracy = Gauge(
'ml_model_accuracy',
'Current model accuracy',
['model_version']
)
active_connections = Gauge(
'ml_active_connections',
'Number of active connections'
)
# Instrument code
@app.post("/predict")
async def predict(request: PredictRequest):
start_time = time.time()
try:
# Make prediction
result = model.predict(request.features)
# Record metrics
predictions_total.labels(
model_version='v2.1.0',
status='success'
).inc()
duration = time.time() - start_time
prediction_latency.labels(model_version='v2.1.0').observe(duration)
return result
except Exception as e:
predictions_total.labels(
model_version='v2.1.0',
status='error'
).inc()
raise
# Start metrics server
start_http_server(9090) # Metrics at :9090/metrics
ServiceMonitor for Prometheus
# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: ml-model-metrics
labels:
app: ml-model
spec:
selector:
matchLabels:
app: ml-model
endpoints:
- port: metrics
interval: 30s
path: /metrics
🎯 Summary
You've mastered Kubernetes for ML workloads:
K8s Fundamentals
Pods, Services, Deployments, and cluster architecture
Model Deployment
Production-ready ML model deployments with health checks
Autoscaling
HPA and VPA for automatic resource scaling
GPU Scheduling
Efficient GPU sharing and multi-GPU training
Helm Charts
Package and deploy complex ML applications
Monitoring
Prometheus and custom metrics for ML workloads
Key Takeaways
- Kubernetes provides container orchestration at scale for ML workloads
- Use Deployments for stateless ML serving, StatefulSets for training
- Implement HPA for automatic scaling based on traffic and metrics
- Leverage GPU time-slicing for inference, full GPUs for training
- Use Helm charts to manage complex ML application deployments
- Monitor with Prometheus and expose custom ML metrics
- Set resource requests/limits for cost control and stability
🚀 Next Steps:
Your ML infrastructure is production-ready on Kubernetes! Next tutorials will cover model monitoring, data pipelines, feature stores, and CI/CD - completing your MLOps toolkit for end-to-end production ML systems.
Test Your Knowledge
Q1: What is a Pod in Kubernetes?
Q2: What does Horizontal Pod Autoscaler (HPA) do?
Q3: How do you request a GPU in a Kubernetes pod?
Q4: What is the benefit of using Helm charts?
Q5: What should you use to store sensitive information like API keys in Kubernetes?