๐ Complete all tutorials to earn your Free MLOps Engineer Certificate
Shareable on LinkedIn โข Verified by AITutorials.site โข No signup fee
๐๏ธ The Infrastructure Problem
Your ML pipeline works perfectly in development. Now you need to deploy to staging and production. You:
- Manually click through AWS console creating S3 buckets, IAM roles, EC2 instances
- Document each step in a 20-page runbook
- Recreate everything for staging (slightly different)
- Someone needs to update a security group - which one was it?
- Disaster recovery? Start clicking again...
Infrastructure as Code (IaC) treats infrastructure like software. You define infrastructure in code files, version control them, and automatically provision identical environments. Changes are code reviews, not manual clicks.
๐ก IaC Benefits:
- Reproducibility: Create identical environments every time
- Version Control: Track infrastructure changes in Git
- Documentation: Code is documentation
- Automation: No manual provisioning
- Testing: Test infrastructure changes before production
- Disaster Recovery: Rebuild entire infrastructure from code
๐ Terraform for ML Infrastructure
What is Terraform?
Terraform is a cloud-agnostic IaC tool. Write infrastructure in HCL (HashiCorp Configuration Language), and Terraform provisions resources across AWS, Google Cloud, Azure, and 1000+ providers.
Installation
# Install Terraform (macOS)
brew install terraform
# Verify installation
terraform version
# Initialize AWS credentials
export AWS_ACCESS_KEY_ID="your-key"
export AWS_SECRET_ACCESS_KEY="your-secret"
export AWS_DEFAULT_REGION="us-west-2"
Your First Terraform Configuration
# main.tf - Create S3 bucket for ML data
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
provider "aws" {
region = "us-west-2"
}
# S3 bucket for training data
resource "aws_s3_bucket" "ml_data" {
bucket = "my-ml-data-bucket"
tags = {
Name = "ML Training Data"
Environment = "production"
Project = "ml-pipeline"
}
}
# Enable versioning
resource "aws_s3_bucket_versioning" "ml_data_versioning" {
bucket = aws_s3_bucket.ml_data.id
versioning_configuration {
status = "Enabled"
}
}
# Block public access
resource "aws_s3_bucket_public_access_block" "ml_data_public_access" {
bucket = aws_s3_bucket.ml_data.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
# Terraform workflow
terraform init # Initialize providers
terraform plan # Preview changes
terraform apply # Apply changes
terraform destroy # Destroy infrastructure
Complete ML Infrastructure
# ml-infrastructure.tf
# Variables for reusability
variable "project_name" {
description = "Project name"
default = "ml-pipeline"
}
variable "environment" {
description = "Environment (dev/staging/prod)"
default = "production"
}
# S3 buckets for different stages
resource "aws_s3_bucket" "raw_data" {
bucket = "${var.project_name}-raw-data-${var.environment}"
}
resource "aws_s3_bucket" "processed_data" {
bucket = "${var.project_name}-processed-data-${var.environment}"
}
resource "aws_s3_bucket" "models" {
bucket = "${var.project_name}-models-${var.environment}"
}
# IAM role for ML training job
resource "aws_iam_role" "ml_training_role" {
name = "${var.project_name}-training-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "sagemaker.amazonaws.com"
}
}]
})
}
# IAM policy for S3 access
resource "aws_iam_policy" "ml_s3_access" {
name = "${var.project_name}-s3-access"
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = [
"s3:GetObject",
"s3:PutObject",
"s3:ListBucket"
]
Resource = [
"${aws_s3_bucket.raw_data.arn}/*",
"${aws_s3_bucket.processed_data.arn}/*",
"${aws_s3_bucket.models.arn}/*"
]
}
]
})
}
# Attach policy to role
resource "aws_iam_role_policy_attachment" "ml_s3_attach" {
role = aws_iam_role.ml_training_role.name
policy_arn = aws_iam_policy.ml_s3_access.arn
}
# EC2 instance for training
resource "aws_instance" "ml_training" {
ami = "ami-0c55b159cbfafe1f0" # Deep Learning AMI
instance_type = "p3.2xlarge" # GPU instance
iam_instance_profile = aws_iam_instance_profile.ml_training_profile.name
root_block_device {
volume_size = 100 # GB
volume_type = "gp3"
}
tags = {
Name = "${var.project_name}-training-${var.environment}"
}
user_data = <<-EOF
#!/bin/bash
pip install -r /app/requirements.txt
python /app/train.py
EOF
}
# IAM instance profile
resource "aws_iam_instance_profile" "ml_training_profile" {
name = "${var.project_name}-training-profile"
role = aws_iam_role.ml_training_role.name
}
# RDS database for metadata
resource "aws_db_instance" "ml_metadata" {
identifier = "${var.project_name}-metadata"
engine = "postgres"
engine_version = "15.3"
instance_class = "db.t3.micro"
allocated_storage = 20
db_name = "mlmetadata"
username = "admin"
password = var.db_password # From variable
skip_final_snapshot = true
tags = {
Name = "${var.project_name}-metadata-${var.environment}"
}
}
# Outputs
output "raw_data_bucket" {
value = aws_s3_bucket.raw_data.bucket
}
output "models_bucket" {
value = aws_s3_bucket.models.bucket
}
output "training_instance_ip" {
value = aws_instance.ml_training.public_ip
}
output "database_endpoint" {
value = aws_db_instance.ml_metadata.endpoint
}
Multi-Environment with Workspaces
# Create workspaces for different environments
terraform workspace new dev
terraform workspace new staging
terraform workspace new production
# Switch workspace
terraform workspace select production
# Apply to specific environment
terraform apply -var="environment=production"
# Each workspace has separate state!
Modules for Reusability
# modules/ml-storage/main.tf
variable "project_name" {}
variable "environment" {}
resource "aws_s3_bucket" "data" {
bucket = "${var.project_name}-${var.environment}"
}
resource "aws_s3_bucket_versioning" "data_versioning" {
bucket = aws_s3_bucket.data.id
versioning_configuration {
status = "Enabled"
}
}
output "bucket_name" {
value = aws_s3_bucket.data.bucket
}
output "bucket_arn" {
value = aws_s3_bucket.data.arn
}
# main.tf - Use module
module "raw_data_storage" {
source = "./modules/ml-storage"
project_name = "ml-pipeline"
environment = "production"
}
module "models_storage" {
source = "./modules/ml-storage"
project_name = "ml-models"
environment = "production"
}
โ๏ธ AWS CloudFormation
CloudFormation vs Terraform
| Aspect | Terraform | CloudFormation |
|---|---|---|
| Cloud Support | โ Multi-cloud | AWS only |
| Language | HCL | YAML/JSON |
| State Management | Requires state file | โ Managed by AWS |
| AWS Integration | Good | โ Native |
| Community | โ Larger | AWS-focused |
CloudFormation Template
# ml-infrastructure.yaml
AWSTemplateFormatVersion: '2010-09-09'
Description: ML Pipeline Infrastructure
Parameters:
ProjectName:
Type: String
Default: ml-pipeline
Description: Project name prefix
Environment:
Type: String
Default: production
AllowedValues:
- dev
- staging
- production
Description: Environment name
Resources:
# S3 Bucket for training data
TrainingDataBucket:
Type: AWS::S3::Bucket
Properties:
BucketName: !Sub '${ProjectName}-data-${Environment}'
VersioningConfiguration:
Status: Enabled
PublicAccessBlockConfiguration:
BlockPublicAcls: true
BlockPublicPolicy: true
IgnorePublicAcls: true
RestrictPublicBuckets: true
Tags:
- Key: Name
Value: !Sub '${ProjectName}-data'
- Key: Environment
Value: !Ref Environment
# S3 Bucket for models
ModelsBucket:
Type: AWS::S3::Bucket
Properties:
BucketName: !Sub '${ProjectName}-models-${Environment}'
VersioningConfiguration:
Status: Enabled
# IAM Role for SageMaker
SageMakerExecutionRole:
Type: AWS::IAM::Role
Properties:
RoleName: !Sub '${ProjectName}-sagemaker-role'
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: sagemaker.amazonaws.com
Action: sts:AssumeRole
ManagedPolicyArns:
- arn:aws:iam::aws:policy/AmazonSageMakerFullAccess
Policies:
- PolicyName: S3Access
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- s3:GetObject
- s3:PutObject
- s3:ListBucket
Resource:
- !GetAtt TrainingDataBucket.Arn
- !Sub '${TrainingDataBucket.Arn}/*'
- !GetAtt ModelsBucket.Arn
- !Sub '${ModelsBucket.Arn}/*'
# ECR Repository for Docker images
MLModelRepository:
Type: AWS::ECR::Repository
Properties:
RepositoryName: !Sub '${ProjectName}-models'
ImageScanningConfiguration:
ScanOnPush: true
Tags:
- Key: Environment
Value: !Ref Environment
# ECS Cluster for model serving
ECSCluster:
Type: AWS::ECS::Cluster
Properties:
ClusterName: !Sub '${ProjectName}-cluster-${Environment}'
CapacityProviders:
- FARGATE
- FARGATE_SPOT
DefaultCapacityProviderStrategy:
- CapacityProvider: FARGATE
Weight: 1
# VPC for ML workloads
MLVPC:
Type: AWS::EC2::VPC
Properties:
CidrBlock: 10.0.0.0/16
EnableDnsHostnames: true
EnableDnsSupport: true
Tags:
- Key: Name
Value: !Sub '${ProjectName}-vpc'
Outputs:
DataBucketName:
Description: Training data bucket
Value: !Ref TrainingDataBucket
Export:
Name: !Sub '${AWS::StackName}-DataBucket'
ModelsBucketName:
Description: Models bucket
Value: !Ref ModelsBucket
Export:
Name: !Sub '${AWS::StackName}-ModelsBucket'
SageMakerRoleArn:
Description: SageMaker execution role ARN
Value: !GetAtt SageMakerExecutionRole.Arn
Export:
Name: !Sub '${AWS::StackName}-SageMakerRole'
# Deploy CloudFormation stack
aws cloudformation create-stack \
--stack-name ml-infrastructure-prod \
--template-body file://ml-infrastructure.yaml \
--parameters ParameterKey=Environment,ParameterValue=production \
--capabilities CAPABILITY_NAMED_IAM
# Update stack
aws cloudformation update-stack \
--stack-name ml-infrastructure-prod \
--template-body file://ml-infrastructure.yaml
# Delete stack
aws cloudformation delete-stack \
--stack-name ml-infrastructure-prod
# View stack outputs
aws cloudformation describe-stacks \
--stack-name ml-infrastructure-prod \
--query 'Stacks[0].Outputs'
๐ Secrets Management with Vault
Why Vault?
Never hardcode secrets (API keys, passwords, credentials) in code or config files. HashiCorp Vault provides centralized secrets management with encryption, access control, and audit logs.
Install Vault
# Install Vault
brew install vault
# Start Vault dev server (development only)
vault server -dev
# Set environment variable
export VAULT_ADDR='http://127.0.0.1:8200'
export VAULT_TOKEN=''
Store and Retrieve Secrets
# Write secrets
vault kv put secret/ml-pipeline/db \
username=admin \
password=secure-password-123
vault kv put secret/ml-pipeline/aws \
access_key=AKIAIOSFODNN7EXAMPLE \
secret_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
# Read secrets
vault kv get secret/ml-pipeline/db
# Get specific field
vault kv get -field=password secret/ml-pipeline/db
# List secrets
vault kv list secret/ml-pipeline
Use Vault in Python
"""
Access secrets from Vault in Python
"""
import hvac
# Initialize Vault client
client = hvac.Client(
url='http://127.0.0.1:8200',
token='your-vault-token'
)
# Read database credentials
db_secrets = client.secrets.kv.v2.read_secret_version(
path='ml-pipeline/db'
)
db_username = db_secrets['data']['data']['username']
db_password = db_secrets['data']['data']['password']
# Use credentials
import psycopg2
conn = psycopg2.connect(
host='ml-db.example.com',
database='mlmetadata',
user=db_username,
password=db_password
)
# Read AWS credentials
aws_secrets = client.secrets.kv.v2.read_secret_version(
path='ml-pipeline/aws'
)
import boto3
s3_client = boto3.client(
's3',
aws_access_key_id=aws_secrets['data']['data']['access_key'],
aws_secret_access_key=aws_secrets['data']['data']['secret_key']
)
Vault with Terraform
# Configure Vault provider
provider "vault" {
address = "http://127.0.0.1:8200"
token = var.vault_token
}
# Read secret from Vault
data "vault_generic_secret" "db_creds" {
path = "secret/data/ml-pipeline/db"
}
# Use secret in resource
resource "aws_db_instance" "ml_metadata" {
# ... other configuration
username = data.vault_generic_secret.db_creds.data["username"]
password = data.vault_generic_secret.db_creds.data["password"]
}
# Write secret to Vault
resource "vault_generic_secret" "api_key" {
path = "secret/ml-pipeline/api"
data_json = jsonencode({
api_key = random_password.api_key.result
})
}
resource "random_password" "api_key" {
length = 32
special = true
}
AWS Secrets Manager Alternative
# Create secret in AWS Secrets Manager
aws secretsmanager create-secret \
--name ml-pipeline/db-credentials \
--secret-string '{"username":"admin","password":"secure-pass"}'
# Retrieve secret
aws secretsmanager get-secret-value \
--secret-id ml-pipeline/db-credentials
# Access from Python
import boto3
import json
client = boto3.client('secretsmanager', region_name='us-west-2')
response = client.get_secret_value(SecretId='ml-pipeline/db-credentials')
secrets = json.loads(response['SecretString'])
db_username = secrets['username']
db_password = secrets['password']
โ IaC Best Practices
Version Control
Store IaC in Git. Every infrastructure change is code review + commit history
Never Commit Secrets
Use .gitignore for terraform.tfstate and *.tfvars. Store secrets in Vault/Secrets Manager
Remote State
Store Terraform state in S3 with DynamoDB locking for team collaboration
Tagging
Tag all resources with project, environment, owner for cost tracking
Test Changes
Always run terraform plan before apply. Test in dev before production
Use Modules
Create reusable modules for common patterns (ML storage, compute, networking)
Remote State Configuration
# backend.tf - Store state in S3
terraform {
backend "s3" {
bucket = "my-terraform-state-bucket"
key = "ml-pipeline/terraform.tfstate"
region = "us-west-2"
encrypt = true
dynamodb_table = "terraform-state-lock"
}
}
โ ๏ธ Security Considerations:
- Never store secrets in Terraform files or state
- Enable state encryption (S3 bucket encryption)
- Use IAM roles/policies for least-privilege access
- Enable CloudTrail/audit logging for infrastructure changes
- Regularly rotate secrets and credentials
- Use separate AWS accounts for dev/staging/prod
๐ฏ Summary
You've mastered Infrastructure as Code for ML:
Terraform
Multi-cloud IaC with HCL, modules, and workspaces
CloudFormation
AWS-native infrastructure provisioning with YAML
Secrets Management
Vault and AWS Secrets Manager for secure credentials
Reproducibility
Create identical environments from code
Version Control
Track all infrastructure changes in Git
Automation
Eliminate manual provisioning and configuration
Key Takeaways
- Infrastructure as Code makes ML systems reproducible and maintainable
- Terraform provides multi-cloud support with modules and workspaces
- CloudFormation offers native AWS integration with managed state
- Never hardcode secrets - use Vault or Secrets Manager
- Store Terraform state remotely with locking for team collaboration
- Tag all resources for cost tracking and management
- Test infrastructure changes in dev before production
๐ Next Steps:
Your infrastructure is now code! Next, you'll learn model monitoring and observability - tracking model performance, detecting drift, and ensuring production models stay healthy.
Test Your Knowledge
Q1: What is the main benefit of Infrastructure as Code?
Q2: What's the main advantage of Terraform over CloudFormation?
Q3: Where should you store sensitive credentials like database passwords?
Q4: What is the purpose of Terraform state?
Q5: What should you do before running 'terraform apply' in production?