Infrastructure as Code - MLOps Tutorial

🎓 Complete all tutorials to earn your Free MLOps Engineer Certificate
Shareable on LinkedIn • Verified by AITutorials.site • No signup fee

🏗️ The Infrastructure Problem

Your ML pipeline works perfectly in development. Now you need to deploy to staging and production. You:

Manually click through AWS console creating S3 buckets, IAM roles, EC2 instances
Document each step in a 20-page runbook
Recreate everything for staging (slightly different)
Someone needs to update a security group - which one was it?
Disaster recovery? Start clicking again...

Infrastructure as Code (IaC) treats infrastructure like software. You define infrastructure in code files, version control them, and automatically provision identical environments. Changes are code reviews, not manual clicks.

💡 IaC Benefits:

Reproducibility: Create identical environments every time
Version Control: Track infrastructure changes in Git
Documentation: Code is documentation
Automation: No manual provisioning
Testing: Test infrastructure changes before production
Disaster Recovery: Rebuild entire infrastructure from code

🌍 Terraform for ML Infrastructure

What is Terraform?

Terraform is a cloud-agnostic IaC tool. Write infrastructure in HCL (HashiCorp Configuration Language), and Terraform provisions resources across AWS, Google Cloud, Azure, and 1000+ providers.

Installation

# Install Terraform (macOS)
brew install terraform

# Verify installation
terraform version

# Initialize AWS credentials
export AWS_ACCESS_KEY_ID="your-key"
export AWS_SECRET_ACCESS_KEY="your-secret"
export AWS_DEFAULT_REGION="us-west-2"

Your First Terraform Configuration

# main.tf - Create S3 bucket for ML data
terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = "us-west-2"
}

# S3 bucket for training data
resource "aws_s3_bucket" "ml_data" {
  bucket = "my-ml-data-bucket"
  
  tags = {
    Name        = "ML Training Data"
    Environment = "production"
    Project     = "ml-pipeline"
  }
}

# Enable versioning
resource "aws_s3_bucket_versioning" "ml_data_versioning" {
  bucket = aws_s3_bucket.ml_data.id
  
  versioning_configuration {
    status = "Enabled"
  }
}

# Block public access
resource "aws_s3_bucket_public_access_block" "ml_data_public_access" {
  bucket = aws_s3_bucket.ml_data.id
  
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

# Terraform workflow
terraform init      # Initialize providers
terraform plan      # Preview changes
terraform apply     # Apply changes
terraform destroy   # Destroy infrastructure

Complete ML Infrastructure

# ml-infrastructure.tf
# Variables for reusability
variable "project_name" {
  description = "Project name"
  default     = "ml-pipeline"
}

variable "environment" {
  description = "Environment (dev/staging/prod)"
  default     = "production"
}

# S3 buckets for different stages
resource "aws_s3_bucket" "raw_data" {
  bucket = "${var.project_name}-raw-data-${var.environment}"
}

resource "aws_s3_bucket" "processed_data" {
  bucket = "${var.project_name}-processed-data-${var.environment}"
}

resource "aws_s3_bucket" "models" {
  bucket = "${var.project_name}-models-${var.environment}"
}

# IAM role for ML training job
resource "aws_iam_role" "ml_training_role" {
  name = "${var.project_name}-training-role"
  
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action = "sts:AssumeRole"
      Effect = "Allow"
      Principal = {
        Service = "sagemaker.amazonaws.com"
      }
    }]
  })
}

# IAM policy for S3 access
resource "aws_iam_policy" "ml_s3_access" {
  name = "${var.project_name}-s3-access"
  
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "s3:GetObject",
          "s3:PutObject",
          "s3:ListBucket"
        ]
        Resource = [
          "${aws_s3_bucket.raw_data.arn}/*",
          "${aws_s3_bucket.processed_data.arn}/*",
          "${aws_s3_bucket.models.arn}/*"
        ]
      }
    ]
  })
}

# Attach policy to role
resource "aws_iam_role_policy_attachment" "ml_s3_attach" {
  role       = aws_iam_role.ml_training_role.name
  policy_arn = aws_iam_policy.ml_s3_access.arn
}

# EC2 instance for training
resource "aws_instance" "ml_training" {
  ami           = "ami-0c55b159cbfafe1f0"  # Deep Learning AMI
  instance_type = "p3.2xlarge"  # GPU instance
  
  iam_instance_profile = aws_iam_instance_profile.ml_training_profile.name
  
  root_block_device {
    volume_size = 100  # GB
    volume_type = "gp3"
  }
  
  tags = {
    Name = "${var.project_name}-training-${var.environment}"
  }
  
  user_data = <<-EOF
              #!/bin/bash
              pip install -r /app/requirements.txt
              python /app/train.py
              EOF
}

# IAM instance profile
resource "aws_iam_instance_profile" "ml_training_profile" {
  name = "${var.project_name}-training-profile"
  role = aws_iam_role.ml_training_role.name
}

# RDS database for metadata
resource "aws_db_instance" "ml_metadata" {
  identifier           = "${var.project_name}-metadata"
  engine              = "postgres"
  engine_version      = "15.3"
  instance_class      = "db.t3.micro"
  allocated_storage   = 20
  
  db_name  = "mlmetadata"
  username = "admin"
  password = var.db_password  # From variable
  
  skip_final_snapshot = true
  
  tags = {
    Name = "${var.project_name}-metadata-${var.environment}"
  }
}

# Outputs
output "raw_data_bucket" {
  value = aws_s3_bucket.raw_data.bucket
}

output "models_bucket" {
  value = aws_s3_bucket.models.bucket
}

output "training_instance_ip" {
  value = aws_instance.ml_training.public_ip
}

output "database_endpoint" {
  value = aws_db_instance.ml_metadata.endpoint
}

Multi-Environment with Workspaces

# Create workspaces for different environments
terraform workspace new dev
terraform workspace new staging
terraform workspace new production

# Switch workspace
terraform workspace select production

# Apply to specific environment
terraform apply -var="environment=production"

# Each workspace has separate state!

Modules for Reusability

# modules/ml-storage/main.tf
variable "project_name" {}
variable "environment" {}

resource "aws_s3_bucket" "data" {
  bucket = "${var.project_name}-${var.environment}"
}

resource "aws_s3_bucket_versioning" "data_versioning" {
  bucket = aws_s3_bucket.data.id
  versioning_configuration {
    status = "Enabled"
  }
}

output "bucket_name" {
  value = aws_s3_bucket.data.bucket
}

output "bucket_arn" {
  value = aws_s3_bucket.data.arn
}

# main.tf - Use module
module "raw_data_storage" {
  source = "./modules/ml-storage"
  
  project_name = "ml-pipeline"
  environment  = "production"
}

module "models_storage" {
  source = "./modules/ml-storage"
  
  project_name = "ml-models"
  environment  = "production"
}

☁️ AWS CloudFormation

CloudFormation vs Terraform

Aspect	Terraform	CloudFormation
Cloud Support	✅ Multi-cloud	AWS only
Language	HCL	YAML/JSON
State Management	Requires state file	✅ Managed by AWS
AWS Integration	Good	✅ Native
Community	✅ Larger	AWS-focused

CloudFormation Template

# ml-infrastructure.yaml
AWSTemplateFormatVersion: '2010-09-09'
Description: ML Pipeline Infrastructure

Parameters:
  ProjectName:
    Type: String
    Default: ml-pipeline
    Description: Project name prefix
  
  Environment:
    Type: String
    Default: production
    AllowedValues:
      - dev
      - staging
      - production
    Description: Environment name

Resources:
  # S3 Bucket for training data
  TrainingDataBucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: !Sub '${ProjectName}-data-${Environment}'
      VersioningConfiguration:
        Status: Enabled
      PublicAccessBlockConfiguration:
        BlockPublicAcls: true
        BlockPublicPolicy: true
        IgnorePublicAcls: true
        RestrictPublicBuckets: true
      Tags:
        - Key: Name
          Value: !Sub '${ProjectName}-data'
        - Key: Environment
          Value: !Ref Environment
  
  # S3 Bucket for models
  ModelsBucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: !Sub '${ProjectName}-models-${Environment}'
      VersioningConfiguration:
        Status: Enabled
  
  # IAM Role for SageMaker
  SageMakerExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: !Sub '${ProjectName}-sagemaker-role'
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: sagemaker.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/AmazonSageMakerFullAccess
      Policies:
        - PolicyName: S3Access
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - s3:GetObject
                  - s3:PutObject
                  - s3:ListBucket
                Resource:
                  - !GetAtt TrainingDataBucket.Arn
                  - !Sub '${TrainingDataBucket.Arn}/*'
                  - !GetAtt ModelsBucket.Arn
                  - !Sub '${ModelsBucket.Arn}/*'
  
  # ECR Repository for Docker images
  MLModelRepository:
    Type: AWS::ECR::Repository
    Properties:
      RepositoryName: !Sub '${ProjectName}-models'
      ImageScanningConfiguration:
        ScanOnPush: true
      Tags:
        - Key: Environment
          Value: !Ref Environment
  
  # ECS Cluster for model serving
  ECSCluster:
    Type: AWS::ECS::Cluster
    Properties:
      ClusterName: !Sub '${ProjectName}-cluster-${Environment}'
      CapacityProviders:
        - FARGATE
        - FARGATE_SPOT
      DefaultCapacityProviderStrategy:
        - CapacityProvider: FARGATE
          Weight: 1
  
  # VPC for ML workloads
  MLVPC:
    Type: AWS::EC2::VPC
    Properties:
      CidrBlock: 10.0.0.0/16
      EnableDnsHostnames: true
      EnableDnsSupport: true
      Tags:
        - Key: Name
          Value: !Sub '${ProjectName}-vpc'

Outputs:
  DataBucketName:
    Description: Training data bucket
    Value: !Ref TrainingDataBucket
    Export:
      Name: !Sub '${AWS::StackName}-DataBucket'
  
  ModelsBucketName:
    Description: Models bucket
    Value: !Ref ModelsBucket
    Export:
      Name: !Sub '${AWS::StackName}-ModelsBucket'
  
  SageMakerRoleArn:
    Description: SageMaker execution role ARN
    Value: !GetAtt SageMakerExecutionRole.Arn
    Export:
      Name: !Sub '${AWS::StackName}-SageMakerRole'

# Deploy CloudFormation stack
aws cloudformation create-stack \
  --stack-name ml-infrastructure-prod \
  --template-body file://ml-infrastructure.yaml \
  --parameters ParameterKey=Environment,ParameterValue=production \
  --capabilities CAPABILITY_NAMED_IAM

# Update stack
aws cloudformation update-stack \
  --stack-name ml-infrastructure-prod \
  --template-body file://ml-infrastructure.yaml

# Delete stack
aws cloudformation delete-stack \
  --stack-name ml-infrastructure-prod

# View stack outputs
aws cloudformation describe-stacks \
  --stack-name ml-infrastructure-prod \
  --query 'Stacks[0].Outputs'

🔐 Secrets Management with Vault

Why Vault?

Never hardcode secrets (API keys, passwords, credentials) in code or config files. HashiCorp Vault provides centralized secrets management with encryption, access control, and audit logs.

Install Vault

# Install Vault
brew install vault

# Start Vault dev server (development only)
vault server -dev

# Set environment variable
export VAULT_ADDR='http://127.0.0.1:8200'
export VAULT_TOKEN=''

Store and Retrieve Secrets

# Write secrets
vault kv put secret/ml-pipeline/db \
  username=admin \
  password=secure-password-123

vault kv put secret/ml-pipeline/aws \
  access_key=AKIAIOSFODNN7EXAMPLE \
  secret_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

# Read secrets
vault kv get secret/ml-pipeline/db

# Get specific field
vault kv get -field=password secret/ml-pipeline/db

# List secrets
vault kv list secret/ml-pipeline

Use Vault in Python

"""
Access secrets from Vault in Python
"""
import hvac

# Initialize Vault client
client = hvac.Client(
    url='http://127.0.0.1:8200',
    token='your-vault-token'
)

# Read database credentials
db_secrets = client.secrets.kv.v2.read_secret_version(
    path='ml-pipeline/db'
)

db_username = db_secrets['data']['data']['username']
db_password = db_secrets['data']['data']['password']

# Use credentials
import psycopg2
conn = psycopg2.connect(
    host='ml-db.example.com',
    database='mlmetadata',
    user=db_username,
    password=db_password
)

# Read AWS credentials
aws_secrets = client.secrets.kv.v2.read_secret_version(
    path='ml-pipeline/aws'
)

import boto3
s3_client = boto3.client(
    's3',
    aws_access_key_id=aws_secrets['data']['data']['access_key'],
    aws_secret_access_key=aws_secrets['data']['data']['secret_key']
)

Vault with Terraform

# Configure Vault provider
provider "vault" {
  address = "http://127.0.0.1:8200"
  token   = var.vault_token
}

# Read secret from Vault
data "vault_generic_secret" "db_creds" {
  path = "secret/data/ml-pipeline/db"
}

# Use secret in resource
resource "aws_db_instance" "ml_metadata" {
  # ... other configuration
  
  username = data.vault_generic_secret.db_creds.data["username"]
  password = data.vault_generic_secret.db_creds.data["password"]
}

# Write secret to Vault
resource "vault_generic_secret" "api_key" {
  path = "secret/ml-pipeline/api"
  
  data_json = jsonencode({
    api_key = random_password.api_key.result
  })
}

resource "random_password" "api_key" {
  length  = 32
  special = true
}

AWS Secrets Manager Alternative

# Create secret in AWS Secrets Manager
aws secretsmanager create-secret \
  --name ml-pipeline/db-credentials \
  --secret-string '{"username":"admin","password":"secure-pass"}'

# Retrieve secret
aws secretsmanager get-secret-value \
  --secret-id ml-pipeline/db-credentials

# Access from Python
import boto3
import json

client = boto3.client('secretsmanager', region_name='us-west-2')

response = client.get_secret_value(SecretId='ml-pipeline/db-credentials')
secrets = json.loads(response['SecretString'])

db_username = secrets['username']
db_password = secrets['password']

✅ IaC Best Practices

📁

Version Control

Store IaC in Git. Every infrastructure change is code review + commit history

🔒

Never Commit Secrets

Use .gitignore for terraform.tfstate and *.tfvars. Store secrets in Vault/Secrets Manager

📦

Remote State

Store Terraform state in S3 with DynamoDB locking for team collaboration

🏷️

Tagging

Tag all resources with project, environment, owner for cost tracking

🧪

Test Changes

Always run terraform plan before apply. Test in dev before production

📚

Use Modules

Create reusable modules for common patterns (ML storage, compute, networking)

Remote State Configuration

# backend.tf - Store state in S3
terraform {
  backend "s3" {
    bucket         = "my-terraform-state-bucket"
    key            = "ml-pipeline/terraform.tfstate"
    region         = "us-west-2"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"
  }
}

⚠️ Security Considerations:

Never store secrets in Terraform files or state
Enable state encryption (S3 bucket encryption)
Use IAM roles/policies for least-privilege access
Enable CloudTrail/audit logging for infrastructure changes
Regularly rotate secrets and credentials
Use separate AWS accounts for dev/staging/prod

🎯 Summary

You've mastered Infrastructure as Code for ML:

🌍

Terraform

Multi-cloud IaC with HCL, modules, and workspaces

☁️

CloudFormation

AWS-native infrastructure provisioning with YAML

🔐

Secrets Management

Vault and AWS Secrets Manager for secure credentials

♻️

Reproducibility

Create identical environments from code

📋

Version Control

Track all infrastructure changes in Git

🔄

Automation

Eliminate manual provisioning and configuration

Key Takeaways

Infrastructure as Code makes ML systems reproducible and maintainable
Terraform provides multi-cloud support with modules and workspaces
CloudFormation offers native AWS integration with managed state
Never hardcode secrets - use Vault or Secrets Manager
Store Terraform state remotely with locking for team collaboration
Tag all resources for cost tracking and management
Test infrastructure changes in dev before production

🚀 Next Steps:

Your infrastructure is now code! Next, you'll learn model monitoring and observability - tracking model performance, detecting drift, and ensuring production models stay healthy.

Test Your Knowledge

Q1: What is the main benefit of Infrastructure as Code?

It's faster than manual provisioning

It creates reproducible, version-controlled infrastructure that can be automatically provisioned

It's required by AWS

It reduces cloud costs

Q2: What's the main advantage of Terraform over CloudFormation?

Multi-cloud support - works with AWS, Google Cloud, Azure, and many other providers

It's faster

Better AWS integration

Managed state without configuration

Q3: Where should you store sensitive credentials like database passwords?

In Terraform .tf files

In Git repository

In secrets management systems like Vault or AWS Secrets Manager

In environment variables in code

Q4: What is the purpose of Terraform state?

To store secrets

To run the infrastructure

To deploy applications

To track the current state of infrastructure and enable updates

Q5: What should you do before running 'terraform apply' in production?

Nothing, just apply

Run 'terraform plan' to preview changes and test in dev/staging first

Delete existing infrastructure

Backup your laptop