Home โ†’ MLOps Engineer โ†’ Infrastructure as Code

Infrastructure as Code for ML

Automate ML infrastructure with Terraform and CloudFormation. Build reproducible environments, manage secrets with Vault, and eliminate manual configuration

๐Ÿ“… Tutorial 10 ๐Ÿ“Š Advanced

๐ŸŽ“ Complete all tutorials to earn your Free MLOps Engineer Certificate
Shareable on LinkedIn โ€ข Verified by AITutorials.site โ€ข No signup fee

๐Ÿ—๏ธ The Infrastructure Problem

Your ML pipeline works perfectly in development. Now you need to deploy to staging and production. You:

  • Manually click through AWS console creating S3 buckets, IAM roles, EC2 instances
  • Document each step in a 20-page runbook
  • Recreate everything for staging (slightly different)
  • Someone needs to update a security group - which one was it?
  • Disaster recovery? Start clicking again...

Infrastructure as Code (IaC) treats infrastructure like software. You define infrastructure in code files, version control them, and automatically provision identical environments. Changes are code reviews, not manual clicks.

๐Ÿ’ก IaC Benefits:

  • Reproducibility: Create identical environments every time
  • Version Control: Track infrastructure changes in Git
  • Documentation: Code is documentation
  • Automation: No manual provisioning
  • Testing: Test infrastructure changes before production
  • Disaster Recovery: Rebuild entire infrastructure from code

๐ŸŒ Terraform for ML Infrastructure

What is Terraform?

Terraform is a cloud-agnostic IaC tool. Write infrastructure in HCL (HashiCorp Configuration Language), and Terraform provisions resources across AWS, Google Cloud, Azure, and 1000+ providers.

Installation

# Install Terraform (macOS)
brew install terraform

# Verify installation
terraform version

# Initialize AWS credentials
export AWS_ACCESS_KEY_ID="your-key"
export AWS_SECRET_ACCESS_KEY="your-secret"
export AWS_DEFAULT_REGION="us-west-2"

Your First Terraform Configuration

# main.tf - Create S3 bucket for ML data
terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = "us-west-2"
}

# S3 bucket for training data
resource "aws_s3_bucket" "ml_data" {
  bucket = "my-ml-data-bucket"
  
  tags = {
    Name        = "ML Training Data"
    Environment = "production"
    Project     = "ml-pipeline"
  }
}

# Enable versioning
resource "aws_s3_bucket_versioning" "ml_data_versioning" {
  bucket = aws_s3_bucket.ml_data.id
  
  versioning_configuration {
    status = "Enabled"
  }
}

# Block public access
resource "aws_s3_bucket_public_access_block" "ml_data_public_access" {
  bucket = aws_s3_bucket.ml_data.id
  
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}
# Terraform workflow
terraform init      # Initialize providers
terraform plan      # Preview changes
terraform apply     # Apply changes
terraform destroy   # Destroy infrastructure

Complete ML Infrastructure

# ml-infrastructure.tf
# Variables for reusability
variable "project_name" {
  description = "Project name"
  default     = "ml-pipeline"
}

variable "environment" {
  description = "Environment (dev/staging/prod)"
  default     = "production"
}

# S3 buckets for different stages
resource "aws_s3_bucket" "raw_data" {
  bucket = "${var.project_name}-raw-data-${var.environment}"
}

resource "aws_s3_bucket" "processed_data" {
  bucket = "${var.project_name}-processed-data-${var.environment}"
}

resource "aws_s3_bucket" "models" {
  bucket = "${var.project_name}-models-${var.environment}"
}

# IAM role for ML training job
resource "aws_iam_role" "ml_training_role" {
  name = "${var.project_name}-training-role"
  
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action = "sts:AssumeRole"
      Effect = "Allow"
      Principal = {
        Service = "sagemaker.amazonaws.com"
      }
    }]
  })
}

# IAM policy for S3 access
resource "aws_iam_policy" "ml_s3_access" {
  name = "${var.project_name}-s3-access"
  
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "s3:GetObject",
          "s3:PutObject",
          "s3:ListBucket"
        ]
        Resource = [
          "${aws_s3_bucket.raw_data.arn}/*",
          "${aws_s3_bucket.processed_data.arn}/*",
          "${aws_s3_bucket.models.arn}/*"
        ]
      }
    ]
  })
}

# Attach policy to role
resource "aws_iam_role_policy_attachment" "ml_s3_attach" {
  role       = aws_iam_role.ml_training_role.name
  policy_arn = aws_iam_policy.ml_s3_access.arn
}

# EC2 instance for training
resource "aws_instance" "ml_training" {
  ami           = "ami-0c55b159cbfafe1f0"  # Deep Learning AMI
  instance_type = "p3.2xlarge"  # GPU instance
  
  iam_instance_profile = aws_iam_instance_profile.ml_training_profile.name
  
  root_block_device {
    volume_size = 100  # GB
    volume_type = "gp3"
  }
  
  tags = {
    Name = "${var.project_name}-training-${var.environment}"
  }
  
  user_data = <<-EOF
              #!/bin/bash
              pip install -r /app/requirements.txt
              python /app/train.py
              EOF
}

# IAM instance profile
resource "aws_iam_instance_profile" "ml_training_profile" {
  name = "${var.project_name}-training-profile"
  role = aws_iam_role.ml_training_role.name
}

# RDS database for metadata
resource "aws_db_instance" "ml_metadata" {
  identifier           = "${var.project_name}-metadata"
  engine              = "postgres"
  engine_version      = "15.3"
  instance_class      = "db.t3.micro"
  allocated_storage   = 20
  
  db_name  = "mlmetadata"
  username = "admin"
  password = var.db_password  # From variable
  
  skip_final_snapshot = true
  
  tags = {
    Name = "${var.project_name}-metadata-${var.environment}"
  }
}

# Outputs
output "raw_data_bucket" {
  value = aws_s3_bucket.raw_data.bucket
}

output "models_bucket" {
  value = aws_s3_bucket.models.bucket
}

output "training_instance_ip" {
  value = aws_instance.ml_training.public_ip
}

output "database_endpoint" {
  value = aws_db_instance.ml_metadata.endpoint
}

Multi-Environment with Workspaces

# Create workspaces for different environments
terraform workspace new dev
terraform workspace new staging
terraform workspace new production

# Switch workspace
terraform workspace select production

# Apply to specific environment
terraform apply -var="environment=production"

# Each workspace has separate state!

Modules for Reusability

# modules/ml-storage/main.tf
variable "project_name" {}
variable "environment" {}

resource "aws_s3_bucket" "data" {
  bucket = "${var.project_name}-${var.environment}"
}

resource "aws_s3_bucket_versioning" "data_versioning" {
  bucket = aws_s3_bucket.data.id
  versioning_configuration {
    status = "Enabled"
  }
}

output "bucket_name" {
  value = aws_s3_bucket.data.bucket
}

output "bucket_arn" {
  value = aws_s3_bucket.data.arn
}
# main.tf - Use module
module "raw_data_storage" {
  source = "./modules/ml-storage"
  
  project_name = "ml-pipeline"
  environment  = "production"
}

module "models_storage" {
  source = "./modules/ml-storage"
  
  project_name = "ml-models"
  environment  = "production"
}

โ˜๏ธ AWS CloudFormation

CloudFormation vs Terraform

Aspect Terraform CloudFormation
Cloud Support โœ… Multi-cloud AWS only
Language HCL YAML/JSON
State Management Requires state file โœ… Managed by AWS
AWS Integration Good โœ… Native
Community โœ… Larger AWS-focused

CloudFormation Template

# ml-infrastructure.yaml
AWSTemplateFormatVersion: '2010-09-09'
Description: ML Pipeline Infrastructure

Parameters:
  ProjectName:
    Type: String
    Default: ml-pipeline
    Description: Project name prefix
  
  Environment:
    Type: String
    Default: production
    AllowedValues:
      - dev
      - staging
      - production
    Description: Environment name

Resources:
  # S3 Bucket for training data
  TrainingDataBucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: !Sub '${ProjectName}-data-${Environment}'
      VersioningConfiguration:
        Status: Enabled
      PublicAccessBlockConfiguration:
        BlockPublicAcls: true
        BlockPublicPolicy: true
        IgnorePublicAcls: true
        RestrictPublicBuckets: true
      Tags:
        - Key: Name
          Value: !Sub '${ProjectName}-data'
        - Key: Environment
          Value: !Ref Environment
  
  # S3 Bucket for models
  ModelsBucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: !Sub '${ProjectName}-models-${Environment}'
      VersioningConfiguration:
        Status: Enabled
  
  # IAM Role for SageMaker
  SageMakerExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: !Sub '${ProjectName}-sagemaker-role'
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: sagemaker.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/AmazonSageMakerFullAccess
      Policies:
        - PolicyName: S3Access
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - s3:GetObject
                  - s3:PutObject
                  - s3:ListBucket
                Resource:
                  - !GetAtt TrainingDataBucket.Arn
                  - !Sub '${TrainingDataBucket.Arn}/*'
                  - !GetAtt ModelsBucket.Arn
                  - !Sub '${ModelsBucket.Arn}/*'
  
  # ECR Repository for Docker images
  MLModelRepository:
    Type: AWS::ECR::Repository
    Properties:
      RepositoryName: !Sub '${ProjectName}-models'
      ImageScanningConfiguration:
        ScanOnPush: true
      Tags:
        - Key: Environment
          Value: !Ref Environment
  
  # ECS Cluster for model serving
  ECSCluster:
    Type: AWS::ECS::Cluster
    Properties:
      ClusterName: !Sub '${ProjectName}-cluster-${Environment}'
      CapacityProviders:
        - FARGATE
        - FARGATE_SPOT
      DefaultCapacityProviderStrategy:
        - CapacityProvider: FARGATE
          Weight: 1
  
  # VPC for ML workloads
  MLVPC:
    Type: AWS::EC2::VPC
    Properties:
      CidrBlock: 10.0.0.0/16
      EnableDnsHostnames: true
      EnableDnsSupport: true
      Tags:
        - Key: Name
          Value: !Sub '${ProjectName}-vpc'

Outputs:
  DataBucketName:
    Description: Training data bucket
    Value: !Ref TrainingDataBucket
    Export:
      Name: !Sub '${AWS::StackName}-DataBucket'
  
  ModelsBucketName:
    Description: Models bucket
    Value: !Ref ModelsBucket
    Export:
      Name: !Sub '${AWS::StackName}-ModelsBucket'
  
  SageMakerRoleArn:
    Description: SageMaker execution role ARN
    Value: !GetAtt SageMakerExecutionRole.Arn
    Export:
      Name: !Sub '${AWS::StackName}-SageMakerRole'
# Deploy CloudFormation stack
aws cloudformation create-stack \
  --stack-name ml-infrastructure-prod \
  --template-body file://ml-infrastructure.yaml \
  --parameters ParameterKey=Environment,ParameterValue=production \
  --capabilities CAPABILITY_NAMED_IAM

# Update stack
aws cloudformation update-stack \
  --stack-name ml-infrastructure-prod \
  --template-body file://ml-infrastructure.yaml

# Delete stack
aws cloudformation delete-stack \
  --stack-name ml-infrastructure-prod

# View stack outputs
aws cloudformation describe-stacks \
  --stack-name ml-infrastructure-prod \
  --query 'Stacks[0].Outputs'

๐Ÿ” Secrets Management with Vault

Why Vault?

Never hardcode secrets (API keys, passwords, credentials) in code or config files. HashiCorp Vault provides centralized secrets management with encryption, access control, and audit logs.

Install Vault

# Install Vault
brew install vault

# Start Vault dev server (development only)
vault server -dev

# Set environment variable
export VAULT_ADDR='http://127.0.0.1:8200'
export VAULT_TOKEN=''

Store and Retrieve Secrets

# Write secrets
vault kv put secret/ml-pipeline/db \
  username=admin \
  password=secure-password-123

vault kv put secret/ml-pipeline/aws \
  access_key=AKIAIOSFODNN7EXAMPLE \
  secret_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

# Read secrets
vault kv get secret/ml-pipeline/db

# Get specific field
vault kv get -field=password secret/ml-pipeline/db

# List secrets
vault kv list secret/ml-pipeline

Use Vault in Python

"""
Access secrets from Vault in Python
"""
import hvac

# Initialize Vault client
client = hvac.Client(
    url='http://127.0.0.1:8200',
    token='your-vault-token'
)

# Read database credentials
db_secrets = client.secrets.kv.v2.read_secret_version(
    path='ml-pipeline/db'
)

db_username = db_secrets['data']['data']['username']
db_password = db_secrets['data']['data']['password']

# Use credentials
import psycopg2
conn = psycopg2.connect(
    host='ml-db.example.com',
    database='mlmetadata',
    user=db_username,
    password=db_password
)

# Read AWS credentials
aws_secrets = client.secrets.kv.v2.read_secret_version(
    path='ml-pipeline/aws'
)

import boto3
s3_client = boto3.client(
    's3',
    aws_access_key_id=aws_secrets['data']['data']['access_key'],
    aws_secret_access_key=aws_secrets['data']['data']['secret_key']
)

Vault with Terraform

# Configure Vault provider
provider "vault" {
  address = "http://127.0.0.1:8200"
  token   = var.vault_token
}

# Read secret from Vault
data "vault_generic_secret" "db_creds" {
  path = "secret/data/ml-pipeline/db"
}

# Use secret in resource
resource "aws_db_instance" "ml_metadata" {
  # ... other configuration
  
  username = data.vault_generic_secret.db_creds.data["username"]
  password = data.vault_generic_secret.db_creds.data["password"]
}

# Write secret to Vault
resource "vault_generic_secret" "api_key" {
  path = "secret/ml-pipeline/api"
  
  data_json = jsonencode({
    api_key = random_password.api_key.result
  })
}

resource "random_password" "api_key" {
  length  = 32
  special = true
}

AWS Secrets Manager Alternative

# Create secret in AWS Secrets Manager
aws secretsmanager create-secret \
  --name ml-pipeline/db-credentials \
  --secret-string '{"username":"admin","password":"secure-pass"}'

# Retrieve secret
aws secretsmanager get-secret-value \
  --secret-id ml-pipeline/db-credentials
# Access from Python
import boto3
import json

client = boto3.client('secretsmanager', region_name='us-west-2')

response = client.get_secret_value(SecretId='ml-pipeline/db-credentials')
secrets = json.loads(response['SecretString'])

db_username = secrets['username']
db_password = secrets['password']

โœ… IaC Best Practices

๐Ÿ“

Version Control

Store IaC in Git. Every infrastructure change is code review + commit history

๐Ÿ”’

Never Commit Secrets

Use .gitignore for terraform.tfstate and *.tfvars. Store secrets in Vault/Secrets Manager

๐Ÿ“ฆ

Remote State

Store Terraform state in S3 with DynamoDB locking for team collaboration

๐Ÿท๏ธ

Tagging

Tag all resources with project, environment, owner for cost tracking

๐Ÿงช

Test Changes

Always run terraform plan before apply. Test in dev before production

๐Ÿ“š

Use Modules

Create reusable modules for common patterns (ML storage, compute, networking)

Remote State Configuration

# backend.tf - Store state in S3
terraform {
  backend "s3" {
    bucket         = "my-terraform-state-bucket"
    key            = "ml-pipeline/terraform.tfstate"
    region         = "us-west-2"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"
  }
}

โš ๏ธ Security Considerations:

  • Never store secrets in Terraform files or state
  • Enable state encryption (S3 bucket encryption)
  • Use IAM roles/policies for least-privilege access
  • Enable CloudTrail/audit logging for infrastructure changes
  • Regularly rotate secrets and credentials
  • Use separate AWS accounts for dev/staging/prod

๐ŸŽฏ Summary

You've mastered Infrastructure as Code for ML:

๐ŸŒ

Terraform

Multi-cloud IaC with HCL, modules, and workspaces

โ˜๏ธ

CloudFormation

AWS-native infrastructure provisioning with YAML

๐Ÿ”

Secrets Management

Vault and AWS Secrets Manager for secure credentials

โ™ป๏ธ

Reproducibility

Create identical environments from code

๐Ÿ“‹

Version Control

Track all infrastructure changes in Git

๐Ÿ”„

Automation

Eliminate manual provisioning and configuration

Key Takeaways

  1. Infrastructure as Code makes ML systems reproducible and maintainable
  2. Terraform provides multi-cloud support with modules and workspaces
  3. CloudFormation offers native AWS integration with managed state
  4. Never hardcode secrets - use Vault or Secrets Manager
  5. Store Terraform state remotely with locking for team collaboration
  6. Tag all resources for cost tracking and management
  7. Test infrastructure changes in dev before production

๐Ÿš€ Next Steps:

Your infrastructure is now code! Next, you'll learn model monitoring and observability - tracking model performance, detecting drift, and ensuring production models stay healthy.

Test Your Knowledge

Q1: What is the main benefit of Infrastructure as Code?

It's faster than manual provisioning
It creates reproducible, version-controlled infrastructure that can be automatically provisioned
It's required by AWS
It reduces cloud costs

Q2: What's the main advantage of Terraform over CloudFormation?

Multi-cloud support - works with AWS, Google Cloud, Azure, and many other providers
It's faster
Better AWS integration
Managed state without configuration

Q3: Where should you store sensitive credentials like database passwords?

In Terraform .tf files
In Git repository
In secrets management systems like Vault or AWS Secrets Manager
In environment variables in code

Q4: What is the purpose of Terraform state?

To store secrets
To run the infrastructure
To deploy applications
To track the current state of infrastructure and enable updates

Q5: What should you do before running 'terraform apply' in production?

Nothing, just apply
Run 'terraform plan' to preview changes and test in dev/staging first
Delete existing infrastructure
Backup your laptop