Skip to content

Network Isolation and NAT Instances

The GitLab Environment Toolkit supports network isolation for GitLab runners using NAT instances. This feature provides enhanced security by isolating runner traffic and enables cost optimization through the use of EC2-based NAT instances instead of managed NAT gateways.

Overview

Network isolation creates a dedicated network infrastructure for GitLab runners with the following components:

  • NAT Instances: EC2-based NAT instances that provide internet access for isolated runners
  • Route Switching: Lambda-based automation to switch between NAT instances and NAT gateways
  • Comprehensive Monitoring: CloudWatch alarms and Grafana dashboards for NAT instance health
  • Automated Health Monitoring: Lambda function to monitor NAT instance health and perform automatic failover

Benefits

  • Cost Optimization: NAT instances can be significantly cheaper than managed NAT gateways for moderate traffic volumes
  • Enhanced Security: Isolated network environment for runner workloads
  • High Availability: Automatic failover between NAT instances and NAT gateways
  • Comprehensive Monitoring: Real-time visibility into NAT instance performance and health
  • Flexible Routing: Ability to switch routing dynamically based on requirements

Required Settings

Terraform Configuration

To enable network isolation with NAT instances, add the following configuration:

# Enable network isolation for runners
runner_network_isolation = true

# Configure NAT instances
runner_nat_instance_type = "c7g.large"  # Instance type for NAT instances (ARM64-based Graviton)

Pre-provisioning Elastic IPs for Customer Whitelisting

When rolling out network isolation to production environments, you may need to provide the NAT instance IP addresses to customers in advance so they can whitelist them. The toolkit supports pre-provisioning Elastic IPs before enabling the full NAT instance infrastructure.

Step 1: Provision EIPs Only

# Pre-provision Elastic IPs without creating NAT instances
runner_nat_eips_only = true

Apply terraform to create only the Elastic IPs:

aws-sso terraform apply

Step 2: Retrieve IP Addresses

Get the provisioned IP addresses to share with customers:

aws-sso terraform output runner_nat_eips

This outputs:

{
  "allocation_ids" = [
    "eipalloc-xxxxx",
    "eipalloc-yyyyy", 
    "eipalloc-zzzzz",
  ]
  "public_ips" = [
    "52.xx.xx.xx",
    "54.xx.xx.xx",
    "18.xx.xx.xx",
  ]
}

Share these public_ips with your customers for whitelisting.

Step 3: Enable Full Network Isolation

Once customers have confirmed whitelisting is complete, enable the full feature:

# Remove runner_nat_eips_only (or set to false)
# runner_nat_eips_only = false

# Enable network isolation - NAT instances will use the pre-provisioned EIPs
runner_network_isolation = true
runner_nat_instance_type = "c7g.large"

Apply terraform to create the NAT instances with the existing EIPs:

aws-sso terraform apply

Variable Validation

You cannot set both runner_nat_eips_only = true and runner_network_isolation = true at the same time. Terraform will return a validation error. This ensures a clear two-phase rollout process.

EIP Stability

The Elastic IPs remain stable across NAT instance replacements (maintenance, upgrades, failures). Customers only need to whitelist these IPs once.

Subnet Configuration

When network isolation is enabled, runner fleet instances are automatically placed in isolated private subnets created by the network isolation feature.

Important Constraint: The autoscaling_fleet_default_subnet_ids variable cannot be used when runner_network_isolation = true. Terraform will enforce this with a validation error.

# This will cause a validation error when runner_network_isolation = true
# autoscaling_fleet_default_subnet_ids = ["subnet-xxx", "subnet-yyy"]  # ❌ Not allowed

# Instead, you can specify subnets per-runner using fleet_subnet_ids
autoscaling_runner_hive_list = [
  {
    name                = "isolated-runners"
    instance_type       = "t3a.small"
    fleet_instance_type = "c5a.large"
    fleet_subnet_ids    = ["subnet-xxx", "subnet-yyy"]  # ✅ Allowed per-runner override
  }
]

Subnet Behavior with Network Isolation: - Default: Runners automatically use the isolated private subnets created by the network isolation feature - Per-runner override: Individual runners can specify fleet_subnet_ids to use custom subnets(not recommended) - Custom subnet requirements: Any custom subnets must be configured to route through the NAT instances for internet access - Subnet type: Custom subnets should be private subnets in the isolated network environment

Architecture

When network isolation is enabled, the following infrastructure is automatically created:

NAT Instances

  • EC2 instances configured as NAT devices in public subnets
  • One NAT instance per availability zone for high availability
  • Automatic IP forwarding and iptables configuration
  • CloudWatch Agent for detailed monitoring

Route Management

  • Lambda function for switching routes between NAT instances and NAT gateways
  • Supports both manual and automated route switching
  • Dry-run capability for testing route changes
  • Comprehensive logging and error handling

Monitoring Infrastructure

  • CloudWatch alarms for NAT instance health, CPU, network, and memory
  • Grafana dashboard with real-time NAT instance metrics
  • Custom CloudWatch metrics for route switching events
  • SNS notifications for critical alerts

Health Monitoring

  • Lambda function that monitors NAT instance health
  • Automatic failover to NAT gateways when NAT instances fail
  • Configurable health check intervals and thresholds
  • Integration with existing alerting infrastructure

Monitoring and Alerting

CloudWatch Alarms

The following alarms are automatically configured:

Alarm Threshold Description
CPU Utilization >80% High CPU usage on NAT instance
Network In/Out >80% of instance max High network utilization
Status Check Failed Any failure Instance or system status check failure
Memory Utilization >85% High memory usage (requires CloudWatch Agent)
Route Switch Events Any event Informational alert for route changes
Route Switch Failures Any failure Critical alert for route switching failures
Lambda Errors Any error Route switcher Lambda function errors
Lambda Duration >45 seconds Route switcher taking too long

Grafana Dashboard

A comprehensive Grafana dashboard is automatically deployed with the following panels:

  • CPU Utilization: Real-time CPU usage across all NAT instances
  • Network Traffic: Inbound and outbound network traffic in bytes and packets
  • Status Monitoring: Color-coded table showing instance health status
  • Memory Utilization: Memory usage (requires CloudWatch Agent installation)

The dashboard automatically discovers all NAT instances and provides filtering capabilities.

To provision this dashboard, the Grafana dashboard must have been uploaded to the grafana instance. You can do this by running the following command:

aws-sso ansible-playbook -i inventory glh.environment_toolkit.grafana --tags dashboards

Route Management

Manual Route Switching

Using Ansible Playbook

The easiest way to manage routes is using the provided Ansible playbook with extra variables:

# Get current route status
ansible-playbook -i inventory glh.environment_toolkit.tools.route_switcher

# Switch to NAT instances (with dry-run)
ansible-playbook -i inventory glh.environment_toolkit.tools.route_switcher -e action=switch_to_nat_instance -e dry_run_mode=true

# Switch to NAT instances (actual)
ansible-playbook -i inventory glh.environment_toolkit.tools.route_switcher -e action=switch_to_nat_instance

# Switch to NAT gateways
ansible-playbook -i inventory glh.environment_toolkit.tools.route_switcher -e action=switch_to_nat_gateway

Using AWS CLI Directly

You can also invoke the Lambda function directly, using the AWS CLI:

# Get current route status
aws-sso aws lambda invoke --function-name <SOLUTION-PREFIX>-route-switcher --cli-binary-format raw-in-base64-out --payload '{"action": "get_status"}' response.json

# Switch to NAT instances (with dry-run)
aws-sso aws lambda invoke --function-name <SOLUTION-PREFIX>-route-switcher --cli-binary-format raw-in-base64-out --payload '{"action": "switch_to_nat_instance", "dry_run": true}' response.json

# Switch to NAT gateways
aws-sso aws lambda invoke --function-name <SOLUTION-PREFIX>-route-switcher --cli-binary-format raw-in-base64-out --payload '{"action": "switch_to_nat_gateway"}' response.json

Automated Health Monitoring

The health monitoring Lambda function automatically:

  1. Checks NAT instance health every 1 minute (configurable via EventBridge rule)
  2. Performs connectivity tests to verify internet access
  3. Switches to NAT gateways if NAT instances fail health checks
  4. Switches back to NAT instances when they recover
  5. Sends notifications for all failover events

Cost Considerations

NAT Instance vs NAT Gateway Cost Comparison

Note: NAT instances are deployed with 3 instances (one per AZ) for high availability.

Component NAT Instance (c7g.medium) NAT Instance (c7g.large) NAT Instance (c7gn.xlarge) NAT Gateway
Hourly Cost (per instance) ~$0.0363/hour ~$0.0725/hour ~$0.1450/hour ~$0.045/hour
Total Hourly Cost (3 AZs) ~$0.1089/hour ~$0.2175/hour ~$0.4350/hour ~$0.135/hour
Network Performance Up to 12.5 Gbps Up to 12.5 Gbps Up to 50 Gbps Up to 45 Gbps
Data Processing Free Free Free $0.045/GB
Total Monthly Cost (24/7) ~$78.39 ~$156.60 ~$313.20 ~$97.20 + data costs

Cost Examples with Traffic Volumes

Example 1: 50GB daily traffic (1.5TB/month)

Solution Base Cost Data Processing Total Monthly Cost
NAT Gateway (3 AZs) $97.20 $67.50 (1.5TB × $0.045) $164.70
NAT Instance (c7g.large × 3) $156.60 $0 $156.60
NAT Instance (c7g.medium × 3) $78.39 $0 $78.39

Example 2: 333GB daily traffic (10TB/month)

Solution Base Cost Data Processing Total Monthly Cost
NAT Gateway (3 AZs) $97.20 $450.00 (10TB × $0.045) $547.20
NAT Instance (c7g.medium × 3) $78.39 $0 $78.39
NAT Instance (c7g.large × 3) $156.60 $0 $156.60
NAT Instance (c7gn.xlarge × 3) $313.20 $0 $313.20

Example 3: 6.7TB daily traffic (200TB/month)

Solution Base Cost Data Processing Total Monthly Cost
NAT Gateway (3 AZs) $97.20 $9,000.00 (200TB × $0.045) $9,097.20
NAT Instance (c7g.medium × 3) $78.39 $0 $78.39
NAT Instance (c7g.large × 3) $156.60 $0 $156.60
NAT Instance (c7gn.xlarge × 3) $313.20 $0 $313.20

Break-even Analysis: - c7g.medium (3 instances): Cost-effective for any traffic volume above ~15GB daily - c7g.large (3 instances): Cost-effective for any traffic volume above ~44GB daily
- c7gn.xlarge (3 instances): Cost-effective for any traffic volume above ~133GB daily - High traffic advantage: - At 10TB/month: Save $390-470/month vs NAT gateways - At 200TB/month: Save $8,784-9,019/month vs NAT gateways

Instance Type Recommendations by Traffic Volume:

Light Traffic (< 100GB daily / < 3TB monthly): - c7g.medium: 2 vCPU, 4 GiB RAM, up to 12.5 Gbps - Most cost-effective for basic workloads - m7g.large: 2 vCPU, 8 GiB RAM, up to 12.5 Gbps - Better for memory-intensive NAT operations

Moderate Traffic (100GB - 1TB daily / 3TB - 30TB monthly): - c7g.large: 2 vCPU, 4 GiB RAM, up to 12.5 Gbps - Balanced performance and cost - m7g.xlarge: 4 vCPU, 16 GiB RAM, up to 12.5 Gbps - Enhanced processing for complex routing - c8g.large: 2 vCPU, 4 GiB RAM, up to 12.5 Gbps - Latest Graviton4 for optimal efficiency

High Traffic (1TB - 5TB daily / 30TB - 150TB monthly): - c7g.xlarge: 4 vCPU, 8 GiB RAM, up to 12.5 Gbps - More CPU for packet processing - c7gn.large: 2 vCPU, 4 GiB RAM, up to 25 Gbps - Enhanced networking capability - m8g.xlarge: 4 vCPU, 16 GiB RAM, up to 15 Gbps - Latest generation with improved performance

Very High Traffic (> 5TB daily / > 150TB monthly): - c7gn.xlarge: 4 vCPU, 8 GiB RAM, up to 50 Gbps - Maximum network performance - c7g.2xlarge: 8 vCPU, 16 GiB RAM, up to 15 Gbps - High CPU for intensive NAT operations - m8g.2xlarge: 8 vCPU, 32 GiB RAM, up to 15 Gbps - Maximum memory and processing power

Instance Family Characteristics: - c7g/c8g: Compute-optimized, best price-performance for NAT workloads - m7g/m8g: Memory-optimized, better for complex routing and connection tracking - c7gn: Network-optimized, enhanced networking up to 50 Gbps for high-throughput scenarios

Note: Costs are calculated for 3 availability zones with high availability deployment. Actual costs depend on instance type, data transfer volumes, and AWS region. NAT instances become significantly more cost-effective as traffic volumes increase due to no data processing fees.

Architecture Details

ARM64 Graviton Instances: NAT instances use ARM64-based AWS Graviton processors for optimal price-performance: - Supported Instance Types: c7g., m7g., c7gn., c8g., m8g. families - AMI: Amazon Linux 2 ARM64 optimized for Graviton processors - Performance: Enhanced network performance and cost efficiency - Validation*: Terraform enforces ARM64 Graviton instance types for optimal NAT performance

Cost Optimization Tips

  1. Choose appropriate instance type: Select instance type based on required network bandwidth. See AWS EC2 Instance Types for network performance specifications
  2. Monitor data transfer: Use CloudWatch metrics to track data usage
  3. Use Spot instances: Consider spot instances for non-critical environments
  4. Schedule instances: Stop NAT instances during off-hours if appropriate

Security Considerations

Network Security

  • NAT instances are deployed in public subnets with restrictive security groups
  • Only necessary ports are opened (HTTP/HTTPS outbound, SSH for management)
  • Source/destination checking is disabled for proper NAT functionality
  • All traffic is logged and monitored

Access Control

  • NAT instances use IAM roles with minimal required permissions
  • CloudWatch Agent permissions for metrics publishing
  • SSM access for remote management without SSH keys
  • Route management Lambda has restricted EC2 permissions

Monitoring and Compliance

  • All route changes are logged and alerted
  • Health check failures trigger immediate notifications
  • Comprehensive audit trail through CloudWatch Logs
  • Integration with existing security monitoring infrastructure

Known Issues

Health Check Metric Accumulation

When Terraform code is first applied or NAT instances are restarted, the instances may trigger multiple health check failures before becoming fully operational. These failures are counted in the CloudWatch metric used for route switching decisions.

Issue: The failure count metric does not automatically reset to zero after successful health checks, causing it to accumulate over time. This can lead to unexpected route switching behavior when the accumulated failure count reaches the threshold, even if recent health checks have been successful.

Workaround: Monitor the health check failure metrics and manually reset them if needed, or be aware that route switching may occur based on historical failures rather than current health status.

Impact: May cause unnecessary route switching from NAT instances to NAT gateways during initial deployment or after maintenance windows.

Auto Scaling Group Modification Issues

When disabling network isolation or making significant changes to the runner configuration, Auto Scaling Groups associated with the isolated runners may not be properly modified or deleted by Terraform.

Issue: Terraform may fail to update or destroy Auto Scaling Groups due to dependencies, running instances, or AWS API limitations. This can leave orphaned resources that continue to incur costs.

Workaround: Manually delete the Auto Scaling Groups through the AWS Console or CLI before running terraform destroy or when making configuration changes.

Impact: May result in orphaned AWS resources and unexpected costs if not manually cleaned up.

NAT Instance Failover Behavior

When a NAT instance in one availability zone fails, the health monitoring system switches all availability zone routes to use managed NAT gateways, not just the failed zone.

Issue: The failover mechanism operates at the environment level rather than per availability zone. If one NAT instance fails, the system switches routing for all AZs to NAT gateways instead of only switching the affected AZ.

Workaround: This is the current design behavior. Monitor individual AZ health and be aware that a single AZ failure will affect routing decisions for the entire environment.

Impact: May result in higher costs during failover periods as all traffic routes through managed NAT gateways instead of only the affected AZ.

Troubleshooting

Common Issues

Route Switching Failures

  1. Review Lambda function logs in CloudWatch
  2. Verify IAM permissions for route table modifications
  3. Check if route tables and NAT resources exist
  4. Ensure network interfaces are available and attached

High CPU or Network Utilization

  1. Consider upgrading to a larger instance type
  2. Review traffic patterns in CloudWatch metrics
  3. Check for unusual network activity or attacks
  4. Consider implementing traffic shaping or rate limiting

Diagnostic Commands

# Check NAT instance status
aws-sso aws ec2 describe-instances --filters "Name=tag:Purpose,Values=nat_instance"

# View route switcher logs
aws-sso aws logs describe-log-groups --log-group-name-prefix "/aws/lambda/route-switcher"

# Check current routes
aws-sso aws ec2 describe-route-tables --filters "Name=tag:Name,Values=*runner*"

Migration and Rollback

Enabling Network Isolation

Standard Enablement (No Pre-provisioning)

  1. Set runner_network_isolation = true in Terraform configuration
  2. Apply Terraform changes to create NAT infrastructure
  3. Existing fleet machines will start creating runners within the isolated subnet
  4. Monitor NAT instance performance and adjust instance types if needed
  5. Execute the ansible playbook to update grafana and upload the NAT dashboard

Production Enablement (With Customer Whitelisting)

For production environments where customers need to whitelist IPs in advance:

  1. Set runner_nat_eips_only = true in Terraform configuration
  2. Apply Terraform changes to create only the Elastic IPs
  3. Retrieve the IPs with aws-sso terraform output runner_nat_eips
  4. Share the IP addresses with customers for whitelisting
  5. Wait for customer confirmation that whitelisting is complete
  6. Remove runner_nat_eips_only and set runner_network_isolation = true
  7. Apply Terraform changes to create NAT instances (they will use the existing EIPs)
  8. Execute the ansible playbook to update grafana and upload the NAT dashboard

Disabling Network Isolation

  1. Set runner_network_isolation = false in Terraform configuration
  2. Apply Terraform changes to remove NAT infrastructure
  3. Runners will revert to using managed NAT gateways
  4. NAT instances and associated resources will be destroyed

Emergency Rollback

If NAT instances fail and automatic failover doesn't work:

  1. Manually invoke the route switcher Lambda to switch to NAT gateways
  2. Investigate NAT instance issues
  3. Consider temporarily disabling network isolation if issues persist

Integration with Existing Infrastructure

Network isolation integrates seamlessly with existing GitLab Environment Toolkit features:

  • Auto Scaling Groups: Runners automatically use isolated network configuration
  • Load Balancers: Application load balancers remain unaffected
  • Monitoring: NAT monitoring integrates with existing Prometheus/Grafana setup
  • Alerting: Uses existing SNS topics and Zulip notification channels

The feature is designed to be non-disruptive and can be enabled/disabled without affecting running workloads.