Network Isolation and NAT Instances¶

The GitLab Environment Toolkit supports network isolation for GitLab runners using NAT instances. This feature provides enhanced security by isolating runner traffic and enables cost optimization through the use of EC2-based NAT instances instead of managed NAT gateways.

Overview¶

Network isolation creates a dedicated network infrastructure for GitLab runners with the following components:

NAT Instances: EC2-based NAT instances that provide internet access for isolated runners
Route Switching: Lambda-based automation to switch between NAT instances and NAT gateways
Comprehensive Monitoring: CloudWatch alarms for NAT instance health
Automated Health Monitoring: Lambda function to monitor NAT instance health and perform automatic failover

Benefits¶

Cost Optimization: NAT instances can be significantly cheaper than managed NAT gateways for moderate traffic volumes
Enhanced Security: Isolated network environment for runner workloads
High Availability: Automatic failover between NAT instances and NAT gateways
Comprehensive Monitoring: Real-time visibility into NAT instance performance and health
Flexible Routing: Ability to switch routing dynamically based on requirements

Required Settings¶

Terraform Configuration¶

To enable network isolation with NAT instances, add the following configuration:

# Enable network isolation for runners
runner_network_isolation = true

# Configure NAT instances
runner_nat_instance_type = "c7g.large"  # Instance type for NAT instances (ARM64-based Graviton)

Pre-provisioning Elastic IPs for Customer Whitelisting¶

When rolling out network isolation to production environments, you may need to provide the NAT instance IP addresses to customers in advance so they can whitelist them. The toolkit supports pre-provisioning Elastic IPs before enabling the full NAT instance infrastructure.

Step 1: Provision EIPs Only¶

# Pre-provision Elastic IPs without creating NAT instances
runner_nat_eips_only = true

Apply terraform to create only the Elastic IPs:

aws-sso terraform apply

Step 2: Retrieve IP Addresses¶

Get the provisioned IP addresses to share with customers:

aws-sso terraform output runner_nat_eips

This outputs:

{
  "allocation_ids" = [
    "eipalloc-xxxxx",
    "eipalloc-yyyyy",
    "eipalloc-zzzzz",
  ]
  "public_ips" = [
    "52.xx.xx.xx",
    "54.xx.xx.xx",
    "18.xx.xx.xx",
  ]
}

Share these public_ips with your customers for whitelisting.

Step 3: Enable Full Network Isolation¶

Once customers have confirmed whitelisting is complete, enable the full feature:

# Remove runner_nat_eips_only (or set to false)
# runner_nat_eips_only = false

# Enable network isolation - NAT instances will use the pre-provisioned EIPs
runner_network_isolation = true
runner_nat_instance_type = "c7g.large"

Apply terraform to create the NAT instances with the existing EIPs:

aws-sso terraform apply

Variable Validation

You cannot set both runner_nat_eips_only = true and runner_network_isolation = true at the same time. Terraform will return a validation error. This ensures a clear two-phase rollout process.

EIP Stability

The Elastic IPs remain stable across NAT instance replacements (maintenance, upgrades, failures). Customers only need to whitelist these IPs once.

Subnet Configuration¶

When network isolation is enabled, runner fleet instances are automatically placed in isolated private subnets created by the network isolation feature.

Important Constraint: The autoscaling_fleet_default_subnet_ids variable cannot be used when runner_network_isolation = true. Terraform will enforce this with a validation error.

# This will cause a validation error when runner_network_isolation = true
# autoscaling_fleet_default_subnet_ids = ["subnet-xxx", "subnet-yyy"]  # ❌ Not allowed

# Instead, you can specify subnets per-runner using fleet_subnet_ids
autoscaling_runner_hive_list = [
  {
    name                = "isolated-runners"
    instance_type       = "t3a.small"
    fleet_instance_type = "c5a.large"
    fleet_subnet_ids    = ["subnet-xxx", "subnet-yyy"]  # ✅ Allowed per-runner override
  }
]

Subnet Behavior with Network Isolation: - Default: Runners automatically use the isolated private subnets created by the network isolation feature - Per-runner override: Individual runners can specify fleet_subnet_ids to use custom subnets(not recommended) - Custom subnet requirements: Any custom subnets must be configured to route through the NAT instances for internet access - Subnet type: Custom subnets should be private subnets in the isolated network environment

Architecture¶

When network isolation is enabled, the following infrastructure is automatically created:

NAT Instances¶

EC2 instances configured as NAT devices in public subnets
One NAT instance per availability zone for high availability
Automatic IP forwarding and iptables configuration
CloudWatch Agent for detailed monitoring

Route Management¶

Lambda function for switching routes between NAT instances and NAT gateways
Supports both manual and automated route switching
Dry-run capability for testing route changes
Comprehensive logging and error handling

Monitoring Infrastructure¶

CloudWatch alarms for NAT instance health, CPU, network, and memory
Custom CloudWatch metrics for route switching events
SNS notifications for critical alerts

Health Monitoring¶

Lambda function that monitors NAT instance health
Automatic failover to NAT gateways when NAT instances fail
Configurable health check intervals and thresholds
Integration with existing alerting infrastructure

Monitoring and Alerting¶

CloudWatch Alarms¶

The following alarms are automatically configured:

Alarm	Threshold	Description
CPU Utilization	>80%	High CPU usage on NAT instance
Network In/Out	>80% of instance max	High network utilization
Status Check Failed	Any failure	Instance or system status check failure
Memory Utilization	>85%	High memory usage (requires CloudWatch Agent)
Route Switch Events	Any event	Informational alert for route changes
Route Switch Failures	Any failure	Critical alert for route switching failures
Lambda Errors	Any error	Route switcher Lambda function errors
Lambda Duration	>45 seconds	Route switcher taking too long

Route Management¶

Manual Route Switching¶

Using Ansible Playbook¶

The easiest way to manage routes is using the provided Ansible playbook with extra variables:

# Get current route status
ansible-playbook -i inventory glh.environment_toolkit.tools.route_switcher

# Switch to NAT instances (with dry-run)
ansible-playbook -i inventory glh.environment_toolkit.tools.route_switcher -e action=switch_to_nat_instance -e dry_run_mode=true

# Switch to NAT instances (actual)
ansible-playbook -i inventory glh.environment_toolkit.tools.route_switcher -e action=switch_to_nat_instance

# Switch to NAT gateways
ansible-playbook -i inventory glh.environment_toolkit.tools.route_switcher -e action=switch_to_nat_gateway

Using AWS CLI Directly¶

You can also invoke the Lambda function directly, using the AWS CLI:

# Get current route status
aws-sso aws lambda invoke --function-name <SOLUTION-PREFIX>-route-switcher --cli-binary-format raw-in-base64-out --payload '{"action": "get_status"}' response.json

# Switch to NAT instances (with dry-run)
aws-sso aws lambda invoke --function-name <SOLUTION-PREFIX>-route-switcher --cli-binary-format raw-in-base64-out --payload '{"action": "switch_to_nat_instance", "dry_run": true}' response.json

# Switch to NAT gateways
aws-sso aws lambda invoke --function-name <SOLUTION-PREFIX>-route-switcher --cli-binary-format raw-in-base64-out --payload '{"action": "switch_to_nat_gateway"}' response.json

Automated Health Monitoring¶

The health monitoring Lambda function automatically:

Checks NAT instance health every 1 minute (configurable via EventBridge rule)
Performs connectivity tests to verify internet access
Switches to NAT gateways if NAT instances fail health checks
Switches back to NAT instances when they recover
Sends notifications for all failover events

Cost Considerations¶

NAT Instance vs NAT Gateway Cost Comparison¶

Note: NAT instances are deployed with 3 instances (one per AZ) for high availability.

Component	NAT Instance (c7g.medium)	NAT Instance (c7g.large)	NAT Instance (c7gn.xlarge)	NAT Gateway
Hourly Cost (per instance)	~$0.0363/hour	~$0.0725/hour	~$0.1450/hour	~$0.045/hour
Total Hourly Cost (3 AZs)	~$0.1089/hour	~$0.2175/hour	~$0.4350/hour	~$0.135/hour
Network Performance	Up to 12.5 Gbps	Up to 12.5 Gbps	Up to 50 Gbps	Up to 45 Gbps
Data Processing	Free	Free	Free	$0.045/GB
Total Monthly Cost (24/7)	~$78.39	~$156.60	~$313.20	~$97.20 + data costs

Cost Examples with Traffic Volumes¶

Example 1: 50GB daily traffic (1.5TB/month)

Solution	Base Cost	Data Processing	Total Monthly Cost
NAT Gateway (3 AZs)	$97.20	$67.50 (1.5TB × $0.045)	$164.70
NAT Instance (c7g.large × 3)	$156.60	$0	$156.60
NAT Instance (c7g.medium × 3)	$78.39	$0	$78.39

Example 2: 333GB daily traffic (10TB/month)

Solution	Base Cost	Data Processing	Total Monthly Cost
NAT Gateway (3 AZs)	$97.20	$450.00 (10TB × $0.045)	$547.20
NAT Instance (c7g.medium × 3)	$78.39	$0	$78.39
NAT Instance (c7g.large × 3)	$156.60	$0	$156.60
NAT Instance (c7gn.xlarge × 3)	$313.20	$0	$313.20

Example 3: 6.7TB daily traffic (200TB/month)

Solution	Base Cost	Data Processing	Total Monthly Cost
NAT Gateway (3 AZs)	$97.20	$9,000.00 (200TB × $0.045)	$9,097.20
NAT Instance (c7g.medium × 3)	$78.39	$0	$78.39
NAT Instance (c7g.large × 3)	$156.60	$0	$156.60
NAT Instance (c7gn.xlarge × 3)	$313.20	$0	$313.20

Break-even Analysis: - c7g.medium (3 instances): Cost-effective for any traffic volume above ~15GB daily - c7g.large (3 instances): Cost-effective for any traffic volume above ~44GB daily - c7gn.xlarge (3 instances): Cost-effective for any traffic volume above ~133GB daily - High traffic advantage: - At 10TB/month: Save $390-470/month vs NAT gateways - At 200TB/month: Save $8,784-9,019/month vs NAT gateways

Instance Type Recommendations by Traffic Volume:

Light Traffic (< 100GB daily / < 3TB monthly): - c7g.medium: 2 vCPU, 4 GiB RAM, up to 12.5 Gbps - Most cost-effective for basic workloads - m7g.large: 2 vCPU, 8 GiB RAM, up to 12.5 Gbps - Better for memory-intensive NAT operations

Moderate Traffic (100GB - 1TB daily / 3TB - 30TB monthly): - c7g.large: 2 vCPU, 4 GiB RAM, up to 12.5 Gbps - Balanced performance and cost - m7g.xlarge: 4 vCPU, 16 GiB RAM, up to 12.5 Gbps - Enhanced processing for complex routing - c8g.large: 2 vCPU, 4 GiB RAM, up to 12.5 Gbps - Latest Graviton4 for optimal efficiency

High Traffic (1TB - 5TB daily / 30TB - 150TB monthly): - c7g.xlarge: 4 vCPU, 8 GiB RAM, up to 12.5 Gbps - More CPU for packet processing - c7gn.large: 2 vCPU, 4 GiB RAM, up to 25 Gbps - Enhanced networking capability - m8g.xlarge: 4 vCPU, 16 GiB RAM, up to 15 Gbps - Latest generation with improved performance

Very High Traffic (> 5TB daily / > 150TB monthly): - c7gn.xlarge: 4 vCPU, 8 GiB RAM, up to 50 Gbps - Maximum network performance - c7g.2xlarge: 8 vCPU, 16 GiB RAM, up to 15 Gbps - High CPU for intensive NAT operations - m8g.2xlarge: 8 vCPU, 32 GiB RAM, up to 15 Gbps - Maximum memory and processing power

Instance Family Characteristics: - c7g/c8g: Compute-optimized, best price-performance for NAT workloads - m7g/m8g: Memory-optimized, better for complex routing and connection tracking - c7gn: Network-optimized, enhanced networking up to 50 Gbps for high-throughput scenarios

Note: Costs are calculated for 3 availability zones with high availability deployment. Actual costs depend on instance type, data transfer volumes, and AWS region. NAT instances become significantly more cost-effective as traffic volumes increase due to no data processing fees.

Architecture Details¶

ARM64 Graviton Instances: NAT instances use ARM64-based AWS Graviton processors for optimal price-performance: - Supported Instance Types: c7g., m7g., c7gn., c8g., m8g. families - AMI: Amazon Linux 2 ARM64 optimized for Graviton processors - Performance: Enhanced network performance and cost efficiency - Validation*: Terraform enforces ARM64 Graviton instance types for optimal NAT performance

Cost Optimization Tips¶

Choose appropriate instance type: Select instance type based on required network bandwidth. See AWS EC2 Instance Types for network performance specifications
Monitor data transfer: Use CloudWatch metrics to track data usage
Use Spot instances: Consider spot instances for non-critical environments
Schedule instances: Stop NAT instances during off-hours if appropriate

Security Considerations¶

Network Security¶

NAT instances are deployed in public subnets with restrictive security groups
Only necessary ports are opened (HTTP/HTTPS outbound, SSH for management)
Source/destination checking is disabled for proper NAT functionality
All traffic is logged and monitored

Access Control¶

NAT instances use IAM roles with minimal required permissions
CloudWatch Agent permissions for metrics publishing
SSM access for remote management without SSH keys
Route management Lambda has restricted EC2 permissions

Monitoring and Compliance¶

All route changes are logged and alerted
Health check failures trigger immediate notifications
Comprehensive audit trail through CloudWatch Logs
Integration with existing security monitoring infrastructure

Known Issues¶

Health Check Metric Accumulation¶

When Terraform code is first applied or NAT instances are restarted, the instances may trigger multiple health check failures before becoming fully operational. These failures are counted in the CloudWatch metric used for route switching decisions.

Issue: The failure count metric does not automatically reset to zero after successful health checks, causing it to accumulate over time. This can lead to unexpected route switching behavior when the accumulated failure count reaches the threshold, even if recent health checks have been successful.

Workaround: Monitor the health check failure metrics and manually reset them if needed, or be aware that route switching may occur based on historical failures rather than current health status.

Impact: May cause unnecessary route switching from NAT instances to NAT gateways during initial deployment or after maintenance windows.

Auto Scaling Group Modification Issues¶

When disabling network isolation or making significant changes to the runner configuration, Auto Scaling Groups associated with the isolated runners may not be properly modified or deleted by Terraform.

Issue: Terraform may fail to update or destroy Auto Scaling Groups due to dependencies, running instances, or AWS API limitations. This can leave orphaned resources that continue to incur costs.

Workaround: Manually delete the Auto Scaling Groups through the AWS Console or CLI before running terraform destroy or when making configuration changes.

Impact: May result in orphaned AWS resources and unexpected costs if not manually cleaned up.

NAT Instance Failover Behavior¶

When a NAT instance in one availability zone fails, the health monitoring system switches all availability zone routes to use managed NAT gateways, not just the failed zone.

Issue: The failover mechanism operates at the environment level rather than per availability zone. If one NAT instance fails, the system switches routing for all AZs to NAT gateways instead of only switching the affected AZ.

Workaround: This is the current design behavior. Monitor individual AZ health and be aware that a single AZ failure will affect routing decisions for the entire environment.

Impact: May result in higher costs during failover periods as all traffic routes through managed NAT gateways instead of only the affected AZ.

Troubleshooting¶

Common Issues¶

Route Switching Failures¶

Review Lambda function logs in CloudWatch
Verify IAM permissions for route table modifications
Check if route tables and NAT resources exist
Ensure network interfaces are available and attached

High CPU or Network Utilization¶

Consider upgrading to a larger instance type
Review traffic patterns in CloudWatch metrics
Check for unusual network activity or attacks
Consider implementing traffic shaping or rate limiting

Diagnostic Commands¶

# Check NAT instance status
aws-sso aws ec2 describe-instances --filters "Name=tag:Purpose,Values=nat_instance"

# View route switcher logs
aws-sso aws logs describe-log-groups --log-group-name-prefix "/aws/lambda/route-switcher"

# Check current routes
aws-sso aws ec2 describe-route-tables --filters "Name=tag:Name,Values=*runner*"

Migration and Rollback¶

Enabling Network Isolation¶

Standard Enablement (No Pre-provisioning)¶

Set runner_network_isolation = true in Terraform configuration
Apply Terraform changes to create NAT infrastructure
Existing fleet machines will start creating runners within the isolated subnet
Monitor NAT instance performance and adjust instance types if needed

Production Enablement (With Customer Whitelisting)¶

For production environments where customers need to whitelist IPs in advance:

Set runner_nat_eips_only = true in Terraform configuration
Apply Terraform changes to create only the Elastic IPs
Retrieve the IPs with aws-sso terraform output runner_nat_eips
Share the IP addresses with customers for whitelisting
Wait for customer confirmation that whitelisting is complete
Remove runner_nat_eips_only and set runner_network_isolation = true
Apply Terraform changes to create NAT instances (they will use the existing EIPs)

Disabling Network Isolation¶

Set runner_network_isolation = false in Terraform configuration
Apply Terraform changes to remove NAT infrastructure
Runners will revert to using managed NAT gateways
NAT instances and associated resources will be destroyed

Emergency Rollback¶

If NAT instances fail and automatic failover doesn't work:

Manually invoke the route switcher Lambda to switch to NAT gateways
Investigate NAT instance issues
Consider temporarily disabling network isolation if issues persist

Integration with Existing Infrastructure¶

Network isolation integrates seamlessly with existing GitLab Environment Toolkit features:

Auto Scaling Groups: Runners automatically use isolated network configuration
Load Balancers: Application load balancers remain unaffected
Monitoring: NAT monitoring integrates with existing Prometheus setup
Alerting: Uses existing SNS topics and Zulip notification channels

The feature is designed to be non-disruptive and can be enabled/disabled without affecting running workloads.