Network Isolation and NAT Instances¶
The GitLab Environment Toolkit supports network isolation for GitLab runners using NAT instances. This feature provides enhanced security by isolating runner traffic and enables cost optimization through the use of EC2-based NAT instances instead of managed NAT gateways.
Overview¶
Network isolation creates a dedicated network infrastructure for GitLab runners with the following components:
- NAT Instances: EC2-based NAT instances that provide internet access for isolated runners
- Route Switching: Lambda-based automation to switch between NAT instances and NAT gateways
- Comprehensive Monitoring: CloudWatch alarms and Grafana dashboards for NAT instance health
- Automated Health Monitoring: Lambda function to monitor NAT instance health and perform automatic failover
Benefits¶
- Cost Optimization: NAT instances can be significantly cheaper than managed NAT gateways for moderate traffic volumes
- Enhanced Security: Isolated network environment for runner workloads
- High Availability: Automatic failover between NAT instances and NAT gateways
- Comprehensive Monitoring: Real-time visibility into NAT instance performance and health
- Flexible Routing: Ability to switch routing dynamically based on requirements
Required Settings¶
Terraform Configuration¶
To enable network isolation with NAT instances, add the following configuration:
# Enable network isolation for runners
runner_network_isolation = true
# Configure NAT instances
runner_nat_instance_type = "c7g.large" # Instance type for NAT instances (ARM64-based Graviton)
Pre-provisioning Elastic IPs for Customer Whitelisting¶
When rolling out network isolation to production environments, you may need to provide the NAT instance IP addresses to customers in advance so they can whitelist them. The toolkit supports pre-provisioning Elastic IPs before enabling the full NAT instance infrastructure.
Step 1: Provision EIPs Only¶
# Pre-provision Elastic IPs without creating NAT instances
runner_nat_eips_only = true
Apply terraform to create only the Elastic IPs:
aws-sso terraform apply
Step 2: Retrieve IP Addresses¶
Get the provisioned IP addresses to share with customers:
aws-sso terraform output runner_nat_eips
This outputs:
{
"allocation_ids" = [
"eipalloc-xxxxx",
"eipalloc-yyyyy",
"eipalloc-zzzzz",
]
"public_ips" = [
"52.xx.xx.xx",
"54.xx.xx.xx",
"18.xx.xx.xx",
]
}
Share these public_ips with your customers for whitelisting.
Step 3: Enable Full Network Isolation¶
Once customers have confirmed whitelisting is complete, enable the full feature:
# Remove runner_nat_eips_only (or set to false)
# runner_nat_eips_only = false
# Enable network isolation - NAT instances will use the pre-provisioned EIPs
runner_network_isolation = true
runner_nat_instance_type = "c7g.large"
Apply terraform to create the NAT instances with the existing EIPs:
aws-sso terraform apply
Variable Validation
You cannot set both runner_nat_eips_only = true and runner_network_isolation = true at the same time. Terraform will return a validation error. This ensures a clear two-phase rollout process.
EIP Stability
The Elastic IPs remain stable across NAT instance replacements (maintenance, upgrades, failures). Customers only need to whitelist these IPs once.
Subnet Configuration¶
When network isolation is enabled, runner fleet instances are automatically placed in isolated private subnets created by the network isolation feature.
Important Constraint: The autoscaling_fleet_default_subnet_ids variable cannot be used when runner_network_isolation = true. Terraform will enforce this with a validation error.
# This will cause a validation error when runner_network_isolation = true
# autoscaling_fleet_default_subnet_ids = ["subnet-xxx", "subnet-yyy"] # ❌ Not allowed
# Instead, you can specify subnets per-runner using fleet_subnet_ids
autoscaling_runner_hive_list = [
{
name = "isolated-runners"
instance_type = "t3a.small"
fleet_instance_type = "c5a.large"
fleet_subnet_ids = ["subnet-xxx", "subnet-yyy"] # ✅ Allowed per-runner override
}
]
Subnet Behavior with Network Isolation:
- Default: Runners automatically use the isolated private subnets created by the network isolation feature
- Per-runner override: Individual runners can specify fleet_subnet_ids to use custom subnets(not recommended)
- Custom subnet requirements: Any custom subnets must be configured to route through the NAT instances for internet access
- Subnet type: Custom subnets should be private subnets in the isolated network environment
Architecture¶
When network isolation is enabled, the following infrastructure is automatically created:
NAT Instances¶
- EC2 instances configured as NAT devices in public subnets
- One NAT instance per availability zone for high availability
- Automatic IP forwarding and iptables configuration
- CloudWatch Agent for detailed monitoring
Route Management¶
- Lambda function for switching routes between NAT instances and NAT gateways
- Supports both manual and automated route switching
- Dry-run capability for testing route changes
- Comprehensive logging and error handling
Monitoring Infrastructure¶
- CloudWatch alarms for NAT instance health, CPU, network, and memory
- Grafana dashboard with real-time NAT instance metrics
- Custom CloudWatch metrics for route switching events
- SNS notifications for critical alerts
Health Monitoring¶
- Lambda function that monitors NAT instance health
- Automatic failover to NAT gateways when NAT instances fail
- Configurable health check intervals and thresholds
- Integration with existing alerting infrastructure
Monitoring and Alerting¶
CloudWatch Alarms¶
The following alarms are automatically configured:
| Alarm | Threshold | Description |
|---|---|---|
| CPU Utilization | >80% | High CPU usage on NAT instance |
| Network In/Out | >80% of instance max | High network utilization |
| Status Check Failed | Any failure | Instance or system status check failure |
| Memory Utilization | >85% | High memory usage (requires CloudWatch Agent) |
| Route Switch Events | Any event | Informational alert for route changes |
| Route Switch Failures | Any failure | Critical alert for route switching failures |
| Lambda Errors | Any error | Route switcher Lambda function errors |
| Lambda Duration | >45 seconds | Route switcher taking too long |
Grafana Dashboard¶
A comprehensive Grafana dashboard is automatically deployed with the following panels:
- CPU Utilization: Real-time CPU usage across all NAT instances
- Network Traffic: Inbound and outbound network traffic in bytes and packets
- Status Monitoring: Color-coded table showing instance health status
- Memory Utilization: Memory usage (requires CloudWatch Agent installation)
The dashboard automatically discovers all NAT instances and provides filtering capabilities.
To provision this dashboard, the Grafana dashboard must have been uploaded to the grafana instance. You can do this by running the following command:
aws-sso ansible-playbook -i inventory glh.environment_toolkit.grafana --tags dashboards
Route Management¶
Manual Route Switching¶
Using Ansible Playbook¶
The easiest way to manage routes is using the provided Ansible playbook with extra variables:
# Get current route status
ansible-playbook -i inventory glh.environment_toolkit.tools.route_switcher
# Switch to NAT instances (with dry-run)
ansible-playbook -i inventory glh.environment_toolkit.tools.route_switcher -e action=switch_to_nat_instance -e dry_run_mode=true
# Switch to NAT instances (actual)
ansible-playbook -i inventory glh.environment_toolkit.tools.route_switcher -e action=switch_to_nat_instance
# Switch to NAT gateways
ansible-playbook -i inventory glh.environment_toolkit.tools.route_switcher -e action=switch_to_nat_gateway
Using AWS CLI Directly¶
You can also invoke the Lambda function directly, using the AWS CLI:
# Get current route status
aws-sso aws lambda invoke --function-name <SOLUTION-PREFIX>-route-switcher --cli-binary-format raw-in-base64-out --payload '{"action": "get_status"}' response.json
# Switch to NAT instances (with dry-run)
aws-sso aws lambda invoke --function-name <SOLUTION-PREFIX>-route-switcher --cli-binary-format raw-in-base64-out --payload '{"action": "switch_to_nat_instance", "dry_run": true}' response.json
# Switch to NAT gateways
aws-sso aws lambda invoke --function-name <SOLUTION-PREFIX>-route-switcher --cli-binary-format raw-in-base64-out --payload '{"action": "switch_to_nat_gateway"}' response.json
Automated Health Monitoring¶
The health monitoring Lambda function automatically:
- Checks NAT instance health every 1 minute (configurable via EventBridge rule)
- Performs connectivity tests to verify internet access
- Switches to NAT gateways if NAT instances fail health checks
- Switches back to NAT instances when they recover
- Sends notifications for all failover events
Cost Considerations¶
NAT Instance vs NAT Gateway Cost Comparison¶
Note: NAT instances are deployed with 3 instances (one per AZ) for high availability.
| Component | NAT Instance (c7g.medium) | NAT Instance (c7g.large) | NAT Instance (c7gn.xlarge) | NAT Gateway |
|---|---|---|---|---|
| Hourly Cost (per instance) | ~$0.0363/hour | ~$0.0725/hour | ~$0.1450/hour | ~$0.045/hour |
| Total Hourly Cost (3 AZs) | ~$0.1089/hour | ~$0.2175/hour | ~$0.4350/hour | ~$0.135/hour |
| Network Performance | Up to 12.5 Gbps | Up to 12.5 Gbps | Up to 50 Gbps | Up to 45 Gbps |
| Data Processing | Free | Free | Free | $0.045/GB |
| Total Monthly Cost (24/7) | ~$78.39 | ~$156.60 | ~$313.20 | ~$97.20 + data costs |
Cost Examples with Traffic Volumes¶
Example 1: 50GB daily traffic (1.5TB/month)
| Solution | Base Cost | Data Processing | Total Monthly Cost |
|---|---|---|---|
| NAT Gateway (3 AZs) | $97.20 | $67.50 (1.5TB × $0.045) | $164.70 |
| NAT Instance (c7g.large × 3) | $156.60 | $0 | $156.60 |
| NAT Instance (c7g.medium × 3) | $78.39 | $0 | $78.39 |
Example 2: 333GB daily traffic (10TB/month)
| Solution | Base Cost | Data Processing | Total Monthly Cost |
|---|---|---|---|
| NAT Gateway (3 AZs) | $97.20 | $450.00 (10TB × $0.045) | $547.20 |
| NAT Instance (c7g.medium × 3) | $78.39 | $0 | $78.39 |
| NAT Instance (c7g.large × 3) | $156.60 | $0 | $156.60 |
| NAT Instance (c7gn.xlarge × 3) | $313.20 | $0 | $313.20 |
Example 3: 6.7TB daily traffic (200TB/month)
| Solution | Base Cost | Data Processing | Total Monthly Cost |
|---|---|---|---|
| NAT Gateway (3 AZs) | $97.20 | $9,000.00 (200TB × $0.045) | $9,097.20 |
| NAT Instance (c7g.medium × 3) | $78.39 | $0 | $78.39 |
| NAT Instance (c7g.large × 3) | $156.60 | $0 | $156.60 |
| NAT Instance (c7gn.xlarge × 3) | $313.20 | $0 | $313.20 |
Break-even Analysis:
- c7g.medium (3 instances): Cost-effective for any traffic volume above ~15GB daily
- c7g.large (3 instances): Cost-effective for any traffic volume above ~44GB daily
- c7gn.xlarge (3 instances): Cost-effective for any traffic volume above ~133GB daily
- High traffic advantage:
- At 10TB/month: Save $390-470/month vs NAT gateways
- At 200TB/month: Save $8,784-9,019/month vs NAT gateways
Instance Type Recommendations by Traffic Volume:
Light Traffic (< 100GB daily / < 3TB monthly): - c7g.medium: 2 vCPU, 4 GiB RAM, up to 12.5 Gbps - Most cost-effective for basic workloads - m7g.large: 2 vCPU, 8 GiB RAM, up to 12.5 Gbps - Better for memory-intensive NAT operations
Moderate Traffic (100GB - 1TB daily / 3TB - 30TB monthly): - c7g.large: 2 vCPU, 4 GiB RAM, up to 12.5 Gbps - Balanced performance and cost - m7g.xlarge: 4 vCPU, 16 GiB RAM, up to 12.5 Gbps - Enhanced processing for complex routing - c8g.large: 2 vCPU, 4 GiB RAM, up to 12.5 Gbps - Latest Graviton4 for optimal efficiency
High Traffic (1TB - 5TB daily / 30TB - 150TB monthly): - c7g.xlarge: 4 vCPU, 8 GiB RAM, up to 12.5 Gbps - More CPU for packet processing - c7gn.large: 2 vCPU, 4 GiB RAM, up to 25 Gbps - Enhanced networking capability - m8g.xlarge: 4 vCPU, 16 GiB RAM, up to 15 Gbps - Latest generation with improved performance
Very High Traffic (> 5TB daily / > 150TB monthly): - c7gn.xlarge: 4 vCPU, 8 GiB RAM, up to 50 Gbps - Maximum network performance - c7g.2xlarge: 8 vCPU, 16 GiB RAM, up to 15 Gbps - High CPU for intensive NAT operations - m8g.2xlarge: 8 vCPU, 32 GiB RAM, up to 15 Gbps - Maximum memory and processing power
Instance Family Characteristics: - c7g/c8g: Compute-optimized, best price-performance for NAT workloads - m7g/m8g: Memory-optimized, better for complex routing and connection tracking - c7gn: Network-optimized, enhanced networking up to 50 Gbps for high-throughput scenarios
Note: Costs are calculated for 3 availability zones with high availability deployment. Actual costs depend on instance type, data transfer volumes, and AWS region. NAT instances become significantly more cost-effective as traffic volumes increase due to no data processing fees.
Architecture Details¶
ARM64 Graviton Instances: NAT instances use ARM64-based AWS Graviton processors for optimal price-performance: - Supported Instance Types: c7g., m7g., c7gn., c8g., m8g. families - AMI: Amazon Linux 2 ARM64 optimized for Graviton processors - Performance: Enhanced network performance and cost efficiency - Validation*: Terraform enforces ARM64 Graviton instance types for optimal NAT performance
Cost Optimization Tips¶
- Choose appropriate instance type: Select instance type based on required network bandwidth. See AWS EC2 Instance Types for network performance specifications
- Monitor data transfer: Use CloudWatch metrics to track data usage
- Use Spot instances: Consider spot instances for non-critical environments
- Schedule instances: Stop NAT instances during off-hours if appropriate
Security Considerations¶
Network Security¶
- NAT instances are deployed in public subnets with restrictive security groups
- Only necessary ports are opened (HTTP/HTTPS outbound, SSH for management)
- Source/destination checking is disabled for proper NAT functionality
- All traffic is logged and monitored
Access Control¶
- NAT instances use IAM roles with minimal required permissions
- CloudWatch Agent permissions for metrics publishing
- SSM access for remote management without SSH keys
- Route management Lambda has restricted EC2 permissions
Monitoring and Compliance¶
- All route changes are logged and alerted
- Health check failures trigger immediate notifications
- Comprehensive audit trail through CloudWatch Logs
- Integration with existing security monitoring infrastructure
Known Issues¶
Health Check Metric Accumulation¶
When Terraform code is first applied or NAT instances are restarted, the instances may trigger multiple health check failures before becoming fully operational. These failures are counted in the CloudWatch metric used for route switching decisions.
Issue: The failure count metric does not automatically reset to zero after successful health checks, causing it to accumulate over time. This can lead to unexpected route switching behavior when the accumulated failure count reaches the threshold, even if recent health checks have been successful.
Workaround: Monitor the health check failure metrics and manually reset them if needed, or be aware that route switching may occur based on historical failures rather than current health status.
Impact: May cause unnecessary route switching from NAT instances to NAT gateways during initial deployment or after maintenance windows.
Auto Scaling Group Modification Issues¶
When disabling network isolation or making significant changes to the runner configuration, Auto Scaling Groups associated with the isolated runners may not be properly modified or deleted by Terraform.
Issue: Terraform may fail to update or destroy Auto Scaling Groups due to dependencies, running instances, or AWS API limitations. This can leave orphaned resources that continue to incur costs.
Workaround: Manually delete the Auto Scaling Groups through the AWS Console or CLI before running terraform destroy or when making configuration changes.
Impact: May result in orphaned AWS resources and unexpected costs if not manually cleaned up.
NAT Instance Failover Behavior¶
When a NAT instance in one availability zone fails, the health monitoring system switches all availability zone routes to use managed NAT gateways, not just the failed zone.
Issue: The failover mechanism operates at the environment level rather than per availability zone. If one NAT instance fails, the system switches routing for all AZs to NAT gateways instead of only switching the affected AZ.
Workaround: This is the current design behavior. Monitor individual AZ health and be aware that a single AZ failure will affect routing decisions for the entire environment.
Impact: May result in higher costs during failover periods as all traffic routes through managed NAT gateways instead of only the affected AZ.
Troubleshooting¶
Common Issues¶
Route Switching Failures¶
- Review Lambda function logs in CloudWatch
- Verify IAM permissions for route table modifications
- Check if route tables and NAT resources exist
- Ensure network interfaces are available and attached
High CPU or Network Utilization¶
- Consider upgrading to a larger instance type
- Review traffic patterns in CloudWatch metrics
- Check for unusual network activity or attacks
- Consider implementing traffic shaping or rate limiting
Diagnostic Commands¶
# Check NAT instance status
aws-sso aws ec2 describe-instances --filters "Name=tag:Purpose,Values=nat_instance"
# View route switcher logs
aws-sso aws logs describe-log-groups --log-group-name-prefix "/aws/lambda/route-switcher"
# Check current routes
aws-sso aws ec2 describe-route-tables --filters "Name=tag:Name,Values=*runner*"
Migration and Rollback¶
Enabling Network Isolation¶
Standard Enablement (No Pre-provisioning)¶
- Set
runner_network_isolation = truein Terraform configuration - Apply Terraform changes to create NAT infrastructure
- Existing fleet machines will start creating runners within the isolated subnet
- Monitor NAT instance performance and adjust instance types if needed
- Execute the ansible playbook to update grafana and upload the NAT dashboard
Production Enablement (With Customer Whitelisting)¶
For production environments where customers need to whitelist IPs in advance:
- Set
runner_nat_eips_only = truein Terraform configuration - Apply Terraform changes to create only the Elastic IPs
- Retrieve the IPs with
aws-sso terraform output runner_nat_eips - Share the IP addresses with customers for whitelisting
- Wait for customer confirmation that whitelisting is complete
- Remove
runner_nat_eips_onlyand setrunner_network_isolation = true - Apply Terraform changes to create NAT instances (they will use the existing EIPs)
- Execute the ansible playbook to update grafana and upload the NAT dashboard
Disabling Network Isolation¶
- Set
runner_network_isolation = falsein Terraform configuration - Apply Terraform changes to remove NAT infrastructure
- Runners will revert to using managed NAT gateways
- NAT instances and associated resources will be destroyed
Emergency Rollback¶
If NAT instances fail and automatic failover doesn't work:
- Manually invoke the route switcher Lambda to switch to NAT gateways
- Investigate NAT instance issues
- Consider temporarily disabling network isolation if issues persist
Integration with Existing Infrastructure¶
Network isolation integrates seamlessly with existing GitLab Environment Toolkit features:
- Auto Scaling Groups: Runners automatically use isolated network configuration
- Load Balancers: Application load balancers remain unaffected
- Monitoring: NAT monitoring integrates with existing Prometheus/Grafana setup
- Alerting: Uses existing SNS topics and Zulip notification channels
The feature is designed to be non-disruptive and can be enabled/disabled without affecting running workloads.