Changelog¶

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[Unreleased]¶

Added¶

Added a playbook to run post-ARM migration.
Custom KMS key for object storage encryption on Scaleway.
Added a variable rds_postgres_blue_green_update to enable blue/green updates for postgres.
Added LogCLI for querying and exploring logs in Loki.

Changed¶

Fixed permissions on SSH host keys after restoration.
Replaced Scaleway noncurrent-version-expiration with native implementation.

Upgrade instructions¶

On prod, add variable rds_postgres_blue_green_update to environment.tf and set it to true.

Fixed¶

Fixed CI deployments for solutions.

3.7.0 - 2026-06-09¶

Added¶

Added backup for GitLab secrets and SSH host keys on Scaleway.
Added variable for opensearch version.
Added Alloy HTTP listen address override to allow Prometheus scraping.
Made the S3 bucket retention settings configurable by using a map.
Added transfer secrets task in omnibus role.
Added an apt update step in omnibus role.
Added an apt update step in gitlab_runner role.

Changed¶

Applied the runner's SSH keep-alive timing fix.
Runner cache retention changed from 120 to 30 days.
Image changes on Scaleway instances are now ignored to prevent accidental deletion.
autoscaling_runner_hive_list instance_type is now t3a.small.
Default opensearch instance type is now or2.large.
Added kernel_module_hardening tool playbook to mitigate Copy Fail and Dirty Frag (esp4, esp6, rxrpc, algif_aead) Linux LPE vulnerabilities.
Added Alloy log shipping configuration, replacing the previous promtail-based setup.
Image changes on Scaleway instances are now ignored to prevent accidental deletion.
Prometheus now scrapes Alloy metrics on the configured Alloy metrics port.
Changed default Promtail ports to default Alloy ports in security_common.tf.
Added a variable to override default ami.
Added arm debian 13 ami.
Image changes on Scaleway instances are now ignored to prevent accidental deletion.
Prometheus now scrapes Alloy metrics on the configured Alloy metrics port.
Changed default Promtail ports to default Alloy ports in security_common.tf.
Logic changed to which ami is selected.
Changed variables to prevent ansible_ deprecation warnings.
Improved readability of a gitlab runner restart ansible task.
Now using ansible facts to dynamically choose architecture, instead of hardcoded.

Fixed¶

Fixed GitLabSearchIndexingStuck alert not showing which instance is affected by adding by (instance) grouping.

Fixed¶

Zero downtime update playbook is now fast and usable again.

Removed¶

Removed maintenance tunnel endpoints.
Logic changed to which ami is selected.
Ubuntu AMI's and Debian 10 ami.

Upgrade instructions¶

For dev and test, change instance type to t4g for loki and prometheus.
For prod, change instance type to m7g for loki and prometheus.
For all environments, add variables loki_ami and prometheus_ami and set value module.gitlab_cluster.aws_ami_debian_13_arm.

Upgrade instructions¶

Add this to ansible/ansible.cfg:

[defaults]
callbacks_enabled = community.general.log_plays

[callback_log_plays]
log_folder = /tmp/ansible-logs/

3.6.2 - 2026-05-14¶

Changed¶

Runner helper image is now pulled from Docker Hub by default, allowing it to be served through the registry.

3.6.1 - 2026-05-11¶

Changed¶

IP address for centralized Grafana changed.

3.6.0 - 2026-04-23¶

Added¶

Added advanced search indexing alert.
Debian 13 AMI data source in Terraform
Added Alloy log shipping configuration, replacing the previous promtail-based setup.
Enabled encryption for all Scaleway buckets.
Debian 13 AMI data source in Terraform.
Added healthchecks and changed logic for choosing hosts.

Changed¶

cloudwatch-alarm-metrics now runs as it's own user.
Removed sidekiq from prometheus OutOfMemory alerts.
Changed day for full backup to every Friday.
Moved all gpg keys to commmon config.
Default AMI from Debian 11 to Debian 13.
Switched to ansible_facts[] syntax for all fact access.
Switched to ec2_tags for AWS tag lookups (required for amazon.aws >=11.x).
AWS EC2 inventory plugin now uses FQCN amazon.aws.aws_ec2.
PostgreSQL db parameter renamed to login_db (required for community.postgresql >=4.x).
Ansible host labels now use inventory_hostname instead of ansible_hostname.
Prometheus config changes now trigger reload instead of restart.
Ansible callback changed to ansible.builtin.default with result_format: yaml.
gitlab_node_type in gitlab.rb template now normalizes hyphens to underscores.
Upgraded terraform and python version in dockerfile.
Python runtime for AWS Lambda 3.10 to 3.14.
Changed default Promtail ports to default Alloy ports in security_common.tf.

Fixed¶

Grafana endpoints are now configured on the correct Global Accelerator.
Fixed copying of GitLab registry key on fresh solutions.
Loki can now better handle large search queries without running out of memory.
Opensearch throttling alert has been improved to reduce noise.

Removed¶

Grafana is no longer included in solutions but replaced with a centralized dashboard.
Removed gcp refrence.
Removed hardcoded cloudprovider refrence.

Upgrade instructions¶

Remove your solution's tunnenl endpoint from AWS-infrastructure.
S3 backup storage class STANDARD_IA was never applied due to case mismatch ('AWS' vs 'aws')
Replace stdout_callback = yaml with result_format = yaml in your ansible/ansible.cfg.

3.5.2 - 2026-03-13¶

Fixed¶

Fixed GitLab backup script after GitLab version update.

3.5.1 - 2026-03-12¶

Fixed¶

Replaced GitLab runner apt key.

3.5.0 - 2026-03-12¶

Added¶

Added troubleshooting steps and workarounds for VPN-related health check timeouts during provisioning.
Added auto_start_stop function to development template environment for cost saving.
Added a lambda that applies ElastiCache self-service updates and notifies through Zulip via SNS.
Implemented batched CloudWatch alarm updates in autostartstop lambda to prevent API throttling.
Added ssm parameter (lock_while_running) logic to prevent autostartstop lambda from executing while parameter is true.
Scaleway is included in the environment-template.
Cronjob to clear docker cache on dedicated runners.
Web Application Firewall support for Scaleway.
Container Registry metadata database support with container_registry_metadata_database_enable (default: false).
Ansible variable container_registry_maintenance_enable to configure registry maintenance.
Endpoints to centralize Grafana data.

Changed¶

Updated onboarding documentation with explicit 1Password (op signin) steps, YubiKey prerequisites, and gitlab_version configuration.
Replaced static sidekiq queue check with percentage based one.
Changed alertmanager logic to accommodate start/stop logic for alerts.
Added variable to costsaving and changed auto_start_stop lambda enable alerts logic to fire 15 minutes after ec2's are started.
Upgraded Loki to version 3.6.3
Automated deployments now use administrator permissions to prevent permission issues.
Scaleway Terraform module is now published on the registry.
Increase memory limits for sidekiq machines with 1:4 CPU/RAM ratio.
Changes gitaly managed backups of git repositories to incremental backups with full backups on sundays.

Fixed¶

GitLab Registry key is now copied to Sidekiq nodes, allowing cleanup policies to work.

Removed¶

Removed all the AWS bastion code.
Added GEO blocking to WAF with waf_geo_blocking_enabled (default: true) and waf_geo_blocked_countries variables.

Upgrade instructions¶

Centralized Grafana¶

We have introduced a new centralized Grafana environment. Make sure your solution is included in scaleway-infrastructure.

GEO Blocking¶

GEO blocking is now enabled by default in the WAF. If your solution already has custom WAF blocking implemented via waf_blocking.tf or similar:

Remove redundant GEO blocking resources from your solution's terraform code:
Remove the aws_wafv2_rule_group resource that contains the geo_match_statement
Remove the associated locals block with blocked_countries if it exists
Keep any IP-based blocking resources (aws_wafv2_ip_set, IP blocking rules) if needed
Update the rule group reference if you were using waf_custom_rule_group_pre_arn or waf_custom_rule_group_post_arn:
If your custom rule group only contained GEO blocking, remove the variable assignment entirely
If it contained both GEO and IP blocking, update the rule group to only contain IP blocking
To customise the blocked countries list, set waf_geo_blocked_countries in your solution's tfvars:
```
waf_geo_blocked_countries = ["IR", "KP", "RU"]  # Your custom list
```
To disable GEO blocking entirely, set:
```
waf_geo_blocking_enabled = false
```
Run terraform plan to verify the changes before applying.

3.4.2 - 2026-03-03¶

Added¶

Added optional CloudWatch memory metrics collection for autoscaling runner fleet instances via autoscaling_fleet_enable_cloudwatch_memory_metrics variable.

Changed¶

Updated apt signing key for GitLab repository.

3.4.1 - 2026-02-10¶

Changed¶

Enabled autoscaling runners over internal networks on Scaleway for when fleeting plugin is fixed.

Fixed¶

Fixed an issue where NAT instances would be recreated after a second terraform apply.

Upgrade instructions¶

Increase retention period in projects from 30/31 to 35 in environment.tf.

3.4.0 - 2026-02-02¶

Added¶

Scaleway is now a supported provider.
Added a cloudwatch alert when average exceeds 80% of max connections for RDS.

Changed¶

Renamed "monitor" to "prometheus" to prevent confusion.

Upgrade instructions¶

The rename from monitor to prometheus causes a lot of resources to be renewed, this does not cause any downtime.
Manually remove the "TCP:9009" listener from the internal load balancer, Terraform can't fix this.
Add opensearch_password to sensitive_vars.yml in your solution.

3.3.3 - 2026-01-15¶

Added¶

Added a variable to allow EIP creation for the NAT instances, without creating the NAT instances themselves.

Fixed¶

Fixed the determination logic for which subnet_ids the autoscaling runner should use.

Removed¶

Removed the object lock terraform code for the backups. We cannot use this due to Gitlab sending an sha1 hash instead of md5.

3.3.2 - 2026-01-08¶

Fixed¶

Fixed object storage lock for buckets with customized prefix.

3.3.1 - 2025-12-01¶

Fixed¶

NAT user data script is now using dynamic interface name detection for ip tables rules instead a static value(eth0).

Changed¶

Deprecated attribute warning regarding region.name has been resolved.

3.3.0 - 2025-11-28¶

Added¶

Incremental Logging for job logs is now enabled if previously disabled.
Add pages subdomain to internal ACM certificate.
Network isolation support for GitLab runners with NAT instances as cost-effective alternative to NAT gateways.
Comprehensive monitoring and alerting for NAT instances including CloudWatch alarms and Grafana dashboards.
Lambda-based route switching between NAT instances and NAT gateways with automated health monitoring.
Enhanced instance module with native support for NAT-specific configurations (associate_public_ip_address and source_dest_check).
Updated auto-scaling-group module with improved mixed instance policy support and instance requirements compatibility.
Enhanced bucket module with additional configuration options and improved lifecycle management.

Upgrade instructions¶

Make sure to upgrade all terraform modules. These changes are highly dependent on the latest terraform modules
Be sure to run the ansible playbook to update grafana and upload the NAT dashboard
The Lambda invoke commands from the docs, rely on having the latest version of the RECT tool because the older version didnt allow for JSON payloads being sent.
Move ALB from public to private subnet.

Changed¶

Update aws-fleeting-plugin from 1.0.0 to 1.0.1.
Move ALB from public to private subnet.
Upgrade AWS terraform provider from 5.33 to 6.20.

Fixed¶

Fixed issue with creating cloudwatch alerts after updating instance count.

Upgrade instructions¶

Replacing the ALB causes minimal downtime, but open connections are reset during the change.
Update the required providers in your solution's main.yaml to the latest versions.

3.2.2 - 2025-09-26¶

Added¶

Added variables to customize Cost Anomaly notification thresholds.

Changed¶

Changed RDS and OpenSearch volume types to decrease costs.

Fixed¶

The Mimir alertmanager_url was not set correctly in the Mimir config causing missing notifications when alerts were triggered.
Replaced Grafana APT repository key with new one.

Upgrade instructions¶

Changing volume types takes makes the instances unavailable while the volumes are recreated, this can take a long time. You can (temporarily) disable this change by switching the volume types back to io1 in your solution.
In order to push the fixed alertmanager_url, run the monitor playbook.(aws-sso ansible-playbook -i inventory glh.environment_toolkit.monitor)

3.2.1 - 2025-06-26¶

Added¶

Made Prometheus metrics endpoint publicly available with secure token-based authentication.
Added cost allocation tags for resources that we bill for when usage is above contractually agreed limits.

Changed¶

The instance module(since v1.0.3) now supports adding cost allocation tags to the EBS volumes of EC2 instances.

Fixed¶

Update VPC CIDR block to fix Client IP in GitLab.
Update backup retention to sensible defaults.

Upgrade instructions¶

Check if backup retention is set in your solution. See environment-template/terraform/backups.tf for an example.

3.2.0 - 2025-04-29¶

Added¶

Support for SSH over SSM. This allows native SSH with pipelining but does not require bastion hosts.
Documentation on dealing with GitLab database preseeding.
waf_custom_rule_group_pre_arn Terraform variable to inject a Solution-managed WAF rule group at the start of the WAF ruleset. waf_custom_rule_group_post_arn does the same but appends to the end of the ruleset instead.
Added support to assign static IPs to Network Load Balancer. Set network_load_balancer_static_ips to true to enable.

Changed¶

Moved all S3 bucket code to our bucket module.
object_storage_force_destroy is now false by default.
Enabled SSH-over-SSM for new projects by default.

Removed¶

Migration path for 3.0.0 to 3.1.0 upgrade of Monitor nodes
Removed Terraform var network_load_balancer_subnet_ids, replace with network_load_balancer_subnet_mapping.

Upgrade instructions¶

Manually empty out the ${prefix}-monitor S3 bucket. Terraform will not be able to remote it otherwise.
If you use S3 replication, don't forget to also remove the replica bucket.
If you override object_storage_buckets, you must manually create moved blocks to move them to the new locations.
Use the examples in s3_move.tf as a guide. In a solution, you must prepend all IDs with module.gitlab_cluster..
Replace network_load_balancer_subnet_ids with network_load_balancer_subnet_mapping in your terraform files.

3.1.3 - 2025-03-11¶

Changed¶

Bumped common-config dependency to 1.0.6

3.1.2 - 2025-02-25¶

Added¶

Support for customizing the Gitaly S3 upload part size. Defaults to 25 MiB (upstream default is 5 MiB).

Changed¶

Allow the autoscaling_runner and dedicated_runner types to receive patch updates for the instance module.
Set the cloudwatch module to be auto-updated as well.

3.1.1 - 2025-02-10¶

Removed¶

Observability related code on port 50000. Was all moved to federated Grafana infrastructure in a previous release.

3.1.0 - 2025-02-10¶

Added¶

Enabled EBS encryption by default and set the most logical key as the default to use.
IAM password policy configuration. Configurable via iam_password_expiry_days. Defaults to 90, disable with 0.
glh.environment_toolkit.health_check playbook to run GitLab health checks.
Support for multiple Loki nodes in automatic HA mode via Gossip/Memberlist protocol.
Support for Prometheus HA deployment using Grafana Mimir.
Monitor nodes now run Grafana Mimir to provide long-term Prometheus storage in S3.
The Thanos stack is turned off automatically (will be removed next version).
Prometheus is no longer running inside Docker.
Grafana role has been updated to reflect this as well, and will install additional dashboards.
Alertmanager is also replaced by Mimir.
In HA mode (automatically conifgured), etcd will be installed to provide leader election for Mimir.
Enabled RDS auto minor version upgrade by default.
Support for new (17.8+) version of Gitaly storage config on Rails nodes.

Changed¶

Loki will store data in the new v13 format, starting at 2025-04-01.
Updated Ansible dependency common config to 1.0.5.
CI secure files now uses consolidated object storage configuration. This was not supported before GitLab 17.0.

Fixed¶

Runner EC2 instances now respect the default KMS key variables when set.

Removed¶

Removed rds_postgres_backup_retention_period, replace with backup_retention_period in your solution.

Upgrade instructions¶

Your runner instances will need to be re-created when they are currently not using a KMS key. This will remove all your data. If you want to keep the data and prevent Ansible re-provisioning, either run the helper script in helpers/encrypt_existing_ebs_volumes.sh or manually perform these steps:
Turn off the EC2 instance
Create a encrypted snapshot of the current volume
Create a new encrypted EBS volume with the correct key, based off the snapshot
Attach this new EBS volume to the EC2 instance and start it.
Remove the old volume and the snapshot.
If you are specifying a KMS key to use as default encryption (either EBS or globally), you must ensure that you grant the auto-scaling service role permissions to use this key if you have autoscaling runners. Alternatively, simply retrieve the ARN for the alias/aws/ebs key and set that using either variable autoscaling_fleet_default_disk_kms_key_arn or by using fleet_disk_kms_key_arn, which can be specificed per runner.
If you are upgrading from 3.0.x, and 2025-04-01 has passed already, set Ansible variable loki_v13_start_date to a ISO formatted date that is a few days in the future. This marks the date Loki starts to use the new v13 data format.
If you have grafana_install_omnibus_dashboards set, it's recommended to use the valus in grafana_install_external_dashboards instead. This will future-proof your selection for new external dashboards.
Thanos and the current monitoring history will be retained (until GET 3.2), but will be inaccessible by default. To access the old history, run docker compose up -d in /root as root. Manually query against it using Grafana. The address is the internal IP of the monitor node, it listens on port 19090.
Remove rds_postgres_version from environment.tf.
Replace rds_postgres_backup_retention_period with backup_retention_period.

3.0.6 - 2025-02-03¶

Added¶

primary_alb_arn output variable for usage in solution code.

Fixed¶

Issues with Gitaly server side backups not being configured properly

3.0.5 - 2025-02-03¶

Added¶

gitlab_rails_exceptions metric in Promtail to keep track of exception log.
Force-prefixed by Promtail with promtail_custom_ in Prometheus output.
Alerts when the rate on promtail_custom_gitlab_rails_exceptions over 1 minute is more than 0.5 for 5 minutes.

Fixed¶

docker_proxy_host not working for Kroki and Registry mirror roles.
GitLab Omnibus installation timeouts, resolved by adding higher lock_timeout values.
Zero downtime updates break primary Rails node because of DB changes after schema cache is filled.

Changed¶

Increased installation timeout for GitLab Runner as well, as a preventative measure.
Increased grpc limits for Loki because it was too low to retrieve large logfiles.

3.0.4 - 2025-01-16¶

Added¶

New variable on Autoscaling Runner definition: runner_block_cost. Defaults to 1.

Fixed¶

Duplicate zero downtime detection login in CI causing invalid playbook names.
Add more OIDC permissions for Global Accelerator

3.0.3 - 2024-12-19¶

Added¶

Outputs for OIDC role ARNs so solutions can add custom permissions.

Changed¶

CI codebase adapted to automatically switch to SSM usage and provide more sane deployment defaults.

Fixed¶

gitlab_object_storage_prefix not used by gitlab_runner role.

3.0.2 - 2024-12-12¶

Fixed¶

docker_proxy_host value was not propagated properly when defined.

Changed¶

Updated Ansible dependency common config to 1.0.2

3.0.1 - 2024-12-03¶

Added¶

CI step to deploy the terraform module.

Fixed¶

Bastion SSH config template breaks when localhost is in hostvars.
Render SSH config playbook breaks when localhost is in hostvars.
Fixed the "create_admin_user" ansible tool.
Callback plugin generates errors when working on an included playbook.

3.0.0 - 2024-12-03¶

Added¶

Documentation about backup and restore procedures.
Full documentaion rework, with mkdocs as the builder. Also deploys versioned releases to a separate repository.
Added shared GitLab CI templates.
Default configuration for AWS Backup mirroring to separate region.
Basic code to render Graphviz diagrams based inside Terraform, outputted to the solution folder.
Cost anomaly detection.
OpenSearch update notifications through SNS.
Setting to use S3 replication for backups.
Dependency on glh.common_config Ansible Collection.
A new set of playbooks was introduced to perform various routine update tasks:
all, still deploys the entire codebase with downtime
gitlab_update, only updates Omnibus and Runner versions, with downtime
zero_downtime_all, runs the entire codebase, per node with loadbalancer (de)registration.
zero_downtime_gitlab_update, only updates Omnibus and Runner, per node with loadbalancer (de)registration.
The rate of zero downtime updates can be controlled with the serial variable.

Changed¶

Pages enablement is now controlled with Terraform variable pages_enabled. Enabled by default.
Pages services and daemons are now started on the GitLab Rails nodes, separate EC2 Pages nodes is no longer supported.
Project now uses pyproject.toml instead of setup.py and requirements.txt.
Ansible namespace has been changed to glh.environment_toolkit.
The common, common_vars, pre_common, post_configure roles were all removed in favor of local defaults.
The zero_downtime role was removed in favor of the new zero_downtime_all and zero_downtime_gitlab_update plays.
The following roles were moved to a new omnibus role that encompasses everything related to gitlab-omnibus:
praefect, gitaly, gitlab_rails, sidekiq
Extending the rails config must now be done by adding paths on localhost to:
common_custom_config_file
gitlab_rails_common_custom_config_file, which is used on gitlab_rails and sidekiq nodes.
[gitlab_node_type]_custom_config_file, such as
- sidekiq_custom_config_file
- praefect_primary_custom_config_file
- gitlab_rails_custom_config_file
Adding per-solution Ansible code is now done using:
pre_roles and post_roles array, which include a Ansible FQDN role pre and post GET respectively.
pre_tasks_common, pre_tasks_[group_name], post_tasks_common and post_tasks_[group_name], which can be used to include custom Ansible task files. These are not playbooks, they use include_tasks.

Moved¶

Compacted instance IAM policy ARN's into single resource.
The following Ansible code was moved to glh.common_config:
openssh-server and Unix admin user installation
Debian tweaks and Ansible requirements installation
Dependencies for Ansible versions via pip
Installation of AWS tooling such as aws-ssm-agent
Basic system initialization such as ntp, unattended upgrade, IPv6 disablement, cloud-init config
The following components were moved to glh.get_extensions because they cluttered the core codebase:
rsyslog forwarder support
amazon-cloudwatch-agent support
The following Terraform (sub)modules were moved into their own repositories:
cloudwatch_alarms
instance
auto_scaling_group
security_group
vpc_peering
bucket (new)

Removed¶

Cleanup paths from 2.1.0 for resolvconf and sshd-git.
Disabled S3 malware protection due to insane pricing.
Dependency on gitlab.gitlab_environment_toolkit.
Removed Terraform support for:
Consul
ElastiCache separate databases
Redis EC2 instances
NFS instances
Haproxy instances
Existing network infrastructure
OpenSearch EC2 instances
PgBouncer EC2 instances
Postgres EC2 instances
Praefect Postgres EC2 instances
RDS Praefect separate database
GEO support
S3 bucket replication
Seperate EC2 nodes for GitLab Pages
Removed Ansible support for:
All of the Terraform objects above
Single-rails-node deployment
non-Gitaly cluster solutions
GCP, Azure, on-premise deployments
Inclusion of *.rb config on Omnibus nodes.

Upgrade instructions¶

Update the reference to the included GitLab CI file in your .gitlab-ci.yml to point to this repository. For an example, please see environment-template/.gitlab-ci.yml.jinja.

Enable backup mirroring in your solution. See the environment-template for an example.

Change any references from gitlabhost.gitlab_environment_toolkit to glh.environment_toolkit, most notably in ansible/requirements.txt. The repositoru path has not changed. Remove all geerlingguy roles from your ansible/galaxy-requirements.yml as well, these are no longer required.

Make sure you run pip install with the -U flag, to ensure dependencies only pinned on major/minor versions are updated properly as well.

If you want to upgrade with less downtime, run Ansible first and Terraform second. Do not change the GitLab version in this process. This is only required if you are running GitLab Pages.

2.1.8 - 2024-11-19¶

Added¶

Setting to use S3 replication for backups.

Upgrade instructions¶

Enable backup mirroring in your solution. See the environment-template for an example.

2.1.7 - 2024-11-12¶

Added¶

VPC endpoints for Secrets Manager, SNS and SSM.

Changed¶

When Global Accelerator is enabled, ports 80 and 443 forward traffic directly to the Application Load Balancer.

Fixed¶

Made the CloudWatch alert for autoscaling groups less trigger-happy.

2.1.6 - 2024-10-22¶

Added¶

Added missing IAM permission.

2.1.5 - 2024-10-22¶

Changed¶

Disabled continuous backups by default due to insane cost increase.

2.1.4 - 2024-10-09¶

Fixed¶

Workaround for issues with Gitaly server side backup due to aws_s3_endpoint not defaulting to a sane value.

2.1.3 - 2024-09-26¶

Changed¶

Remove dennis users on all nodes.

2.1.2 - 2024-09-23¶

Added¶

Added feature flag to disable GuardDuty features.

2.1.1 - 2024-08-29¶

Added¶

Allow updating only /etc/gitlab/trusted-certs by using --tags update-trusted-certs.

Fixed¶

The test for determining if a runner fleet image should be ARM or x86 was not strict enough, causing wrong output.
Purge resolvconf as well as uninstalling it to remove all leftovers.
Stop and disable systemd-resolved since it makes our networking stack non-deterministic.
Restart our networking interfaces if resolvconf or systemd-resolved states change to generate /etc/resolv.conf.
AWS Guardduty scanning on S3 buckets was not able to be deployed when the aws/s3 default KMS key is used in S3.

2.1.0 - 2024-08-27¶

Added¶

GitLab Omnibus config override stanza for Praefect nodes.
Enable 'debug_addr' for Docker Registry on Rails nodes so we can scrape the Prometheus metrics from it.
AWS GuardDuty is enabled to provide malware protection.
Ansible Callback plugin to write playbook output to jsonlines formatted files.
Added CloudWatch alerts for:
GuardDuty findings
Unhealthy nodes in target groups
Anomalous nodes in target groups
RDS + read replica
ACM certificate expiration
ElastiCache Redis cluster
Opensearch cluster
EC2 instances
EC2 autoscaling groups
Custom GitLabHost MOTD.
Patches to ensure MOTD is printed on starting a AWS-SSM shell.
Solutions now set up DNSSEC signing for the glhc.nl domain. See gitlab_dns_zone_ds_record for the DS value to use.
New Grafana Dashboard with basic metrics about GitLab Runner and Fleeting plugin.
Support for specifing a custom KMS key ARN for usage with AWS Backup via backup_kms_key_arn.
Support for mirroring AWS Backups to a user-defined additional AWS Backup Vault via backup_mirror_vault_arn.
Allow for overriding the AWS Backup cron schedule via backup_cron_schedule.
We now run RepositoryArchiveCleanUpService on each Rails node every night to clean up temporary files.
Support for using Graviton/ARM64 types in autoscaling runner fleets. The orchestrator nodes still need to be x86.
Scoped final security group to a Load Balancer instead of entire VPC range.
New Grafana dashboard with Prometheus scraping status to replace /prometheus/ maintenance endpoint.
Added AlertManager to Grafana via provisioning so alerts can be shown and silences via Grafana.
Output variable that generates copy-pastable to put into aws-infrastructure repository.
Exposing a new maintenance tunnel NLB to allow access to Grafana via aws-infrastructure.
Prometheus exporter that fetches all CloudWatch Alarms with status, and AlertManager rules to trigger after 12h.

Changed¶

Tasks that installed sshd-git on gitlab-rails nodes were changed to clean up the leftovers instead.
We no longer test if sjoerd or arnoud are removed, they should have no leftover data anywhere by now.
GitLab Omnibus is no longer installed on Bastion nodes and existing installs will be removed automatically.
The security group outputs for autoscaling_fleet_node, autoscaling_runner, dedicated_runner now end in _id.
Split AWS permissions into separate files.
We now use Copier for creating environment boilerplate.
The environment-template has been inlined into this main repository.
Update AWS Fleeting plugin for autoscaling runners to version 1.0.0.
Non-working panels removed from Grafana 'Server Performance' dashboard (NFS, HAProxy, PGBouncer).
All monitoring data in Prometheus is now labeled by gitlab_node_type.
Prometheus CPU load alert no longer triggers on runners, these have a new alert that triggers at 95% load over 30m.
gitlab_shell_ssh_port is now forced to be port 22 and no longer needs to be explicitly set in a solution.
The default AMI name filters for autoscaling fleet nodes have been changed in accordance with our new naming scheme.
Updated RDS CA Certificate to the new AWS default.
Grafana now accepts auto-login via auth.proxy options. This only works in combination with aws-infrastructure.

Fixed¶

Global Accelerator health checks now succeed.
You can provision 1 opensearch node by setting opensearch_service_multi_az to false and opensearch_service_node_count to 1.
Added CloudWatch alerts to cost saving and fixed IAM permissions.
Fixed configuring host name after initial provisioning.

Removed¶

Removed preparing_your_aws_account.md, no longer makes sense since we migrated to AWS SSO.
Maintenance script on rails node now runs in a systemd timer and uses a .d style layout to allow for overrides.
Prometheus Node Exporter is now disabled in Omnibus config and is replaced with the Debian version instead.
Removed the tools.nfs_cleanup playbook: we don't support NFS anymore.
Removed the tools.migrate_to_gitlab_sshd playbook.
Removed all GitLabHost code that supports NFS nodes where possible. Overrides to disable upstream NFS remain.
Removed bastion related listeners, targets groups and security groups if no bastion nodes are created.
Removed resolvconf package.

Fixed¶

SSH host keys from gitlab-sshd were not included in AWS Backups.

Upgrade instructions¶

If you haven't run the tools.nfs_cleanup playbook during the previous update, do that before upgrading.
You can remove gitlab_shell_ssh_port from inventory/vars.yml in each solution.

It's recommended to add the following to your solution's environment.tf:

output "infrastructure_config" {
    value = module.gitlab_cluster.infrastructure_config
}

2.0.7 - 2024-07-18¶

Fixed¶

Using aws_s3_endpoint doesn't actually work because of bugs in upstream.

Changed¶

Autoscaling runners now use a static SSH keypair to reduce CPU load on the orchestration nodes.
This also allows for manual SSH connections to a fleeting node which is useful for debugging.
gitlab-runner.service now also stops gracefully on autoscaling nodes (previously was only on restart).
Revert/lower the fleeting checking interval since upping the value didn't solve anything.

2.0.6 - 2024-07-16¶

Fixed¶

Traffic to S3 was not properly routed to the S3 VPC Endpoint when initiated from one of the private subnets.

2.0.5 - 2024-07-16¶

Fixed¶

Our pre_common role now depends on the common_vars role, since it uses variables defined there.
Fixed variable resolving in zero downtime update.

2.0.4 - 2024-07-11¶

Added¶

Enabled autoscaling group metrics.
Add OIDC permission to enable metrics collection on auto scaling groups.
Added keepalive configuration for connection between autoscaling runner and fleet nodes.

Changed¶

Updated SSL policy on load balancers and made it configurable.

2.0.3 - 2024-06-27¶

Fixed¶

Remove deprecated Sidekiq concurrency parameter forcefully to assist with simultaneous upgrade of GET 2.0 and GL 17.0.

2.0.2 - 2024-06-27¶

Changed¶

Cherry pick fixes from 1.6.7 release into 2.0.2.

Fixed¶

New-style secrets handling was not properly included, breaking gitlab-secrets.json on all nodes.

2.0.1 - 2024-06-24¶

Changed¶

Cherry pick fixes from 1.6.6 release into 2.0.1.

2.0.0 - 2024-06-21¶

Added¶

The proxy_download feature of GitLab object storage is now controllable via gitlab_object_storage_proxy_download.
Cleanup policy to SSM s3 bucket, deletes all files after 1 day.
Logging of all requests blocked by AWS WAF to a CloudWatch log stream. Retention can be set via: waf_log_retention.
Allow assuming of OIDC role from within the account itself by enabling gitlab_oidc_debugging_enabled.
Terraform modules are published on GitLab.
Bastion nodes now have a simplistic HTTP daemon for usage with NLB health checks that tests if OpenSSH is running.
glh-postgres-gitlab and glh-postgres-praefect CLI commands added to rails nodes for easy DB access.

Changed¶

Modified grow_filesystems.yml - Added growpart for /dev/xvda to handle resizing of physical disk
Upstream version of GET has been updated to 3.3.0
Various variables have been moved from common to common_vars when they were used in multiple locations.
Some variables have been moved from common to their specific role when they were only used there.
The shared role has been split up into gitlab_runner_linux and grafana_apt_repo.
Variables related to GitLab runners have been moved from common to gitlab_runner_linux.
Dependency on geerlingguy.docker has been removed and Docker is now installed by our own docker role.
Variable glh_docker_repo_url was renamed to docker_repo_url and moved to the new docker role.
Variable docker_repo_host now controls the protocol and host to use, docker_repo_url adds the path as well.
Variable s3_endpoint was renamed to aws_s3_endpoint.
Variable sns_endpoint was renamed to aws_sns_endpoint.
Disabled IPv6 by default in user_data for new instances.
Removed default Prometheus federative servers. Can be enabled at solution level.
gitlab-sshd is now used by default to provide Git operations over SSH.

Fixed¶

AWS Permission for ssm:StartSession.
Allow creation of a singular OpenSearch node by setting opensearch_service_multi_az to false.
GitLab's GPG key is now also installed from Apt proxy when enabled

Upgrade instructions¶

Upgrade all Terraform AWS provider version constraints in your solution to version = "~> 5.33".
Remove geerlingguy.docker from galaxy-requirements.yml.
Upgrade geerlingguy.node_exporter to 2.0.1 in galaxy-requirements.yml file.
Ensure you install all dependencies afterwards (terraform init -upgrade, pip and ansible-galaxy)
Rename all options called opensearch to opensearch_service in environment.tf.
For the forced migration to gitlab-sshd, the Git over SSH services will be unavailable during the deployment.
Ensure gitlab_shell_ssh_port is not set in your environment.tf.
Before running any other Ansible playbooks, run once: tools.migrate_to_gitlab_sshd. This is non-destructive.
If you want to upgrade with less downtime, set gitlab_shell_ssh_port = 22 in environment.tf.
- Run Terraform apply as normal after following the rest of the upgrade instructions.
- Remove this option after running Ansible completely and run Terraform again.
OpenSSH services will remain on some nodes for now (even when unused), and will be removed in 2.1.0.
Rename the following variables in your Ansible Inventory vars.yml:
external_pages_url is now pages_external_url
gitlab_pages_ssl_cert_file is now pages_ssl_cert_file
gitlab_pages_ssl_key_file is now pages_ssl_key_file
gitlab_pages_custom_config_file is now pages_custom_config_file
gitlab_pages_custom_files_paths is now pages_custom_files_paths
You should run gitlabhost.gitlab_environment_toolkit.tools.nfs_cleanup playbook once to clean up NFS leftovers.
If you override glh_docker_repo_url, rename it to docker_repo_host and remove the path (/linux) from the URL.
Replace the source in environment.tf with one of the following options:

# For production solutions
source = "git.glhd.nl/glh/gitlab-environment-toolkit/aws"
version = "2.0.0"

# For development solutions
source = "git::git@git.glhd.nl:glh/ha/gitlab-environment-toolkit.git//terraform/aws?ref=main"

1.6.7 - 2024-06-27¶

Added¶

Added additional parameters for fleeting configuration to autoscaling runner:
delete_instances_on_shutdown
update_interval
update_interval_when_expecting
Allow configuring AWS ALB idle timeout with application_load_balancer_idle_timeout.

Fixed¶

When using the multiple tokens option for dedicated runners, a empty token is set as well.
Override the AWS WAF rule that restricts URL size when the /-/kroki endpoint is targeted.
Names for CloudWatch metrics for WAF overrides were wrong/duplicated.

1.6.6 - 2024-06-24¶

Added¶

Add setting for vm.max_map_count to autoscaling runner userdata template. This allows ElasticSearch to run in jobs.

Fixed¶

Fixed some lingering references pointing the registry mirror to port 5000.
Auto scaling group templates updates were not set as the new default template to use.

1.6.5 - 2024-06-20¶

Added¶

Allow specifying throughput on EBS volumes via the instance module.

Changed¶

Set some recommended additional settings on the Auto Scaling Group used for AutoScaling Runners.

1.6.4 - 2024-06-13¶

Added¶

Logging of all requests blocked by AWS WAF to a CloudWatch log stream. Retention can be set via: waf_log_retention.

Changed¶

GitLab Runner Fleeting Plugin for AWS updated to latest upstream: version 0.5.0.

Fixed¶

More required permissions were added to the OIDC role policy files.

1.6.3 - 2024-06-04¶

Fixed¶

Loki and Promtail versions were not pinned, and thus were not upgraded either.
GLH cluster runners can now access the health check endpoints.

1.6.2 - 2024-05-30¶

Fixed¶

AWS Permission for ssm:StartSession.

1.6.1 - 2024-05-30¶

Added¶

Feature flag for OIDC integration.

Fixed¶

Added missing AWS permissions.
Health check for GitLab Pages not working when custom domains is enabled, because the protocol was set incorrectly.
Loki, Registry Mirror, Monitor and Runner nodes were not explictly given access to the S3 KMS key.

1.6.0 - 2024-05-21¶

Added¶

Added S3 bucket creation, so SSM will work when enabled.
Amazon SSM agent is now installed by Ansible as well. Can be disabled by setting install_amazon_ssm_agent: false.
Added a maintenance job to run apt clean on the servers daily.
Terraform variable runner_cache_object_retention_period to control how long objects stay in the shared runner cache. Defaults to 120.
OIDC configuration and IAM permissions for automated deployments.
Terraform variable network_load_balancer_security_group_ids to add additional security groups to the primary NLB.
Terraform variable default_security_group_ids to add additional security groups to all EC2 nodes.
Terraform variable gitlab_rails_security_group_ids to add additional security groups to the GitLab Rails nodes.
Terraform variable sidekiq_security_group_ids to add additional security groups to the Sidekiq nodes.
Simple playbook create_admin_user to create/update admin user in GitLab.
Docker image for automated deployment in solutions.
Allow installing files into /etc/gitlab/trusted-certs by listing them in trusted_certs from within solutions.
vpc_cidr_block is now set by default in ansible/inventory/terraform_vars.yml.
GitLab secrets and SSH host keys are stored in AWS Secret Manager.
get_license_info playbook to print EE license information from the Rails console.
Solution documentation is now stored in GET's documentation as well.
create_access_token playbook to create a PAT for a given user in the database

Changed¶

Added user 'dennis' with matching key to the common_vars role
Removed EBS snapshot configuration for Gitaly volumes.
Terraform output variable s3_bucket_arns used to contain nested lists. These are now flattened.
Limit prefix length to 14 characters in the name of the target group for Prometheus.
Limit prefix length to 12 characters in the name of the target group for Registry Mirror.
It's now possible to run the transfer_secrets playbook before initial provisioning is done.
Moved listen port on internal NLB for registry mirror to port 443.
auto_start_stop lambda function was rewritten to allow usage of the code on localhost as well.
RDS backups are now stored in AWS Backups as well.
Moved the following Ansible playbooks to the tools namespace:
create_access_token
create_admin_user
get_license_info
grow_filesystems
post_data_migration
pre_data_migration
remove_swap
render_ssh_config
transfer_secrets

Fixed¶

Registry Mirror role not working when using docker_proxy_host.

Removed¶

Removed references to AWS secrets ini file.

Upgrade instructions¶

Run at least one successful Terraform run before running Ansible, because terraform_vars.yml was changed.
Set rds_postgres_backup_retention_period to 30 in your solution if you want to keep current RDS backups.

1.5.1 - 2024-04-09¶

Added¶

Allow disabling WAF max body size limit by setting the value of waf_body_size_restriction to 0.

1.5.0 - 2024-04-09¶

Added¶

Support installing debs via a proxy for Debian, GitLab, Grafana and Docker.io.
Support pulling Docker image via a proxy for Monitor, Kroki, Registry Mirror.
Support disabling the :cleanup Raketasks with maintenance_rails_cleanup_enabled for usage during migrations.
Variable grafana_install_omnibus_dashboards to prevent installation dashboards located on the internet.
Variable sns_endpoint for overriding the AWS SNS API URL in order to support VPC endpoints for SNS in solutions.
Output variable prometheus_alertmanager_topic_arn for getting the full ARN for the auto-created SNS topic.
New Ansible role common_vars. Currently contains maintenance SSH related variables.
Ansible generates user_data.sh for installing and configuring SSH access in the Terraform folder, which is used when creating new EC2 instances by Terraform. This is done in the render_tfvars role/playbook.
Re-implemented logic from glh-admin-access package in Ansible under common/tasks/ssh_access.yml.
Creates system users from list maintenance_users.
Grants sudo permissions to all users in list maintenance_users.
Installs SSH authorized*keys for each user from ssh_keys*<user> variables.
Configures SSHD with our default options and to listen on port maintenance_ssh_port.
Terraform output variable s3_bucket_arns which can be used in solutions to get a list of all S3 buckets in use.
Support for having Gitaly nodes create backups directly to a (new) S3 bucket.
Terraform variable noncurrent_version_retention_period to set how long noncurrent objects are kept. Default: 5.

Changed¶

Remove geerlingguy.node_exporter and install Debian's version instead.
Remove docker-compose from geerlingguy.docker on Monitor node, replaced with docker compose via Apt install.
Move sshd-git logic and files from common role to gitlab_rails role.
GitLab Rails backups are now created with REPOSITORIES_SERVER_SIDE=true and *_CONCURRENCY=6.
Renamed Terraform variable backup_retention_period_s3 to backup_retention_period.
Add the <prefix>-backups S3 bucket to AWS Backup plan.

Fixed¶

Harmonized and changed usage of geerlingguy.docker to minimize issues when using Aptproxy.
sshd-git is no longer installed and configured on all nodes in a cluster. Leftovers are auto-removed.
Registry Mirror value for $KANIKO_MIRROR_ARGS on GitLab Runners used to contain 'https://' but this is not allowed.

Removed¶

Removed support for Ansible variable glh_apt_repo_url, and the related installation code for that Apt repository.
Removed code for installing glh-admin-access package, both in Ansible and Terraform.
Removed support for glh_apt_repo_url in Terraform aws/instance module.
Removed support for migrating from the previous Ansible based admin access to the glh-admin-access package.

Upgrade instructions¶

If your solution has backup_retention_period_s3 defined, rename that to backup_retention_period.
Note: If you need to absolutely guarantee that current data is kept, define noncurrent_version_retention_period as well, and set it to the same value as backup_retention_period. You can remove/revert this after a few days.

1.4.5 - 2024-02-29¶

Fixed¶

Fixed UptimeRobot IP ranges.

1.4.4 - 2024-02-27¶

Fixed¶

Added missing IAM policy to store logs to CloudWatch.

1.4.3 - 2024-02-22¶

Added¶

Allowed UptimeRobot access to health check endpoint.

Fixed¶

Fixed node exporter on Bastion nodes.

1.4.2 - 2024-02-20¶

Added¶

Documentation for migrating a customer's existing GitLab instance to GET.
Helper playbooks for data-migration related tasks.
Helper playbook grow_filesystems to grow filesystems on enlarged disks to their new size.
Check to ensure the Ansible inventory actually contains hosts (prevents expired AWS credential errors)
Option to store EC2 system logs in AWS CloudWatch.

Changed¶

Clone omnibus-dashboards repository with depth: 1 to speed up the process.

1.4.1 - 2024-02-05¶

Changed¶

Zero downtime update no longer includes terminated nodes.
Improved pending migrations check in zero downtime update.

1.4.0 - 2024-02-01¶

Added¶

Add override for the NLB subnets.
Add option to switch the load balancers to internal.
Support for deploying and managing AWS Global Accelerators.
Added preserve host header to ALB.
Prometheus Alertmanager is now installed on the monitor node, and sends its messages to AWS SNS.
AWS SNS topic for alerts, that is configured for our shared infrastructure (see glh/ha/aws-infrastructure).
Added s3 bucket for CI Secure Files.

Changed¶

AWS SSM update is now run weekly at 03:00 instead of during the day.
Loki server configuration optimized to allow for larger queries.

Fixed¶

Auto start/stop cost saving lambda was not allowed to start customer-managed KMS encrypted EC2 machines.

Removed¶

Removed ansible DNS configuration.
Removed entire consul support.

Upgrade instructions¶

Projects must be updated to 1.3.x before upgrading to this version due to moved Terraform resources.
Remove the groups:consul entry from ansible/inventory/aws_ec2.yml.

1.3.3 - 2024-01-30¶

Fixed¶

Explicitly disable the KAS service from listening for SSL traffic directly.

1.3.2 - 2024-01-29¶

Fixed¶

Increased timeout on Grafana dashboard clone to prevent timeout errors.
Allow outbound ICMP requests.

Removed¶

Removed glh_container_registry_enable and glh_container_registry_external_url.

Upgrade instructions¶

Remove glh_container_registry_enable and glh_container_registry_external_url, make sure the non-glh prefixed vars are set.

1.3.1 - 2023-12-21¶

Fixed¶

Added missing egress rule to allow ALB to connect to Kroki nodes.
Recognise all 200-499 status codes as success for pages nodes, force-auth returns a 3xx code instead of 200.
Security group definition for GitLab pages is broken when custom domains is disabled.
Handling of version requirements for gitlab-runner package.
Package of gitlab-runner cannot be updated by Ansible due to package hold.
Add a default value of false for pages_enable_custom_domains in Ansible to prevent over-reliance on Terraform.

1.3.0 - 2023-11-30¶

Added¶

It's now possible to add solution specific Terraform results to Ansible's terraform_vars.yml file.
Allow users to disable management of the 'get-terraform-state' S3 bucket: manage_get_terraform_state_bucket = false.
Support for autoscaling GitLab Runners based on AWS ASGs and the new Fleeting plugins.
The Terraform variable alternative_fleet_instance_types can now be used on autoscaling runner hive definitions to configure alternative instance types to migitate potential capacity issues on AWS.
The Terraform variables fleet_additional_tags and autoscaling_fleet_default_additional_tags can now be used to add additional AWS tags to autoscaling runner resources and subresources.
Terraform variables autoscaling_fleet_default_disk_device_name and fleet_disk_device_name can be used to override the correct expected root disk name, which is required when using alternative AMIs with a different root device name.
AWS auto scaling groups and subresources for autoscaling runners get the created-by:runner-autoscale tag assigned.
A VPC Endpoint Gateway and security group rule to route S3 traffic through AWS's internal network to reduce costs.
Added support for SSM instead of SSH.
Added docs to restrict AWS permissions.
All instances are added to AWS Fleet Manager.
Garbage collection runs on primary rails node.
Basic support for enabling custom domains for GitLab Pages. Read our docs to learn about the current limitations.

Changed¶

Renamed ansible/terraform_output.yml to ansible/terraform_vars.yml.
Allow nodes with the gitlab_s3_policy to list bucket contents as well, so they can perform cleanup tasks.
Access control for GitLab pages has been set to always enabled, users can disable auth on a per-project basis.
Bastion nodes are now placed in the private subnet.
Separated internal and external certificates.
Moved all security group logic to a module to prevent duplication. No changes should occur because of this.
Rename registry_mirror security group to registry-mirror because other secgroups are named as such as well.

Fixed¶

Using an external domain name no longer triggers AWS ACM validation errors.
Default disk device name for autoscaling runner fleet nodes now works as expected with our default runner AMI id.
Disable EC2MetaDataSSRF_QUERYARGUMENTS and GenericRFI_QUERYARGUMENTS WAF rules on /oauth/authorize endpoint.

Upgrade instructions¶

Before running terraform apply, remove the ansible/terraform_vars.yml file and update your .gitignore file (ansible/terraform_output.yml => ansible/terraform_vars.yml).
Ensure you run Ansible on your primary Rails node after updating to configure GitLab Pages oAuth in the GitLab DB.

1.2.5 - 2023-11-30¶

Added¶

Logic to copy custom files to gitlab_pages components in Ansible.

1.2.4 - 2023-11-17¶

Changed¶

Separated internal and external certificates.

1.2.3 - 2023-11-02¶

Added¶

The Ansible glh_domain variable which points to {{ prefix }}.glhc.nl by default.
The Ansible external_kas_url variable which points to {{ external_url_sanitised | replace('https', 'wss') }}/-/kubernetes-agent/ by default.
The Ansible gitlab_kas_enabled variable which defaults to true.
Ansible var glh_container_registry_enable for solutions to control value of container_registry_enable.
Ansible var glh_container_registry_external_url for solutions to control value of container_registry_external_url.

Fixed¶

The registry is now correctly enabled by default and reachable on https://registry.{{ glh_domain }}.
container_registry_enable and container_registry_external_url are now overridden in the entire dependency chain.
Use the glh_ counterparts to control their values in solutions.
GitLab KAS is now correctly configured.
Pin the version of node_exporter to prevent rate-limit issues when looking up the latest version on GitHub.

1.2.2 - 2023-10-18¶

Added¶

Playbook for configuring gitlab_nfs nodes, since it is no longer configured via the common playbook.

Fixed¶

Ensure common role is never invoked directly to prevent omnibus installation on some roles.
The backups bucket is not actually excluded where needed and ends up in a versioning and backup policy.
Add autoscaling_runner config for Prometheus, this was missing previously.

Removed¶

The common playbook has been removed, invoke the all or role-specific playbook instead.

Changed¶

Moved logic to uninstall Omnibus from Monitor role to common role, and invoke it when required.

1.2.1 - 2023-10-12¶

Fixed¶

Options regarding EC2 instance root devices were not applied properly.
Fresh cluster cannot be provisioned because of chicken-and-egg S3 problems between glh_nlb.tf and glh_dns.tf.
Fresh cluster cannot be provisioned because of chicken-and-egg S3 problems in glh_backup.tf.

1.2.0 - 2023-10-05¶

Added¶

Support for dedicated single-instance GitLab Runners with the Docker executor.
The primary_domain_name, primary_registry_domain_name & primary_pages_domain_name domains are now resolved in private DNS zones.
Missing security group references.

Fixed¶

Allow sidekiq access to OpenSearch.
Monitor nodes are now included in zero downtime update.

Changed¶

Moved Promtail role and tasks to common role to ease zero downtime updates.

1.1.0 - 2023-09-18¶

Added¶

Loki role for centralized logging.
Promtail role for sending logs to Loki.
Separate Grafana role.
Thanos install on the monitor role via Docker.
NLB/ALB configuration for accessing Grafana, Prometheus and Thanos.
Pre-installed CloudWatch datasource for Grafana.
Pre-made Grafana dashboards have been copied into our version of the stack from upstream version 2.8.5.
Ansible playbook to generate local SSH configuration for easy SSH proxying.
Validation to Ansible pre_configure playbook.

Removed¶

Monitor role no longer has Grafana installed, use the new Grafana role instead.
Grafana access via /-/grafana on the primary GitLab hostname.
Monitor role no longer has gitlab-omnibus installed, and it is automatically removed on the next Ansible run. Prometheus is run via Docker instead.

Changed¶

Running more than 1 monitor node is temporarily not supported while we figure out how to properly automate a HA setup.
Prometheus's metrics are now stored in S3 instead of on-disk.
The instance label on Prometheus metrics is now set to hostname:scrape_port. For example: glh-dev-gitlab-rails-1:8080.
The hostname label on Prometheus metrics is set to the ansible_hostname by the Prometheus server, even if the metrics already have this label set. Compared to the example above, this will be set to: glh-dev-gitlab-rails-1.
Instances are now rebooted before reconfiguration in zero downtime update.
All security.security_group_<GROUP> outputs have been changed to security.security_group_<GROUP>_id and now return the ID instead of the complete resource.

Upgrade instructions¶

Set monitor_node_count to 1 when it is set to more than 1 currently.
Manually export data from your current Grafana instance when you need to keep it, this includes users and dashboards.
Silence monitoring for your cluster when you are scraping your cluster with federation, all metrics will be reset which may trigger a lot of false-positive alerts.
If you encounter errors with running Ansible on existing monitor nodes, manually run gitlab-ctl stop beforehand.
If you reference a security.security_group_<GROUP>.id in your solution, replace it with security.security_group_<GROUP>_id.

1.0.0 - 2023-09-12¶

Added¶

Initial 1.0.0 release, functionally the same as 0.7.0.

0.7.0 - 2023-08-29¶

Added¶

Terraform variable reverse_az_ordering - flips the AZ choice when creating EC2 machines.
Ansible variable aws_nfs_stop_after_run - can be used to override if NFS servers are stopped or not.
Optional schedule to auto start/stop instances for cost saving.
Option to filter ingress traffic.

Fixed¶

Fixed missing dependencies in backup causing errors during setup of new cluster.
NTP package is now installed by default so Ansible runs don't fail on newly created machines.

Removed¶

The security_group_common_egress_https_cidr_blocks variable, this is replaced by http_allowed_egress_cidr_blocks.
The default security group no longer has ingress and egress rules.

Changed¶

Renamed the following egress filter variables:
security_group_common_egress_dns_cidr_blocks => dns_allowed_egress_cidr_blocks
security_group_common_egress_http_cidr_blocks => http_allowed_egress_cidr_blocks
security_group_common_egress_ntp_cidr_blocks => ntp_allowed_egress_cidr_blocks
EC2 nodes that aren't explicitly added to a security group, no longer have network access.

0.6.0 - 2023-08-22¶

Added¶

DNS nameservers can now be configured using the dns_nameservers Ansible variable.
NTP servers can now be configured using the ntp_servers Ansible variable.

Changed¶

Updated WAF rules for GitLab 15.0 GraphQL API calls.

0.5.1 - 2023-08-18¶

Added¶

It's now possible to SSH to nodes using their hostname (e.g. ssh <prefix>-gitlab-rails-1) from a Bastion node.

Changed¶

All nodes now have their hostname set correct.
All nodes now reboot during a zero downtime update.

Removed¶

Swap is no longer added on new nodes.

Upgrade instructions¶

To propagate the hostname changes, each node needs to be rebooted.
To remove swap from existing nodes, run the remove_swap playbook.

0.5.0 - 2023-08-18¶

Added¶

Terraform module variables for using existing ACM certificates:
certificate_arn
pages_certificate_arn
registry_certificate_arn
Support for Kroki diagram servers.
Added security groups to Network Load Balancers.
Ansible variables are now generated by this toolkit.
The EC2 common security group now filters egress traffic with destinations outside the GitLab VPC.

Changed¶

Upgraded to GET 2.8.5.
All non-primary domain traffic is now redirected to the primary domains.
For example: if you set domain_name to git.example.com, git.cluster.glhc.nl is redirected there.
This does not apply to pages, because AWS ALB cannot perform substitutions on hostnames.
ALB rules were split off into their own file: glh_alb_rules.tf.
ALB rule ordering was changed to solutions can more easily inject custom rules.
In case Terraform fails to apply these with error PriorityInUse, just run Terraform again to fix this.
Restricted access to health check & metric endpoints.
A number of resources and variables have had their names changed in order to comply to code standards.

Fixed¶

Fixed invalid s3 lifecycle configuration.
Redirect pages traffic to the pages_domain_name in addition to *..
Fixed copy of SSH config.
Fixed creation of SSL certificate when no domain_name is set.
Fixed access from monitor node to pages metrics.
Grafana and Kroki now only accept traffic on the primary domain.

Upgrade instructions¶

Existing Network Load Balancers have to be deleted manually to add the security groups. Ansible needs to be run afterward.
Remove terraform/ansible_vars.tf from your project.
Due to the resource name changes, you might see a large number of Terraform changes. This is expected behaviour.

0.4.1 - 2023-08-08¶

Fixed¶

Backup S3 bucket now has one lifecycle configuration.
user_data is now an ignored change on AWS instances.

0.4.0 - 2023-08-08¶

Added¶

Ansible variable glh_apt_repo_url to configure which GitLabHost Apt repository is installed.
Added jumphost IP to Bastion SSH allow list.

Changed¶

Set "block public access" on S3 buckets to true by default for increased security.
Provisioning SSH servers is now done using a Debian package that installs via user-data set by Terraform.

Fixed¶

Moved consul server config from bastion to consul role to prevent errors during setup.

Removed¶

Ansible-based user setup, sudo configuration, SSH maintenance server configuration.

Upgrade instructions¶

Remove the following Ansible variables (including examples):
ansible_user
ansible_ssh_private_key_file
Remove the following Terraform module parameters (including examples):
ssh_allow_port_22

0.3.1 - 2023-08-07¶

Added¶

Added load balancer stickiness based on GitLab session cookie for better zero downtime update experience.

0.3.0 - 2023-07-28¶

Added¶

<PREFIX>.glhc.nl DNS record for easier management.
Added lifecycle policy to all buckets to delete noncurrent versions as soon as possible.

Changed¶

Changed terraform directory structure, "terraform/gitlab_aws_cluster" is now "terraform/aws/cluster".
The registry.<PREFIX>.glhc.nl DNS record is now always created.

Fixed¶

Shared SSH host keys between Bastion nodes before writing custom SSH config.
Fixed consul in initial setup.
The Zero Downtime Update playbook is now working properly.

Upgrade instructions¶

In your environment.tf, replace terraform/gitlab_aws_cluster with terraform/aws/cluster in the module source.

0.2.1 - 2023-06-29¶

Fixed¶

Generating a certificate now waits for validation before assigning it to the load balancer.

0.2.0 - 2023-06-27¶

Added¶

GitLab Pages is now supported.

Changed¶

Path to shared secrets is updated to reflect changes in the template.

0.1.1 - 2023-06-16¶

Added¶

Consul is now installed on bastion nodes and can be run with consul members.
Variable to set WAF body size restriction.
CI linting.
Wrote a proper README.

Upgrade instructions¶

Update ansible/inventory/aws_ec2.yml, add the following under keyed_groups:

groups:
  # Register primary bastion node as consul servers
  consul: tags.gitlab_node_level == 'bastion-primary'

0.1.0 - 2023-06-05¶

Added¶

Initial developmental release.