Skip to content

Logging and monitoring

This page describes the features related to logging and monitoring which present in GET.

Observability suite

When the Loki, Grafana, and Monitor roles are all enabled, a fairly complete stack for observing the cluster is available. Details on how to enable the roles can be read in their appropriate sections above.

Relevant client daemons on all nodes are automatically installed when the matching server role is activated.

Observability endpoint

We have a central proxy server, that has private tunnels to all deployed solutions in the glh-ha-engineers AWS organization. You can use the Grafana interface proxied by this to browse Prometheus metrics, Loki logs, and Grafana itself. We pre-configure Grafana dashboards in the solution so you can get started quickly. You will need to have a working AWS account inside the GitLabHost parent organization to use this.

To get started, simply visit HA Federated Grafana and choose your solution of choisae.

Loki

Loki is a centralized logging service. Nodes in the cluster run promtail, which is the daemon that ships logs to the available servers. By default, /var/log, /var/log/nginx and systemd-journal are scraped. In addition, several interesting logfiles generated by GitLab are scraped when detected, like production_json.log.

When at least one Loki node is present, Ansible will install the promtail daemon on all nodes in a cluster, and configure these to upload their logs to Loki through the internal NLB, which in turn picks any healthy Loki node.

When more than one Loki node is present, they are automatically configured in a clustered mode using the built-in member list protocol support. A Grafana datasource is automatically added for viewing logs stored in Loki when enabled. The amount of Loki nodes must be set to either 0, 1, or a multiple of 3. This is validated automatically.

Monitor

The monitor role consists of Prometheus and a complete Thanos stack. Prometheus scrapes metrics from available daemons running on nodes in the cluster. Thanos is responsible for storing these metrics in S3, and provides de-duplication when multiple monitor nodes are present. This is required because every Prometheus node does its own scraping without considering that there might be other Prometheus nodes present.

Prometheus is configured to scrape all known metrics from all running daemons we know about in the cluster automatically. This includes node_exporter on each node, which provided statistics on for example CPU and RAM usage. When Loki is enabled, all Loki and Promtail daemons are scraped as well, to provide metrics on the logging system.

Grafana

Grafana is a webinterface that is mostly used to explore raw observability data, and create dashboards based on this data. We automatically add relevant data sources to Grafana on install, currently Loki, Prometheus and AWS CloudWatch are preconfigured and ready to use on first login. We also pre-import some ready-made dashboards to make some sense of raw metric data. You can log in with username admin and the password as set in Ansible.

Additional configuration

Loki

We use Grafana Loki to scrape all system logs from all nodes in a cluster. The logs are stored in a S3 bucket which is managed by Terraform.

You can learn more about how to use Loki in the Observability Section of this document.

The following relevant Terraform variables are available:

Name Default Description
loki_node_count 0 Number of Loki nodes to create
loki_instance_type "" Instance type of the Loki node(s)
loki_disk_type gp3 Optional
loki_disk_size 50 Optional
loki_disk_encrypt true Optional
loki_disk_delete_on_termination true Optional
loki_disk_kms_key_arn null Optional
loki_data_disks [] Optional
loki_iam_instance_policy_arns [] Optional

Most of these settings are entirely optional, or can be configured through default variables. The following example is enough to get started:

module "gitlab_cluster" {
  # [...]

  loki_node_count    = 1
  loki_instance_type = "t3a.small"
}

Grafana

We have an Ansible role for installing Grafana directly from the Grafana Labs Apt repository. Currently, only zero or one Grafana node are valid options, since we do not have automatic clustering support yet.

You can learn more about how to use Grafana in the Observability Section of this document.

The following relevant Ansible variables are available:

Name Default Description
grafana_password None This value is automatically set to be the password of the admin user in Grafana.

The following relevant Terraform variables are available:

Name Default Description
grafana_node_count 0 Number of Grafana nodes to create. Must be 0 or 1
grafana_instance_type "" Instance type of the Grafana node(s)
grafana_disk_type gp3 Optional
grafana_disk_size 50 Optional
grafana_disk_encrypt true Optional
grafana_disk_delete_on_termination true Optional
grafana_disk_kms_key_arn null Optional
grafana_data_disks [] Optional
grafana_iam_instance_policy_arns [] Optional

Most of these settings are entirely optional, or can be configured through default variables. The following example is enough to get started:

module "gitlab_cluster" {
  # [...]

  grafana_node_count    = 1
  grafana_instance_type = "t3a.small"
}

Monitor

The monitor roles is fairly similar to upstream on the Terraform side of things, but has had a complete rewrite on the Ansible side. We do not use the bundled Prometheus daemon, and we do not rely on node_exporter from gitlab-omnibus.

Currently, only zero or one Monitor node are valid options, since we do not have automatic clustering support yet.

Apart from Prometheus, the role also runs a complete Thanos stack which stores the data in a S3 bucket that is automatically created by Terraform. Upstream usually has Grafana installed on the Monitor node, but we have split it off into a separate Grafana role.

You can learn more about Prometheus and Thanos in the Observability Section of this document.

The following Terraform variables are changed compared to upstream:

Name Default Description
monitor_node_count 0 Number of Monitor nodes to create. Must be 0 or 1