Logging and monitoring¶
This page describes the features related to logging and monitoring which present in GET.
Observability suite¶
When the Loki, and Monitor roles are enabled, a fairly complete stack for observing the cluster is
available. Details on how to enable the roles can be read in their appropriate sections above.
Relevant client daemons on all nodes are automatically installed when the matching server role is activated.
Observability endpoint¶
We have a centralized Grafana solution which can be used to browse Prometheus metrics, Loki logs, and Grafana itself. We pre-configure Grafana dashboards in the solution so you can get started quickly.
To get started, simply visit HA Federated Grafana and choose your solution of choice.
Loki¶
Loki is a centralized logging service. Nodes in the cluster run alloy, which is the daemon that ships
logs to the available servers. By default, /var/log, /var/log/nginx and systemd-journal are scraped. In addition,
several interesting logfiles generated by GitLab are scraped when detected, like production_json.log.
When at least one Loki node is present, Ansible will install the alloy daemon on all nodes in a cluster,
and configure these to upload their logs to Loki through the internal NLB, which in turn picks any healthy Loki node.
When more than one Loki node is present, they are automatically configured in a clustered mode using the built-in member list protocol support. A Grafana datasource is automatically added for viewing logs stored in Loki when enabled. The amount of Loki nodes must be set to either 0, 1, or a multiple of 3. This is validated automatically.
Monitor¶
The monitor role consists of Prometheus and a complete Thanos stack. Prometheus scrapes metrics
from available daemons running on nodes in the cluster. Thanos is responsible for storing these metrics in S3, and
provides de-duplication when multiple monitor nodes are present. This is required because every Prometheus node does
its own scraping without considering that there might be other Prometheus nodes present.
Prometheus is configured to scrape all known metrics from all running daemons we know about in the cluster
automatically. This includes node_exporter on each node, which provided statistics on for example CPU and RAM usage.
When Loki is enabled, all Loki and Alloy daemons are scraped as well, to provide metrics on the logging system.
Additional configuration¶
Loki¶
We use Grafana Loki to scrape all system logs from all nodes in a cluster. The logs are stored in a S3 bucket which is managed by Terraform.
You can learn more about how to use Loki in the Observability Section of this document.
The following relevant Terraform variables are available:
| Name | Default | Description |
|---|---|---|
loki_node_count |
0 |
Number of Loki nodes to create |
loki_instance_type |
"" |
Instance type of the Loki node(s) |
loki_disk_type |
gp3 |
Optional |
loki_disk_size |
50 |
Optional |
loki_disk_encrypt |
true |
Optional |
loki_disk_delete_on_termination |
true |
Optional |
loki_disk_kms_key_arn |
null |
Optional |
loki_data_disks |
[] |
Optional |
loki_iam_instance_policy_arns |
[] |
Optional |
Most of these settings are entirely optional, or can be configured through default variables. The following example is enough to get started:
module "gitlab_cluster" {
# [...]
loki_node_count = 1
loki_instance_type = "t3a.small"
}
Prometheus¶
The monitor roles is fairly similar to upstream on the Terraform side of things, but has had a complete rewrite on the
Ansible side. We do not use the bundled Prometheus daemon, and we do not rely on node_exporter from gitlab-omnibus.
Currently, only zero or one Prometheus node are valid options, since we do not have automatic clustering support yet.
Apart from Prometheus, the role also runs a complete Thanos stack which stores the data in a S3 bucket that is automatically created by Terraform.
You can learn more about Prometheus and Thanos in the Observability Section of this document.
The following Terraform variables are changed compared to upstream:
| Name | Default | Description |
|---|---|---|
prometheus_node_count |
0 |
Number of Prometheus nodes to create. Must be 0 or 1 |