Logging and monitoring¶
This page describes the features related to logging and monitoring which present in GET.
Observability suite¶
When the Loki, Grafana, and Monitor roles are all enabled, a fairly complete stack for observing the cluster is
available. Details on how to enable the roles can be read in their appropriate sections above.
Relevant client daemons on all nodes are automatically installed when the matching server role is activated.
Observability endpoint¶
We have a central proxy server, that has private tunnels to all deployed solutions in the glh-ha-engineers AWS
organization. You can use the Grafana interface proxied by this to browse Prometheus metrics, Loki logs, and Grafana
itself. We pre-configure Grafana dashboards in the solution so you can get started quickly. You will need to have a
working AWS account inside the GitLabHost parent organization to use this.
To get started, simply visit HA Federated Grafana and choose your solution of choisae.
Loki¶
Loki is a centralized logging service. Nodes in the cluster run promtail, which is the daemon that ships
logs to the available servers. By default, /var/log, /var/log/nginx and systemd-journal are scraped. In addition,
several interesting logfiles generated by GitLab are scraped when detected, like production_json.log.
When at least one Loki node is present, Ansible will install the promtail daemon on all nodes in a cluster,
and configure these to upload their logs to Loki through the internal NLB, which in turn picks any healthy Loki node.
When more than one Loki node is present, they are automatically configured in a clustered mode using the built-in member list protocol support. A Grafana datasource is automatically added for viewing logs stored in Loki when enabled. The amount of Loki nodes must be set to either 0, 1, or a multiple of 3. This is validated automatically.
Monitor¶
The monitor role consists of Prometheus and a complete Thanos stack. Prometheus scrapes metrics
from available daemons running on nodes in the cluster. Thanos is responsible for storing these metrics in S3, and
provides de-duplication when multiple monitor nodes are present. This is required because every Prometheus node does
its own scraping without considering that there might be other Prometheus nodes present.
Prometheus is configured to scrape all known metrics from all running daemons we know about in the cluster
automatically. This includes node_exporter on each node, which provided statistics on for example CPU and RAM usage.
When Loki is enabled, all Loki and Promtail daemons are scraped as well, to provide metrics on the logging system.
Grafana¶
Grafana is a webinterface that is mostly used to explore raw observability data, and create dashboards based
on this data. We automatically add relevant data sources to Grafana on install, currently Loki, Prometheus and AWS
CloudWatch are preconfigured and ready to use on first login. We also pre-import some ready-made dashboards to make some
sense of raw metric data. You can log in with username admin and the password as set in Ansible.
Additional configuration¶
Loki¶
We use Grafana Loki to scrape all system logs from all nodes in a cluster. The logs are stored in a S3 bucket which is managed by Terraform.
You can learn more about how to use Loki in the Observability Section of this document.
The following relevant Terraform variables are available:
| Name | Default | Description |
|---|---|---|
loki_node_count |
0 |
Number of Loki nodes to create |
loki_instance_type |
"" |
Instance type of the Loki node(s) |
loki_disk_type |
gp3 |
Optional |
loki_disk_size |
50 |
Optional |
loki_disk_encrypt |
true |
Optional |
loki_disk_delete_on_termination |
true |
Optional |
loki_disk_kms_key_arn |
null |
Optional |
loki_data_disks |
[] |
Optional |
loki_iam_instance_policy_arns |
[] |
Optional |
Most of these settings are entirely optional, or can be configured through default variables. The following example is enough to get started:
module "gitlab_cluster" {
# [...]
loki_node_count = 1
loki_instance_type = "t3a.small"
}
Grafana¶
We have an Ansible role for installing Grafana directly from the Grafana Labs Apt repository. Currently, only zero or one Grafana node are valid options, since we do not have automatic clustering support yet.
You can learn more about how to use Grafana in the Observability Section of this document.
The following relevant Ansible variables are available:
| Name | Default | Description |
|---|---|---|
grafana_password |
None | This value is automatically set to be the password of the admin user in Grafana. |
The following relevant Terraform variables are available:
| Name | Default | Description |
|---|---|---|
grafana_node_count |
0 |
Number of Grafana nodes to create. Must be 0 or 1 |
grafana_instance_type |
"" |
Instance type of the Grafana node(s) |
grafana_disk_type |
gp3 |
Optional |
grafana_disk_size |
50 |
Optional |
grafana_disk_encrypt |
true |
Optional |
grafana_disk_delete_on_termination |
true |
Optional |
grafana_disk_kms_key_arn |
null |
Optional |
grafana_data_disks |
[] |
Optional |
grafana_iam_instance_policy_arns |
[] |
Optional |
Most of these settings are entirely optional, or can be configured through default variables. The following example is enough to get started:
module "gitlab_cluster" {
# [...]
grafana_node_count = 1
grafana_instance_type = "t3a.small"
}
Monitor¶
The monitor roles is fairly similar to upstream on the Terraform side of things, but has had a complete rewrite on the
Ansible side. We do not use the bundled Prometheus daemon, and we do not rely on node_exporter from gitlab-omnibus.
Currently, only zero or one Monitor node are valid options, since we do not have automatic clustering support yet.
Apart from Prometheus, the role also runs a complete Thanos stack which stores the data in a S3 bucket that is automatically created by Terraform. Upstream usually has Grafana installed on the Monitor node, but we have split it off into a separate Grafana role.
You can learn more about Prometheus and Thanos in the Observability Section of this document.
The following Terraform variables are changed compared to upstream:
| Name | Default | Description |
|---|---|---|
monitor_node_count |
0 |
Number of Monitor nodes to create. Must be 0 or 1 |