Prometheus Alerting — most common alert rules
This tutorial will list out the most common Prometheus alert rules.
As of version 2.0, you define alert rules using YAML files.
We configure Prometheus alert rules within this directory -> /etc/prometheus/rules/{file_name}.yml
Caution ->
Alert thresholds depend on nature of applications.
Some queries in this page may have arbitrary tolerance threshold.
Take care of indentation folks. :)
First three alert rule belongs to blackbox_exporter and last three requires node_exporter on the client machine to provide the system metrics.
- InstanceDown
- alert: InstanceDown
# Annotation — additional informational labels to store more information
annotations:
title: ‘Instance {{ $labels.instance }} down’
description: ‘{{ $labels.instance }} has been down for more than 1 minute.’# Condition for alerting
expr: probe_success{job=”blackbox-icmp”} == 0
for: 1m# Labels — additional labels to be attached to the alert
labels:
severity: ‘critical’
2. SshDown
- alert: SshDown
annotations:
description: ‘{{ $labels.instance }} has not been connectable for more than 2 minutes.’
summary: ‘{{ $labels.instance }} SSH is unreachable’
expr: probe_success{job=”blackbox-ssh”} == 0
for: 2m
labels:
severity: ‘critical’
3. HttpFailure
- alert: HttpFailure
annotations:
description: ‘The HTTP(S) service at {{ $labels.instance }} has been failing for more than 2 minutes.’
summary: ‘{{ $labels.instance }} is offline or erroring’
expr: probe_success{job=”blackbox-https”} == 0
for: 2m
labels:
severity: ‘critical’
4. HostOutOfMemory
- alert: HostOutOfMemory
expr: node_memory_MemAvailable_bytes{job=”node_exporter”} / node_memory_MemTotal_bytes{job=”node_exporter”} * 100 < 10
for: 5m
labels:
severity: warning
annotations:
summary: “Host out of memory (instance {{ $labels.instance }})”
description: “Node memory is filling up (< 10% left)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}”
5. CriticalCPULoad
- alert: CriticalCPULoad
expr: 100 — (avg by(instance) (irate(node_cpu_seconds_total{job=”node_exporter”,mode=”idle”}[5m])) * 100) > 10
for: 5m
labels:
severity: warning
annotations:
summary: “Host high CPU load (instance {{ $labels.instance }})”
description: “The CPU load reported from {{ $labels.instance }} has exceeded 80% for more than 2 minutes.”
6. HostOutOfDiskSpace
- alert: HostOutOfDiskSpace
expr: node_filesystem_free_bytes{mountpoint=”/”, fstype=”rootfs”,job=”node_exporter”} / node_filesystem_size_bytes{job=”node_exporter”} * 100 < 10
for: 1m
labels:
severity: critical
annotations:
summary: “Host out of disk space (instance {{ $labels.instance }})”
description: “Disk is almost full (< 10% left)\n {{ $labels.instance_short }}\n {{ $labels.mountpoint }}\n VALUE = {{ printf \”node_filesystem_avail_bytes{mountpoint=’%s’}\” .Labels.mountpoint | query | first | value | humanize1024 }}”
You can finally check your rule syntax using promtool -
# promtool check rules /etc/prometheus/rules/rules.yml
Checking rules/rules.yml
SUCCESS: 6 rules found
Hope you like the tutorial. Please let me know your feedback in the response section.
Happy learning!