Prometheus Alerting — most common alert rules

Rakesh Jain
2 min readApr 21, 2020

--

This tutorial will list out the most common Prometheus alert rules.

As of version 2.0, you define alert rules using YAML files.
We configure Prometheus alert rules within this directory -> /etc/prometheus/rules/{file_name}.yml

Caution ->
Alert thresholds depend on nature of applications.
Some queries in this page may have arbitrary tolerance threshold.
Take care of indentation folks. :)

First three alert rule belongs to blackbox_exporter and last three requires node_exporter on the client machine to provide the system metrics.

  1. InstanceDown
- alert: InstanceDown
# Annotation — additional informational labels to store more information
annotations:
title: ‘Instance {{ $labels.instance }} down’
description: ‘{{ $labels.instance }} has been down for more than 1 minute.’
# Condition for alerting
expr: probe_success{job=”blackbox-icmp”} == 0
for: 1m
# Labels — additional labels to be attached to the alert
labels:
severity: ‘critical’

2. SshDown

- alert: SshDown
annotations:
description: ‘{{ $labels.instance }} has not been connectable for more than 2 minutes.’
summary: ‘{{ $labels.instance }} SSH is unreachable’
expr: probe_success{job=”blackbox-ssh”} == 0
for: 2m
labels:
severity: ‘critical’

3. HttpFailure

- alert: HttpFailure
annotations:
description: ‘The HTTP(S) service at {{ $labels.instance }} has been failing for more than 2 minutes.’
summary: ‘{{ $labels.instance }} is offline or erroring’
expr: probe_success{job=”blackbox-https”} == 0
for: 2m
labels:
severity: ‘critical’

4. HostOutOfMemory

- alert: HostOutOfMemory
expr: node_memory_MemAvailable_bytes{job=”node_exporter”} / node_memory_MemTotal_bytes{job=”node_exporter”} * 100 < 10
for: 5m
labels:
severity: warning
annotations:
summary: “Host out of memory (instance {{ $labels.instance }})”
description: “Node memory is filling up (< 10% left)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}”

5. CriticalCPULoad

- alert: CriticalCPULoad
expr: 100 — (avg by(instance) (irate(node_cpu_seconds_total{job=”node_exporter”,mode=”idle”}[5m])) * 100) > 10
for: 5m
labels:
severity: warning
annotations:
summary: “Host high CPU load (instance {{ $labels.instance }})”
description: “The CPU load reported from {{ $labels.instance }} has exceeded 80% for more than 2 minutes.”

6. HostOutOfDiskSpace

- alert: HostOutOfDiskSpace
expr: node_filesystem_free_bytes{mountpoint=”/”, fstype=”rootfs”,job=”node_exporter”} / node_filesystem_size_bytes{job=”node_exporter”} * 100 < 10
for: 1m
labels:
severity: critical
annotations:
summary: “Host out of disk space (instance {{ $labels.instance }})”
description: “Disk is almost full (< 10% left)\n {{ $labels.instance_short }}\n {{ $labels.mountpoint }}\n VALUE = {{ printf \”node_filesystem_avail_bytes{mountpoint=’%s’}\” .Labels.mountpoint | query | first | value | humanize1024 }}”

You can finally check your rule syntax using promtool -

# promtool check rules /etc/prometheus/rules/rules.yml
Checking rules/rules.yml
SUCCESS: 6 rules found

Hope you like the tutorial. Please let me know your feedback in the response section.

Happy learning!

--

--

Rakesh Jain
Rakesh Jain

Written by Rakesh Jain

DevOps Professional | Technical writer

No responses yet