In this part, we will look at how to design alerts in Prometheus. First, however, we need to understand a few concepts that are used in Prometheus:
- Metrics: Metrics are a core concept of Prometheus. We can expose these from our codes, and Prometheus will store them in a time-series format. We can then use them with flexible query language.
- Labels: Prometheus indicates the service that a particular metric applies to. Labels in Prometheus are arbitrary, and, as such, they can be much more powerful than just which service/instance exposed a metric.
In the following example, http_failure_request is the metric that denotes all the points collected by Prometheus for the product page service, which exposes an HTTP failure request. For example, service="productpage" is a label, which denotes that this particular http_failure_request metric is for the productpage service:
# Request counter for the Product Page service( Application created in ISTIO)
http_failure_request{service="productpage"}
Prometheus can gather metrics from services, VMs, infrastructure, or any other third-party application. To expose and scrape the metrics, it uses the /metrics URLs, which return a full list of metrics with label sets and their values without any calculation:
The syntax for how you can create Prometheus alert rules using annotations is as follows:
alert: Lots_Of_product_page_Jobs_In_Queue
expr: sum(jobs_in_queue{service="productpage"}) > 100
for: 15m
labels:
severity: minor
annotations:
summary: Product page queue appears to be building up (consistently more than 100
jobs waiting)
dashboard: https://grafana.monitoring.intra/dashboard/db/productpage-overview
impact: Product page is experiencing delays, causing orders to be marked as pending
runbook: https://wiki-internal/runbooks/productpage-queues.html