You’ve received an alert in your Slack channel that the CPU rate for the
warehouse service is in violation.
Slack alerts include a link to the monitor that sent the alert. Clicking the link takes you to the Alert Configuration page for the monitor.
A chart displays the current metric performance. The shaded area at the top of the chart shows you the threshold that’s set for the alert. In this case, at around 10:16, one of the lines went above 75%, causing the alert to fire.
But what exactly does that line represent? Let’s take a closer look.
Alerts are based on queries to your metric data. When you expand the Query
section, you can see that the query is on the
Instead of reporting CPU usage for every service, you can filter the query to report on a subset of a metric. In this case, it’s reporting CPU utilization only for the
warehouse service. Lightstep Observability uses the attributes in your metric data to create the filters, so you can filter on any attribute in that metric by specifically including or excluding data from those attributes.
You can also configure the chart to report separate values based on attributes in the metric. In this example, the chart shows separate values for each host, service, and version. Like filters, you can group by any attribute on the metric.
So now we know that we’re looking at the CPU rate only for the
warehouse service, and that each line in the chart represents a different host/service/version.
The Aggregate selection in the query determines the value to display for each point in time. In this example, aggregation is set as
latest, so each line is showing the latest CPU rate at that point in time.
Now that we understand what the chart is displaying, let’s find out why it fired.
The alert configuration determines the threshold for when the alert should fire and for how long. In this example, it’s set to send a single alert when any host’s CPU usage goes above
75% during a 30 second evaluation window. The chart displays a shaded area to show where the threshold is set.
You can configure alerts to send multiple alerts. You can also set separate levels of alerts (warning and critical), and send an alert when no data is reporting for the metric.
Alert notification destination configuration
Destinations determine where to send the alert when its fired and resolved. Lightste Observability supports PagerDuty, Slack, and other third-party apps that support webhooks. In this case, the alert is sent to a Slack channel every 10 minutes until the alert is resolved or the metric query falls below the threshold.
So now we know, based on the alert query and configuration, that the CPU usage for one of the hosts of the
warehouse service went above
75% at least once within 30 seconds.
Now that we know the issue, we need to find out why.
What did we learn?
- Charts in alerts allow you to see current and past performance of a metric, including when the alert was triggered.
- Alerts are based on queries to your metric data. You can fine tune the query by adding filters, using groups to alert on specific values of a metric attribute, and aggregating the data in a way that makes sense.
- You can set multiple thresholds for an alert and configure the frequency.