You’ve received an alert in your Slack channel that the CPU rate for the
warehouse service is in violation.
Slack alerts include a link to the monitor that sent the alert. Clicking the link takes you to the Alert Configuration page for the monitor.
The alert status is OK because the metric went below the threshold.
A chart displays the current metric performance. The shaded area at the top of the chart shows you the threshold that’s set for the alert. In this case, at around 4:15, one of the lines went above .01, causing the alert to fire.
But what exactly does that line represent? Let’s take a closer look.
Alerts are based on queries to your metric data. When you expand the Query
section, you can see that the query is on the
cpu.utilization metric and the
rate time series operator is applied.
More about time series operators
Time series operators determine how a value is reported over time. The
count operator counts the number of events over a period of time. The
rate operator (used in this example), tells you how many happened over a time period. Think of a car that went 50 miles in an hour versus the fact that it went 50 miles per hour. 50 is the
count (the number of miles) and 50 mph is the
rate (the number of miles over a specified period of time). In Lightstep, the time period is one second.
Learn how Lightstep displays metrics.
Instead of reporting CPU usage for every service, you can filter the query to report on a subset of a metric. In this case, it’s reporting CPU utilization only for the
warehouse service. Lightstep uses the tags in your metric data to create the filters, so you can filter on any tag in that metric by specifically including or excluding data from those tags.
You can also configure the chart to report separate values based on tags in the metric. In this example, the chart shows separate values for each host. Like filters, you can group by any tag on the metric.
So now we know that we’re looking at the CPU rate only for the
warehouse service, and that each line in the chart represents a different host.
The aggregation selection in the query determines the value to display for each point in time. In this example, aggregation is set as the mean (average), so each line is showing the average CPU rate for a host.
Lightstep supports a number of different aggregation methods, such as
Now that we understand what the chart is displaying, let’s find out why it fired.
Alert and Destination Configuration
The alert configuration determines the threshold for when the alert should fire and for how long. In this example, it’s set to send a single alert when any host’s CPU usage goes above
0.01 at least once during a five minute evaluation window. The chart displays a shaded area to show where the threshold is set.
You can configure alerts to send multiple alerts and you can change the evaluation window. You can also set separate levels of alerts (warning and critical), and send an alert when no data is reporting for the metric.
Destinations determine where to send the alert when its fired and resolved. Lightstep supports PagerDuty, Slack, and other third-party apps that support webhooks. In this case, the alert is sent to a Slack channel every 10 minutes until the alert is resolved or the metric query falls below the threshold.
So now we know, based on the alert query and configuration, that the average CPU usage for one of the hosts of the
warehouse service went above
.01 at least once within five minutes.
Now that we know the issue, we need to find out why.
What Did We Learn?
- Charts in alerts allow you to see current and past performance of a metric, including when the alert was triggered.
- Alerts are based on queries to your metric data. You can fine tune the query by adding filters, using groups to alert on specific values of a metric tag, and aggregating the data in a way that makes sense.
- You can set multiple thresholds for an alert and configure the frequency.