You’ve received an alert in your Slack channel that the CPU rate for the warehouse service is in violation.Alert in Slack for a metric violation

Slack alerts include a link to the monitor that sent the alert. Clicking the link takes you to the Alert Configuration page for the monitor.Monitor configuration

The alert status is OK because the metric went below the threshold.

Alert Analysis

A chart displays the current metric performance. The shaded area at the top of the chart shows you the threshold that’s set for the alert. In this case, at around 4:15, one of the lines went above .01, causing the alert to fire. Threshold crossed

But what exactly does that line represent? Let’s take a closer look.

Alert Query

Alerts are based on queries to your metric data. When you expand the Query section, you can see that the query is on the cpu.utilization metric and the rate time series operator is applied.

More about time series operators

Time series operators determine how a value is reported over time. The count operator counts the number of events over a period of time. The rate operator (used in this example), tells you how many happened over a time period. Think of a car that went 50 miles in an hour versus the fact that it went 50 miles per hour. 50 is the count (the number of miles) and 50 mph is the rate (the number of miles over a specified period of time). In Lightstep, the time period is one second.

Learn how Lightstep displays metrics.

Expandable end

Instead of reporting CPU usage for every service, you can filter the query to report on a subset of a metric. In this case, it’s reporting CPU utilization only for the warehouse service. Lightstep uses the tags in your metric data to create the filters, so you can filter on any tag in that metric by specifically including or excluding data from those tags.

You can also configure the chart to report separate values based on tags in the metric. In this example, the chart shows separate values for each host. Like filters, you can group by any tag on the metric.

So now we know that we’re looking at the CPU rate only for the warehouse service, and that each line in the chart represents a different host.

The aggregation selection in the query determines the value to display for each point in time. In this example, aggregation is set as the mean (average), so each line is showing the average CPU rate for a host.

Lightstep supports a number of different aggregation methods, such as count, min, max, and sum.

Alert query

Now that we understand what the chart is displaying, let’s find out why it fired.

Alert and Destination Configuration

The alert configuration determines the threshold for when the alert should fire and for how long. In this example, it’s set to send a single alert when any host’s CPU usage goes above 0.01 at least once during a five minute evaluation window. The chart displays a shaded area to show where the threshold is set.

You can configure alerts to send multiple alerts and you can change the evaluation window. You can also set separate levels of alerts (warning and critical), and send an alert when no data is reporting for the metric.

Alert configuration

Destinations determine where to send the alert when its fired and resolved. Lightstep supports PagerDuty, Slack, and other third-party apps that support webhooks. In this case, the alert is sent to a Slack channel every 10 minutes until the alert is resolved or the metric query falls below the threshold. Alert destination

So now we know, based on the alert query and configuration, that the average CPU usage for one of the hosts of the warehouse service went above .01 at least once within five minutes.

Now that we know the issue, we need to find out why.

What Did We Learn?

  • Charts in alerts allow you to see current and past performance of a metric, including when the alert was triggered.
  • Alerts are based on queries to your metric data. You can fine tune the query by adding filters, using groups to alert on specific values of a metric tag, and aggregating the data in a way that makes sense.
  • You can set multiple thresholds for an alert and configure the frequency.