You’ve received an alert in your Slack channel that the CPU rate for the warehouse service is in violation.Alert in Slack for a metric violation

Slack alerts include a link to the monitor that sent the alert. Clicking the link takes you to the Alert Configuration page for the monitor.Monitor configuration

Alert analysis

A chart displays the current metric performance. The shaded area at the top of the chart shows you the threshold that’s set for the alert. In this case, at around 10:16, one of the lines went above 75%, causing the alert to fire. Threshold crossed

But what exactly does that line represent? Let’s take a closer look.

Alert query

Alerts are based on queries to your metric data. When you expand the Query section, you can see that the query is on the cpu.utilization metric. Alert query

Instead of reporting CPU usage for every service, you can filter the query to report on a subset of a metric. In this case, it’s reporting CPU utilization only for the warehouse service. Lightstep Observability uses the attributes in your metric data to create the filters, so you can filter on any attribute in that metric by specifically including or excluding data from those attributes.

You can also configure the chart to report separate values based on attributes in the metric. In this example, the chart shows separate values for each host, service, and version. Like filters, you can group by any attribute on the metric.

So now we know that we’re looking at the CPU rate only for the warehouse service, and that each line in the chart represents a different host/service/version.

The Aggregate selection in the query determines the value to display for each point in time. In this example, aggregation is set as latest, so each line is showing the latest CPU rate at that point in time.

Now that we understand what the chart is displaying, let’s find out why it fired.

Alert configuration

The alert configuration determines the threshold for when the alert should fire and for how long. In this example, it’s set to send a single alert when any host’s CPU usage goes above 75% during a 30 second evaluation window. The chart displays a shaded area to show where the threshold is set.

You can configure alerts to send multiple alerts. You can also set separate levels of alerts (warning and critical), and send an alert when no data is reporting for the metric.

Alert configuration

Alert notification destination configuration

Destinations determine where to send the alert when its fired and resolved. Lightste Observability supports PagerDuty, Slack, and other third-party apps that support webhooks. In this case, the alert is sent to a Slack channel every 10 minutes until the alert is resolved or the metric query falls below the threshold. Alert destination

So now we know, based on the alert query and configuration, that the CPU usage for one of the hosts of the warehouse service went above 75% at least once within 30 seconds.

Now that we know the issue, we need to find out why.

What did we learn?

  • Charts in alerts allow you to see current and past performance of a metric, including when the alert was triggered.
  • Alerts are based on queries to your metric data. You can fine tune the query by adding filters, using groups to alert on specific values of a metric attribute, and aggregating the data in a way that makes sense.
  • You can set multiple thresholds for an alert and configure the frequency.