The Lightstep Unified Query Language (UQL) allows you to retrieve metrics and spans time series data from the Lightstep database for use in dashboard charts, notebook queries, and alerts. This document is intended to help you write powerful alerting queries using UQL.
Configuring alerts in UQL
There are two primary components to set up an alert in Lightstep Observability using UQL: the query and the alert configuration. Using UQL to express the query in the editor allows you to write powerful and precise queries. The alert configuration, below the query editor, is where you set thresholds for when the alert should fire.
Basic threshold alerts
Let’s say you want an alert that fires if disk usage for a service is above 85%. The gauge metric,
disk.percent_used reports the disk utilization as a float between 0 and 100. By grouping by
service and using the
max aggregator, you ensure that the alert will fire if any service has a disk usage percentage above 85%. The final reducer (
reduce 10m, mean) smooths out the data further, reducing the likelihood of a flappy alert - it only alerts if the average over the last 10 minutes is above 85%.
Disk usage alert
1 2 3 4 metric disk.percent_used | latest 30s, 30s | // this aligner is required, because it's a gauge metric group_by [service], max | reduce 10m, mean
To ensure the alert fires if the disk usage is above 85% over the last 10 minutes, use the UI to send a notification when any value is above 85.
You can also define latency SLOs using UQL. If you want to be alerted when latency for the
ingress operation for the
api-proxy service is above 1000ms, you can write a latency
spans query that is filtered to an operation and service, and excludes errored requests.
1 2 3 4 5 6 spans latency | delta 1h | // look at the ingress operation for the api-proxy service and exclude "bad" requests filter operation == "ingress" && service == "api-proxy" && http.status_class != "4xx" && error != true | group_by , sum | point percentile(value, 99.0) // take the 99th percentile of latency
Like with all UQL alerts, you use the UI to set the threshold for the alert to send a notification when any value is above 1000ms.
Error percentage alerts
When you want an alert to fire if the rate of errors for a service is above a certain threshold, you can write an error percentage alert that takes the ratio of spans for a service that have the tag
error=true against all spans for that service.
1 2 3 4 with errors = spans count | delta | filter service == warehouse && error == true | group_by , sum; total = spans count | delta | filter service == warehouse | group_by , sum; join errors/total * 100
Like with all UQL alerts, you set the threshold using the UI.
When you have fairly predictable daily traffic patterns for an endpoint and want to be alerted if that pattern changes, you can write a “seasonality” alert. In this example, the season is short - just a day. The alert will fire if the current number of requests, averaged over the last hour, differs by more than 20% from yesterday’s average (over the same hour window).
1 2 3 4 with req = metric requests | reduce 1h, mean | group_by , mean; baseline = metric requests | time_shift 1d | reduce 1h, mean | group_by , mean; join abs((req-baseline)/baseline) * 100
Like with all UQL alerts, you set the threshold in the UI to send an alert if the percentage change is over 20%.
Instead of alerting if the number of requests has changed by some percentage since yesterday, you can instead alert if the current number of requests is more than 2 standard deviations from the mean over the last day. To calculate this, you need 3 time series: the current number of requests, the average over the last day, and the standard deviation over the last day.
1 2 3 4 5 with average = metric requests | delta 30s, 30s | group_by , sum | reduce 1d, mean; standard_dev = metric requests | delta 30s, 30s | group_by , sum | reduce 1d, std_dev; actual = metric requests | delta | group_by , sum; join abs(actual - average)/standard_dev
Because you’re taking the absolute value in the query, you don’t need to set both an “above” and “below” threshold. The alert fires if the value is more than 2 standard deviations from the mean.