The Lightstep Unified Query Language (UQL) allows you to retrieve metrics and spans time series data from the Lightstep database for use in dashboard charts, notebook queries, and alerts. This document is intended to help you write powerful alerting queries using UQL.
For more details on specific operations, see the UQL Reference. We also have a UQL Cheatsheet to help you build queries more generally.
Configuring alerts in UQL
There are two primary components to set up an alert in Lightstep Observability using UQL: the query and the alert configuration. Using UQL to express the query in the editor allows you to write powerful and precise queries. The alert configuration, below the query editor, is where you set thresholds for when the alert should fire.
Examples
Basic threshold alerts
Metric value
Let’s say you want an alert that fires if disk usage for a service is above 85%. The gauge metric, disk.percent_used
reports the disk utilization as a float between 0 and 100. By grouping by service
and using the max
aggregator, you ensure that the alert will fire if any service has a disk usage percentage above 85%. The final reducer (reduce 10m, mean
) smooths out the data further, reducing the likelihood of a flappy alert - it only alerts if the average over the last 10 minutes is above 85%.
Disk usage alert
1
2
3
4
metric disk.percent_used |
latest 30s, 30s | // this aligner is required, because it's a gauge metric
group_by [service], max |
reduce 10m, mean
To ensure the alert fires if the disk usage is above 85% over the last 10 minutes, use the UI to send a notification when any value is above 85.
Latency
You can also define latency SLOs using UQL. If you want to be alerted when latency for the ingress
operation for the api-proxy
service is above 1000ms, you can write a latency spans
query that is filtered to an operation and service, and excludes errored requests.
Latency SLO
1
2
3
4
5
6
spans latency |
delta 1h |
// look at the ingress operation for the api-proxy service and exclude "bad" requests
filter operation == "ingress" && service == "api-proxy" && http.status_class != "4xx" && error != true |
group_by [], sum |
point percentile(value, 99.0) // take the 99th percentile of latency
Like with all UQL alerts, you use the UI to set the threshold for the alert to send a notification when any value is above 1000ms.
Error percentage alerts
When you want an alert to fire if the rate of errors for a service is above a certain threshold, you can write an error percentage alert that takes the ratio of spans for a service that have the tag error=true
against all spans for that service.
Error percentage
1
2
3
4
with
errors = spans count | delta | filter service == warehouse && error == true | group_by [], sum;
total = spans count | delta | filter service == warehouse | group_by [], sum;
join errors/total * 100
Like with all UQL alerts, you set the threshold using the UI.
Percentage change
When you have fairly predictable daily traffic patterns for an endpoint and want to be alerted if that pattern changes, you can write a “seasonality” alert. In this example, the season is short - just a day. The alert will fire if the current number of requests, averaged over the last hour, differs by more than 20% from yesterday’s average (over the same hour window).
Seasonality alert
1
2
3
4
with
req = metric requests | reduce 1h, mean | group_by [], mean;
baseline = metric requests | time_shift 1d | reduce 1h, mean | group_by [], mean;
join abs((req-baseline)/baseline) * 100
Like with all UQL alerts, you set the threshold in the UI to send an alert if the percentage change is over 20%.
Standard deviation
Instead of alerting if the number of requests has changed by some percentage since yesterday, you can instead alert if the current number of requests is more than 2 standard deviations from the mean over the last day. To calculate this, you need 3 time series: the current number of requests, the average over the last day, and the standard deviation over the last day.
Standard deviation
1
2
3
4
5
with
average = metric requests | delta 30s, 30s | group_by [], sum | reduce 1d, mean;
standard_dev = metric requests | delta 30s, 30s | group_by [], sum | reduce 1d, std_dev;
actual = metric requests | delta | group_by [], sum;
join abs(actual - average)/standard_dev
Because you’re taking the absolute value in the query, you don’t need to set both an “above” and “below” threshold. The alert fires if the value is more than 2 standard deviations from the mean.