The Lightstep Unified Query Language (UQL) allows you to retrieve metrics and spans time series data from the Lightstep database for use in dashboard charts, notebook queries, and alerts. This document is intended to help you write powerful alerting queries using UQL.

For more details on specific operations, see the UQL Reference. We also have a UQL Cheatsheet to help you build queries more generally.

Configuring alerts in UQL

There are two primary components to set up an alert in Lightstep Observability using UQL: the query and the alert configuration. Using UQL to express the query in the editor allows you to write powerful and precise queries. The alert configuration, below the query editor, is where you set thresholds for when the alert should fire.

UQL editor on alerts page

Examples

Basic threshold alerts

Metric value

Let’s say you want an alert that fires if disk usage for a service is above 85%. The gauge metric, disk.percent_used reports the disk utilization as a float between 0 and 100. By grouping by service and using the max aggregator, you ensure that the alert will fire if any service has a disk usage percentage above 85%. The final reducer (reduce 10m, mean) smooths out the data further, reducing the likelihood of a flappy alert - it only alerts if the average over the last 10 minutes is above 85%.

Disk usage alert

1
2
3
4
metric disk.percent_used | 
latest 30s, 30s | // this aligner is required, because it's a gauge metric
group_by [service], max | 
reduce 10m, mean

To ensure the alert fires if the disk usage is above 85% over the last 10 minutes, use the UI to send a notification when any value is above 85.

UQL editor on alerts page

Latency

You can also define latency SLOs using UQL. If you want to be alerted when latency for the ingress operation for the api-proxy service is above 1000ms, you can write a latency spans query that is filtered to an operation and service, and excludes errored requests.

Latency SLO

1
2
3
4
5
6
spans latency |
delta 1h | 
// look at the ingress operation for the api-proxy service and exclude "bad" requests
filter operation == "ingress" && service == "api-proxy" && http.status_class != "4xx" && error != true | 
group_by [], sum | 
point percentile(value, 99.0) // take the 99th percentile of latency

Like with all UQL alerts, you use the UI to set the threshold for the alert to send a notification when any value is above 1000ms.

Error percentage alerts

When you want an alert to fire if the rate of errors for a service is above a certain threshold, you can write an error percentage alert that takes the ratio of spans for a service that have the tag error=true against all spans for that service.

Error percentage

1
2
3
4
with
	errors = spans count | delta | filter service == warehouse && error == true | group_by [], sum;
	total = spans count | delta | filter service == warehouse | group_by [], sum;
join errors/total * 100

Like with all UQL alerts, you set the threshold using the UI.

Percentage change

When you have fairly predictable daily traffic patterns for an endpoint and want to be alerted if that pattern changes, you can write a “seasonality” alert. In this example, the season is short - just a day. The alert will fire if the current number of requests, averaged over the last hour, differs by more than 20% from yesterday’s average (over the same hour window).

Seasonality alert

1
2
3
4
with 
 req = metric requests | reduce 1h, mean | group_by [], mean;
 baseline = metric requests | time_shift 1d | reduce 1h, mean | group_by [], mean;
join abs((req-baseline)/baseline) * 100

Like with all UQL alerts, you set the threshold in the UI to send an alert if the percentage change is over 20%.

UQL editor on alerts page

Standard deviation

Instead of alerting if the number of requests has changed by some percentage since yesterday, you can instead alert if the current number of requests is more than 2 standard deviations from the mean over the last day. To calculate this, you need 3 time series: the current number of requests, the average over the last day, and the standard deviation over the last day.

Standard deviation

1
2
3
4
5
with
	average = metric requests | delta 30s, 30s | group_by [], sum | reduce 1d, mean;
	standard_dev = metric requests | delta 30s, 30s | group_by [], sum | reduce 1d, std_dev;
	actual = metric requests | delta | group_by [], sum;
join abs(actual - average)/standard_dev

Because you’re taking the absolute value in the query, you don’t need to set both an “above” and “below” threshold. The alert fires if the value is more than 2 standard deviations from the mean.

UQL editor on alerts page