Often, incident response begins with an alert sent to the on-call team. You create alerts in Lightstep that activate when a set threshold on a Stream is crossed. Thresholds can be set for an error percentile, latency, or operations per second.
More about Streams
When you create a Stream that’s based on a query or based on a specific operation, Lightstep receives data from your Satellites and stores statistics and example traces related to the Stream to ensure you always have data from 0 to p99.9, including outliers. The Stream view displays statistical time series data and example traces and stores them for as long as your Data Retention policy allows.
Learn more about Streams.
When an alert is triggered, a message is sent to the configured destination, like a Slack channel or PagerDuty. The message includes a link into the Stream that triggered the alert and links to example traces.
You create an alert by defining a destination, a condition that determines when the alert will trigger, and a rule that tells Lightstep when and how to send an alert.
For this step, let’s create an alert that will trigger whenever the error percentage rate on the
android service goes above 5%. Let’s tell Lightstep to send that alert to the on-call team’s Slack channel every 10 minutes until it’s resolved.
We’ll start by creating a Slack destination. In Lightstep, the Destinations tab of the Monitoring view shows all current destinations that alerts can be sent to. Click New Message Destination to create a destination for the on-call Slack channel.
In the dialog, you use the dropdown to search for the channel you want to post the alerts to. In this case, we’ll search for
#on-call. When you click Save, the new destination now appears in the list.
Now that we have a destination to send the alert to, we can create the condition and rule. Conditions are set on a Stream, so we need to open the Stream that will use the condition. We have a Stream that monitors the
androidservice, so we’ll use that.
When we open the Stream view, we can see that there have been errors. Good thing we’re creating an alert!
You click the Create Conditions button to create the condition and rule for the alert.
We’ll define the threshold to send the alert in this dialog. Choose Error Percentage for the Signal, set the Threshold to be above 5%, and the Evaluation Window to be 5 minutes, meaning that the alert won’t be sent until the condition lasts for 5 minutes.
Now we’ll create the rules that determine where to send the alerts and how often. Click the Add Alerting Rule and enter Slack for the Integration, the #on-call channel as the Destination, and set the Interval to be 10m, meaning the alert will be sent every 10 minutes until resolved. Once you click Create, the condition appears on the Stream. A dotted grey line on the Stream shows the threshold and the name of the alert that will trigger when the threshold is crossed.
That’s it! Now we wait for the threshold to be crossed and the alert to be sent.
Sure enough - it happened again! The on-call team can click one of the example traces to see what’s going on. Looks like it might be a 429 error code coming from the
get-store-data operation on the
In the next step, we’ll see how we can make it easy for the team to begin remediation by adding links from Lightstep to other tools the team uses.
What Did We Learn?
- You create alerts on Streams. Lightstep Satellites continuously monitor 100% of your telemetry data, looking for instances where defined thresholds are crossed.
- You create alerts by defining a destination for the alert, a threshold that should trigger the alert, and rules for when the alert should be sent.