Often, incident response begins with an alert sent to the on-call team. You create alerts in Lightstep that activate when a set threshold on a Stream is crossed. Thresholds can be set for an error percentile, latency, or operations per second.
More about Streams
Streams allow you to proactively monitor parts of your system that are crucial to business health. You create Streams based on a query of your services, operations, and attributes. Lightstep continuously receives data from your Satellites that match the query and stores statistics and example traces to ensure you always have data from 0 to p99.9, including outliers. The Stream view displays statistical time series data and example traces and stores them for as long as your Data Retention policy allows.
Learn more about Streams.
When an alert is triggered, a message is sent to the configured destination, like a Slack channel or PagerDuty. The message includes a link into the Stream that triggered the alert and links to example traces.
You create an alert by defining a notification destination and a threshold that determines when the alert will trigger.
For this step, let’s create an alert that will trigger whenever the error percentage rate on the
android service goes above 5%. Let’s tell Lightstep to send that alert to the on-call team’s Slack channel every 10 minutes until it’s resolved.
We’ll start by creating a Slack notification destination. In Lightstep, the Destinations tab of the Alerts view shows all current notification destinations that alerts can be sent to. Click New Message Destination to create a destination for the on-call Slack channel.
In the dialog, you use the dropdown to search for the channel you want to post the alerts to. In this case, we’ll search for
#on-call. When you click Save, the new destination now appears in the list.
Now that we have a destination to send the alert to, we can create the alert on a Stream. We have a Stream that monitors the
androidservice, so we’ll use that.
When we open the Stream view, we can see that there have been errors. Good thing we’re creating an alert!
Click the Create Alert button to create the alert.
We’ll define the threshold to send the alert in this dialog. Choose Error Percentage for the Signal, set the Threshold to be above 5%, and the Evaluation Window to be 5 minutes, meaning that the alert won’t be sent until the violation lasts for 5 minutes.
Now we’ll add a notification destination and configure where to send the alerts and how often. Click Add Notification Destinations and select Slack for the Integration, the #on-call channel as the Destination, and set the Interval to be 5m, meaning the alert will be sent every 5 minutes until resolved. Once you click Create, the alert appears on the Stream. A dotted grey line on the Stream shows the threshold and the name of the alert that will trigger when the threshold is crossed.
That’s it! Now we wait for the threshold to be crossed and the alert to be sent.
Sure enough - it happened again! The on-call team can click one of the example traces to see what’s going on. Looks like it might be a 429 error code coming from the
get-store-data operation on the
In the next step, we’ll see how we can make it easy for the team to begin remediation by adding links from Lightstep to other tools the team uses.
What Did We Learn?
- You create alerts on Streams. Lightstep Satellites continuously monitor 100% of your telemetry data, looking for instances where defined thresholds are crossed.
- You create alerts by defining a destination for the alert, a threshold that should trigger the alert, and rules for when the alert should be sent.