What are SLAs, SLOs, and SLIs?
It’s likely that you have set Service Level Agreements (SLAs) with your customers. These are promises that your product will meet certain performance benchmarks, like uptime and responsiveness, and if breached, require you to provide some sort of compensation to them. In order to meet these agreements, you need to monitor your system and receive alerts when performance hits a point that threatens breaking an SLA.
But you don’t want to set up these alerts and monitors at the point where they do break an SLA. Instead, you want to be alerted well before SLAs come even close to being broken. This is what Service Level Objectives (SLOs) are. They are the contracts of performance you make internally, that when broken, alert you to the problem so that you have time to address it before an SLA is broken. Best practice is to set alerts that warn you when your app is breaking an SLO.
Service Level Indicators (SLIs) are the tools that continuously measure your app’s performance and determine when it is breaking an SLO. Lightstep offers a number of different tools to use as SLIs. They use the instrumentation in your app to collect and visualize data reporting on its performance. There are two main categories of SLI tools: those that use tracing data and those that use metrics data.
Metrics SLI Tools
Lightstep’s metrics tools ingest metrics from third-parties to provide quantitative analysis about processes running inside your system. Dashboards allow you to group charts (queries into your data) to see performance at a glance. You create charts on the dashboards using detailed queries that can include multiple metrics and you can apply formulas to return the data you want. You can group the results for deeper understanding and choose the chart type that provides the best visualization.
When you notice any deviations in your metric charts, you can use Change Intelligence to find the root cause. Change Intelligence links your metric data with trace data to find the components whose performance also changed during the metric deviation. You can immediately see what in your system may have caused a deviation.
Tracing SLI Tools
Lightstep’s tracing tools use the tracing data from your instrumented services to provide insights into the requests running through your system. You can easily spot latency and errors caused by issues in your code. For example, you can find errors caused by too many requests, or latency caused by an operation incorrectly handling the cache. Lightstep uses Streams to monitor and visualize your tracing SLIs.
This image shows a Stream for the
/api/get-profile operation on the
android service. It provides a timeseries chart showing latency at different percentiles, along with operation and error rate. When you hover over the latency series, you can view individual spans as dots in a scatterplot. This makes it very easy to see the actual distribution of latency and immediately notice outliers. Dots in red are spans with errors. You can click on any of the dots to open the full trace in Trace view.
Learn more about Streams here.
Seeing the data is great, but a true SLI must be able to notify you when an SLO is in danger. You can set alerts on both metrics and Streams that send notifications whenever a configured threshold has been crossed.
For metrics, you set thresholds based on a metric query.
For Streams, you set thresholds based on latency, error, or operation rate.
Destinations determine where the alerts are sent. Lightstep supports many built-in integrations as well as webhooks.Links in the alert take you into Lightstep where you can start your investigation.