Lightstep’s Service Health page provides a quick performance overview for operations on your services.

All your services are listed on the left. When you select a service, sparkline charts for your Key Operations show you the latency, error rate, and operation rate over a given time range (the default is an hour). Larger charts for the same SLOs are shown on the right.

An operation is a named span that represents a piece of the system’s workflow, generated by your instrumentation. Satellites collect this data in real-time from spans and send it to the Lightstep Hypothesis Engine.

You notice that the customer-facing android service is suddenly experiencing an increase in latency on the /api/update-inventory operation.

The /api/get-store operation looks like it was experiencing an error rate spike, but it seems to have resolved, so let’s focus on /api/update-inventory.

You want to find out why, but first, let’s understand what’s shown on this page.

The first thing you might notice is that not all your operations are shown for the service. Only the Key Operations (operations whose performance is strategic to the health of your system) are listed. By default, Lightstep selects ingress operations (the first operations handling external requests from outside that service) as your Key Operations, as they are typically the “endpoints” for your service. If something is “wrong” with them, it’s likely to affect everything upstream (in this case, your customer).

Ok, so you can see the performance of your Key Operations. But how do you know which ones are really experiencing issues? Not all latency regressions are actually an issue - sometimes, they’re just a blip. Lightstep actually measures and records the amount of change over time for you and surfaces that degree of change in a few ways.

First, the operations are ordered by the magnitude of change, from highest to lowest across the three measures (latency, error rate, and operation rate). Second, Lightstep displays change bars (like thermometers) representing the degree of change for each of these measures. The more yellow in the bar, the more change that has happened.

More About How Lightstep Measures Change

When determining what has changed in your services, Lightstep compares the SLI of the baseline SLI timeseries to the comparison SLI timeseries. Those time periods are determined using the data currently visible in the charts.

You can change the amount of time displayed using the time period dropdown at the top right of the page.

The baseline and comparison time periods are determined as follows:

If there is one or more deployment markers visible:

  • For latency and error rates, Lightstep compares the performance after the selected version to performance in all other versions.
  • For operation rate, Lightstep compares the rate before the deployment marker to after the deployment marker.

If there are no deployment markers visible:
Lightstep compares the performance of the first half of the time period to the second half.

Only changes that are relative (i.e. a change of 10ms to 500ms is ranked higher than one of 1s to 2s) are considered. The yellow bars on the sparkline chart indicate the amount of change. Lightstep measures two aspects of change: size and continuity. A full bar indicates that a large, sustained change has happened. Smaller bars indicate either a smaller change or one that did not last for the full time period.

The yellow bar means that an SLI had an objectively large change, regardless of service or operation. Lightstep’s algorithm runs on each SLI independently. For example, when the bar displays for an operation’s latency, that means latency has changed – not that its change was greater compared to the other SLIs.

Expandable end

What we know so far:

  • There’s a latency spike in the /api/update-inventory operation. This is an ingress operation on a service that is surfaced directly to the customer.
  • The change bar tell us that the latency is an anomaly - it’s not something that repeatedly happened in the recent past.
  • There wasn’t a deployment of the android service. If there was, Lightstep would display deployment markers in the chart.

Let’s start the investigation of the latency by comparing performance from the regression with a time range where we know things were Ok. We still don’t know where the issue is coming from (it could be from this operation, but it might also be deep in the stack, bubbling up to this operation).

What about the Metrics shown at the top of the page?

Lightstep displays machine metrics for the service at the top of the page in an accordion (the metrics displayed depend on the language of your instrumentation). Often, the performance of your hardware can contribute to latency as much as the code that’s deployed on it. By being able to see how your infrastructure is performing, you can quickly deduce whether or not you have an infrastructure problem instead of a code problem.

Expandable end


What Did We Learn?

  • Lightstep shows you performance sparkline charts (latency, error rate, and operation rate) for the Key (ingress) Operations on a service.
  • Lightstep measures the amount of change for you. No need to guess whether a change is a problem.