The Service Health view on the Deployments tab shows you the latency, error rate, and operation rate of your Key Operations (operations whose performance is strategic to the health of your system) on the selected service. Lightstep considers Key Operations to be your ingress operations on the service.

Key Operations are displayed in order of magnitude of change in performance (you can change the order). You can quickly see the top performance changes and compare performance during latency or error regression spikes with normal performance.

On this page, Lightstep can also show machine metrics for the service during the reported time period.

Observe Machine Metrics

If you instrumented your app using one of the following tracers, Lightstep displays metrics for the selected service for the time period covering both the regression and baseline.

CPU Usage (%), Memory (%) and Network (bytes) are shown in the collapsed accordion panel.

Depending on your instrumentation language, other metrics may be available. To view all machine metrics in detail, expand the Metrics panel.

Hover over the chart to view details.

View Top Changes in Performance

Lightstep displays sparkline charts for Key Operations on the service. They show recent performance for Service Level Indicators (SLIs): latency, error rate, and operation rate. Shaded yellow bars to the left of the chart indicate the magnitude of the change.

If there’s a deployment marker visible, the change is measured as the difference between the selected version and all other versions. If there is no marker, the change is measured as the difference between the first and second half of the time period shown in the chart.

The operations are initially sorted by highest change during the visible time period (you can change that time period) and you can also change the sorting order).

More About How Lightstep Measures Change

When determining what has changed in your services, Lightstep compares the SLI of the baseline SLI timeseries to the comparison SLI timeseries. Those time periods are determined using the data currently visible in the charts.

You can change the amount of time displayed using the time period dropdown at the top right of the page.

The baseline and comparison time periods are determined as follows:

If there is one or more deployment markers visible:

  • For latency and error rates, Lightstep compares the performance after the selected version to performance in all other versions.
  • For operation rate, Lightstep compares the rate before the deployment marker to after the deployment marker.

If there are no deployment markers visible:
Lightstep compares the performance of the first half of the time period to the second half.

Only changes that are relative (i.e. a change of 10ms to 500ms is ranked higher than one of 1s to 2s) are considered. The yellow bars on the sparkline chart indicate the amount of change. Lightstep measures two aspects of change: size and continuity. A full bar indicates that a large, sustained change has happened. Smaller bars indicate either a smaller change or one that did not last for the full time period.

The yellow bar means that an SLI had an objectively large change, regardless of service or operation. Lightstep’s algorithm runs on each SLI independently. For example, when the bar displays for an operation’s latency, that means latency has changed – not that its change was greater compared to the other SLIs.

Select an operation’s sparkline charts to view larger charts for the latency, error rate, and operation rate. You use these larger charts to start your investigation.larger charts are highlighted on the right

Change the Sparkline Charts Display

By default, the operations are sorted by the amount of detected change (largest to smallest). Use the dropdown to change the sort.

Also by default, only the latency percentile with the largest amount of change displays. You can change the sparkline charts to show more percentiles using the More ( ⋮ ) icon.

You can search for an operation using the Search field.

Change the Reporting Time Period

By default, the data shown is from the last hour. You can change that to a number of different time periods, up to one week in the past, using the dropdown. Use the < > controls to move backwards and forwards through time. You can view data up to 10 days in the past.

You can also zoom in on a time period by clicking and dragging over the time period you want a closer look at. The charts redraw to report on just the period you selected.

Monitor Deployments

The Service Health view on the Deployments tab provides the ability to see how deployments specifically affect performance of your services. When you implement a tag to display versions of your service, a deployment marker displays at the time the deployment occurred. In the example below, a deploy of the inventory service occurred at around 2:20 pm. Hover over the marker in the larger charts to view details. These markers allow you to quickly correlate deployment with a possible regression.

When you have multiple versions in a time window, you can view the performance of each deployed version. For example, in this image, multiple versions have been deployed. Hover over the chart to see the percentage of traffic in each version.

This feature requires a Satellite upgrade to the March 2020 release.

Learn more about how to use this view to monitor the health of your deployments.