Code and configuration changes are some of the most common causes for performance regressions. While viewing a dashboard after a deployment may show that a latency regression occurred, it won’t explain the root cause. You can use Lightstep not only to monitor your services after a deploy, but also to compare performance over specific time periods and then dig into details to find the differences that caused the latency.
Compare Two Time Ranges for Latency
When you notice a regression in latency, you can click on that respective chart and immediately begin to investigate the issue.
From the Service Directory, select a service and click the Deployments tab. Sparkline charts show the top changes in latency, error rate, and operation rate for the ingress operations for that service (up to 50). The operations are initially sorted by highest change during the time period initially selected. If you’ve implemented a tag to denote new versions of your service, a deployment marker displays at the time the deployment occurred. In the example below, for the
update-inventoryoperation, a deploy occurred at 10:45 am. Hover over the marker to view details. These markers allow you to quickly correlate deployment with a possible regression. When a deploy only affects a segment of traffic (like a canary deployment or a blue/green deploy), Lightstep shows you the performance and amount of traffic for each version.
When you find a latency spike to investigate, select the operation’s sparkline chart and then click in the spike in larger Latency chart to create a range of time to collect data from.
Choose a spot in the middle of a spike to avoid collecting data from before the regression occurred. You can also zoom in on a time period by clicking and dragging over the time period you want a closer look at. The charts redraw to report on just the period you selected.
A shaded yellow bar shows the time range from which span data will be used for analysis of the regression.
Choose to compare that data with data from directly before the previous deploy, an hour ago, a day ago, or you can select your own baseline window.
When you click Select custom baseline window, click into the chart and move the blue block to choose your baseline.
The Comparison view opens to show data from both time periods.
To change either the Regression or Baseline time, click and move the shaded box. You can also zoom in on a time period by clicking and dragging over the time period you want to zoom in on.
Read on for instructions for using these tools.
Observe Machine Metrics
If you instrumented your app using one of the following tracers, Lightstep displays those metrics for the selected service for the time period covering both the regression and baseline.
The metrics reported depend on the language of your app.
To view machine metrics, expand the Metrics panel.
By default, the following metrics are shown:
- CPU Usage (%)
- Memory (%)
- Network (bytes)
Hover over the chart to view details.
Compare Latency Changes Using the Histogram
The Compare Latency Histogram is similar to the histogram shown on the Explorer view.
Learn more about histograms
Once you run a query, the Latency Histogram is generated by 100% of the span data collected from the Satellites that match the query. The bottom of the histogram shows latency time, and each blue line represents the number of spans that fall into that latency time. Note that on this page, the histogram displays yellow bars instead of blue. You can learn more about the histogram here
In the chart, the yellow bars represent the number of spans for the duration shown on the x-axis. The blue line overlay shows the shape of the histogram for the baseline. This overlay helps identify the scope of the performance impact: how much has latency changed and for how many requests.
By default, markers show where the 95th percentile lies. Use the dropdown to see other percentile markers.
The Compare Tags table is similar to Correlations shown on the Explorer view. For the ingress operation being investigated, the table displays tags that are on spans associated with latency in the regression time range.
Learn more about Correlations
A standard method of identifying the root cause of a performance regression is to manually comb through traces and search for common system attributes associated with that regression. With Correlations, Lightstep helps you find attributes correlated with latency and errors automatically.
Lightstep Correlations search over all the span data in all traces that the current query participates in and finds patterns in that data for latency and error contribution, looking at service/operation pairs and tags. This analysis allows you to identify services, operations, and tags that are meaningfully associated with observed regressions. By surfacing statistical relationships between these system attributes, you can quickly generate and validate hypotheses during a root cause investigation. You can learn more about correlations here
For each of these tags, the table indicates the average latency of the spans containing the tag in the baseline time range (in blue), the average latency of the span containing the same tag in the regression time range (in yellow), and the difference in duration.
By default, the table is sorted by duration change, from highest to lowest. You can change the sort by clicking on any column header.
Hover over a correlation to see the latency of spans with that tag overlaid on the histogram (shaded a darker yellow). For example, in this histogram, the tag
service.version:v1.14.569 appears on only the spans with higher latency in the regression and may be associated with the root cause of the latency.
Click on a row in the Compare Tags table to view example traces that contain the correlated tag.
Click one of the traces to view the full trace in the Trace view. If you want to see more traces with this tag, you can filter the Trace Analysis table (below) to see only spans from that tag by clicking Filter by this tag or you can group the current results by that tag by clicking Group by this tag.
Compare Slowest Operations Contributing to the Critical Path in a Trace
The Operation Diagram and Compare Operations table show the operations across the request path of traces in the regression. By default, the diagram shows all downstream operations from the original ingress operation and service. Yellow halos denote average latency contribution. The ingress operation you’re currently investigating has a blue animated halo.
If you want to focus the investigation to show operations up to the first operation in another service, select Direct Downstream Operations at the top right of the Operation Diagram.
The diagram edges represent relationships between spans and is dependent on the quality of your instrumentation. Missing spans may cause edges to appear when there is no parent-child relationship. The following can cause missing spans:
- The Satellite dropped the span (for example, your Satellite pool is not auto-scaling to keep up with traffic)
- The tracer dropped the span (for example, your application crashed or never called
- There is an issue with your instrumentation (context was dropped, or the service with a parent span is not instrumented).
When you see missing spans, check the Reporting Status page to find the culprit.
The table is sorted by the change in critical path latency between the baseline and regression, from largest to smallest. You can change the sort order by clicking on another column.
For example, in this table the
write-cache operation on the
inventory service contributed an average of 56.4ms in latency in the baseline time range, but contributed 191ms in latency in traces from the regression time range. Identifying the largest increase in latency can help pinpoint the root cause of the latency increase in the regression.
Click on a row in the Compare Operations table to view example traces that contain this operation. Clicking on a row in the table will also center the Operation Diagram on that specific operation, to help narrow your investigation on only the relevant operations.
Click one of the traces to view the full trace in the Trace view. If you want to see more traces with this operation, you can filter the Trace Analysis table (below) to see only spans from that tag by clicking Filter by this operation.
Compare Span Data Using the Trace Analysis Table
Similar to the Trace Analysis table on the Explorer view, this Trace Analysis table shows information from the spans in both the regression and baseline time ranges. By default, the table shows the service, the operation reporting the span, the span’s duration, and the start time from the regression time range. You can add other columns as needed by clicking on the + icon.
You can switch between analyzing traces from the baseline or regression time periods by selecting on the time range of interest. After you perform an analysis, for example a group-by, you can switch time ranges and see that same group-by aggregation on the other time range’s traces.