Code and configuration changes are some of the most common causes for performance regressions. While viewing a dashboard after a deployment may show that a regression occurred, it won’t explain the root cause. You can use Lightstep not only to monitor your services after a deploy, but also to compare performance over specific time periods and then dig into details to find the differences that caused the issue.

When you notice a regression in latency or error rate, you can click on that respective chart and immediately begin to investigate the issue.

Investigate a Latency Regression

When you notice a spike in latency in an operation from the Service Directory, you can use the Comparison view to compare two time ranges and then dig down to find the root cause for the difference.

To compare two time ranges for latency:

  1. From the Service Directory, select a service. Charts showing latency, error rate, and operation rate are shown for the ingress operations for that service (up to 50). If you’ve implemented a tag to denote new versions of your service, a deployment marker displays at the time the deployment occurred. In the example below, for the update-inventory operation, a deploy occurred at 12:45 pm. Hover over the marker to view details. These markers allow you to quickly correlate deployment with a possible regression.

  2. When you find a latency spike to investigate, click in the chart to create a range of time to collect data from. Choose a spot in the middle of a regression to avoid collecting data from before the spike occurred. A shaded yellow bar shows the time range from which span data will be used for analysis of the regression.

    Choose to compare that data with data from directly before the previous deploy, an hour ago, a day ago, or you can select your own baseline window.

    When you click Select custom baseline window, click into the chart and move the blue block to choose your baseline.

    The Comparison view opens to show data from both time periods.

    To change either the Regression or Baseline time, click and move the shaded box.

Read on for instructions for using these tools.

Compare Latency Changes Using the Histogram

The Compare Latency Histogram is similar to the histogram shown on the Explorer view.

Learn more about histograms

Once you run a query, the Latency Histogram is generated by 100% of the span data collected from the Satellites that match the query. The bottom of the histogram shows latency time, and each blue line represents the number of spans that fall into that latency time. Note that on this page, the histogram displays yellow bars instead of blue. You can learn more about the histogram here

Expandable end

In the chart, the yellow bars represent the number of spans for the duration shown on the x-axis. The blue line overlay shows the shape of the histogram for the baseline. This overlay helps identify the scope of the performance impact: how much has latency changed and for how many requests.

By default, markers show where the 95th percentile lies. Use the dropdown to see other percentile markers.

Compare Tags

The Compare Tags table is similar to Correlations shown on the Explorer view. For the ingress operation being investigated, the table displays tags that are associated with latency in the regression time range for all spans in the full trace.

For each of these tags, the table indicates the average latency of the spans containing the tag in the regression time range (in yellow), the average latency of the span containing the same tag in the baseline time range (in blue), and the difference in duration.

By default, the table is sorted by duration change, from highest to lowest. You can change the sort by clicking on any column header.

Hover over a correlation to see the tag’s latency overlaid on the histogram. For example, in this histogram, the tag region:us-west-1 appears on more frequently on spans with higher latency and may be associated with the root cause of the latency regression.

Click on a row in the Compare Tags table to view example traces that contain the correlated tag.

Click one of the traces to view the full trace in the Trace view.

Compare Slowest Operations Contributing to the Critical Path in a Trace

The Operation Diagram and Compare Operations table show the operations that contribute to latency across the request path of all the traces from the regression time range. By default, the diagram shows all downstream intra-service operations from the original ingress operation and service. Yellow halos denote average latency contribution. The ingress operation you’re currently investigating has a blue animated halo.

If you want to expand the investigation to include operations from other services, select All Downstream Operations at the top right of the Operation Diagram. Depending on trace depth and instrumentation, the complete Operation Diagram may be quite large.

The diagram edges represent relationships between spans and is dependent on the quality of your instrumentation. Missing spans may cause edges to appear when there is no parent-child relationship. The following can cause missing spans:

  • The Satellite dropped the span (for example, your Satellite pool is not auto-scaling to keep up with traffic)
  • The tracer dropped the span (for example, your application crashed or never called span.flush() ),
  • There is an issue with your instrumentation (context was dropped, or the service with a parent span is not instrumented).

When you see missing spans, check the Reporting Status page to find the culprit.

The table is sorted by the change in critical path latency between the baseline and regression, from largest to smallest. You can change the sort order by clicking on another column.

For example, in this table the update-inventory operation on the inventory service contributed an average of 873ms in latency from the baseline time range, but contributed 4s in latency in traces from the regression time range. Identifying the largest increase in latency can help pinpoint the root cause of the latency increase in the regression.

Click on a row in the Compare Operations table to view example traces that contain this operation. Clicking on a row in the table will also center the Operation Diagram on that specific operation, to help narrow your investigation on only the relevant operations.

Click one of the traces to view the full trace in the Trace view.

Observe Machine Metrics

If you instrumented your app using one of the following tracers, Lightstep displays those metrics for the selected service for the time period covering both the regression and baseline.

The metrics reported depend on the language of your app.

By default, the following metrics are shown:

  • CPU Usage (%)
  • Memory (%)
  • Network (bytes)

Add more charts using the dropdown.

Hover over the chart to view details.

Compare Span Data Using the Trace Analysis Table

Similar to the Trace Analysis table on the Explorer view, this Trace Analysis table shows information from the spans in both the regression and baseline time ranges. By default, the table shows the service, the operation reporting the span, the span’s duration, and the start time from the regression time range. You can add other columns as needed by clicking on the + icon.

You can switch between analyzing traces from the baseline or regression time periods by selecting on the time range of interest. After you perform an analysis, for example a group-by, you can switch time ranges and see that same group-by aggregation on the other time range’s traces.

Investigate an Error Rate Increase

When you notice an increase in the error rate for an operation, you can use the Comparison view to compare a point of time when the error rate was stable (called the *baseline*) to the time when the rate was high (*regression*). You then use analysis tools to identify what caused the increase in errors.

To compare two time ranges for error rate:

  1. From the Service Directory, select a service. Charts showing latency, error rate, and operation rate are shown for the ingress operations for that service (up to 50). If you’ve implemented a tag to denote new versions of your service, a deployment marker displays at the time the deployment occurred. In the example below, for the /api/get-store operation, a deploy occurred at around 10:00 am. Hover over the marker on the error rate chart to view details. These markers allow you to quickly correlate deployment with a spike in errors.

  2. When you find an error spike to investigate, click in the error rate chart to select your regression time period, which specifies the range of time to collect data from. Choose a spot in the middle of a regression to avoid collecting data from before the spike occurred. A shaded red bar shows the time range from which trace data will be used for analysis of the regression.

    Choose to compare that data with data from directly before the previous deploy, an hour ago, a day ago, or you can select your own baseline window.

    When you click Custom baseline window, click into the chart and move the blue block to choose your baseline.

    The Comparison view opens to show data from both time periods. The color blue represents the baseline time window and red represents the regression time window.

    To change either the Regression or Baseline time, click and move the shaded box.

The Operation Diagram shows the operations with errors across the request path of all traces from the regression time range. By default, the diagram shows all downstream intra-service operations from the original ingress operation and service. Red halos denote error percentages—the larger the halo, the more errors present. The ingress operation you’re currently investigating has a blue animated halo.

If you want to expand the investigation to include downstream operations from other services, select All Downstream Operations at the top right of the Operation Diagram. Depending on trace depth and instrumentation, the complete Operation Diagram may be quite large.

The diagram edges represent relationships between spans and is dependent on the quality of your instrumentation. Missing spans may cause edges to appear when there is no parent-child relationship. The following can cause missing spans:

  • The Satellite dropped the span (for example, your Satellite pool is not auto-scaling to keep up with traffic)
  • The tracer dropped the span (for example, your application crashed or never called span.flush() ),
  • There is an issue with your instrumentation (context was dropped, or the service with a parent span is not instrumented).

When you see missing spans, check the Reporting Status page to find the culprit.

Hovering over an operation gives you an error count.

Compare Operation Error Rates

The Operations with Errors table shows the number of errors per minute on an operation for both the baseline and regression time periods, along with the calculated change. From this table you can see where the majority of errors originate, the change in rate, and then click through to full traces to further diagnose.

The table only shows operations that originated the errors, so there may be fewer operations in the table than the diagram. Operations are considered to be originating errors if they have no descendant spans with errors.

Click on an operation name in the table to view example traces that include errors and the operation. Clicking an operation here also causes the Operation diagram to regenerate based on traces that contain the selected operation with errors. If there are traces without errors, examples of those are shown as well.

Click an example trace to view it in the full Trace view. If you want to see more traces from this operation, you can filter the Trace Analysis table (below) to see only spans from that operation.

Correlate Tags to errors

The Tags with Errors table shows you tags that may be causing the increase in errors. It also allows you to compare the errors per minute of spans with those tags in the baseline and the regression windows. Because tags provide contextual data about a span, this table can often give you clues to what might be causing the errors.

For example, in the image below you can see that the server error type, the Too many requests error message, and the 429 error code all have the same increase in error rate, which likely means they all appear in the same spans.

Click on a tag to view example traces from the diagram that include errors and that tag. Clicking here also causes the Operation diagram to regenerate based on traces that contain the selected tag with error.

If there are traces without errors, examples of those are shown as well.

Click an example trace to view it in the full Trace view. If you want to see more traces with this tag, you can filter the Trace Analysis table (below) to see only spans from that tag or you can group the current results by that tag.

Observe Machine Metrics

If you instrumented your app using one of the following tracers, Lightstep displays those metrics for the selected service for the time period covering both the regression and baseline.

The metrics reported depend on the language of your app.

By default, the following metrics are shown:

  • CPU Usage (%)
  • Memory (%)
  • Network (bytes)

Add more charts using the dropdown.

Hover over the chart to view details.

Compare Span Data Using the Trace Analysis Table

Similar to the Trace Analysis table on the Explorer view, this Trace Analysis table shows information from the spans in both the regression and baseline time ranges. By default, the table shows the service, the operation reporting the span, the span’s duration, and the start time from the regression time range. You can add other columns as needed by clicking on the + icon.

You can switch between analyzing traces from the baseline or regression time periods by selecting the time range of interest. After you perform an analysis, for example a group-by, you can switch time ranges and see that same group-by aggregation on the other time range’s traces.