When you notice an increase in error rate on Lightstep’s Service Health view, you can use the analytical tools to find the source of errors. This is especially helpful after deploying a service to ensure there are no regressions in errors, and if there are, to quickly determine the source and rollback.
Compare Two Time Ranges for Error Rate
When you do notice an increase in the error rate for an operation, you can use the Comparison view to compare a point of time when the error rate was stable (called the baseline) to the time when the rate was high (regression). You then use analysis tools to identify what caused the increase in errors.
From the Service Directory, select a service. Sparkline charts show changes in latency, error rate, and operation rate for the ingress operations for that service (up to 50). The operations are initially sorted by highest change during the time period initially selected. If you’ve implemented a tag to denote new versions of your service, a deployment marker displays at the time the deployment occurred. In the example below, for the
update-inventoryoperation, a deploy occurred at 10:45 am. Hover over the marker to view details. These markers allow you to quickly correlate deployment with a possible regression. When a deploy only affects a segment of traffic (like a canary deployment or an A/B test), Lightstep shows you the performance and amount of traffic for each version.
When you find an error spike to investigate, select the operation’s sparkline chart and then click in the spike in larger Error % chart to create a range of time to collect data from.
Choose a spot in the middle of a spike to avoid collecting data from before the regression occurred. You can also zoom in on a time period by clicking and dragging over the time period you want a closer look at. The charts redraw to report on just the period you selected.
A shaded red bar shows the time range from which span data will be used for analysis of the regression.
Choose to compare that data with data from directly before the previous deploy, an hour ago, a day ago, or you can select your own baseline window.
When you click Custom baseline window, click into the chart and move the blue block to choose your baseline.
The Comparison view opens to show data from both time periods. The color blue represents the baseline time window and red represents the regression time window.
To change either the Regression or Baseline time, click and move the shaded box. You can also zoom in on a time period by clicking and dragging over the time period you want to zoom in on.
Read on for instructions for using these tools.
Observe Machine Metrics
If you instrumented your app using one of the following tracers, Lightstep displays those metrics for the selected service for the time period covering both the regression and baseline.
The metrics reported depend on the language of your app.
To view machine metrics, expand the Metrics panel.
By default, the following metrics are shown:
- CPU Usage (%)
- Memory (%)
- Network (bytes)
Hover over the chart to view details.
See Operation Request Path
The Operation Diagram shows downstream operations from the original ingress operation and service. Red halos denote error percentages—-the larger the halo, the more errors present. The ingress operation you’re currently investigating has a blue animated halo.
If you want to focus the investigation to show operations up to the first operation in another service, select Direct Downstream Operations at the top right of the Operation Diagram.
The diagram edges represent relationships between spans and is dependent on the quality of your instrumentation. Missing spans may cause edges to appear when there is no parent-child relationship. The following can cause missing spans:
- The Satellite dropped the span (for example, your Satellite pool is not auto-scaling to keep up with traffic)
- The tracer dropped the span (for example, your application crashed or never called
- There is an issue with your instrumentation (context was dropped, or the service with a parent span is not instrumented).
When you see missing spans, check the Reporting Status page to find the culprit.
Hovering over an operation gives you an error count.
Compare Operation Error Rates
The Operations with Errors table shows the number of errors per minute on the operations in the diagram that originated an error, for both the baseline and regression time periods, along with the calculated change. From this table you can see where the majority of errors originate, the change in rate, and then click through to full traces to further diagnose.
The table only shows operations that originated the errors, so there may be fewer operations in the table than the diagram. Operations are considered to be originating errors if they have no descendant spans with errors.
Click on an operation name in the table to view example traces that include errors and the operation. Clicking an operation here also causes the Operation diagram to regenerate based on traces that contain the selected operation with errors. If there are traces without errors, examples of those are shown as well.
Click an example trace to view it in the full Trace view. If you want to see more traces from this operation, you can filter the Trace Analysis table (below) to see only spans from that operation by clicking the Filter by this operation button.
Correlate Tags to errors
The Tags with Errors table shows you tags that may be correlated with the increase in errors—these tags more often appear on spans with errors. It also allows you to compare the errors per minute of spans with those tags in the baseline and the regression windows. Because tags provide contextual data about a span, this table can often give you clues to what might be causing the errors.
Learn more about Correlations
A standard method of identifying the root cause of a performance regression is to manually comb through traces and search for common system attributes associated with that regression. With Correlations, Lightstep helps you find attributes correlated with latency and errors automatically.
Lightstep Correlations search over all the span data in all traces that the current query participates in and finds patterns in that data for latency and error contribution, looking at service/operation pairs and tags. This analysis allows you to identify services, operations, and tags that are meaningfully associated with observed regressions. By surfacing statistical relationships between these system attributes, you can quickly generate and validate hypotheses during a root cause investigation. You can learn more about correlations here
For example, in the image below you can see that the
429 status and error code and error type of
server all have the same increase in error rate, which likely means they all appear in the same spans.
Click on a tag to view example traces from the diagram that include errors and that tag. Clicking here also causes the Operation diagram to regenerate based on traces that contain the selected tag with error.
If there are traces without errors, examples of those are shown as well.
Click an example trace to view it in the full Trace view. If you want to see more traces with this tag, you can filter the Trace Analysis table (below) to see only spans from that tag by clicking Filter by this tag or you can group the current results by that tag by clicking Group by this tag.
Lightstep shows you the logs from the spans that make up the Operation Diagram. The graph shows the frequency of logs in the spans that Lightstep has analyzed. The table below shows the log messages, the number of times that message appears on spans during both the baseline and regression, and the change in frequency between the two. The service and operation names are shown at the top of the message.
Click the hamburger menu for the log message to see example traces and to copy the log message. Click an example trace to view it in the full Trace view.
You can search for messages using the Search field. The results display the accuracy of the match.
Compare Span Data Using the Trace Analysis Table
Similar to the Trace Analysis table on the Explorer view, this Trace Analysis table shows information from the spans in both the regression and baseline time ranges. By default, the table shows the service, the operation reporting the span, the span’s duration, and the start time from the regression time range. You can add other columns as needed by clicking on the + icon. You can use the Filter by button to see only spans with errors (for example
You can switch between analyzing traces from the baseline or regression time periods by selecting the time range of interest. After you perform an analysis, for example a group-by, you can switch time ranges and see that same group-by aggregation on the other time range’s traces.