When you notice an increase in error rate on the Service Health view, you can use the analytical tools to find the source of errors. This is especially helpful after deploying a service to ensure there are no regressions in errors, and if there are, to quickly determine the source and rollback.
Compare two time ranges for error rate
When you do notice an increase in the error rate for an operation, you can use the RCA view to compare a point of time when the error rate was stable (called the baseline) to the time when the rate was high (regression). You then use analysis tools to identify what caused the increase in errors.
- From the Service Directory, select a service and click the Deployments tab. Sparkline charts show the top changes in latency, error rate, and operation rate for the Key Operations for that service. The operations are initially sorted by highest change during the time period initially selected.
More About How Lightstep Observability Measures Change
When determining what has changed in your services, Lightstep Observability compares the SLI of the baseline SLI timeseries to the comparison SLI timeseries. Those time periods are determined using the data currently visible in the charts.
You can change the amount of time displayed using the time period dropdown at the top right of the page.
The baseline and comparison time periods are determined as follows:
If there is one or more deployment markers visible:
- For latency and error rates, Lightstep Observability compares the performance after the selected version to performance in all other versions.
- For operation rate, it compares the rate before the deployment marker to after the deployment marker.
If there are no deployment markers visible:
Lightstep Observability compares the performance of the first half of the time period to the second half.
Only changes that are relative (i.e. a change of 10ms to 500ms is ranked higher than one of 1s to 2s) are considered. The yellow bars on the sparkline chart indicate the amount of change. Lightstep Observability measures two aspects of change: size and continuity. A full bar indicates that a large, sustained change has happened. Smaller bars indicate either a smaller change or one that did not last for the full time period.
The yellow bar means that an SLI had an objectively large change, regardless of service or operation. Lightstep’s algorithm runs on each SLI independently. For example, when the bar displays for an operation’s latency, that means latency has changed – not that its change was greater compared to the other SLIs.
If you’ve implemented an attribute to denote new versions of your service, a deployment marker displays at the time the deployment occurred. In the example below, for the
update-inventory operation, a deploy occurred at 12:40 pm. Hover over the marker to view details. These markers allow you to quickly correlate deployment with a possible regression. When a deploy only affects a segment of traffic (like a canary deployment or an A/B test), Lightstep Observability shows you the performance and amount of traffic for each version.
When you find an error spike to investigate, select the operation’s sparkline chart and then click in the spike in larger Error % chart to create a range of time to collect data from.
Choose a spot in the middle of a spike to avoid collecting data from before the regression occurred. You can also zoom in on a time period by clicking and dragging over the time period you want a closer look at. The charts redraw to report on just the period you selected.
A shaded red box shows the time range from which span data will be used for analysis of the regression.
Choose to compare that data with data from directly before the previous deploy, an hour ago, a day ago, or you can select your own baseline window.
When you click Custom baseline window, click into the chart and move the blue box to choose your baseline.
The RCA (root cause analysis) view opens to show data from both time periods. The color blue represents the baseline time window and red represents the regression time window.
To change either the Regression or Baseline time, click and move the shaded box. You can also zoom in on a time period by clicking and dragging over the time period you want to zoom in on.
Read on for instructions for using these tools.
See operation request path
The Operation Diagram shows downstream, upstream, or all operations from the operation and service you’re investigating. By default, Lightstep applies an
error=true filter and builds this diagram using only spans with errors. Red halos denote error percentages. The larger the halo, the more errors are present. The operation you’re currently investigating has a blue animated halo.
Lightstep also shows you the number of logs associated with each operation, based on the filters applied. For example, when
error=true is applied as a filter, these counts only include logs on spans with errors.
By default, the diagram shows you all downstream operations. You can change the diagram to show upstream operations or both downstream and upstream operations using the dropdown menu.
The diagram edges represent relationships between spans and is dependent on the quality of your instrumentation. Missing spans may cause edges to appear when there is no parent-child relationship. The following can cause missing spans:
- The Microsatellite dropped the span (for example, your Microsatellite pool is not auto-scaling to keep up with traffic)
- The tracer dropped the span (for example, your application crashed or never called
- There is an issue with your instrumentation (context was dropped, or the service with a parent span is not instrumented).
When you see missing spans, check the Reporting Status page to find the culprit.
Hovering over an operation gives you the error rate and log count.
Compare operation error rates
The Operations with Errors table shows the number of errors per minute on the operations in the diagram that originated an error, for both the baseline and regression time periods, along with the calculated change. From this table you can see where the majority of errors originate, the change in rate, and then click through to full traces to further diagnose.
The table only shows operations that originated the errors, so there may be fewer operations in the table than the diagram. Operations are considered to be originating errors if they have no descendant spans with errors.
By default, the table is sorted by the amount of change. You can change the sort order by clicking on another column. You can also filter the diagram (and the Logs analysis and Trace Analysis table) to show only span data from an operation by either selecting a node in the diagram or by clicking the Filter icon on the operation’s row in the table.
Click the More ( ⋮ ) icon to view example traces. Click one of the traces to view the full trace in the Trace view. Click View in Service Directory to return to the Service Health view where you can view the sparkline charts and any deployment markers, but for this service and operation, .
Correlate attributes to errors
The Attributes with Errors table shows you attributes that may be correlated with the increase in errors—that is, these attributes more often appear on spans with errors. It also allows you to compare the errors per minute in the baseline and the regression windows. Because attributes provide contextual data about a span, this table can often give you clues to what might be causing the errors.
Learn more about Correlations
A standard method of identifying the root cause of a performance regression is to manually comb through traces and search for common system attributes associated with that regression. With Correlations, Lightstep Observability helps you find attributes correlated with latency and errors automatically.
Correlations search over all the span data in all traces that the current query participates in and finds patterns in that data for latency and error contribution, looking at service/operation pairs and attributes. This analysis allows you to identify services, operations, and attributes that are meaningfully associated with observed regressions. By surfacing statistical relationships between these system attributes, you can quickly generate and validate hypotheses during a root cause investigation. You can learn more about correlations here
For example, in the image below you can see that the
429 status and error code and error type of
server all have the same increase in error rate, which likely means they all appear in the same spans.
Click the Filter icon on a row in the Compare Attributes table to filter the Operation diagram, Logs Event analysis, and Trace Analysis table and only see spans with that attribute.
Click the More ( ⋮ ) icon to view example traces. Click one of the traces to view the full trace in the Trace view. Click Group by to group the current results in the Trace Analysis table by this attribute.
View log events
These are logs of events that your tracing instrumentation has appended to the spans. If you used tracing libraries that do not support log appending, you will not see Log Event Analysis.
Lightstep Observability shows you the logs of events from the spans that make up the Operation Diagram (if you’ve filtered the diagram, the logs are filtered as well).
The graph shows the frequency of events in the spans that Lightstep Observability has analyzed. The table below shows the log messages, the number of times that message appears on spans during both the baseline and regression, and the change in frequency between the two. The service and operation names are shown at the top of the message. When patterns are detected, placeholders display a range, type, etc.
Hover over a row and click the More icon for a log message to see example traces and to copy the log message. Click an example trace to view it in the full Trace view.
You can search for messages using the Search field. The results display the accuracy of the match and are sorted by accuracy. Pattern placeholders are replaced with the actual text from the message. Use the Filter icon to filter the Log analysis, Service diagram, and Trace Analysis table by the log message.
Compare span data using the Trace Analysis table
Similar to the Trace Analysis table on the Explorer view, this Trace Analysis table shows information from the spans in both the regression and baseline time ranges. By default, the table is filtered to show only spans with errors with the
error=true filter. For each span you can see the service, the operation reporting the span, the span’s duration, and the start time from the regression time range. You can add other columns as needed by clicking on the + icon.
Unlike the Trace Analysis table on the Explorer View, this table shows data from spans not just in the service and operation you’re investigating, but from all spans that participate in the same trace.
You can switch between analyzing traces from the baseline or regression time periods at the top of the table. Any filtering or grouping is done for the data for both time periods.
Filter the data
By default, Lightstep Observability shows you all the data available from spans from the two time periods (baseline and regression). You can filter the data to narrow in on the cause of regression. When you apply filters, the Operation diagram, Log Event analysis, and Trace Analysis tables all repopulate with data that match the filters.
Multiple filter criteria are joined by the OR operator. Therefore, you can only filter by one service/operation pair.
You can apply filters in a number of different ways:
From the Narrow your results button: Clicking this button brings up a query builder (similar to the one in Explorer). You can add any operation, service, or attribute that exists in the data. This button is “sticky” and always displays at the top of a component that can be filtered.
To filter by log messages, use the filter icon on the Log Event Analysis table.
From the Operation Diagram: Select a node in the diagram to filter by that service and operation.