Now that we know there’s a spike in the error rate on the /api/get-store operation, let’s see if we can find out why. Let’s compare the performance of that operation during the regression against a time range when we know it was healthy and look for differences between the two.

To start the investigation, you click into the regression in the Error % chart, and choose the time range to compare it to. If you feel a deploy caused the regression, you can choose to view data before the deploy. There are no deployment markers, so we know a deployment of this service isn’t the cause.

Check out this Learning Path specifically about performing root cause analysis after a deploy.

In looking at the chart, it seems the operation was healthy an hour ago (the error line is flat), so you can choose that.

When you do, you’re taken to the RCA page and can now dive into the comparison of data between the two time periods.

There’s a lot of tools here! Let’s see how they can help.

At the top of the page, you can see the same three charts as before (latency, error rate, and operation rate) for the operation you’re investigating, this time with the baseline time range highlighted in blue and the regression time range in yellow. You can move those boxes to change the range if you need to. Below those charts is the same Metrics accordion where you can view the machine metrics for this operation.

The tools we’ll use to find the source of regression start below the metrics, in the Analyze your regression section.

Data for these tools are taken from an aggregation of spans generated by the operation during the two time periods chosen, and in some cases, also from spans generated by operations from this and other services that participated in the same request. Lightsep’s algorithm for capturing that trace data ensures that a full spectrum of performance is collected from the Satellites.

Filter the Data

The first thing you might notice is that the data has been filtered. Lightstep allows you to filter the data used to create the Operations diagram, Log Analysis, and Trace Analysis table to narrow down the investigation. By default on the Error RCA page, the data is already filtered to only use and show data from spans that contain errors. You don’t have to sift through data that you know isn’t causing the regression you’re investigating.

Use the Operation Diagram and Tables to Find the Error Source

The Operation diagram shows all the operations in the request path, starting with the operation you’re investigating (/api/get-store on the android service). The Operations with Errors table shows operations in the request’s path that originated errors. The Compare Attributes table shows attributes that most frequently appear on spans with errors.

Let’s take a closer look at the diagram. You can see the /api/get-store operation at the top (the animated halo tells you it’s the operation you’re investigating). The red halos denote the amount of errors - the larger the halo, the more errors.

Most of the operations shown have the same amount of errors, so it’s likely that errors are “bubbling up” from the bottom of the stack. When you drag the diagram up to view more operations, you can see that get-store-data on the store-server service seems to be the origination of errors.

And looking at the Operations with Errors table, you can see that it’s the only one listed and that it has close to 400 more errors in the regression than in the baseline.

So we can be fairly confident that the get-store-data operation is the source, but what happened? Why is it suddenly throwing errors?

Use the Compare Attributes Table to Get More Info

Below the Operations with Errors table is the Attributes with Errors table. This table shows you attributes that were consistently on spans with errors. It compares the errors per minute on spans with that attribute in the baseline, to spans with that attribute in the regression.

This table is an excellent example of why using attributes in your instrumentation is so essential. The more attributes you attach to your spans, the better this table can help you find issues!

Looking at the table, you can see that spans with the attribute service.version and a value of v10.8.575 had the largest increase in errors, and that there were no spans in the baseline with that attribute/value pair. This likely means that a deploy occurred (changing the value of the attribute), and that this deployment introduced errors. That’s a definite clue.

You can also see that 429 HTTP errors were introduced in the regression.

Let’s see if there was a deploy for the store-server service around the time the error spike occurred.

The Operations with Errors table lets you go back to the Service Health view for an operation using the More ( ⋮ ) icon. Doing this will verify if there was a deployment at the time the error rate on our service increased.

Sure enough! Lightstep shows deployment marker on the store-server service at 3:23pm - the same we saw the error spike on the android service. And the get-store-data operation on the store-server service started experiencing a spike in errors then.

What we know so far:

  • The spike in errors on the android service started around 3:30 pm. There were no deployments of the android service at that time.
  • The Operations diagram shows errors bubbling up through the stack, seemingly originating from the get-store-data operation on the store-server service.
  • The Operations with Errors table tells us that the get-store-data operation is the only operation that’s originating errors.
  • Comparing attributes on spans that were part of the same requests as the /api/get-store showed us that there may have been a deploy of another service. Going back to the Operations with Errors table, we found that there was a deploy of the store-server service at the same time the error rate spiked.
  • Many spans had the 429 (too many requests) error. But we don’t know for sure that they are coming from the get-store-data operation.

Let’s see if we can find out if those errors are definitely coming the get-store-data operation.


What Did We Learn?

  • The RCA page has several useful tools that allow you to compare performance between a baseline and regression time ranges.
  • The Operations diagram shows you all the operations in the request path of the operation that you’re investigating, letting you see deep into the system.
  • The Operations with Errors table shows you spans that originated errors, allowing you to rule out services that are only passing the error on up the system.
  • The Attributes with Errors table shows attributes that more frequently appear on spans with errors.