Now that we know there’s a latency problem on the /api/update-inventory operation, let’s see if we can find out why. Let’s compare the performance of that operation during the regression against a time range when we know it was healthy, and then look for differences between the two.

To start the investigation, you click into the regression in the Latency chart, and choose the time range to compare it to. If you feel a deploy caused the regression, you can choose to view data before the deploy. There are no deployment markers, so we know a deployment of this service isn’t the cause.

Check out this Learning Path specifically about performing root cause analysis after a deploy.

In looking at the chart at an hour prior, it seems the operation was healthy (the latency line is flat), so you can choose that.

When you do, you’re taken to the RCA page and can now dive into the comparison of data between the two time periods.

There’s a lot of tools here! Let’s see how they can help.

At the top of the page, you can see the same three charts as before (latency, error rate, and operation rate) for the operation you’re investigating, this time with the baseline time range highlighted in blue and the regression time range in yellow. You can move those boxes to change the range if you need to. Below those charts is the same Metrics accordion where you can view the machine metrics for this operation.

The tools we’ll use to find the source of regression start below the metrics.

Data for these tools are taken from an aggregation of spans generated by the operation during the two time periods chosen, and in some cases, also from spans generated by operations from this and other services that participated in the same request. Lightsep’s algorithm for capturing that trace data ensures that a full spectrum of performance is collected from the Satellites.

Use the Histogram to Determine the Breadth of the Latency

First up is the histogram. The x-axis on this chart is the duration of the /api/update-inventory spans, taken from an aggregate of data during the regression. The height of the yellow bars represents how many spans took that amount of time to complete.

Consider each bar to be a bucket that holds the spans whose duration matches that point on the x-axis. So the higher the yellow bar, the more spans had that duration.

The blue line overlayed on top is the shape of the histogram for the baseline. This overlay helps identify the latency scope. You can see how much the latency has changed and for how many requests.

Taking a look at this histogram, you can see that the regression has buckets of spans at the longer-duration end of the chart, outside of the baseline overlay. So you can deduce that during the regression, the /api/update-inventory was not only taking longer to complete, but a significant number of spans during this time range were taking longer.

Ok, so you know that there was higher latency in the regression, and the latency likely affected many of the requests. But this still doesn’t pinpoint the origin. Let’s dig further.

Use the Compare Attributes Table to Find Span Attributes Correlated with Latency

To the right of the histogram is the Compare Attributes table. This table shows you attributes that were consistently on spans at the higher end of the histogram.

This table is an excellent example of why using attributes in your instrumentation is so essential. The more attributes you attach to your spans, the better this table can help you find issues!

This table compares the latency of spans that have a given attribute in the baseline to spans with that attribute in the regression. Unlike the histogram, this table includes data from an aggregate of spans in the full request (not just spans from the /api/update-inventory operation). Additionally, instead of showing a span’s total duration, the table shows the time the span spent in the critical path.

Critical Path?
During a request, a parent span might be waiting for a child span (or even a child of the child) to complete before it can finish. The duration shown in this table is only when the operation for the span was “active,” allowing you to get a better sense of where the issue is. Because Lightstep only considers the time when the operation is actually doing something, you’re not going to get false positives because the operation was waiting around for somebody else.

More About the Critical Path

Consider this request shown in the Trace view of Lightstep. Each horizontal bar represents the full duration of an operation’s span. You can see that the spans towards the top take much longer, mostly because they’re waiting for the spans further down the stack to complete. Lightstep shows you the critical path using a black line, allowing you to see how long that operation was actually active during the request. And that’s really the only time you care about, right?

Expandable end

Looking at the table, you can see that spans with the attribute large_batch and a value of true had a critical path duration of 1.1 seconds, and that attribute didn’t appear on spans in the baseline. That’s a definite clue.

You can also see that the attribute service.version with a value of v1.14.575 had a critical path duration of 504 milliseconds and didn’t appear in any spans from the baseline. Further down in the table, you can see that the previous version showed the same duration for the regression and baseline. So apparently some service was deployed during the regression and it’s possible this may have caused the latency.

What we know so far:

  • The spike in latency started around 11:45 am. There were no deployments of the android service at that time.
  • The histogram shows that many requests that included /api/update-inventory during the regression have high latency.
  • Comparing attributes on spans that were part of the same requests as the /api/update-inventory shows that there may have been a deploy of another service that caused issues with writing large batches, but we’re not yet sure which one it was.

Let’s see if we can find out which service that was!

What Did We Learn?

  • The RCA page has several useful tools that allow you to compare performance between a baseline and regression time ranges.
  • The histogram shows you latency of the operation you’re investigating for both the baseline and regression, allowing you to quickly determine if the latency is widespread.
  • The Compare Attributes table correlates attributes on spans from the request with latency - surfacing attributes that more frequently appear on spans that spend more time in the critical path.