Let’s see if the other tools on the page can help us determine which service was deployed.

Use the Operation Diagram to View Operations in the Request Path

Below the histogram are the Operation diagram and table. The diagram shows all the operations in the request path, starting with the operation you’re investigating (/api/update-inventory on the android service). Like the Compare Attributes table, the Compare Operations table shows operations most responsible for critical path latency.

Let’s take a closer look at the diagram. You can see the /api/update-inventory operation at the top. The animated halo lets you know that’s the operation you’re investigating. The yellow halos denote the amount of latency - the larger the halo, the more latency.

Most of the operations shown have the same amount of latency, so they’re likely not the cause. When you drag the diagram up to view more operations, you can see that one of them shows significant latency - write-cache on the inventory service. Hovering over that node shows information about the operation. This is likely the source of the upstream latency.

Use the Compare Operations to Get More Info

Let’s see what the table tells us. As with the Compare Attributes table, the duration shown is the time that the operation was in the critical path during the regression and compares it to the baseline.

It looks like the latency for write-cache on the inventory service grew by 115ms and is the only operation with a remarkable change. And we know that the Compare Attributes table told us that the large_batch=true attribute value is likely involved with the latency. So if we can find out if write-cache has that attribute value, then it’s highly likely that it’s the culprit.

We’ll be able to determine that in the next step. For now, let’s see if there was a deploy for this service since we also know from the attribute table that a new version was highly correlated with the latency.

The Compare Operations table lets you go back to the Service Health view for an operation using the More ( ⋮ ) icon. Doing this will verify if there was a deployment at the time latency on our service started.

Sure enough! Lightstep shows deployment marker on the inventory service around 11:45 am. And the update-inventory operation on that service started experiencing latency then.

What we know so far:

  • The spike in latency started around 11:45 am. There were no deployments of the android service at that time but there was a deployment of the inventory service.
  • The write-cache operation on the inventory service also started experiencing a regression at the same time.
  • The attribute large_batch:true is also likely involved with the latency.

Time to verify our hypothesis that the deploy included a change to the write-cache operation and to see if that operation contained the large_batch:true attribute.


What Did We Learn?

  • The Operations diagram shows you all the operations in the request path of the operation that you’re investigating, letting you see deep into the system.
  • The diagram uses yellow halos to show the relative latency for an operation during the regression. When you see an operation with a large halo (and there isn’t one below it with a larger halo), you can hypothesize that it’s the latency source.
  • The Compare Operations table lets you see which operations in the path contribute to latency, both during the regression and the baseline. You can determine if latency for another operation is new to the regression.