We think that the
inventory service may be the issue, but we’ve only see span data from the
/api/update-inventory operation on the
iOS service, so let’s dig a little deeper.
Click back on the Trace Analysis tab. On the right, Correlations are listed.
Instead of having to open a bunch of traces to find a pattern, Lightstep does that for you with Correlations. Correlations search over all traces and finds patterns in that data for latency contribution, allowing you to reach a data-driven hypothesis.
Looking at Correlations, the operation
inventoryservice shows a very high correlation of latency.
Click on that Correlation. The histogram overlays the latency distribution for that operation. Looks like almost every span at the higher end of latency is correlated with the
Looks like some customer tags also have a high correlation.
Click on the
customer:meowsytag to see how that correlation contributes to latency.
Wow - looks like spans with the
meowsytag value are all at the high end of latency. No wonder there were complaints!
Better check how much other customers are affected.
In the Correlations panel, click Group by customer.
A few things happened when we did this. First, note that the Show all Spans in Traces checkbox is active. This means that now the Trace Analysis table is not going to show just spans that matched our original query, but also all the other spans in the same traces.
And now the table shows aggregate data by customer. While meowsy shows the highest average latency, we can see that other customers are affected to some degree.
So our hypothesis at this point is that the
update-inventoryoperation on the
inventoryservice is the root cause of latency. We know that’s affecting the
iOSservice and many of our customers’ interactions with the app. Is it affecting other services as well? Let’s zero in on that operation.
In the query bar, enter a search for
operation: update-inventoryand click Run.
Click on Service Diagram.
Notice that because now we queried on
inventory, the diagram dynamically redraws to select and center on
inventory. And now along with looking downstream for issues, we can see that because the
androidservices are upstream, they are likely experiencing latency as well.
What Did We Learn?
- Correlations makes finding other services, tags, and operations that are contributing to latency easy! No need to open a bunch of traces to figure it out for ourselves.
- Adding all spans that contribute to a trace increases the ability to verify issues.
- The dynamic Service Diagram allows us to easily see downstream and upstream to find other services that may be affected.
We have a hypothesis - now let’s verify it.