We think that the inventory service may be the issue, but we’ve only see span data from the /api/update-inventory operation on the iOS service, so let’s dig a little deeper.

  1. Click back on the Trace Analysis tab. On the right, Correlations are listed.

    Instead of having to open a bunch of traces to find a pattern, Lightstep does that for you with Correlations. Correlations search over all traces and finds patterns in that data for latency contribution, allowing you to reach a data-driven hypothesis.

    Looking at Correlations, the operation update-inventory on the inventory service shows a very high correlation of latency.

  2. Click on that Correlation. The histogram overlays the latency distribution for that operation. Looks like almost every span at the higher end of latency is correlated with the update-inventory operation.

    Looks like some customer tags also have a high correlation.

  3. Click on the customer:meowsy tag to see how that correlation contributes to latency.

    Wow - looks like spans with the meowsy tag value are all at the high end of latency. No wonder there were complaints!

    Better check how much other customers are affected.

  4. In the Correlations panel, click Group by customer.

    A few things happened when we did this. First, note that the Show all Spans in Traces checkbox is active. This means that now the Trace Analysis table is not going to show just spans that matched our original query, but also all the other spans in the same traces.

    And now the table shows aggregate data by customer. While meowsy shows the highest average latency, we can see that other customers are affected to some degree.

    So our hypothesis at this point is that the update-inventory operation on the inventory service is the root cause of latency. We know that’s affecting the iOS service and many of our customers’ interactions with the app. Is it affecting other services as well? Let’s zero in on that operation.

  5. In the query bar, enter a search for service: inventory and operation: update-inventory and click Run.

  6. Click on Service Diagram.

    Notice that because now we queried on inventory, the diagram dynamically redraws to select and center on inventory. And now along with looking downstream for issues, we can see that because the web and android services are upstream, they are likely experiencing latency as well.

What Did We Learn?

  • Correlations makes finding other services, tags, and operations that are contributing to latency easy! No need to open a bunch of traces to figure it out for ourselves.
  • Adding all spans that contribute to a trace increases the ability to verify issues.
  • The dynamic Service Diagram allows us to easily see downstream and upstream to find other services that may be affected.

We have a hypothesis - now let’s verify it.