Now that the chaos has begun, it’s time to use Lightstep Observability to see what affect the increased latency has on the app.

  1. In Lightstep Observability, navigate to back to the Explorer view.
    You can see in the histogram that there are now many more spans at the higher end of latency.

More about Explorer

When you hear that a particular service is slow, or there’s a spike in errors, or a particular customer is having an issue, your first action is to see real-time data regarding the issue. Lightstep Observability’s Explorer view allows you to query all span data from the past hour to see what’s going on.

Explorer consists of four main components. Each of these components provides a different view into the data returned by your query:

  • Query: Allows you to query on all span data past hour. Every query you make is saved as a Snapshot, meaning you can revisit the query made at this point in time, any time in the future, and see the data just as it was. Results from the query are shown in the Latency Histogram, the Trace Analysis table, and the Service diagram.

  • Latency histogram: Shows the distribution of spans over latency periods. Spans are shown in latency buckets represented by the blue lines. Longer blue lines mean there are more spans in that latency bucket. Lines towards the left have lower latency and towards the right, higher latency.

  • Trace Analysis table: Shows data for spans matching your query. By default, you can see the service, operation, latency, and start time for each span. Click a span to view it in the context of the trace.

  • Correlations panel: For latency, shows services, operations, and attributes that are correlated with the latency you are seeing in your results. That is, they appear more often in spans that are experiencing higher latency. For errors, shows attributes that are more often on spans that have errors. The panel can show both positive and negative correlations, allowing you to drill down into attributes that may be contributing to latency, as well as ruling out ones that are not. Find out more about Correlations here.

  • Service diagram: Depicts the service based on your query in relation to all its dependencies and possible code paths, both upstream and downstream. Areas of latency (yellow) and errors (red) are shown. Learn more about the Service diagram here.

Using these tools, it’s possible to quickly form and validate hypotheses around the current issue.

Learn more about Explorer.

Expandable end

  1. To get a better analysis of span data at the high end of latency, drag your cursor over the spans that are taking longer than . This will filter the data to only those highly latent spans. In the Trace Analysis table, sort the Duration column so show the highest latency. Explorer view

  2. Click on the top span to open it in the Trace view. Trace view

    The map at the top shows the critical path (where most of the latency is found) as a black line. Critical path

  3. Click on that critical path to focus on that span’s details.

    It seems the /hipstershop.CurrencyService/Convert grpc call is having trouble under load. Trace details

Now you know that the /hipstershop.CurrencyService/Convert call needs some work handling this kind of traffic. This is exactly the sort of failure that we want to uncover with Chaos Engineering, and we were able to use Lightstep Observability’s tracing data to automatically generate the experiment for us.

What did we learn?

  • Lightstep Observability makes it easy to find issues when your system is under load.