Support has let you know that customers are complaining about the slowness of the iOS app when trying to update their inventory. It’s your job to figure out why that’s happening. It could be the app, something in the API call…anything really. How do you know where to start?
Let’s start with the Explorer view. This is where you can run very specific queries, test hypotheses, find correlated issues, and narrow your search down to a possible root cause.
More about Explorer
When you hear that a particular service is slow, or there’s a spike in errors, or a particular customer is having an issue, your first action is to see real-time data regarding the issue. Lightstep’s Explorer view allows you to query all span data currently in the Satellites to see what’s going on.
Explorer consists of four main components. Each of these components provides a different view into the data returned by your query:
Query: Allows you to query on all span data currently in the Satellites recall window. Every query you make is saved as a Snapshot, meaning you can revisit the query made at this point in time, any time in the future, and see the data just as it was. Results from the query are shown in the Latency Histogram, the Trace Analysis table, and the Service diagram.
Latency histogram: Shows the distribution of spans over latency periods. Spans are shown in latency buckets represented by the blue lines. Longer blue lines mean there are more spans in that latency bucket. Lines towards the left have lower latency and towards the right, higher latency.
Trace Analysis table: Shows data for spans matching your query. By default, you can see the service, operation, latency, and start time for each span. Click a span to view it in the context of the trace.
Correlations panel: For latency, shows services, operations, and tags that are correlated with the latency you are seeing in your results. That is, they appear more often in spans that are experiencing higher latency. For errors, shows tags that are more often on spans that have errors. The panel can show both positive and negative correlations, allowing you to drill down into attributes that may be contributing to latency, as well as ruling out ones that are not. Find out more about Correlations here.
Service diagram: Depicts the service based on your query in relation to all its dependencies and possible code paths, both upstream and downstream. Areas of latency (yellow) and errors (red) are shown. Learn more about the Service diagram here.
Using these tools, it’s possible to quickly form and validate hypotheses around the current issue.
Lets get started
From the left navigation bar, click Explorer.
This page gives you a view into the span data from the recall window (working memory) of the Satellites - the part of the Lightstep architecture that collects data from the tracing instrumentation in your app.
The histogram at the top shows you current latency based on 100% of the span data sent by the application. Below the histogram, the Trace Analysis table shows exemplar spans from the histogram that represent all ranges of performance, including those that are otherwise statistically interesting.
You can run queries on this data based on your instrumented services, operations, and tags. You can use the Open Tracing tags or create your own to support any way you need to slice and dice your data. Lightstep supports unlimited cardinality!
Let’s query for spans from the
update-inventory apicall (since this is where customers are seeing issues).
Click in the Search bar, select Service and choose service: iOS. Then select Operation, and choose operation: /api/update-inventory, and click Run.
The latency histogram redraws to show latency distribution from the
Spans are shown in latency buckets - longer blue lines mean there are more spans are in that latency bucket. Lines towards the left have lower latency and towards the right, higher latency.
By default, a marker shows where the 95th percentile lies (you can change that using the dropdown). You can add overlays that shows the same data from one hour, one day, and one week prior.
Select 1 hour prior to view an overlay of trace data (in red) from the Satellites one hour ago.
It looks like things are getting slightly worse.
Uncheck 1 hour prior.
Let’s see if we can find something downstream of the
iosservice that might be causing latency.
Next to the Trace Analysis tab, click Service Diagram. The Service Diagram shows all potential code paths and dependencies based on the the current query.
The Service diagram is dynamic, meaning it always shows you exactly what you’ve queried on using the same data as the Trace Analysis table. Services that may be contributing to latency are shown with a yellow halo (the size of the halo is relative to the amount of contributing latency). Services with errors show a red halo. In this case, we can see that the
inventoryservice has high latency and may be what’s causing the slowness.
Using the Service diagram, not only can you determine hot spots that are contributing meaningful latency, but as important, where there are no latency issues, keeping you from alerting and causing work for the wrong people.
What Did We Learn?
- Because Lightstep Satellites contain 100% of trace data, queries provide real-time comprehensive data telling us exactly what’s going on at that moment.
- We can quickly see where latency is happening throughout the stack using the Service Diagram. Finding culprits is quick and easy.
Let’s search further - what is making the
inventory service so slow?