When you notice an issue after a deployment, you can use the Service Health view to start your investigation.
- In the Service Health view, click into the area of latency on the
update-inventoryLatency chart. You can choose to compare that current performance with performance before the deploy.
You’re taken to the Comparison view. Lightstep offers a number of different tools you can use to start creating hypotheses about what is causing the regression. On this page, blue represents the time period before the deploy (the baseline) and yellow represents the time period after (the regression).
Looking at the histogram, the data from the regression period is shown in the yellow bars. The baseline data is shown as the blue line. There are a lot more yellow bars in the higher latency side of histogram, which confirms that there’s higher latency in the regression data.
Learn more about histograms
Once you run a query, the Latency Histogram is generated by 100% of the span data collected from the Satellites that match the query. The bottom of the histogram shows latency time, and each blue line represents the number of spans that fall into that latency time.
Note that on this page, the regression data is shown in yellow bars. You can learn more about the histogram here
Now let’s see if we can start to figure out why the latency is higher.
Lightstep can correlate attribute values associated with regression. In the Compare Attributes table, you can see that both the
service-version:v1.14.4987 attribute values have a high correlation with latency and weren’t found in the baseline data. This tells us that it’s highly likely the deploy of that version is correlated with the latency and that it seems to have introduced an issue with writing large batches of data.
Learn more about Correlations
A standard method of identifying the root cause of a performance regression is to manually comb through traces and search for common system attributes associated with that regression. With Correlations, Lightstep helps you find attributes correlated with latency automatically.
Lightstep Correlations search over all the span data in all traces that the current query participates in and finds patterns in that data for latency contribution, looking at service/operation pairs and attributes. This analysis allows you to identify services, operations, and attributes that are meaningfully associated with observed latency. By surfacing statistical relationships between these system attributes, you can quickly generate and validate hypotheses during a root cause investigation.
Note that on this page, the histogram displays yellow bars instead of blue. You can learn more about the histogram here
Let’s use the Operations map to see if there’s any one operation that’s particularly latent.
Scroll down to the Operations map.
Looking at the operations for this service, you can see that the
write-cacheoperation has a high latency (the yellow halo is quite large). The table tells you that the latency has increased by over 27ms. That makes sense given that the
large-batch:trueattribute was correlated with latency.
Now that we’ve narrowed the issue down to the
write-cacheoperation, let’s take a look at a full trace.
Back in the Compare Operations table, click on the
write-cacheoperation to see example traces.
Click on one of them to open it in the Trace view.
Learn more about Trace View
You use the Trace view to see a full trace from beginning to end of a request. The Trace view shows you a flame graph of the full trace (each service a different color), and below that, each span is shown in a hierarchy, allowing you to see the parent-child relationship of all the spans in the trace. Errors are shown in red.
Clicking a span shows details in the right panel, based on the span’s metadata. Whenever you view a trace in Lightstep, it’s persisted for the length of your Data Retention policy so it can be bookmarked, shared, and reviewed by your team at any time. The selected span is part of the URL, so it will remain selected when the URL is visited.
You can learn more about the Trace view here
Note that the
write-cacheoperation is indeed an issue, as it’s taking up most of the critical path’s time. Seems like there might have been a change to how large batches are handled by that operation.
Time to rollback the deploy!
What Did We Learn?
- Histograms show buckets of spans organized by latency. By comparing the baseline to the regression, it’s easy to confirm that latency has increased.
- Correlations shows you spans with specific attributes that may be contributing to latency.
- The Operations diagram shows latency per operation.
- The Trace view contains lots of information about each span in the trace, making it easy to verify your hypothesis.