A standard method of identifying the root cause of a performance regression is to manually comb through traces and search for common system attributes associated with that regression. With Correlations, LightStep helps you find attributes correlated with latency automatically.
LightStep Correlations search over all the span data in all traces that the current query participates in and finds patterns in that data for latency contribution, looking at service/operation pairs and tags. This analysis allows you to identify services, operations, and tags that are meaningfully associated with observed latency. By surfacing statistical relationships between these system attributes, you can quickly generate and validate hypotheses during a root cause investigation.
Correlations are shown to the right of the Trace Analysis table and reflect the current data returned by the query. If you select a range in the histogram, correlations are redetermined based on that range. You can hide and show the Correlations panel using the Show Correlations checkbox.
The correlations are sorted in descending order based on the absolute value of the correlation score.
There are three types of correlations:
- Service/operation pair: Shows the percentage of latency contribution to the critical path.
- Tag in the trace: The tag appears on a span in the trace (but not currently in the results because it did not match the query).
- Tag in the result: The tag appears on a span that is in the result set.
Correlation scores fall between -1 and +1 and are determined as follows:
- +1: The attribute is only found in the selected latency range. So the closer to +1, the higher the chance that the attribute is causing latency in the currently selected range in the histogram.
- -1: The attribute is never found in the selected latency range. So the closer to -1, the less chance that attribute has anything to do with latency.
- 0: The attribute is equally likely to be found inside and outside of the selected latency range and thus has no statistical relationship to the latency.
Increase Correlation Accuracy
By default, the Trace Analysis table shows only spans that match the query. To get greater accuracy with correlations, you need a more holistic view to find a downstream issue. Click Show all Spans in Traces to include all spans that contribute to the trace.
View Correlations in the Histogram to Determine Impact
Hover over a correlation to see the latency for that attribute overlayed in the histogram. For example, in this histogram, the tag
subnetwork with a value of
us-west1 shows a high correlation for spans in the p95 and above range. It does not seem to have much of a correlation with latency below that range.
Selecting the correlation allows you to easily refine the results in the Trace Analysis table to quickly reach hypothetical causes for latency.
Filter and Group with Correlations
Once you see attributes in traces that may correspond with latency, you’ll likely want to filter the span data to further refine your hypothesis. For example, in the above graphic, there are three attributes with high correlation scores. Before you alert teams responsible for all three, you may want to see how they correlate with each other.
Filter the Trace Analysis Table
Click on a correlation and choose Filter by to show only results that match that criteria.
Once you filter the results, you can sort by duration to see the longest spans of correlated latency. Clicking a span takes you to the Trace view, where you can see the actual latency contribution.
The Latency Contribution is the contribution to the critical path latency. In this case, 5.38 seconds of the 5.57 second span is considered on the critical path, or 70% of the total critical path.
Group the Trace Analysis Table
To get a comparison of how different values for a service, operation, or tag might contribute to latency, you can group results. For example, you might want to group the results by the
customer_id tag to see how BEEMO, which has a high correlation to latency, compares to other customers.
Click Group by to group by the attribute in the Correlations panel.
Click on a group to see span data just for that group. LightStep shows you the average latency, the error percentage, and the number of spans in that group.
LightStep needs significant data to create correlations. You may see errors when there is not enough data to determine correlations.
If no correlations are returned, then there are no strong correlations found in the selected region.
This indicates that there are no system attributes that correlate strongly inside of the selected latency region (in comparison with the region outside the selection). To see strong enough correlations, try adjusting your query or histogram selection.
Less Than 50 Traces
LightStep needs at least 50 traces to compile meaningful correlations. If your query results set or if the range you selected in the histogram has less than 50, you get a warning.
Adjust your query or increase the selected area in the histogram to include more traces.
Based on Low Data
The Correlations panel displays a warning when the amount of data from the query was not enough to determine true correlations. You have over 50 traces, so correlations can be compiled, but they may not provide a good signal.
Expand your selection to allow LightStep to analyze a larger data set.