Find correlated areas of latency and errors

A standard method of identifying the root cause of a performance regression is to manually comb through traces and search for common system attributes associated with that regression. With Correlations, Cloud Observability helps you find attributes correlated with latency and errors automatically.

Correlations search over all the span data in all traces that the current query participates in and finds patterns in that data for latency and error contribution, looking at service/operation pairs and attributes. This analysis allows you to identify services, operations, and attributes that are meaningfully associated with observed regressions. By surfacing statistical relationships between these system attributes, you can quickly generate and validate hypotheses during a root cause investigation.

We will be introducing new workflows to replace Explorer, and as a result, it will soon no longer be supported. Instead, use notebooks for your investigation where you can run ad-hoc queries, view data over a longer time period, and run Cloud Observability’s correlation feature. In your notebooks and dashboards, you can use the traces list panel to view a list of spans from a query and dependency maps to view a service diagram.

View correlations

Correlations are shown to the right of the Trace Analysis table and reflect the current data returned by the query. If you select a range in the histogram, correlations are redetermined based on that range. There are separate tabs for latency and error correlations.

The correlations are sorted in descending order based on the absolute value of the correlation score.

There are three types of correlations:

  • Service/operation pair (latency only): Shows the percentage of latency contribution to the critical path.
  • Attribute in the trace: The attribute appears on a span in the trace (but not currently in the results because it did not match the query).
  • Attribute in the result: The attribute appears on a span that is in the result set.

Correlation scores fall between -1 and +1 and are determined as follows:

  • +1: The attribute is only found in the selected latency range. So the closer to +1, the higher the chance that the attribute is causing latency in the currently selected range in the histogram.
  • -1: The attribute is never found in the selected latency range. So the closer to -1, the less chance that attribute has anything to do with latency.
  • 0: The attribute is equally likely to be found inside and outside of the selected latency range and thus has no statistical relationship to the latency.

Increase correlation accuracy

By default, the Trace Analysis table shows only spans that match the query. To get greater accuracy with correlations, you need a more holistic view to find a downstream issue. Click Show all Spans in Traces to include all spans that contribute to the trace.

Find latency using correlations

Correlations allow you to quickly find attributes that are contributing to latency.

View correlations in the histogram to determine impact

Hover over a correlation to see the latency for that attribute overlayed in the histogram. For example, in this histogram, the operation wait-for-client-queue on the krakend-api-gateway service shows a high correlation for spans in the p95 and above range. It does not seem to have much of a correlation with lower latency.

Selecting the correlation allows you to easily refine the results in the Trace Analysis table to quickly reach hypothetical causes for latency.

Once you see attributes in traces that may correspond with latency or errors, you’ll likely want to filter the span data to further refine your hypothesis. For example, in this query, there are a number of attributes with high correlation scores for latency. Before you alert teams responsible for all of them, you may want to see how they correlate with each other.

Filter the Trace Analysis table

Click on a correlation and choose Filter by to show only results that match that criteria.

In this example, you might filter by the wait-for-client-queue operation on the krakend-api-gateway service, to see only spans from that operation, as it’s the only operation highly correlated with the latency (the others are all attribute).

Once you filter the results, you can sort by duration to see the longest spans of correlated latency. Clicking a span takes you to the Trace view, where you can see the actual latency contribution.

The Latency Contribution is the contribution to the critical path latency. In this case, 5.79 seconds of the 6.03 second span is considered on the critical path, or 92.6% of the total critical path.

Group the Trace Analysis table

To get a comparison of how different values for a service, operation, or attribute might contribute to latency, you can group results. In this example, you notice that the attribute large_batch with the value true shows high correlation, meaning that many spans with that attribute correlate to latency.

Click Group by to group by the attribute in the Correlations panel.

Click on a group to see span data just for that group. Cloud Observability shows you the average latency, the error percentage, and the number of spans in that group.

Find the cause of errors using correlations

When you see that there are error correlations for your query, you can easily find out where the errors are coming from by filtering and/or grouping the spans in the Trace Analysis table to further refine your query.

For example, in this query on the android service, there are a number of attributes with high error correlation scores. The http.status_code attribute with a value of 400 shows the highest correlation.

Filter the Trace Analysis table

You can filter the Trace Analysis table to see only the spans that are highly correlated with errors.

Click on a correlation and choose Filter by to show only results that match that criteria.

For example, because the http.status_code attribute with a value of 400 showed the highest correlation, you might filter by that. Now you see only spans that have that attribute value. To further narrow down the results, you can group them by another attribute, operation, or service.

In the example, looking again at the Correlations panel, you see that the attribute canary with a value of true is also highly correlated with errors. To see if all the filtered spans have the attribute value canary: true, at the top of the table click Group by and choose canary.

The results only show the value of true, which means all the spans have the attribute value canary: true. Had there been spans in the filtered table with the value false, you would see that group as well.

Click on the group to see spans that have attribute values of canary: true and http.status_code: 400 (remember, you filtered the table previously for that attribute). The table shows you the attribute group you are looking at (in this case canary: true), the average latency for these spans, the error percentage, and the number of spans in the group.

Now you can click on a span to view a full trace and find the error. In this example, it seems to come from the get-profile-for-user operation on the profile service. Sure enough, the logs on the trace show that an invalid profile ID is causing the error.

Group the Trace Analysis table

You can also start your investigation by comparing the different values of an attribute correlated with errors by clicking Group by for an attribute in the Correlations panel.

In this example, maybe you immediately notice the canary attribute with the value true is correlated with errors and decided to group by the value of the canary attribute to see how many spans are coming from that environment.

In this case, you can see that 137 spans are from the Canary environment and they all have errors.

Click on a group to see span data just for that group. Cloud Observability shows you the average latency, the error percentage, and the number of spans in that group. .

Now you can click on a span to view a full trace and find the error. As before, it seems to come from the get-profile-for-user operation on the profile service due to an invalid profile ID.

Troubleshoot correlations

Cloud Observability needs significant data to create correlations. You may see errors when there is not enough data to determine correlations.

No correlations

If no correlations are returned, then there are no strong correlations found in the selected region.

This indicates that there are no system attributes that correlate strongly inside of the selected latency region (in comparison with the region outside the selection). To see strong enough correlations, try adjusting your query or histogram selection.

Less than 50 traces

Cloud Observability needs at least 50 traces to compile meaningful correlations. If your query results set or if the range you selected in the histogram has less than 50, you get a warning.

Adjust your query or increase the selected area in the histogram to include more traces.

Based on low data

The Correlations panel displays a warning when the amount of data from the query was not enough to determine true correlations. You have over 50 traces, so correlations can be compiled, but they may not provide a good signal.

Expand your selection to allow Cloud Observability to analyze a larger data set.

See also

Query real-time span data

Use attributes and log events to find issues fast

Updated Apr 26, 2023