Investigate a latency regression

We will be introducing new workflows to replace the RCA view. As a result, it will soon no longer be supported. Instead, use notebooks for your investigation where you can run ad-hoc queries, view data over a longer time period, and run Cloud Observability’s correlation feature.

Code and configuration changes are some of the most common causes for performance regressions. While viewing a dashboard after a deployment may show that a latency regression occurred, it won’t explain the root cause. You can use Cloud Observability not only to monitor your services after a deploy, but also to compare performance over specific time periods and then dig into details to find the differences that caused the latency.

Compare two time ranges for latency

When you notice a regression in latency, you can click on that respective chart and immediately begin to investigate the issue.

  1. From the Service Directory, select a service and click the Deployments tab. Sparkline charts show the top changes in latency, error rate, and operation rate for the Key Operations for that service. The operations are initially sorted by highest change during the time period initially selected.

More About How Cloud Observability Measures Change

Cloud Observability measures two aspects of change: size and continuity. A full bar indicates that a large, sustained change has happened. Smaller bars indicate either a smaller change or one that did not last for the full time period. Only changes that are relative (i.e. a change of 10ms to 500ms is ranked higher than one of 1s to 2s) are considered.

The yellow bar means that an SLI had an objectively large change, regardless of service or operation. Cloud Observability’s algorithm runs on each SLI independently. For example, when the bar displays for an operation’s latency, that means latency has changed – not that its change was greater compared to the other SLIs.

When determining change, Cloud Observability compares the SLI of the baseline SLI time series to the comparison SLI time series. Those time periods are determined using the data currently visible in the charts.

You can change the amount of time displayed using the time period dropdown at the top right of the page.

The baseline and comparison time periods are determined as follows:

If there is one or more deployment markers visible:

  • For latency and error rates, Cloud Observability compares the performance after the selected version to performance in all other versions.
  • For operation rate, it compares the rate before the deployment marker to after the deployment marker.

If there are no deployment markers visible:
Cloud Observability compares the performance of the first half of the time period to the second half.

Expandable end

If you’ve implemented an attribute to denote new versions of your service, a deployment marker displays at the time the deployment occurred. In the example below, for the update-inventory operation, a deploy occurred at around 12:40 pm. Hover over the marker to view details. These markers allow you to quickly correlate a deployment with a possible regression. When a deploy only affects a segment of traffic (like a canary deployment or a blue/green deploy), Cloud Observability shows you the performance and amount of traffic for each version.

  1. When you find a latency spike to investigate, select the operation’s sparkline chart and then click in the spike in the larger Latency chart to create a range of time to collect data from.

    Choose a spot in the middle of a spike to avoid collecting data from before the regression occurred. You can also zoom in on a time period by clicking and dragging over the time period you want a closer look at. The charts redraw to report on just the period you selected.

    A shaded yellow box shows the time range from which span data will be used for analysis of the regression.

  2. Choose to compare that data with data from directly before the previous deploy, an hour ago, a day ago, or you can select your own baseline window.

    When you click Select custom baseline window, click into the chart and move the blue box to choose your baseline.

    The RCA (root cause analysis) view opens to show data from both time periods.

    To change either the Regression or Baseline time, click and move the shaded box. You can also zoom in on a time period by clicking and dragging over the time period you want to zoom in on.

Read on for instructions for using these tools.

Compare latency changes using the histogram

The Compare Latency Histogram is similar to the histogram shown on the Explorer view.

Learn more about histograms

Once you run a query, the Latency Histogram is generated by 100% of the span data collected from the Microsatellites that match the query. The bottom of the histogram shows latency time, and each blue line represents the number of spans that fall into that latency time. Note that on this page, the histogram displays yellow bars instead of blue. You can learn more about the histogram here

Expandable end

In the chart, the yellow bars represent the number of spans for the duration shown on the x-axis. The blue line overlay shows the shape of the histogram for the baseline. This overlay helps identify the scope of the performance impact: how much has latency changed and for how many requests.

By default, markers show where the 95th percentile lies. Use the dropdown to see other percentile markers.

Compare attributes

The Compare Attributes table is similar to Correlations shown on the Explorer view. For the operation being investigated, the table displays attributes that are on spans associated with an increase in latency in the regression time range.

Learn more about Correlations

A standard method of identifying the root cause of a performance regression is to manually comb through traces and search for common system attributes associated with that regression. With Correlations, Cloud Observability helps you find attributes correlated with latency and errors automatically.

Correlations search over all the span data in all traces that the current query participates in and finds patterns in that data for latency and error contribution, looking at service/operation pairs and attributes. This analysis allows you to identify services, operations, and attributes that are meaningfully associated with observed regressions. By surfacing statistical relationships between these system attributes, you can quickly generate and validate hypotheses during a root cause investigation. You can learn more about correlations here

Expandable end

For each of these attributes, the table shows the average latency of the operation being investigated where it contains that attribute anywhere in the trace. The average latency is shown for the baseline (in blue), regression (in yellow), along with the difference between the regression and baseline.

By default, the table is sorted by duration change, from highest to lowest. You can change the sort by clicking on any column header.

Hover over a correlation to see the latency of spans with that attribute in the regression on the histogram (shaded a darker yellow). For example, in this histogram, the attribute large_batch:true appears on only the spans with higher latency and only in the regression (there are no darker bars within the blue baseline period). This attribute is likely associated with the root cause of the latency increase.

Click on a row (or use the Filter icon) in the Compare Attributes table to filter the Operation diagram, Log Event analysis, and Trace Analysis table by that attribute.

Click the More ( ⋮ ) icon to view example traces. Click one of the traces to view the full trace in the Trace view. Click Group by to group the current results in the Trace Analysis table by this attribute.

Compare slowest operations contributing to the critical path in a trace

The Operation Diagram shows the operations across the request path of traces in the regression. By default, the diagram shows all downstream operations from the operation and service you’re investigating. Yellow halos denote average latency contribution. The operation you’re currently investigating has a blue animated halo.

You can also view upstream operations or all operations using the dropdown at the top of the Operation Diagram.

The diagram edges represent relationships between spans and is dependent on the quality of your instrumentation. Missing spans may cause edges to appear when there is no parent-child relationship. The following can cause missing spans:

  • The Microsatellite dropped the span (for example, your Microsatellite pool is not auto-scaling to keep up with traffic)
  • The tracer dropped the span (for example, your application crashed or never called span.flush() ),
  • There is an issue with your instrumentation (context was dropped, or the service with a parent span is not instrumented).

When you see missing spans, check the Reporting Status page to find the culprit.

The table shows the operations that are most contributing to the critical path of the operation you are investigating, along with their critical path contribution. Note that this is different from strictly showing the latency of an operation. The amount of time displayed is strictly the amount of time the operation is in the critical path.

The table is sorted by the critical path contribution in the regression, from largest to smallest.

For example, in this table the write-cache operation on the inventory service contributed an average of 58.4ms in the critical path for baseline time range, but contributed 146ms in the critical path for the regression time range. Identifying the largest increase in latency in the critical path can help pinpoint the root cause of the latency in the regression.

You can change the sort order by clicking on another column. You can also filter the diagram (and the Logs analysis and Trace Analysis table) either by selecting a node in the diagram or by clicking the Filter icon on a row in the table.

Click the More ( ⋮ ) icon to view example traces. Click one of the traces to view the full trace in the Trace view. Click View in Service Directory to return to the Service Health view where you can view the sparkline charts and any deployment markers, but for this service and operation.

View log events

These are logs of events that your tracing instrumentation has appended to the spans. If you used tracing libraries that do not support log appending, you will not see Log Event Analysis.

Cloud Observability shows you the logs of events from the spans that make up the Operation Diagram (if you’ve filtered the diagram, the logs are filtered as well).

The graph shows the frequency of events in the spans that Cloud Observability has analyzed. The table below shows the log messages, the number of times that message appears on spans during both the baseline and regression, and the change in frequency between the two. The service and operation names are shown at the top of the message. When patterns are detected, placeholders display a range, type, etc.

Hover over a row and click the More icon for a log message to see example traces and to copy the log message. Click an example trace to view it in the full Trace view.

You can search for messages using the Search field. The results display the accuracy of the match and are sorted by accuracy. Pattern placeholders are replaced with the actual text from the message. Use the Filter icon to filter the Log analysis, Service diagram, and Trace Analysis table by the log message.

Compare span data using the Trace Analysis table

Similar to the Trace Analysis table on the Explorer view, this Trace Analysis table shows information from the spans in both the regression and baseline time ranges. By default, the table shows the service, the operation reporting the span, the span’s duration, and the start time from the regression time range. You can add other columns as needed by clicking on the + icon.

Unlike the Trace Analysis table on the Explorer View, this table shows data from spans not just in the service and operation you’re investigating, but from all spans that participate in the same trace.

You can switch between analyzing traces from the baseline or regression time periods at the top of the table. Any filtering or grouping is done for the data from both time periods.

Filter the data

By default, Cloud Observability shows you all the data available from spans from the two time periods (baseline and regression). You can filter the data to narrow in on the cause of regression. When you apply filters, the Operation diagram, Log Event analysis, and Trace Analysis tables all repopulate with data that match the filters.

Multiple filter criteria are joined by the OR operator. Therefore, you can only filter by one service/operation pair.

You can apply filters in a number of different ways:

  • From the Narrow your results button: Clicking this button brings up a query builder (similar to the one in Explorer). You can add any operation, service, or attribute that exists in the data. This button is “sticky” and always displays at the top of a component that can be filtered.

    To filter by log messages, use the filter icon on the Log Event Analysis table.

  • From the Operation Diagram: Select a node in the diagram to filter by that service and operation.

See also

Updated Apr 26, 2023