Investigate an error rate increase

We will be introducing new workflows to replace the RCA view. As a result, it will soon no longer be supported. Instead, use notebooks for your investigation where you can run ad-hoc queries, view data over a longer time period, and run Cloud Observability’s correlation feature.

When you notice an increase in error rate on the Service Health view, you can use the analytical tools to find the source of errors. This is especially helpful after deploying a service to ensure there are no regressions in errors, and if there are, to quickly determine the source and rollback.

Compare two time ranges for error rate

When you do notice an increase in the error rate for an operation, you can use the RCA view to compare a point of time when the error rate was stable (called the baseline) to the time when the rate was high (regression). You then use analysis tools to identify what caused the increase in errors.

  1. From the Service Directory, select a service and click the Deployments tab. Sparkline charts show the top changes in latency, error rate, and operation rate for the Key Operations for that service. The operations are initially sorted by highest change during the time period initially selected.

More About How Cloud Observability Measures Change

Cloud Observability measures two aspects of change: size and continuity. A full bar indicates that a large, sustained change has happened. Smaller bars indicate either a smaller change or one that did not last for the full time period. Only changes that are relative (i.e. a change of 10ms to 500ms is ranked higher than one of 1s to 2s) are considered.

The yellow bar means that an SLI had an objectively large change, regardless of service or operation. Cloud Observability’s algorithm runs on each SLI independently. For example, when the bar displays for an operation’s latency, that means latency has changed – not that its change was greater compared to the other SLIs.

When determining change, Cloud Observability compares the SLI of the baseline SLI time series to the comparison SLI time series. Those time periods are determined using the data currently visible in the charts.

You can change the amount of time displayed using the time period dropdown at the top right of the page.

The baseline and comparison time periods are determined as follows:

If there is one or more deployment markers visible:

  • For latency and error rates, Cloud Observability compares the performance after the selected version to performance in all other versions.
  • For operation rate, it compares the rate before the deployment marker to after the deployment marker.

If there are no deployment markers visible:
Cloud Observability compares the performance of the first half of the time period to the second half.

Expandable end

If you’ve implemented an attribute to denote new versions of your service, a deployment marker displays at the time the deployment occurred. In the example below, for the update-inventory operation, a deploy occurred at 12:40 pm. Hover over the marker to view details. These markers allow you to quickly correlate deployment with a possible regression. When a deploy only affects a segment of traffic (like a canary deployment or an A/B test), Cloud Observability shows you the performance and amount of traffic for each version.

  1. When you find an error spike to investigate, select the operation’s sparkline chart and then click in the spike in larger Error % chart to create a range of time to collect data from.

    Choose a spot in the middle of a spike to avoid collecting data from before the regression occurred. You can also zoom in on a time period by clicking and dragging over the time period you want a closer look at. The charts redraw to report on just the period you selected.

    A shaded red box shows the time range from which span data will be used for analysis of the regression.

  2. Choose to compare that data with data from directly before the previous deploy, an hour ago, a day ago, or you can select your own baseline window.

    When you click Custom baseline window, click into the chart and move the blue box to choose your baseline.

    The RCA (root cause analysis) view opens to show data from both time periods. The color blue represents the baseline time window and red represents the regression time window.

    To change either the Regression or Baseline time, click and move the shaded box. You can also zoom in on a time period by clicking and dragging over the time period you want to zoom in on.

Read on for instructions for using these tools.

See operation request path

The Operation Diagram shows downstream, upstream, or all operations from the operation and service you’re investigating. By default, Cloud Observability applies an error=true filter and builds this diagram using only spans with errors. Red halos denote error percentages. The larger the halo, the more errors are present. The operation you’re currently investigating has a blue animated halo.

Cloud Observability also shows you the number of logs associated with each operation, based on the filters applied. For example, when error=true is applied as a filter, these counts only include logs on spans with errors.

By default, the diagram shows you all downstream operations. You can change the diagram to show upstream operations or both downstream and upstream operations using the dropdown menu.

The diagram edges represent relationships between spans and is dependent on the quality of your instrumentation. Missing spans may cause edges to appear when there is no parent-child relationship. The following can cause missing spans:

  • The Microsatellite dropped the span (for example, your Microsatellite pool is not auto-scaling to keep up with traffic)
  • The tracer dropped the span (for example, your application crashed or never called span.flush() ),
  • There is an issue with your instrumentation (context was dropped, or the service with a parent span is not instrumented).

When you see missing spans, check the Reporting Status page to find the culprit.

Hovering over an operation gives you the error rate and log count.

Compare operation error rates

The Operations with Errors table shows the number of errors per minute on the operations in the diagram that originated an error, for both the baseline and regression time periods, along with the calculated change. From this table you can see where the majority of errors originate, the change in rate, and then click through to full traces to further diagnose.

The table only shows operations that originated the errors, so there may be fewer operations in the table than the diagram. Operations are considered to be originating errors if they have no descendant spans with errors.

By default, the table is sorted by the amount of change. You can change the sort order by clicking on another column. You can also filter the diagram (and the Logs analysis and Trace Analysis table) to show only span data from an operation by either selecting a node in the diagram or by clicking the Filter icon on the operation’s row in the table.

Click the More ( ⋮ ) icon to view example traces. Click one of the traces to view the full trace in the Trace view. Click View in Service Directory to return to the Service Health view where you can view the sparkline charts and any deployment markers, but for this service and operation, .

Correlate attributes to errors

The Attributes with Errors table shows you attributes that may be correlated with the increase in errors—that is, these attributes more often appear on spans with errors. It also allows you to compare the errors per minute in the baseline and the regression windows. Because attributes provide contextual data about a span, this table can often give you clues to what might be causing the errors.

Learn more about Correlations

A standard method of identifying the root cause of a performance regression is to manually comb through traces and search for common system attributes associated with that regression. With Correlations, Cloud Observability helps you find attributes correlated with latency and errors automatically.

Correlations search over all the span data in all traces that the current query participates in and finds patterns in that data for latency and error contribution, looking at service/operation pairs and attributes. This analysis allows you to identify services, operations, and attributes that are meaningfully associated with observed regressions. By surfacing statistical relationships between these system attributes, you can quickly generate and validate hypotheses during a root cause investigation. You can learn more about correlations here

Expandable end

For example, in the image below you can see that the 429 status and error code and error type of server all have the same increase in error rate, which likely means they all appear in the same spans.

Click the Filter icon on a row in the Compare Attributes table to filter the Operation diagram, Logs Event analysis, and Trace Analysis table and only see spans with that attribute.

Click the More ( ⋮ ) icon to view example traces. Click one of the traces to view the full trace in the Trace view. Click Group by to group the current results in the Trace Analysis table by this attribute.

View log events

These are logs of events that your tracing instrumentation has appended to the spans. If you used tracing libraries that do not support log appending, you will not see Log Event Analysis.

Cloud Observability shows you the logs of events from the spans that make up the Operation Diagram (if you’ve filtered the diagram, the logs are filtered as well).

The graph shows the frequency of events in the spans that Cloud Observability has analyzed. The table below shows the log messages, the number of times that message appears on spans during both the baseline and regression, and the change in frequency between the two. The service and operation names are shown at the top of the message. When patterns are detected, placeholders display a range, type, etc.

Hover over a row and click the More icon for a log message to see example traces and to copy the log message. Click an example trace to view it in the full Trace view.

You can search for messages using the Search field. The results display the accuracy of the match and are sorted by accuracy. Pattern placeholders are replaced with the actual text from the message. Use the Filter icon to filter the Log analysis, Service diagram, and Trace Analysis table by the log message.

Compare span data using the Trace Analysis table

Similar to the Trace Analysis table on the Explorer view, this Trace Analysis table shows information from the spans in both the regression and baseline time ranges. By default, the table is filtered to show only spans with errors with the error=true filter. For each span you can see the service, the operation reporting the span, the span’s duration, and the start time from the regression time range. You can add other columns as needed by clicking on the + icon.

Unlike the Trace Analysis table on the Explorer View, this table shows data from spans not just in the service and operation you’re investigating, but from all spans that participate in the same trace.

You can switch between analyzing traces from the baseline or regression time periods at the top of the table. Any filtering or grouping is done for the data for both time periods.

Filter the data

By default, Cloud Observability shows you all the data available from spans from the two time periods (baseline and regression). You can filter the data to narrow in on the cause of regression. When you apply filters, the Operation diagram, Log Event analysis, and Trace Analysis tables all repopulate with data that match the filters.

Multiple filter criteria are joined by the OR operator. Therefore, you can only filter by one service/operation pair.

You can apply filters in a number of different ways:

  • From the Narrow your results button: Clicking this button brings up a query builder (similar to the one in Explorer). You can add any operation, service, or attribute that exists in the data. This button is “sticky” and always displays at the top of a component that can be filtered.

    To filter by log messages, use the filter icon on the Log Event Analysis table.

  • From the Operation Diagram: Select a node in the diagram to filter by that service and operation.

See also

Service health

Updated Apr 26, 2023