Now that we know there was a spike in CPU usage for one of the hosts of the warehouse service, we can use Change Intelligence to find what caused it. Change Intelligence compares your system’s performance based on span data before and during the alert and correlates those changes to the metric deviation that triggered the alert.

Start Change Intelligence

You can start your investigation using Change Intelligence right from the Alert page. Click Analyze deviation. Start Change Intelligence

Change Intelligence works by comparing performance during the deviation to performance during a baseline time period. You set the deviation time period in the panel (the baseline is set to be right before that).Set the deviation time windows

You can change deviation time window by clicking and dragging the blue box.

Click Analyze deviation to go to Change Intelligence.

Compare performance

Change Intelligence compares the performance of Key Operations on the service that sent the metrics, looking for changes in latency, operation rate, and error rate during the deviation. The operations with the biggest change between the two time windows are shown. In this case, the update-catalog experienced changes. Comparison of changes on Key Operations

Once Change Intelligence finds performance changes in a Key Operation, it looks at span data from that operation and analyses the corresponding traces, searching for attributes that appear frequently on spans from services up and down the stack during the performance regression. In other words, if an attribute appears on a number of spans during the deviation (and doesn’t on spans from the baseline), it’s likely that something about that attribute is correlated with the issue.

In this case, the customer:ProWool attribute is highly correlated. The row in the table shows that it’s sent from spans from the iOS service. Attribute correlates with changes

Expanding that row shows SLI charts for that span, and you can see that latency and rate both increased. SLI charts for spans with ProWool

Using the the Show comparison toggle, you can see that for spans that didn’t have that attribute, performance didn’t change nearly as much. Compare to spans without the attribute

So now we know that requests from ProWool coming into the iOS service generated many of the traces that experienced regressions in performance during the metric deviation time period. Now that Change Intelligence has narrowed down our investigation, we can look at traces to understand why.

View traces

Clicking on a span sample during the deviation opens the corresponding trace.

The trace viwe shows us that there was an issue writing to the database from the cache. In this trace, the customer ProWool sent 2,000 requests to update the catalog and it overwhelmed the write-cache and database-update operations, which in turn, caused the increase in CPU utilization. Trace view confirms hypothesis

Change Intelligence found this issue!

What did we learn?

  • Change Intelligence compares performance of Key Operations on the service that sent the metric to find changes in performance between the baseline and deviation time windows.
  • Change Intelligence finds attributes up and down the stack that most often appear in traces with performance issues.
  • By correlating SLI performance to metric deviations using attributes often found during the deviation, Change Intelligence can help find the answer quickly.