Now that we know there was a spike in CPU usage for one of the hosts of the
warehouse service, we can use Change Intelligence to find what caused it. Change Intelligence compares your system’s performance based on trace data before and during the alert and correlates those changes to the metric deviation that triggered the alert.
Start Change Intelligence
You can start your investigation using Change Intelligence right from the Alert page. Click into the chart where the threshold was crossed, and choose What caused this change?
Change Intelligence begins by setting the baseline and deviation time periods. The deviation time is set at the point where you clicked the chart and the baseline is set to be right before that.
You can change the baseline and deviation time windows by clicking and dragging.
Once the comparison time windows are set, Change Intelligence compares the performance of Key Operations on the service that sent the metrics, looking for changes in latency, operation rate, and error rate during the deviation. The operations with the biggest change between the two time windows are shown. In this case, the
refresh-catalog both experienced changes in p99 latency, but the
update-catalog operation also experienced a change in rate.
Let’s focus the investigation to the changes on this operation, as it has experienced the most amount of change.
Icons display the magnitude of change.
Expanding the latency and rate sections, we can see that the changes do coincide with the changes seen in the metric deviation. They both show an increase around the same time, while the baseline remains flat.
Once Change Intelligence finds performance changes in a Key Operation, it looks at traces with that operation. It analyses the traces, searching for attributes that appear frequently on spans from services up and down the stack during the performance regression, and surfaces those attributes as “clues” to the issue. In other words, if an attribute appears on a number of traces with performance issues (and doesn’t on traces that are stable), it’s likely that something about that attribute is causing the issue.
In this case, the
customer:ProWool is surfaced. Expanding that section shows that it’s sent on spans from the
/api/update-catalog operation on the
iOS service and it appears in over 40% of the traces during the deviation. On traces with that attribute, latency and rate both increased, while on traces without that attribute, both were fairly stable.
So now we know that requests coming into the
iOS service from ProWool generated many of the traces that experienced regressions in performance during the metric deviation time period. Now that Change Intelligence has narrowed down our investigation, we can look at traces to understand why.
Change Intelligence collects traces from both the baseline and deviation windows that experience the change in performance and carry the suspect attribute.
Clicking on one shows us that there was an issue writing to the database from the cache. In this trace, the customer ProWool sent 2,000 requests to update the catalog and it overwhelmed the
database-update operations, which in turn, caused the increase in CPU utilization.
Change Intelligence found this issue!
What did we learn?
- Change Intelligence compares performance of Key Operations on the service that sent the metric to find changes in performance between the baseline and deviation time windows.
- Change Intelligence finds attributes up and down the stack that most often appear in traces with performance issues.
- By making correlations between SLI performance and metric deviations and correlations between that performance and frequently found attributes, Change Intelligence can find the answer to “What caused that change?”