Now that we know there was a spike in CPU usage for one of the hosts of the warehouse
service, we can use Change Intelligence to find what caused it. Change Intelligence compares your system’s performance based on span data before and during the alert and correlates those changes to the metric deviation that triggered the alert.
Start Change Intelligence
You can start your investigation using Change Intelligence right from the Alert page. Click Analyze deviation.
Change Intelligence works by comparing performance during the deviation to performance during a baseline time period. You set the deviation time period in the panel (the baseline is set to be right before that).
You can change deviation time window by clicking and dragging the blue box.
Click Analyze deviation to go to Change Intelligence.
Compare performance
Change Intelligence compares the performance of Key Operations on the service that sent the metrics, looking for changes in latency, operation rate, and error rate during the deviation. The operations with the biggest change between the two time windows are shown. In this case, the update-catalog
experienced changes.
Once Change Intelligence finds performance changes in a Key Operation, it looks at span data from that operation and analyses the corresponding traces, searching for attributes that appear frequently on spans from services up and down the stack during the performance regression. In other words, if an attribute appears on a number of spans during the deviation (and doesn’t on spans from the baseline), it’s likely that something about that attribute is correlated with the issue.
In this case, the customer:ProWool
attribute is highly correlated. The row in the table shows that it’s sent from spans from the iOS
service.
Expanding that row shows SLI charts for that span, and you can see that latency and rate both increased.
Using the the Show comparison toggle, you can see that for spans that didn’t have that attribute, performance didn’t change nearly as much.
So now we know that requests from ProWool coming into the iOS
service generated many of the traces that experienced regressions in performance during the metric deviation time period. Now that Change Intelligence has narrowed down our investigation, we can look at traces to understand why.
View traces
Clicking on a span sample during the deviation opens the corresponding trace.
The trace viwe shows us that there was an issue writing to the database from the cache. In this trace, the customer ProWool sent 2,000 requests to update the catalog and it overwhelmed the write-cache
and database-update
operations, which in turn, caused the increase in CPU utilization.
Change Intelligence found this issue!
What did we learn?
- Change Intelligence compares performance of Key Operations on the service that sent the metric to find changes in performance between the baseline and deviation time windows.
- Change Intelligence finds attributes up and down the stack that most often appear in traces with performance issues.
- By correlating SLI performance to metric deviations using attributes often found during the deviation, Change Intelligence can help find the answer quickly.