Once you’ve created charts in notebooks, dashboards, or metric alerts, you can use Change Intelligence to investigate what caused any deviations in those charts. For metric data, Change Intelligence looks for attributes on key operations up and down the request path of example traces to find those that correlate with the change in performance. For span data, it looks for attributes that match the chart’s query whose performance also changed during the deviation.
For metric charts, Change Intelligence looks for the
service.name attribute (tag/label) in your metric data to determine the sending service. If you want to change that or add others, you need to register them.
You will get the most value out of Change Intelligence when you fully instrument your services for distributed tracing. Using a good amount of attributes ensures Change Intelligence can correlate deviations to spans in your trace data to find what caused the change.
Investigate a deviation from a chart or a dashboard
You can start your investigation directly from a dashboard, or from a chart open from a dashboard, a metric alert, or a notebook.
In a dashboard or chart (open in the editor or the Alert configuration page), click Analyze deviation.
Change Intelligence works best when it can focus on a single service and its dependencies. If your dashboard or chart includes metrics coming from many services, use filters to choose one service to focus on. For a dashboard, use the global filter. For a chart, edit your query to filter the data to include one service. You need to save your chart before you can start Change Intelligence.
In the dialog, drag the blue window to the deviation in the chart and click Analyze deviation.
Change Intelligence opens with the chart you started from, showing the baseline (purple) and deviation (blue) time windows that it uses to compare performance. You can change this by selecting a different time period or changing the baseline and deviation time windows.
You can get a better view of the baseline and deviation time periods by expanding the accordions above the chart.
You can also zoom in on a time period by clicking and dragging over the time period you want a closer look at. The dashboard or chart redraws to report on just the period you selected.
To see the original query for the chart, click View query.
The query displays in a read-only format. If any global filters have been applied, you can see those by clicking the filter dropdown.
(Metric data only) Compare the baseline and deviation data for the Key Operation with the highest magnitude of change.
Based on the time windows for the baseline and deviation, Change Intelligence analyzes Key Operations on the service that emitted the metric. It displays the operations that experienced a change in performance and lists them on the page, sorted by highest degree of change first (operations without any changes aren’t displayed).
Sparkline charts display the average latency, operation rate, and error rate for the operation, for both the baseline (purple) and deviation (blue), sorted by the magnitude of change (indicated by an icon). For latency, the percentile of latency (p99, p95, p50) with the largest change is displayed. You can expand each section for a closer view.
You can now see how changes in performance of an operation may coincide with the deviation in the metric. And you can see if those changes are also present on other Key Operations.
For metric data, expand the first attribute row in the Attributes with the most change area to view the attributes that appeared in most traces with the performance change, along with the percentage of traces where the attribute appears, in both the baseline and deviation. For span data, these are already displayed in the table.
More about the most likely cause
When Change Intelligence finds performance changes in an operation occurring at the same time as the metric deviation, it analyses traces that include that operation. It finds attributes (on operations up and down the request path), that appear frequently on traces from the performance regression, correlating that attribute to a possible root cause. In other words, if an attribute appears on a number of traces with performance issues (and doesn’t on traces that are stable), it’s likely that something about that attribute is causing the issue.
When you expand the attribute row, Change Intelligence provides a number of useful tools to help you with your analysis.
Mini service diagram (metric data only)
This map shows you the service and operation in the trace sending the attribute. In this example, the
/api/update-catalogoperation on the
iOSservice is sending the
SLIs for traces
The average latency, operation rate, and error rate are shown for the baseline compared to the deviation, for traces both with the suspect attribute and without. In this example, you can see that 40% of traces had the attribute
customer:ProWooland they also had an increase in latency and operation rate.
Below that, you can compare that performance with traces (from both the baseline and deviation) that did not have that attribute. In this example, you can see that traces that did not have the
customer:ProWoolattribute didn’t experience the same degradation in performance.
Because traces for requests from ProWool are experiencing more latency and a higher rate, it is likely that traces with this attribute can pinpoint the problem.
Click View sample traces to view traces with the both the suspect attribute and performance issues.
Traces are sorted by latency, and you can choose traces from both the baseline and deviation.
Clicking a trace opens it in the Trace view where you can see the critical path through the request and view details about each span in the trace.
More about Trace view
You use the Trace view to see a full trace from beginning to end of a request. The Trace view shows you a flame graph of the full trace (each service a different color), and below that, each span is shown in a hierarchy, allowing you to see the parent-child relationship of all the spans in the trace. Errors are shown in red.
Clicking a span shows details in the right panel, based on the span’s metadata. Whenever you view a trace in Lightstep Observability, it’s persisted for the length of your Data Retention policy so it can be bookmarked, shared, and reviewed by your team at any time. The selected span is part of the URL, so it will remain selected when the URL is visited.
Learn more about Trace view.
In this example, it looks like ProWool sent over 1,000 requests and the write to the database is overwhelmed. That’s likely why the CPU metric spiked.
If the attribute in the first row did not lead to the issue, continue selecting other attributes that are correlated with performance changes.
The attributes are listed by magnitude of change, so the lower down the list, the less likely it is that you’ll find the issue.
For metric data, if the issue isn’t found from the first Key Operation, start again at Step 3 for the next Key Operation on the page.
Change the chart’s display
You can change the time period for the chart, change the baseline and deviation time windows, and turn off/on deployment markers.
Change the chart’s time period
Use the time picker to select a different time period. Use the < > controls to move backwards and forwards through time. Change Intelligence re-calculates to restrict analyses to that time period.
Change the baseline and deviation windows
By default, the baseline is set as the time directly before where you clicked in the chart. The deviation time period is set at the time period where you clicked the chart.
Change either time period by grabbing the handles on the window and dragging. Or, drag a time window to move it.
As you change the time period(s), Change Intelligence re-calculates to find the biggest changes in performance between the two time periods.
Turn deployment markers on or off (metric data only)
Lightstep Observability can determine when deploys ocurred when you implement an attribute in your service instrumentation to hold versions of your service. When Lightstep finds that the value for that attribute has changed, it displays a deployment marker on the chart. Deployment markers can help you associate recent deployments with changes in performance.
By default, Change Intelligence displays the marker. To turn them off, click Display Settings.
Share Your findings
You can share the findings with anyone with a Lightstep Observability account, and they can see and work with Change Intelligence using the exact data that you used.
You can share a specific correlation by clicking the Share button.
When users click the link, they are directed to that correlation, highlighted in blue.
Or, click the Share button at the top of the page to share a link to the entire page.