Once you’ve created charts in notebooks, dashboards, or alerts, you can use Change Intelligence to investigate what caused any deviations in those charts. For metric data, Change Intelligence looks for attributes on key operations up and down the request path of example traces to find those that correlate with the change in performance. For span data, it looks for attributes in traces that match the chart’s query whose performance also changed during the deviation.
For metric charts, Change Intelligence looks for the
service.name attribute (tag/label) in your metric data to determine the sending service. If you want to change that or add others, you need to register them.
You will get the most value out of Change Intelligence when you fully instrument your services for distributed tracing. Using a good amount of attributes ensures Change Intelligence can correlate deviations to spans in your trace data to find what caused the change.
Investigate a deviation from a chart or a dashboard
You can start your investigation directly from a dashboard, or from a chart open from a dashboard, an alert, or a notebook.
In a dashboard or chart, click Analyze deviation.
Change Intelligence works best when it can focus on a single service and its dependencies. If your dashboard or chart includes metrics coming from many services, use filters to choose one service to focus on. For a dashboard, use the global filter. For a chart, edit your query to filter the data to include one service. You need to save your chart before you can start Change Intelligence.
In the dialog, drag the blue window to the deviation in the chart and click Analyze deviation.
Change Intelligence opens with the chart you started from, showing the baseline (purple) and deviation (blue) time windows that it uses to compare performance. You can change this by selecting a different time period or changing the baseline and deviation time windows.
You can get a better view of the baseline and deviation time periods by expanding the accordions above the chart.
You can also zoom in on a time period by clicking and dragging over the time period you want a closer look at. The dashboard or chart redraws to report on just the period you selected.
To see the original query for the chart, click View query.
The query displays in a read-only format. If any global filters have been applied, you can see those by clicking the filter dropdown.
(Metric data only) Compare the baseline and deviation data for the Key Operation with the highest magnitude of change.
Based on the time windows for the baseline and deviation, Change Intelligence analyzes Key Operations on the service that emitted the metric. It displays the operations that experienced a change in performance and lists them on the page, sorted by highest degree of change first (operations without any changes aren’t displayed).
Sparkline charts display the average latency, operation rate, and error rate for the operation, for both the baseline (purple) and deviation (blue), sorted by the magnitude of change (indicated by an icon). For latency, the percentile of latency (p99, p95, p50) with the largest change is displayed. You can expand each section for a closer view.
You can now see how changes in performance of an operation may coincide with the deviation in the metric. And you can see if those changes are also present on other Key Operations.
In the Attributes with most change table, Lightstep lists the attributes that appeared in most traces with the performance change (correlated attributes). Each attribute row displays a map icon showing the relationship of the service containg the attribute to the service you queried. Hover over the icon to see a larger view of the map. The row also shows the service and operation that sent the attribute in its span, along with the percentage of traces where that attribute appears, in both the baseline and deviation.
For span data, tabs on the table allow you to view correlated attributes based on the following:
- On queried spans: Shows attributes only from spans returned by the original query
- On queried service: Shows attributes only on spans from the service in your query
- From upstream: Shows correlated attributes only on spans upstream of the service you queried on
- From downstream: Shows attributes only on spans downstream of the service you queried on
- On all spans: Shows all attributes from all spans in the corresponding traces
Only tabs that have relevant span data display.
More about correlated attributes
Change Intelligence is able to correlate attributes with performance change by finding attributes on operations up and down the request path that appear frequently on traces from the deviation time period (for metrics, it determines the request path by first finding performance changes in key operations on the service during the deviation, and then analyzing span data from those operations). If an attribute appears on a number of traces that have performance issues during the deviation (and doesn’t on traces that are stable), it’s likely that something about that attribute is causing the issue.
When you expand the attribute row, Change Intelligence shows SLI charts that compare the baseline to the deviation for the p99 latency, operation rate and error rate percentage.
By default, the time series only display for spans that contain the attribute. Use the Show comparison toggle to compare that with spans that don’t contain the attribute. When you add those spans, you can also choose to group the data by the attribute name.
Turn off Show span samples to get a better view of the groupings.
Looking at the previous image, you can see that traces that did not have the
customer:ProWoolattribute didn’t experience the same degradation in latency or operation rate.
Because traces for requests from ProWool are experiencing more latency and a higher rate, it is likely that traces with this attribute can pinpoint the problem.
Turn Show span samples on and click on a span sample dot to view a trace.
Clicking a trace opens it in the Trace view where you can see the critical path through the request and view details about each span in the trace.
More about Trace view
You use the Trace view to see a full trace from beginning to end of a request. The Trace view shows you a flame graph of the full trace (each service a different color), and below that, each span is shown in a hierarchy, allowing you to see the parent-child relationship of all the spans in the trace. Errors are shown in red.
Clicking a span shows details in the right panel, based on the span’s metadata. Whenever you view a trace in Lightstep Observability, it’s persisted for the length of your Data Retention policy so it can be bookmarked, shared, and reviewed by your team at any time. The selected span is part of the URL, so it will remain selected when the URL is visited.
Learn more about Trace view.
In this example, it looks like ProWool sent over 1,000 requests and the write to the database is overwhelmed. That’s likely why the CPU metric spiked.
If the attribute in the first row did not lead to the issue, continue selecting other attributes that are correlated with performance changes.
The attributes are listed by magnitude of change, so the lower down the list, the less likely it is that you’ll find the issue.
For metric data, if the issue isn’t found from the first Key Operation, start again at Step 3 for the next Key Operation on the page.
Change the chart’s display
You can change the time period for the chart, change the baseline and deviation time windows, and turn off/on deployment markers.
Change the chart’s time period
Use the time picker to select a different time period. Use the < > controls to move backwards and forwards through time. Change Intelligence re-calculates to restrict analyses to that time period.
Change the baseline and deviation windows
By default, the baseline is set as the time directly before where you clicked in the chart. The deviation time period is set at the time period where you clicked the chart.
Change either time period by grabbing the handles on the window and dragging. Or, drag a time window to move it.
As you change the time period(s), Change Intelligence re-calculates to find the biggest changes in performance between the two time periods.
Turn deployment markers on or off (metric data only)
Lightstep Observability can determine when deploys ocurred when you implement an attribute in your service instrumentation to hold versions of your service. When Lightstep finds that the value for that attribute has changed, it displays a deployment marker on the chart. Deployment markers can help you associate recent deployments with changes in performance.
By default, Change Intelligence displays the marker. To turn them off, click Display Settings.
Share your findings
You can share correlated attribute charts with anyone with a Lightstep Observability account, and they can see and work with Change Intelligence using the exact data that you used.
You can share a specific correlation by clicking the Share button.
Or, click the Share button at the top of the page to share a link to the entire page.
Add a correlation to a notebook
You can add an SLI chart for a correlated attribute to a notebook for when, during an investigation, you want to be able to run ad hoc queries, take notes, and save your analysis for use in postmortems or runbooks. Notebooks allow you to view metric and trace data from different places in Lightstep Observability together, in one place.
To add a chart to a notebook, click Add to notebook and search to choose an existing notebook or create a new notebook.
When you add to a notebook, a chart is created querying on the attribute, service, and operation from the correlation.