Once you’ve created charts in notebooks, dashboards, or alerts, you can use Change Intelligence to investigate what caused any deviations in those charts. For metric data, Change Intelligence looks at traces that include spans from key operations on the service that emitted the metric. For span data, it searches traces that match the chart’s query. In both cases, Change Intelligence surfaces the attributes that appeared in traces with performance issues occurring at the same time as the spike in the chart.

For metric charts, Change Intelligence looks for the service.name attribute (tag/label) in your metric data to determine the sending service. If you want to change that or add others, you need to register them.

You will get the most value out of Change Intelligence when you fully instrument your services for distributed tracing. Using a good amount of attributes ensures Change Intelligence can correlate deviations to spans in your trace data to find what caused the change.

Investigate a deviation from a chart or a dashboard

You can start your investigation directly from a dashboard, or from a chart open from a dashboard, an alert, or a notebook.

  1. In a dashboard or chart, click Analyze deviation.

    You can also click directly into a chart to start Change Intelligence.

    Start metric investigation

    Change Intelligence works best when it can focus on a single service and its dependencies. If your dashboard or chart includes metrics coming from many services, use filters to choose one service to focus on. For a dashboard, use the global filter. For a chart, edit your query to filter the data to include one service. You need to save your chart before you can start Change Intelligence.

  2. In the side panel, drag the blue window to the deviation in the chart and click Analyze deviation. Select deviation window

    The side panel shows the baseline (purple) and deviation (blue) time windows that Change Intelligence uses to compare performance. You can change this by selecting a different time period or changing the baseline and deviation time windows. Time windows show baseline and deviation

    You can click View full Change Intelligence to open and work with Change Intelligence in the expanded view.

  1. (Metric data only) Expand the Key Operation with the highest magnitude of change (the first one in the list).

    Based on the time windows for the baseline and deviation, Change Intelligence analyzes Key Operations on the service that emitted the metric. It displays the operations that experienced a change in performance and lists them on the page, sorted by highest degree of change first (operations without any changes aren’t displayed). Key Operations with change

  2. In the Attributes with most change table, Lightstep lists the attributes that appeared in most traces with the performance change (correlated attributes). Each attribute row displays a map icon showing the relationship of the service containg the attribute to the service you queried. Hover over the icon to see a larger view of the map. Most likely cause of a metric change

    For span data, the Filter dropdown allows you to view correlated attributes based on the following:

    • On queried spans: Shows attributes only from spans returned by the original query
    • On queried service: Shows attributes only on spans from the service in your query
    • From upstream: Shows correlated attributes only on spans upstream of the service you queried on
    • From downstream: Shows attributes only on spans downstream of the service you queried on
    • On all spans: Shows all attributes from all spans in the corresponding traces

    Span attribute results

    Only options that have relevant span data display.

    More about correlated attributes

    Change Intelligence is able to correlate attributes with performance change by finding attributes on operations up and down the request path that appear frequently on traces from the deviation time period (for metrics, it determines the request path by first finding performance changes in key operations on the service during the deviation, and then analyzing span data from those operations). If an attribute appears on a number of traces that have performance issues during the deviation (and doesn’t on traces that are stable), it’s likely that something about that attribute is causing the issue.

    Expandable end

    When you expand the attribute row, Change Intelligence shows SLI charts that compare the baseline to the deviation for latency (p99 for spans, p95 for metrics), operation rate and error rate percentage. SLIs for correlated attribute

    By default, the time series only display for spans that contain the attribute. Use the Show comparison toggle to compare that with spans that don’t contain the attribute. When you add those spans, you can also choose to group the data by the attribute name.

    Turn off Show span samples to get a better view of the groupings.

    Compare and group other spans

    Looking at the previous image, you can see that traces that did not have the customer:ProWool attribute didn’t experience the same degradation in latency or operation rate.

    Because traces for requests from ProWool are experiencing more latency and a higher rate, it is likely that traces with this attribute can pinpoint the problem.

  3. If you want to investigate performance on a single attribute, click the Add to notebook icon. Choose a notebook (or create a new one). Add an attribute query to a notebook

    Lighstep copies the query for that attribute to the notebook. Attribute query

    Alternatively, if you are looking only at attributes from spans in your query, you can choose to add the attribute to as a filter or group-by to your query. Add attribute to your query

  4. Cmd+ or Ctrl+ click on a span sample dot to view a trace in a new browser tab.

    Opening the Trace view in a separate tab allows you to view multiple traces at once and not lose your place in Change Intelligence.

    Clicking a trace opens it in the Trace view where you can see the critical path through the request and view details about each span in the trace.

    More about Trace view

    You use the Trace view to see a full trace from beginning to end of a request. The Trace view shows you a flame graph of the full trace (each service a different color), and below that, each span is shown in a hierarchy, allowing you to see the parent-child relationship of all the spans in the trace. Errors are shown in red.

    Clicking a span shows details in the right panel, based on the span’s metadata. Whenever you view a trace in Lightstep Observability, it’s persisted for the length of your Data Retention policy so it can be bookmarked, shared, and reviewed by your team at any time. The selected span is part of the URL, so it will remain selected when the URL is visited.

    Learn more about Trace view.

    Expandable end

    Trace view

    In this example, it looks like ProWool sent over 1,000 requests and the write to the database is overwhelmed. That’s likely why the CPU metric spiked.

  5. If the attribute in the first row did not lead to the issue, continue selecting other attributes that are correlated with performance changes.

    The attributes are listed by magnitude of change, so the lower down the list, the less likely it is that you’ll find the issue.

  6. For metric data, if the issue isn’t found from the first Key Operation, start again at Step 3 for the next Key Operation on the page.

Work with the expanded view of Change Intelligence

If you need to get a closer look at Change Intelligence, at any time you can click View full Change Intelligence to open it in a new page. Expand Change Intelligence

The expanded view offers this additional functionality:

  • View the baseline and deviation time periods by expanding the accordions above the chart. Expand to see details
  • View the original query for the chart by clicking View query. View query

    The query displays in a read-only format. If any global filters have been applied, you can see those by clicking the filter dropdown.Read-only query

  • For metrics, sparkline charts display the average latency, operation rate, and error rate for the operation, for both the baseline (purple) and deviation (blue), sorted by the magnitude of change (indicated by an icon). For latency, the percentile of latency (p99, p95, p50) with the largest change is displayed. You can expand each section for a closer view. Sparkline charts of important SLIs

    You can see how changes in performance of an operation may coincide with the deviation in the metric. And you can see if those changes are also present on other Key Operations

  • Rows in the attribute table show the service and operation that sent the attribute in its span, along with the percentage of traces where that attribute appears, in both the baseline and deviation. Most likely cause of a metric change

Change the chart’s display

You can change the time period for the chart, change the baseline and deviation time windows, and turn off/on deployment markers.

Change the chart’s time period

Use the time picker to select a different time period. Use the < > controls to move backwards and forwards through time. To help you explore data, the time picker saves your custom time ranges and shows them under Recently used.

Change the time period Change Intelligence re-calculates to restrict analyses to that time period.

Change the baseline and deviation windows

By default, the baseline is set as the time directly before where you clicked in the chart. The deviation time period is set at the time period where you clicked the chart.

Change either time period by grabbing the handles on the window and dragging. Or, drag a time window to move it.Change the time periods using the handles

You can also zoom in on a time period by clicking and dragging over the time period you want a closer look at. The dashboard or chart redraws to report on just the period you selected (expanded view only).

As you change the time period(s), Change Intelligence re-calculates to find the biggest changes in performance between the two time periods.

Turn deployment markers on or off (expanded view only)

Lightstep Observability can determine when deploys ocurred when you implement an attribute in your service instrumentation to hold versions of your service. When Lightstep finds that the value for that attribute has changed, it displays a deployment marker on the chart. Deployment markers can help you associate recent deployments with changes in performance. Deployment marker

By default, Change Intelligence displays the marker. To turn them off, click Display Settings. Deployment Marker settings

Share your findings (expanded view only)

You can share correlated attribute charts with anyone with a Lightstep Observability account, and they can see and work with Change Intelligence using the exact data that you used.

You can share a specific correlation by clicking the Share button. Share a correlation

Or, click the Share button at the top of the page to share a link to the entire page. Share Change Intelligence

Add a correlation to a notebook

You can add an SLI chart for a correlated attribute to a notebook for when, during an investigation, you want to be able to run ad hoc queries, take notes, and save your analysis for use in postmortems or runbooks. Notebooks allow you to view metric and trace data from different places in Lightstep Observability together, in one place. Notebooks

To add the latency SLI chart to a notebook from the panel view, click the Add to notebook icon.Add latency SLI chart to notebook To add another chart, expand the row and click Add to notebook for that chart. Add a single chart

In the expanded view, click Add to notebook for the chart to be added. Add chart from expanded view

Search to choose an existing notebook or create a new notebook.Add to notebook

When you add to a notebook, a chart is created querying on the attribute, service, and operation from the correlation. New notebook entry