Once you’ve set up Lightstep to ingest your metric data, you can use Change Intelligence to investigate what caused any deviations in those metrics. Change Intelligence determines the service that emitted a metric, searches for performance changes on Key Operations from that service at the same time as the deviation, and then uses trace data to determine what caused the change.

By default, Change Intelligence looks for the service.name tag (label) in your metric data to determine the sending service. If you want to change that or add others, you need to register them.

You will get the most value out of Change Intelligence when you fully instrument your services for distributed tracing. Using a good amount of attributes ensures Change Intelligence can correlate deviations in metrics with spans in your trace data to find what caused the change.

Investigate a Metric Deviation From a Chart or a Dashboard

You can start your investigation directly from a metric dashboard, or from a chart (even a chart on a metric alert).

  1. In a dashboard or chart (open in the editor or the Alert configuration page), click on a deviation in the time series and choose What caused this change?.

    Start metric investigation

    Change Intelligence works best when it can focus on a single service and its dependencies. If your dashboard or chart includes metrics coming from many services, use filters to choose one service to focus on. For a dashboard, use the global filter. For a chart, edit your query to filter the data to include one service. You need to save your chart before you can start Change Intelligence.

    Change Intelligence opens with the chart you started from, showing the baseline (purple) and deviation (blue) time windows that it uses to compare performance. The deviation time window is determined by where you clicked in the chart. The baseline is the time period directly before that. You can change this by selecting a different time period or changing the baseline and deviation time windows. Time windows show baseline and deviation

    You can get a better view of the baseline and deviation time periods by expanding the accordions below the chart. Expand to see details

    By default, deployment markers display on the chart so you can see if a deploy might have ocurred right before a deviation. You can turn these off.

    You can also zoom in on a time period by clicking and dragging over the time period you want a closer look at. The dashboard or chart redraws to report on just the period you selected.

    To see the original query for the chart, click View query. View query

    The query displays in a read-only format. If any global filters have been applied, you can see those by clicking the filter dropdown.Read-only query

  2. Compare the baseline and deviation data for the Key Operation with the highest magnitude of change.

    Based on the time windows for the baseline and deviation, Change Intelligence analyzes Key Operations on the service that emitted the metric. It displays the operations that experienced a change in performance and lists them on the page, sorted by highest degree of change first (operations without any changes aren’t displayed). SLIs for Key Operations

Lightstep considers Key Operations to be your ingress operations on a service.

Sparkline charts display the average latency, operation rate, and error rate for the operation, for both the baseline (purple) and deviation (blue), sorted by the magnitude of change (indicated by an icon). For latency, the percentile of latency (p99, p95, p50) with the largest change is displayed. You can expand each section for a closer view. Sparkline charts of important SLIs

You can now see how changes in performance of an operation may coincide with the deviation in the metric. And you can see if those changes are also present on other Key Operations.

  1. Expand the first attribute row in the Most likely causes of performance changes area.

    For each Key Operation listed on the page, the most likely causes (the attributes that appeared in most traces with the performance change) are shown, along with the percentage of traces where the attribute appears, in both the baseline and deviation. A link opens additional probable causes. Most likely cause of change

    More about the most likely cause

    When Change Intelligence finds performance changes in an operation occurring at the same time as the metric deviation, it analyses traces that include that operation. It finds attributes (on operations up and down the request path), that appear frequently on traces from the performance regression, correlating that attribute to a possible root cause. In other words, if an attribute appears on a number of traces with performance issues (and doesn’t on traces that are stable), it’s likely that something about that attribute is causing the issue.

    Expandable end

    When you expand the attribute row, Change Intelligence provides a number of useful tools to help you with your analysis.

    Mini service diagram
    This map shows you the service and operation in the trace sending the attribute. In this example, the /api/update-catalog operation on the iOS service is sending the customer:ProWool attribute. Operation diagram

    SLIs for traces
    The average latency, operation rate, and error rate are shown for the baseline compared to the deviation, for traces both with the suspect attribute and without. In this example, you can see that 40% of traces had the attribute customer:ProWool and they also had an increase in latency and operation rate.

    Below that, you can compare that performance with traces (from both the baseline and deviation) that did not have that attribute. In this example, you can see that traces that did not have the customer:ProWool attribute didn’t experience the same degradation in performance.Compare traces

    Because traces for requests from ProWool are experiencing more latency and a higher rate, it is likely that traces with this attribute can pinpoint the problem.

  2. Click View sample traces to view traces with the both the suspect attribute and performance issues.
    Traces are sorted by latency, and you can choose traces from both the baseline and deviation. View traces

    Clicking a trace opens it in the Trace view where you can see the critical path through the request and view details about each span in the trace.

    More about Trace view

    You use the Trace view to see a full trace from beginning to end of a request. The Trace view shows you a flame graph of the full trace (each service a different color), and below that, each span is shown in a hierarchy, allowing you to see the parent-child relationship of all the spans in the trace. Errors are shown in red.

    Clicking a span shows details in the right panel, based on the span’s metadata. Whenever you view a trace in Lightstep, it’s persisted for the length of your Data Retention policy so it can be bookmarked, shared, and reviewed by your team at any time. The selected span is part of the URL, so it will remain selected when the URL is visited.

    Learn more about Trace view.

    Expandable end

    Trace view

    In this example, it looks like ProWool sent over 1,000 requests and the write to the database is overwhelmed. That’s likely why the CPU metric spiked.

  3. If the attributes in the Most likely cause of performance changes area did not lead to the issue, click the more causes link to view other attributes that are correlated with performance changes.

    The attributes are listed by magnitude of change, so the lower down the list, the less likely it is that you’ll find the issue. More attributes with change in performance

  4. If the issue isn’t found from the first Key Operation, repeat Steps 2-5 for the next Key Operation on the page. Next Key Operation

Change the Chart’s Display

You can change the time period for the chart, change the baseline and deviation time windows, and turn off/on deployment markers.

Change the Chart’s Time Period

Use the time picker to select a different time period. Use the < > controls to move backwards and forwards through time.Change the time period Change Intelligence re-calculates to restrict analyses to that time period.

Change the Baseline and Deviation Windows

By default, the baseline is set as the time directly before where you clicked in the chart. The deviation time period is set at the time period where you clicked the chart.

Change either time period by grabbing the handles on the window and dragging. Or, drag a time window to move it.Change the time periods using the handles

As you change the time period(s), Change Intelligence re-calculates to find the biggest changes in performance between the two time periods.

Turn Deployment Markers On or Off

Lightstep can determine when deploys ocurred when you implement an attribute in your service instrumentation to hold versions of your service. When Lightstep finds that the value for that attribute has changed, it displays a deployment marker on the chart. Deployment markers can help you associate recent deployments with changes in performance. Deployment marker

By default, Change Intelligence displays the marker. To turn them off, click Display Settings. Deployment Marker settings

Share Your Findings

You can share the findings with anyone with a Lightstep account, and they can see and work with Change Intelligence using the exact data that you used.

You can share a specific correlation by clicking the Share button. Share a correlation

When users click the link, they are directed to that correlation, highlighted in blue. Shared correlation

Or, click the Share button at the top of the page to share a link to the entire page. Share Change Intelligence