Now that you’ve sent metrics to Lightstep Observability and instrumented your app, you can use Change Intelligence to investigate what caused any deviations in those metrics. Change Intelligence determines the service that emitted a metric, searches for performance changes on Key Operations from that service at the same time as the deviation, and then uses trace data to determine what caused the change.

By default, Change Intelligence looks for the attribute (tag/label) in your metric data to determine the sending service. If your AWS metric data uses a different attribute, you need to register them.

  1. In your dashboard or chart, click on a deviation in the time series and choose What caused this change?.

    Start metric investigation

    Change Intelligence works best when it can focus on a single service and its dependencies. If your dashboard or chart includes metrics coming from many services, use filters to choose one service to focus on. For a dashboard, use the global filter. For a chart, edit your query to filter the data to include one service. You need to save your chart before you can start Change Intelligence.

    Change Intelligence opens with the chart you started from, showing the baseline (purple) and deviation (blue) time windows that it uses to compare performance. The deviation time window is determined by where you clicked in the chart. The baseline is the time period directly before that. Expand to see details

  2. Compare the baseline and deviation data for the Key Operation with the highest magnitude of change.

    Based on the time windows for the baseline and deviation, Change Intelligence analyzes Key Operations on the service that emitted the metric. It displays the operations that experienced a change in performance and lists them on the page, sorted by highest degree of change first (operations without any changes aren’t displayed). SLIs for Key Operations

Lightstep Observability considers Key Operations to be your ingress operations on a service.

Sparkline charts display the average latency, operation rate, and error rate for the operation, for both the baseline (purple) and deviation (blue), sorted by the magnitude of change (indicated by an icon). For latency, the percentile of latency (p99, p95, p50) with the largest change is displayed. You can expand each section for a closer view. Sparkline charts of important SLIs

You can now see how changes in performance of an operation may coincide with the deviation in the metric. And you can see if those changes are also present on other Key Operations.

  1. Expand the first attribute row in the Most likely causes of performance changes area.

    For each Key Operation listed on the page, the most likely causes (the attributes that appeared in most traces with the performance change) are shown, along with the percentage of traces where the attribute appears, in both the baseline and deviation. A link opens additional probable causes. Most likely cause of change

    More about the most likely cause

    When Change Intelligence finds performance changes in an operation occurring at the same time as the metric deviation, it analyses traces that include that operation. It finds attributes (on operations up and down the request path), that appear frequently on traces from the performance regression, correlating that attribute to a possible root cause. In other words, if an attribute appears on a number of traces with performance issues (and doesn’t on traces that are stable), it’s likely that something about that attribute is causing the issue.

    Expandable end

    When you expand the attribute row, Change Intelligence provides a number of useful tools to help you with your analysis.

    Mini service diagram
    This map shows you the service and operation in the trace sending the attribute. In this example, the /api/update-catalog operation on the iOS service is sending the customer:ProWool attribute. Operation diagram

    SLIs for traces
    The average latency, operation rate, and error rate are shown for the baseline compared to the deviation, for traces both with the suspect attribute and without. In this example, you can see that 40% of traces had the attribute customer:ProWool and they also had an increase in latency and operation rate.

    Below that, you can compare that performance with traces (from both the baseline and deviation) that did not have that attribute. In this example, you can see that traces that did not have the customer:ProWool attribute didn’t experience the same degradation in performance.Compare traces

    Because traces for requests from ProWool are experiencing more latency and a higher rate, it is likely that traces with this attribute can pinpoint the problem.

  2. Click View sample traces to view traces with the both the suspect attribute and performance issues.
    Traces are sorted by latency, and you can choose traces from both the baseline and deviation. View traces

    Clicking a trace opens it in the Trace view where you can see the critical path through the request and view details about each span in the trace.

    More about Trace view

    You use the Trace view to see a full trace from beginning to end of a request. The Trace view shows you a flame graph of the full trace (each service a different color), and below that, each span is shown in a hierarchy, allowing you to see the parent-child relationship of all the spans in the trace. Errors are shown in red.

    Clicking a span shows details in the right panel, based on the span’s metadata. Whenever you view a trace in Lightstep Observability, it’s persisted for the length of your Data Retention policy so it can be bookmarked, shared, and reviewed by your team at any time. The selected span is part of the URL, so it will remain selected when the URL is visited.

    Learn more about Trace view.

    Expandable end

    Trace view

    In this example, it looks like ProWool sent over 1,000 requests and the write to the database is overwhelmed. That’s likely why the CPU metric spiked!

What did we learn?

  • You can use Change Intelligence directly from a dashboard or an open chart.
  • Based on the time windows for the baseline and deviation, Change Intelligence analyzes Key Operations on the service that emitted the metric, and uses traces from those operations to find corresponding performance changes.
  • When Change Intelligence finds those performance changes, it analyses traces that include that operation, finding attributes that appear on traces with the performance issues (and doesn’t appear often on traces that are stable). It surfaces the one that’s most likely pointing to the issue.
  • You can view those traces directly from Change Intelligence to pinpoint and verify the problem.