Now that you’ve sent metrics to Lightstep Observability and instrumented your app, you can use Change Intelligence to investigate what caused any deviations in those metrics. Change Intelligence determines the service that emitted a metric, searches for performance changes on Key Operations from that service at the same time as the deviation, and then uses trace data to determine what caused the change.

By default, Change Intelligence looks for the attribute (tag/label) in your metric data to determine the sending service. If your AWS metric data uses a different attribute, you need to register them.

  1. In a dashboard or chart, click Analyze deviation.

    Start metric investigation

    Change Intelligence works best when it can focus on a single service and its dependencies. If your dashboard or chart includes metrics coming from many services, use filters to choose one service to focus on. For a dashboard, use the global filter. For a chart, edit your query to filter the data to include one service. You need to save your chart before you can start Change Intelligence.

  2. In the side panel, drag the blue window to the deviation in the chart and click Analyze deviation. Select deviation window

    The side panel shows the baseline (purple) and deviation (blue) time windows that Change Intelligence uses to compare performance. You can change this by selecting a different time period or changing the baseline and deviation time windows. Time windows show baseline and deviation

    You can click View full Change Intelligence to open and work with Change Intelligence in the expanded view.

  3. Expand the Key Operation with the highest magnitude of change (the first one in the list).

    Based on the time windows for the baseline and deviation, Change Intelligence analyzes Key Operations on the service that emitted the metric. It displays the operations that experienced a change in performance and lists them on the page, sorted by highest degree of change first (operations without any changes aren’t displayed). SLIs for Key Operations

    Lightstep lists the attributes that appeared in most traces with the performance change (correlated attributes). Each attribute row displays a map icon showing the relationship of the service containg the attribute to the service you queried. Hover over the icon to see a larger view of the map.Most likely cause of a metric change

    More about correlated attributes

    Change Intelligence is able to correlate attributes with performance change by finding attributes on operations up and down the request path that appear frequently on traces from the deviation time period (for metrics, it determines the request path by first finding performance changes in key operations on the service during the deviation, and then analyzing span data from those operations). If an attribute appears on a number of traces that have performance issues during the deviation (and doesn’t on traces that are stable), it’s likely that something about that attribute is causing the issue.

    Expandable end

  4. Expand the first attribute row.

    Change Intelligence shows SLI charts that compare the baseline to the deviation for the p99 latency, operation rate and error rate percentage. SLIs for correlated attribute

    By default, the time series only display for spans that contain the attribute. Use the Show comparison toggle to compare that with spans that don’t contain the attribute. When you add those spans, you can also choose to group the data by the attribute name.

    Turn off Show span samples to get a better view of the groupings.

    Compare and group other spans

    Looking at the previous image, you can see that traces that did not have the customer:ProWool attribute didn’t experience the same degradation in latency or operation rate.

    Because traces for requests from ProWool are experiencing more latency and a higher rate, it is likely that traces with this attribute can pinpoint the problem.

  5. Click on a span sample dot to view a trace.

    Clicking a span opens it in the Trace view where you can see the critical path through the request and view details about each span in the trace.

    More about Trace view

    You use the Trace view to see a full trace from beginning to end of a request. The Trace view shows you a flame graph of the full trace (each service a different color), and below that, each span is shown in a hierarchy, allowing you to see the parent-child relationship of all the spans in the trace. Errors are shown in red.

    Clicking a span shows details in the right panel, based on the span’s metadata. Whenever you view a trace in Lightstep Observability, it’s persisted for the length of your Data Retention policy so it can be bookmarked, shared, and reviewed by your team at any time. The selected span is part of the URL, so it will remain selected when the URL is visited.

    Learn more about Trace view.

    Expandable end

    Trace view

    In this example, it looks like ProWool sent over 1,000 requests and the write to the database is overwhelmed. That’s likely why the CPU metric spiked!

What did we learn?

  • You can use Change Intelligence directly from a dashboard or an open chart.
  • Based on the time windows for the baseline and deviation, Change Intelligence analyzes Key Operations on the service that emitted the metric, and uses traces from those operations to find corresponding performance changes.
  • When Change Intelligence finds those performance changes, it analyses traces that include that operation, finding attributes that appear on traces with the performance issues (and doesn’t appear often on traces that are stable). It surfaces the one that’s most likely pointing to the issue.
  • You can view those traces directly from Change Intelligence to pinpoint and verify the problem.