Let’s use the tools to see where the latency is and try to find what’s causing it. We’ll start with the histogram that compares the regression latency with the baseline.

Looking at the histogram, the regression data is shown in the yellow bars. The baseline data is shown in the blue line. There are a lot more yellow bars in the higher latency side of histogram, which confirms that there’s higher latency in the regression data.

Learn more about histograms

Once you run a query, the Latency Histogram is generated by 100% of the span data collected from the Satellites that match the query. The bottom of the histogram shows latency time, and each blue line represents the number of spans that fall into that latency time.

Note that on this page, the regression data is shown in yellow bars. You can learn more about the histogram here

Expandable end

Now let’s see if we can start to figure out why the latency is higher.

Lightstep can correlate attribute values associated with regression.

Learn more about Correlations

A standard method of identifying the root cause of a performance regression is to manually comb through traces and search for common system attributes associated with that regression. With Correlations, Lightstep helps you find attributes correlated with latency automatically.

Lightstep Correlations search over all the span data in all traces that the current query participates in and finds patterns in that data for latency contribution, looking at service/operation pairs and attributes. This analysis allows you to identify services, operations, and attributes that are meaningfully associated with observed latency. By surfacing statistical relationships between these system attributes, you can quickly generate and validate hypotheses during a root cause investigation.

Note that on this page, the histogram displays yellow bars instead of blue. You can learn more about the histogram here

Expandable end

In the Compare Attributes table, the customer:trendible and region:us-west-1 attributes both have a high correlation with latency,

  1. Hover over both of those attributes in the table. Lightstep shades the histogram to show you where the spans that contain those attributes land.
    Looks like the region attribute is correlated with most of the higher end of latency. . So perhaps something about the us-west-1 region is causing latency.

    Let’s use the Operations map to see if there’s any one operation that’s particularly latent.

  2. Scroll down to the Operations map. Looking at the operations on downstream operations for this service, you can see that the /api/update-inventory operation has a high latency. The regression has caused it to increase by over 3 seconds. But is that operation causing latency? Or, is a downstream operation outside of the iOS service causing a problem?

  3. Click on the dropdown to select All Downstream Operations. Now we can see that it’s actually the update-inventory operation that’s causing the latency.

Now that we’ve narrowed the issue down to the update-inventory operation on the inventory service, let’s take a look at a full trace the includes that operation.

What Did We Learn?

  • Histograms show buckets of spans organized by latency. By comparing the baseline to the regression, it’s easy to confirm that latency has increased.
  • Correlations shows you spans with specific attributes that may be contributing to latency.
  • The Operations diagram shows latency per operation, and you can view all downstream operations to see where that latency starts.