Code and configuration changes are some of the most common causes for performance regressions. While viewing a dashboard after a deployment may show that a regression occurred, it won't explain the root cause. You can use LightStep's Service Health for Deployments feature not only to monitor your services after a deploy, but also to compare performance over specific time periods and then dig into details to find the differences that caused the issue.
When you access the Service Directory, the Deployments tab shows you the latency, operation rate, and error rate of your ingress operations on a service (maximum number of operations shown is 50). The operations are initially sorted by request rate, in descending order.
When you notice a regression in latency, you can click on that chart and immediately begin to investigate the issue.
When you notice a spike in latency in an operation from the Service Directory, you can use the Comparison view to compare two time ranges and then dig down to find the root cause for the difference.
To compare two time ranges:
- From the Service Directory, select a service.
Charts showing latency, operation rate, and error rate, are shown for the ingress operations for that service (up to 50).
If you've implemented a tag to denote new versions of your service, a deployment marker displays at the time the deployment occurred. In the example below, for the
update-inventory operation, deploys occurred at 5:00 am and 7:00 am.
Hover over the marker to view details.
These markers allow you to quickly correlate deployment with a possible regression.
- When you find a latency spike to investigate, click in the chart to create a range of time to collect data from. Choose a spot in the middle of a regression to avoid collecting data from before the spike occurred. A shaded yellow bar shows the time range from which span data will be used for analysis of the regression. Click Investigate Regression.
You're taken to the Comparison view.
- In the Latency chart, select a point in time to set as the baseline. This should be a point where latency is stable.
If the chart isn't displaying a good baseline range, use the dropdown to select a different time period to display.
To help you find a good baseline, as you move your cursor over the timeline, the values for p50, p95, and p99, as well as rate and error percentages, are shown at the top right of the charts. The shaded blue area shows the timeframe from which span data will be used for comparison of the baseline to the regression.
To change either the Regression or Baseline time, click and move the shaded box.
- Click Compare.
A number of tools display comparisons of trace data from the two time periods.
Read on for instructions for using these tools.
The Compare Latency Histogram is similar to the histogram shown on the Explorer view. It automatically queries historical span data from the baseline and regression time periods for the operation on the service.
The Latency Histogram shows the number of spans and the duration for both the regression and baseline time ranges. This helps identify the scope of the performance impact: how much has latency changed and for how many requests.
Spans are grouped into latency buckets: the height indicates the number of spans in that latency bucket while the x-axis indicates the latency range for the bucket. A blue line overlay shows the shape of the histogram for the baseline, while the yellow bars show the spans from the regression time range.
By default, markers show where the 95th percentile lies. Use the dropdown to see other percentile markers.
The Compare Tags table is similar to Correlations shown on the Explorer view. For the ingress operation being investigated, the table displays tags that are associated with latency in the regression time range for all spans in the full trace.
For each of these tags, the table indicates the average latency of the spans containing the tag in the regression time range (in yellow), the average latency of the span containing the same tag in the baseline time range (in blue), and the difference in duration.
By default, the table is sorted by duration change, from highest to lowest. You can change the sort by clicking on any column header.
Hover over a correlation to see the tag's latency overlaid on the histogram. For example, in this histogram, the tag
region:us-west-1 appears on more frequently
on spans with higher latency and may be associated with the root cause of the latency regression.
Click on a row in the Compare Tags table to view example traces that contain the correlated tag.
Click one of the traces to view the full trace in the Trace view.
The Operation Diagram and Compare Operations table show the operations that contribute to latency across the request path of all the traces from the regression time range. By default, the diagram shows all downstream intra-service operations from the original ingress operation and service. Yellow halos denote average latency contribution. The ingress operation you're currently investigating has a blue animated halo.
If you want to expand the investigation to include operations from other services, select "All Downstream Operations" on the top right of the Operation Diagram. Depending on trace depth and instrumentation, the complete Operation Diagram may be quite large.
The diagram edges represent relationships between spans and is dependent on the quality of your instrumentation. Missing spans may cause edges to appear when there is no parent-child relationship.
The following can cause missing spans:
- The Satellite dropped the span (for example, your Satellite pool is not auto-scaling to keep up with traffic)
- The tracer dropped the span (for example, your application crashed or never called
- There is an issue with your instrumentation (context was dropped, or the service with a parent span is not instrumented).
When you see missing spans, check the Reporting Status page to find the culprit.
The table is sorted by the change in critical path latency between the baseline and regression, from largest to smallest. You can change the sort order by clicking on another column.
For example, in this table the
update-inventory operation on the
inventory service contributed an average of 873ms in latency from the baseline time range, but contributed 4s in latency in traces from the regression time range. Identifying the largest increase in latency can help pinpoint the root cause of the latency increase in the regression.
Click on a row in the Compare Operations table to view example traces that contain this operation. Clicking on a row in the table will also center the Operation Diagram on that specific operation, to help narrow your investigation on only the relevant operations.
Click one of the traces to view the full trace in the Trace view.
Similar to the Trace Analysis table on the Explorer view, this Trace Analysis table shows information from the spans in both the regression and baseline time ranges. By default, the table shows the service, the operation reporting the span, the span's duration, and the start time from the regression time range. You can add other columns as needed by clicking on the + icon.
You can switch between analyzing traces from the baseline or regression time periods by selecting on the time range of interest. After you perform an analysis, for example a group-by, you can switch time ranges and see that same group-by aggregation on the other time range's traces.
The Comparison view is useful for performing root cause analysis, as it let's you compare your aggregate trace data from the time of a regression with aggregate trace data from a time when it was performing normally. This makes it easy to find what has changed and where.
Let's say you're an on-call engineer and get reports that the
iOS app is slow.
Here's how you can use the Comparison view to find the issue.
- From the Service Directory page, select the
- Scroll down to view the
iOSoperations and see that the
/api/update-inventoryoperation has a few latency spikes.
You see that there is actually one big spike in the p99 latency. Let's look at this more closely.
- Click on that chart in the middle of that spike to set the Regression time period and click Investigate Regression.
- Click in a stable point of the latency chart to set the Baseline time period.
- Click Compare
The page redraws with additional analysis tools.
You can see the rise in latency in the histogram by comparing the regression data in the yellow bars against the baseline blue line. In the Compare Tags table, note that the
region:us-west-1 tags both have a high correlation with latency, but the
region tag is correlated with most of the higher end of latency.
- Looking at the operations on downstream operations for this service, you can see that the
/api/update-inventoryoperation has a high latency. The regression has caused it to increase by over 3 seconds.
Let's see if that is actually coming from a downstream operation outside of the
- Click on the dropdown to select All Downstream Operations. Now we can see that it's actually the
update-inventoryoperation that's causing the latency.
- Click on the operation to view sample traces then click on a trace to open it in the Trace view.
Note that the
update-inventory operation is indeed an issue, as it's taking up most of the critical path's time and is contributing to 99.7% of the latency. Also note that the tag
customer has a value of
trendible and the tag
region has a value of
us-west-1 both of which were discovered by Correlations, so we know we've likely found the problem.
Let's expand the log section to see if there's any good information there.
- In the sidebar, expand Logs.
Seems there's an issue with a network connection. But what caused it? Let's investigate the
- Click back on the Service Directory and choose the
Sure enough, there's a deployment marker right around the time we saw the spike in the
iOS service. Looks like that deploy likely caused the network issues leading to latency!
Updated about 24 hours ago