Now that the app is running, and you’ve registered a deployment marker, you can view the telemetry data that it’s generating in Lightstep.

  1. In Lightstep Observability, click on Service Directory from the navigation bar.

  2. From the Service Directory page, select your service. If you’re using the sample app, select my-example-helm-app.

  3. Look at the service’s root operation and see that it has a latency spike after the first deployment marker. Let’s look at this more closely.

  4. Click on that chart in the middle of that spike to set the Regression time period to 1 hour prior. You’re taken to the RCA (root cause analysis) view where you can use analysis tools to determine the cause of latency.

Lightstep Observability offers a number of different tools you can use to start creating hypotheses about what is causing the regression. Let’s use the tools to see where the latency is and try to find what’s causing it. We’ll start with the histogram that compares the regression latency with the baseline.

Looking at the histogram, the regression data is shown in the yellow bars. The baseline data is shown in the blue line. There are a lot more yellow bars in the higher latency side of histogram, which confirms that there’s higher latency in the regression data.

Learn more about histograms

Once you run a query, the Latency Histogram is generated by 100% of the span data collected from the Microsatellites that match the query. The bottom of the histogram shows latency time, and each blue line represents the number of spans that fall into that latency time.

You can learn more about the histogram here

Expandable end

Now let’s see if we can start to figure out why the latency is higher.

  1. In the Compare Operations table, the root attribute has a high correlation with latency.

    Now that we’ve narrowed the issue down to the root operation and we know it started right after the deploy, let’s take a look at a full trace the includes that operation.

  2. Click on any span in the Trace Analysis table at the bottom to open the trace that contains that span.

  3. Sure enough the trace shows that the root span taking up most of the critical path of latency. Because the sample app contains only one operation, there are no other spans to compare it to. But in a “normal” trace the problem span will stand out.

    If you’re using the sample app, notice that there’s a link to visit the Codefresh build for this service in the Workflow section of the Details panel. You can create these links for almost anything that has a resolvable URL. In this case, you’d be able to jump directly into Codefresh to roll back the deploy.

Learn more about Workflow Links

When investigating an incident or debugging a performance problem, rarely do you use just one tool or need information from just one source. You may need to jump between different tools, synthesize information from multiple sources, and notify other people quickly. Lightstep Observability lets you add flexible Workflow Links on the Trace View page that link to other resources, allowing access to all the info you need when you need it.

For example, say you want to jump directly to logs from the time when an issue occurred or to quickly view your playbook’s instructions when the span includes a certain error code. Lightstep Observability can construct these customized links automatically, using attributes and other metadata from a span in the trace.

You can create Workflow Links to anything, but the real power comes from using parameters directly from the span to substitute attribute values, timestamps, and other metadata.

You can learn more about Workflow Links here

Expandable end

So now you can be confident that the latency is caused by the deploy.

Time to rollback the deploy!

What did we learn?

  • Histograms show buckets of spans organized by latency. By comparing the baseline to the regression, it’s easy to confirm that latency has increased.
  • Correlations show you spans with specific attributes that may be contributing to latency.
  • The Operations diagram shows latency per operation.
  • The Trace view contains tons of information about each span in the trace, making it easy to verify your hypothesis.
  • You can create Workflow Links to any external URL that may be needed when resolving regressions.