Now that you’ve instrumented your app, or reviewed the instrumentation in the Hipster Shop, it’s time to look at the data being sent into Lightstep.

  1. Run your app to generate some spans. If you’re using the Hipster Shop, order a few items.

  2. Log into Lightstep. You’re brought to the Service Directory where you see your services listed along the left. Lightstep shows latency, error rate and ops rates for the Key Operations in the currently selected service. It can take a few minutes for data to be analyzed and displayed in the Service Directory, so let’s move on to the part of Lightstep that uses real-time data.

  3. In the navigation bar, click on the Explorer icon. This page gives you a view into the span data from the recall window (working memory) of the Satellites. The histogram at the top shows you current latency based on 100% of the span data sent by the application. Below the histogram, the Trace Analysis table shows exemplar spans from the histogram that represent all ranges of performance, including those that are otherwise statistically interesting.

More about Explorer

When you hear that a particular service is slow, or there’s a spike in errors, or a particular customer is having an issue, your first action is to see real-time data regarding the issue. Lightstep’s Explorer view allows you to query all span data currently in the Satellites to see what’s going on.

Explorer consists of four main components. Each of these components provides a different view into the data returned by your query:

  • Query: Allows you to query on all span data currently in the Satellites recall window. Every query you make is saved as a Snapshot, meaning you can revisit the query made at this point in time, any time in the future, and see the data just as it was. Results from the query are shown in the Latency Histogram, the Trace Analysis table, and the Service diagram.

  • Latency histogram: Shows the distribution of spans over latency periods. Spans are shown in latency buckets represented by the blue lines. Longer blue lines mean there are more spans in that latency bucket. Lines towards the left have lower latency and towards the right, higher latency.

  • Trace Analysis table: Shows data for spans matching your query. By default, you can see the service, operation, latency, and start time for each span. Click a span to view it in the context of the trace.

  • Correlations panel: For latency, shows services, operations, and tags that are correlated with the latency you are seeing in your results. That is, they appear more often in spans that are experiencing higher latency. For errors, shows tags that are more often on spans that have errors. The panel can show both positive and negative correlations, allowing you to drill down into attributes that may be contributing to latency, as well as ruling out ones that are not. Find out more about Correlations here.

  • Service diagram: Depicts the service based on your query in relation to all its dependencies and possible code paths, both upstream and downstream. Areas of latency (yellow) and errors (red) are shown. Learn more about the Service diagram here.

Using these tools, it’s possible to quickly form and validate hypotheses around the current issue.

Learn more about Explorer.

Expandable end

Because we haven’t entered a query, the histogram is showing all spans from the current recall window. The x-axis shows the duration of spans and the blue lines represent the number of spans that took that amount of time. If you’re using the Hipster Shop, you can see that there’s a number of spans that took over 1 second to complete. Let’s take a look at those.

  1. Click and drag across the histogram to select the spans over 1s in length. The Trace Analysis table redraws to show only spans in the selection.

  2. In the table, select Group by and choose service. This allows you to see the average time per service for those spans. Now you can see that the cartservice service average latency is over 3 seconds. Probably would be a good idea to see if that can be tuned to be more performant. Now that we’ve given the Hypothesis Engine time to analyze data, let’s look at the health of the services in the Service Directory.

  3. In the navigation bar, click the Service Directory icon. Select a service whose operations are showing spikes in latency. For general investigation, it’s always good to start as close to the user as possible. For example, the Frontend service for the Hipster Shop shows some spikes on the HTTP POST operation.

    When you instrument your services to report on changes in service versions, Lightstep displays deployment markers, making it easy to see when a deploy may have affected performance.

    Let’s see if we can find out why….

  4. Click on a spike to analyze the performance and choose 1 hour prior. Lightstep shows you comparison data between the two time periods. Blue represents the baseline time period and yellow is the regression. You can see in the latency histogram that there are yellow bars (representing the number of spans in the time period on the X axis) at the higher end of latency that weren’t in the baseline (the blue line).

    To the right of the histogram, the Compare Tags table shows tags that appear more often on spans where a latency regression occurred. Tags can be an important piece of the instrumentation and add great value when diagnosing issues. In this case, there are none that highly correlate with latency (when you select a tag, Lightstep highlights the bars in the histogram that contain spans with that tag). But you may see some correlations in your data.

  5. Scroll down to the Operation Diagram. This shows you the path the request took through the system’s operations. A yellow halo denotes latency in that operation. In this image from the Hipster Shop, you can see that downstream, there’s an operation on the checkoutservice that has higher levels of latency.

    The table on the right shows the difference in operation performance between the baseline and regression, allowing you to see which operations have changed the most. You can see that the PlaceOrder operation on checkoutservice service has a regression and is contributing to latency (the yellow halo).

  6. To get a better look at what may be causing the latency, filter the Trace Analysis table below to show only spans from a suspect operation. Hover over a an operation in the table - one that shows an increase in latency - and click the Filter icon to filter the Trace Analysis table.

  7. Sort the Trace Analysis table to show the duration of the span from longest to shortest.

  8. Click on a span with a long duration to open it in the Trace view. The span you selected in the table is automatically selected in the trace. In the mini-trace at the top, the black line shows the critical path through the request. Click on the largest segment of the black line to select that span in the full trace.

    In this case, there are actually two operations contributing to most of the latency - GetCart and further down the stack, GetCartAsync. It turns out the PlaceOrder operation is latent because it’s waiting on two other operations that may actually be the root cause of the issue.

    More about Trace view

    You use the Trace view to see a full trace from beginning to end of a request. The Trace view shows you a flame graph of the full trace (each service a different color), and below that, each span is shown in a hierarchy, allowing you to see the parent-child relationship of all the spans in the trace. Errors are shown in red.

    Clicking a span shows details in the right panel, based on the span’s metadata. Whenever you view a trace in Lightstep, it’s persisted for the length of your Data Retention policy so it can be bookmarked, shared, and reviewed by your team at any time. The selected span is part of the URL, so it will remain selected when the URL is visited.

    Learn more about the Trace view.

    Expandable end

What Did We Learn?

  • Lightstep ingests telemetry data then analyzes and displays it in a variety of ways to help you troubleshoot performance.
  • You can compare a baseline and regression time period to narrow down culprits.
  • Instrumenting your app using tags and logs provide more insight into what may be causing or may be affected by issues.
  • The Trace view helps to confirm your hypothesis by showing the critical path through a request and providing tag and log data to support your findings.