Now that you’ve instrumented your app, or reviewed the instrumentation in the Hipster Shop, it’s time to look at the data being sent into Lightstep. When you log into Lightstep, you’re brought to the Service Directory where you see your services listed along the left. Lightstep shows latency, error rate and ops rates for the Key Operations in the currently selected service.
It can take a few minutes for data to be analyzed and displayed in the Service Directory, so let’s move on to the part of Lightstep that uses real-time data.
View Performance at Real-Time
In the navigation bar, click on the Explorer icon. This page gives you a view into the span data from the recall window (working memory) of the Satellites. The histogram at the top shows you current latency based on 100% of the span data sent by the application. Because we haven’t entered a query, the histogram is showing all spans from the current recall window. The x-axis shows the duration of spans and the blue lines represent the number of spans that took that amount of time. If you’re using the Hipster Shop, you can see that there’s a number of spans that took over 1 second to complete.
Might be good to analyze the performance of those to see if it can be improved!
More about Explorer
When you hear that a particular service is slow, or there’s a spike in errors, or a particular customer is having an issue, your first action is to see real-time data regarding the issue. Lightstep’s Explorer view allows you to query all span data currently in the Satellites to see what’s going on.
Explorer consists of four main components. Each of these components provides a different view into the data returned by your query:
Query: Allows you to query on all span data currently in the Satellites recall window. Every query you make is saved as a Snapshot, meaning you can revisit the query made at this point in time, any time in the future, and see the data just as it was. Results from the query are shown in the Latency Histogram, the Trace Analysis table, and the Service diagram.
Latency histogram: Shows the distribution of spans over latency periods. Spans are shown in latency buckets represented by the blue lines. Longer blue lines mean there are more spans in that latency bucket. Lines towards the left have lower latency and towards the right, higher latency.
Trace Analysis table: Shows data for spans matching your query. By default, you can see the service, operation, latency, and start time for each span. Click a span to view it in the context of the trace.
Correlations panel: For latency, shows services, operations, and attributes that are correlated with the latency you are seeing in your results. That is, they appear more often in spans that are experiencing higher latency. For errors, shows attributes that are more often on spans that have errors. The panel can show both positive and negative correlations, allowing you to drill down into attributes that may be contributing to latency, as well as ruling out ones that are not. Find out more about Correlations here.
Service diagram: Depicts the service based on your query in relation to all its dependencies and possible code paths, both upstream and downstream. Areas of latency (yellow) and errors (red) are shown. Learn more about the Service diagram here.
Using these tools, it’s possible to quickly form and validate hypotheses around the current issue.
Learn more about Explorer.
Below the histogram, the Trace Analysis table shows exemplar spans from the histogram that represent all ranges of performance. When you click the Service Diagram tab, you see a full map of your application, showing dependencies upstream and downstream. In this diagram from the Hipster Shop, you can see that there are errors in the
frontend service that should probably be investigated.
View Service Health
Now that we’ve given the Hypothesis Engine time to analyze data, let’s look at the health of the services in the Service Directory.
In the navigation bar, click the Service Directory icon.
Lightstep displays sparkline charts for Key Operations on the service. They show recent performance for Service Level Indicators (SLIs): latency, error rate, and operation rate. Shaded yellow bars to the left of the chart indicate the magnitude of the change.
When you instrument your services to report on changes in service versions, Lightstep displays deployment markers, making it easy to see when a deploy may have affected performance.
When you notice a spike in latency, error, or operation rate, you can click into the larger chart to start your investigation by comparing that spike to a time when performance was normal.
Lightstep shows you comparison data between the two time periods. Blue represents the baseline time period and yellow is the regression.
This page has a number of useful tools like metrics, attribute correlations, log analysis, and an operations diagram to help you get to the root cause of the issue.
View a Trace
Once you narrow down the issue, you can open an example trace in the Trace view. In the mini-trace at the top and the full expanded trace, the black line shows the critical path through the request. Click on any span to view its details in the panel.
More about Trace view
You use the Trace view to see a full trace from beginning to end of a request. The Trace view shows you a flame graph of the full trace (each service a different color), and below that, each span is shown in a hierarchy, allowing you to see the parent-child relationship of all the spans in the trace. Errors are shown in red.
Clicking a span shows details in the right panel, based on the span’s metadata. Whenever you view a trace in Lightstep, it’s persisted for the length of your Data Retention policy so it can be bookmarked, shared, and reviewed by your team at any time. The selected span is part of the URL, so it will remain selected when the URL is visited.
Learn more about the Trace view.
What Did We Learn?
- Lightstep ingests telemetry data then analyzes and displays it in a variety of ways to help you troubleshoot performance.
- You can compare a baseline and regression time period to narrow down culprits.
- Instrumenting your app using attributes and logs provide more insight into what may be causing or may be affected by issues.
- The Trace view helps to confirm your hypothesis by showing the critical path through a request and providing attribute and logs of event data to support your findings.