When you hear that a particular service is slow, or there's a spike in errors, or a particular customer is having an issue, your first action is to see data regarding the issue. LightStep's Explorer view allows you to query all span data currently in the Satellites to see what's going on.
Explorer consists of four main components. Each of these components provides a different view into the data returned by your query:
- Query: Allows you to query on all span data currently in the Satellites recall window. Every query you make is saved as a Snapshot, meaning you can revisit the query made at this point in time, any time in the future, and see the data just as it was. Results from the query are shown in the Latency Histogram, the Trace Analysis table, and the Service diagram.
- Latency histogram: Shows the distribution of spans over latency periods. Spans are shown in latency buckets - longer blue lines mean there are more spans in that latency bucket. Lines towards the left have lower latency and towards the right, higher latency.
- Trace Analysis table: Shows data for spans matching your query. By default, you can see the service, operation, latency, and start time for each span. Click a span to view it in the context of the trace.
- Correlations panel: Shows services, operations, and tags are correlated with the latency you are seeing in your results. The panel shows both positive and negative correlations, allowing you to drill down into attributes that may be contributing to latency, as well as ruling out ones that are not. Find out more about Correlations here.
- Service diagram: Depicts the service based on your query in relation to all its dependencies and possible code paths, both upstream and downstream. Areas of latency (yellow) and errors (red) are shown. Learn more about the Service diagram here.
Using these tools, it's possible to quickly form and validate hypotheses around the current issue.
You can query the span data on any combination of a service, an operation, and any number of span tags. Every time you run a query, the results are saved as a Snapshot so you can go back to data at that point in time any time in the future.
You run your query from the top of Explorer.
To run a query:
- Enter your query into the search bar; choose from adding a service, operation, or tag to the query. LightStep offers suggestions as you type.
You can have only one service and one operation in your query, but you can add as many tags as you'd like.
The real-time query language supports only
AND operations. A simple workaround is to generate an OpenTracing tag and apply it to all operations matching the condition of interest. You can then create and execute a query for this tag.
- Once you've entered your query, click Run.
LightStep queries the Satellites' recall window and returns all span data that matches your query.
LightStep returns spans that fall into the following time range:
If there are spans in the satellite’s memory but aren’t in the time range above, no results are returned. This time period prevents older data from being returned. If the Satellite recall window is not full, LightStep uses 3 minutes as the median recall.
LightStep saves every query you make as a Snapshot. Snapshots provide a view into saved data that you can share with other LightStep users. When you share a Snapshot, the recipient can work with the data in the same way that you did. Snapshots are automatically created for you, and the data is saved for as long as your data retention policy allows. Snapshots are perfect for Slack messages, emails, post-mortem docs, and anywhere you need a definitive historical view of your span data.
To view your Snapshots:
- Click the gray dropdown that displays Today and a timestamp (this is the timestamp of your latest snapshot).
Your Snapshots are listed by date, time, and query.
- Select the Snapshot to view.
Explorer rerenders using the data from the Snapshot.
You can share a Snapshot with another LightStep user using a URL. When the user clicks the link, the same query is run using the data from the Snapshot (instead of live data).
To share a Snapshot:
- Click Share.
The URL is copied to your clipboard.
- Paste the URL wherever you want someone to access the data.
If you know you will want to revisit a query multiple times, instead of coming back to Explorer and running the query, you can create a Stream. When you create a Stream based on a query, LightStep looks at the distribution of data from the Satellites every minute and persists example traces from different buckets of distribution for that query to ensure you always have data from 0 to p99.9, including outliers. You can create dashboards from Streams, and add alerts when data in a Stream crosses a threshold.
Learn more about Streams here.
Once you run a query, the Latency Histogram is generated by 100% of the span data collected from the Satellites that match the query. The bottom of the histogram shows latency time, and each blue line represents the number of spans that fall into that latency time.
By default, a marker shows the 95th percentile. You can change that using the Show percentile dropdown.
You can compare the current histogram with histograms from the past by choosing 1 hour, 1 day or 1 week prior. An overlay of that prior data displays on top of the histogram. This overlay represents data from the same time window exactly 1 hour, 1 day, or 1 week ago. If selecting a historical time period results in "No matches found," then there were no spans matching the query during the historical window in time.
Seeing the overlay as soon as you run your query?
If your query returns results that differ significantly from the past, the overlay displays automatically, alerting you to a potential issue.
Now that you have an overview of latency for spans, you can get more detailed information from the Trace Analysis table.
The Trace Analysis table shows information from a sample of spans used to create the histogram. By default, the table shows the service; the operation reporting the span, the span's duration, and the start time. You can add other columns as needed.
LightStep analyzes 100% of the data used to create the histogram and then selects span data that represents all ranges of performance or is otherwise statistically interesting, ensuring outliers and other anomalies are well represented.
When you click on a span, you're taken to the Trace view where you can see the span in the context of the full trace.
You can add columns that show the span's tag data to the right of the table by clicking the + icon. As you start typing, LightStep finds tags matching your search.
You can filter the data in the Trace Analysis table in several ways.
You can filter to see a certain range of latencies by clicking and dragging the area to filter to in the histogram.
Use the Show percentile dropdown to mark where a certain percentile starts.
The Trace Analysis table refreshes to show spans only in the selected percentile range.
You can also filter to show only specific services, operations or tags.
You can group your results to see interesting aggregate data about your spans. You can group by service, operation, or tag.
When you group your results, LightStep organizes the table by those groups.
Click on a group to see the spans belonging to that group. LightStep shows you the p50 latency, the error percentage, and the number of spans in that group.
You can also filter and group from the Correlations panel.
By default, LightStep only shows spans that match your query. But often, you'll want to see spans that participated in the same traces as the spans returned by your query. For example, if you're trying to reach a hypothesis, you may want a more broad view of what is going on in a trace, without having to open a bunch of traces yourself.
To see all spans that took part in a trace with your query results, select Show all Spans in Traces.
The table refreshes to show spans that participate in the same traces that your results spans do.
The Explorer view is useful for validating (or invalidating) hypotheses about a system regression. Let's say you're an on-call engineer and need to investigate high error rates in
api-server. You suspect a downstream service caused the errors, but it is not clear which service is causing the issue or why.
Here's how you can use Explorer to debug the issue:
- Run a query for
- Check Show all Spans in Traces so you can see spans outside of the
- Group by
serviceso you can quickly spot the service with issues.
- Sort by Error % to find the services with a high error rate.
This will show you the upstream and downstream services of the
api-server, sorted by error rate.
Notice that the
auth-service (called by
api-server) has a high error rate. You're aware that there was a recent canary release for this service, so you dig deeper by doing the following:
- Filter by
- Group by
canary=true has an 86% error rate (illustrating that the canary release for the
auth-service is experiencing errors). You can then click on that
canary=true row to see example traces and debug further.
Validating hypotheses is even easier when you can see your services in a diagram that shows service relationships and performance. Read View Service Hierarchy and Performance for more information about the Service diagram, which does just that!