Explorer gives you a set of powerful analysis tools to help you better understand your system's behavior.
At the top of Explorer, there's a query builder that categorizes query terms by Service, Operation, and Tags. There is no practical limit to the number of query terms that you can include. When adding a new query term, you can scan the list of suggestions and begin typing to quickly find what you're looking for. In the case of a rarely seen search term, you're able to add terms outside the suggestion list.
Your previous queries are saved locally and can be viewed by clicking on the query history dropdown.
This chart shows the global latency distribution for the full set of spans, or timed operations, that are currently in memory across every [𝑥]PM Satellite within your environment. When reading the chart, remember that the x-axis is a log scale measure of latency and the y-axis is a log scale measurement of frequency.
You can also click and drag along the latency axis of the histogram chart to further narrow the set of example spans to include only those in a specific latency range. This can be used to drill down into subsections to analyze particular performance problems. These powerful features make it possible to flexibly examine latency characteristics for an application in its entirety, through monoliths and microservices or filter by any dimension no matter how focused or broad.
Within the Latency Histogram are checkboxes (1 hour, 1 day, 1 week), that when clicked, overlay the average latency shape calculated over the given windows of time. In this way, users can compare the shape of the Latency Histogram they are seeing right now, with the historical average latency histograms LightStep keeps continually up to date.
In cases where LightStep detects that the historical histograms for a particular query are sufficiently different from the live histogram, we will automatically show the difference by programmatically selecting the time windows as soon as the query is made.
This will help you discover latency changes for your system in a more seamless and automatic way.
Your Satellites must be upgraded to the July 2018 (or later) release to enable this feature.
Below the Latency Histogram is a table of spans that match the query that was executed above.
There are four columns in the table: Service, Operation, Duration, and Start Time. The first two columns, Service and Operation, are used for identifying the part of your system that the span was generated by. Duration indicates the total duration of this specific span, and Start Time indicates the time at which the span began. Spans containing an error have a red Duration bar.
The Trace Analysis Table offers a number of other tools to help root-cause analysis of system behavior.
By default, the Trace Analysis table only displays the spans that directly match the query defined in the Explorer query bar. However, those spans are typically part of larger traces.
The Show all Spans in Traces checkbox allows you to expand the result set in the Trace Analysis table to include all the spans from all the traces that matched your Explorer query, instead of just the spans that directly matched the parameters in your query.
For example, if your Explorer Query was Service="ios-client" Operation="api-request" the span that is highlighted in light blue would be returned in your result.
However, if you check the Show all Spans in Traces checkbox, then you will be able to see all the spans in this trace in the Trace Analysis table.
The Add Column button is just to the right of the Start Time column. This button allows you to add any tag as a column in the table.
Each row will contain the value for that tag. If the span does not contain the tag, an em-dash ("—") will be displayed instead.
The Group by button allows you to aggregate system behavior based on tags, Operations, or Services. There are two ways you can Group by:
- Click the Group by button on the top of the Trace Analysis table and enter the name of the tag and hit enter; or
- Click a tag in the Correlations results and you will have the option to Group by that tag.
Group by the tag "cache_hit" and to see the performance impact of cache misses
In much the same way, we can filter the results in the Trace Analysis table by Service, Operation, or tag:
- Click the Filter button on the top of the Trace Analysis table and enter the name of the tag and hit enter; or
- Click a Service-Operation or tag in the Correlations results and you will have the option to Filter by that Service-Operation or tag.
You can add as many filters as you need. Filters help you reduce your search space to find the trace that contains the exact set of properties you were looking for.
The Trace Analysis table is useful for validating (or invalidating) hypotheses about a system regression. For example, an on-call engineer might be interested in investigating high error rates in
api-server. He/she may suspect that it is caused by a particular downstream service in the trace, but it is not clear which service is causing the issue or why. The following Trace Analysis features can help debug the issue further:
- Query the Explorer for
- Check Show all Spans in Traces.
- Group by
- Sort by Error %.
This will show you the upstream and downstream services of the
api-server, sorted by error rate.
Notice that the
auth-service (which is called by
api-server) has a high error rate. You're aware that there was a recent canary release for this service, so you dig deeper by doing the following:
- Filter by
- Group by
canary=true has an 86% error rate (illustrating that the canary release for the
auth-service is experiencing errors). You can then proceed to click on that
canary=true row to see example traces and debug further.