We will be introducing new workflows to replace Explorer, and as a result, it will soon no longer be supported. Instead, use notebooks for your investigation where you can run ad-hoc queries, view data over a longer time period, and run Cloud Observability’s correlation feature. In your notebooks and dashboards, you can use the traces list panel to view a list of spans from a query and dependency maps to view a service diagram.
When you hear that a particular service is slow, or there’s a spike in errors, or a particular customer is having an issue, your first action is to see real-time data regarding the issue. Cloud Observability’s Explorer view allows you to query all span data from the past hour to see what’s going on.
Explorer consists of four main components. Each of these components provides a different view into the data returned by your query:
Using these tools, it’s possible to quickly form and validate hypotheses around the current issue.
You can query the span data on any combination of a service, an operation, and any number of span attributes. Every time you run a query, the results are saved as a Snapshot so you can go back to data at that point in time and analyze it in the Explorer view.
You run your query from the top of Explorer. You can use the Query Builder to ensure valid syntax, or you can enter the query manually.
When you run your query, Cloud Observability queries data from the past hour and returns all span data that matches your query.
To run a query using the Query Builder:
Click into the search bar to open the Query Builder. You can build queries for services, operations, and attributes. Use
NOT IN to build the query. When you click into the Service or Operation field, Cloud Observability displays valid values.
When you add multiple values to the Operation field, spans that match either value (
OR operation) are returned.
To add attributes, click the Add an attribute filter button. You can add multiple attributes.
Attributes are added to the query as an
AND operation, meaning only spans that match all attributes are returned.
To run a query manually:
Click into the Search bar and start typing your query. If you need help building your query, checkout the Query Language Cheat Sheet. Reset your query by clicking the X icon on the right.
Values must be an exact match - capitalization matters.
Custom attribute’s name, in quotes.
Cloud Observability generated attributes.
The ID for a span, trace, or tracer. Valid values are hex strings including only the following characters:
Use the following syntax rules to build complex queries:
Use a comma to query multiple values for a key. Multiple values are treated as
service IN (“iOS”,“android”)
Returns spans that are from either the
“customer” NOT IN (“smith”,”jones”)
Returns spans that do not have
jones as the value for the
Only one set of values per key are allowed.
service IN (“iOS”, “android”)
service IN (“iOS”, “android”) AND service IN (“web”)
AND operations to build queries with multiple key/value sets (the
OR operation is not supported).
service IN (“iOS”,“android”) AND “aws_region” IN (“us_east”)
Returns spans that are in either the
android service and are in the
us_east AWS region.
"lightstep.trace_id" NOT IN (“edcba347fe”, “abc7584def”) AND “error” IN (“true”)
Returns spans that are not in traces with the ID
abc7584def and have the value
true for the error attribute.
service IN (“iOS”, “android) AND operation IN (“auth”, “transaction-db”) AND “aws-region” NOT IN (“us-west”) AND "lightstep.span_id" IN (“edcba347fe”, “abc7584def”)
Cloud Observability saves every query you make as a Snapshot. Snapshots provide a view into saved data that you can share with other Cloud Observability users. When you share a Snapshot, the recipient can work with the data in the same way that you did. Snapshots are automatically created for you, and the data is saved for as long as your data retention policy allows. Snapshots are perfect for Slack messages, emails, post-mortem docs, and anywhere you need a definitive historical view of your span data.
To view your Snapshots:
You can share a Snapshot with another Cloud Observability user using a URL. When the user clicks the link, the same query is run using the data from the Snapshot (instead of live data).
To share a snapshot:
When you integrate Cloud Observability with Slack, you can share a preview of the query histogram in any channel of your workspace. Simply paste the URL from the the Share button into a Slack channel. Other Slack members can see the histogram and Cloud Observability users can click View Explorer to jump right to that page.
You can add an Explorer query to a notebook for when, during an investigation, you want to be able to run ad hoc queries, take notes, and save your analysis for use in postmortems or runbooks. Notebooks let you view logs, metrics, and traces from different places in Cloud Observability together, in one place. While Explorer queries show you the last hour of data, notebooks allow you to view the data in your retention window.
To add to a notebook, click Add to notebook and search to choose an existing notebook or create a new notebook.
When you add to a notebook, a panel is created using the same query. You can see the latency for multiple percentiles and view exemplar traces. The annotation is a link back to the original, so you can quickly return to the origin of your investigation.
Learn more about notebooks.
When you want to monitor the data from a query going forward, instead of coming back to Explorer and running the query, you can create a Stream. When you create a Stream, Cloud Observability looks at the data from the Microsatellites every minute and persists example traces from different buckets of distribution for that query to ensure you always have data from 0 to p99.9, including outliers.
Learn more about Streams here.
Once you run a query, the Latency Histogram is generated by 100% of the span data collected from the Microsatellites that match the query. The bottom of the histogram (X axis) shows latency time, and each blue line represents the number of spans that fall into that latency time (Y axis).
For example, in the histogram below, you can see that around 1k spans fall into the 4.42s-4.77s time range.
By default, a marker shows the 95th percentile. You can change that using the Show percentile dropdown.
You can compare the current histogram with histograms from the past by choosing 1 hour, 1 day or 1 week prior. An overlay of that prior data displays on top of the histogram. This overlay represents data from the same time window exactly 1 hour, 1 day, or 1 week ago. If selecting a historical time period results in “No matches found,” then there were no spans matching the query during the historical window in time.
If your query returns results that differ significantly from the past, the overlay displays automatically, alerting you to a potential issue.
Now that you have an overview of latency for spans, you can get more detailed information from the Trace Analysis table.
The Trace Analysis table shows information from a sample of spans used to create the histogram. By default, the table shows the service; the operation reporting the span, the span’s duration, and the start time. You can add other columns as needed.
Cloud Observability analyzes 100% of the data used to create the histogram and then selects span data that represents all ranges of performance or is otherwise statistically interesting, ensuring outliers and other anomalies are well represented.
When you click on a span, you’re taken to the Trace view where you can see the span in the context of the full trace.
You can add columns that show the span’s attribute data to the right of the table by clicking the + icon. As you start typing, Cloud Observability finds attributes matching your search.
You can filter the data in the Trace Analysis table in several ways.
You can filter to see a certain range of latencies by clicking and dragging the area to filter to in the histogram.
Use the Show percentile dropdown to mark where a certain percentile starts.
The Trace Analysis table refreshes to show spans only in the selected percentile range.
You can also filter to show only specific services, operations or attributes.
You can group your results to see interesting aggregate data about your spans. You can group by service, operation, or attribute.
When you group your results, Cloud Observability organizes the table by those groups.
Click on a group to see the spans belonging to that group. Cloud Observability shows you the average latency, the error percenattributee, and the number of spans in that group.
You can also filter and group from the Correlations panel.
By default, Cloud Observability only shows spans that match your query. For example, if you queried on a service, you’ll only see spans from that service. But often, you’ll want to see spans that participated in the same traces as the spans returned by your query. If you’re trying to reach a hypothesis, you may want a more broad view of what is going on in a trace, without having to open a bunch of traces yourself.
To see all spans that took part in a trace with your query results, select Show all Spans in Traces.
The table refreshes to show spans that participate in the same traces that your results spans do. Use the group by or filter buttons to then filter data and still include spans from outside your initial query.
The Explorer view is useful for validating (or invalidating) hypotheses about a system regression. Let’s say you’re an on-call engineer and need to investigate high error rates in
api-server. You suspect a downstream service caused the errors, but it is not clear which service is causing the issue or why.
Here’s how you can use Explorer to debug the issue:
service so you can quickly spot the service with issues.
Sort by Error % to find the services with a high error rate.
This will show you the upstream and downstream services of the
api-server, sorted by error rate.
Notice that the
auth-service (called by
api-server) has a high error rate. You’re aware that there was a recent canary release for this service, so you dig deeper by doing the following:
canary=true has an 86% error rate (illustrating that the canary release for the
auth-service is experiencing errors). You can then click on that
canary=true row to see example traces and debug further.
Validating hypotheses is even easier when you can see your services in a diagram that shows service relationships and performance. Read View Service Hierarchy and Performance for more information about the Service diagram, which does just that!
Updated Apr 26, 2023