When you hear that a particular service is slow, or there’s a spike in errors, or a particular customer is having an issue, your first action is to see data regarding the issue. Lightstep’s Explorer view allows you to query all span data currently in the Satellites to see what’s going on.

Explorer consists of four main components. Each of these components provides a different view into the data returned by your query:

  • Query: Allows you to query on all span data currently in the Satellites recall window. Every query you make is saved as a Snapshot, meaning you can revisit the query made at this point in time, any time in the future, and see the data just as it was. Results from the query are shown in the Latency Histogram, the Trace Analysis table, and the Service diagram.

  • Latency histogram: Shows the distribution of spans over latency periods. Spans are shown in latency buckets - longer blue lines mean there are more spans in that latency bucket. Lines towards the left have lower latency and towards the right, higher latency.

  • Trace Analysis table: Shows data for spans matching your query. By default, you can see the service, operation, latency, and start time for each span. Click a span to view it in the context of the trace.

  • Correlations panel: For latency, shows services, operations, and tags are correlated with the latency you are seeing in your results. For errors, shows tags correlated with spans that have errors. The panel shows both positive and negative correlations, allowing you to drill down into attributes that may be contributing to latency, as well as ruling out ones that are not. Find out more about Correlations here.

  • Service diagram: Depicts the service based on your query in relation to all its dependencies and possible code paths, both upstream and downstream. Areas of latency (yellow) and errors (red) are shown. Learn more about the Service diagram here.

Using these tools, it’s possible to quickly form and validate hypotheses around the current issue.

Query Data to Create a Snapshot

You can query the span data on any combination of a service, an operation, and any number of span tags. Every time you run a query, the results are saved as a Snapshot so you can go back to data at that point in time any time in the future.

Run a Query

You run your query from the top of Explorer. You can use the Query Builder to ensure valid syntax, or you can enter the query manually.

When you run your query, Lightstep queries the Satellites’ recall window and returns all span data that matches your query.

Lightstep returns spans that fall into the following time range:start=query_time-median_recallend=query_time
If there are spans in the satellite’s memory but aren’t in the time range above, no results are returned. This time period prevents older data from being returned. If the Satellite recall window is not full, Lightstep uses 3 minutes as the median recall.

To run a query using the Query Builder:

Click into the search bar to open the Query Builder. You can build queries for services, operations, and tags. Use IN or NOT IN to build the query. When you click into the Service or Operation field, Lightstep displays valid values.

When you add multiple values to the Operation field, spans that match either value (OR operation) are returned.

To add tags, click the Add a tag filter button. You can add multpile tags.

Tags are added to the query as an AND operation, meaning only spans that match all tags are returned.

To run a query manually:

Click into the Search bar and start typing your query. If you need help building your query, checkout the Query Language Cheat Sheet. Reset your query by clicking the X icon on the right.

Values must be an exact match - capitalization matters.

Supported Keys

KeyValueExample

service

Service’s name

service IN (“iOS”)
Returns spans from the iOS service

service NOT IN (“android”)
Returns spans from every service but android
 

operation

Operation's name

operation IN (“/api/get-profile”)
Returns spans from the /api/get-profile operation
 

"tag_name"

Custom tag’s name, in quotes.
For example “customer” or “aws-region”
 

“aws-region” IN (“east”)
Returns spans where the aws-region tag value is east

"lightstep.span_id"
"lightstep.trace_id"
"lightsetp.tracer_id"

Lightstep generated tags.

The ID for a span, trace, or tracer.

"lightstep.span_id" IN (“ad5490ghl”)
Returns that specific span

"lightstep.trace_id" NOT IN (“cebr0875xl”)
Returns spans in traces other than the cebr0875xl trace

"lightstep.tracer_id" NOT IN (“cebr0875xl”)
Returns spans that were produced from tracers other than the cebr0875xl tracer

Querying Multiple Keys and Values

Use the following syntax rules to build complex queries:

  • Use a comma to query multiple values for a key. Multiple values are treated as OR operations.
    Example:
    service IN (“iOS”,“android”)
    Returns spans that are from either the iOS or android service

    “customer” NOT IN (“smith”,”jones”)
    Returns spans that do not have smith or jones as the value for the customer tag.
     

  • Only one set of values per key are allowed.
    Example:
    Valid:
    service IN (“iOS”, “android”)

    Not valid:
    service IN (“iOS”, “android”) AND service IN (“web”)
     

  • Use AND operations to build queries with multiple key/value sets (the OR operation is not supported).
    Example:
    service IN (“iOS”,“android”) AND “aws_region” IN (“us_east”)
    Returns spans that are in either the iOS or android service and are in the us_east AWS region.

    "lightstep.trace_id" NOT IN (“etla347uz”, “aeo7584xco”) AND “error” IN (“true”)
    Returns spans that are not in traces with the ID etla347uz or aeo7584xco and have the value true for the error tag.

Full Example

service IN (“iOS”, “android) AND operation IN (“auth”, “transaction-db”) AND “aws-region” NOT IN (“us-west”) AND "lightstep.span_id" IN (“ad753587das”, “zb18857cat”)

View Snapshots

Lightstep saves every query you make as a Snapshot. Snapshots provide a view into saved data that you can share with other Lightstep users. When you share a Snapshot, the recipient can work with the data in the same way that you did. Snapshots are automatically created for you, and the data is saved for as long as your data retention policy allows. Snapshots are perfect for Slack messages, emails, post-mortem docs, and anywhere you need a definitive historical view of your span data.

To view your Snapshots:

  1. Click the gray dropdown that displays Today and a timestamp (this is the timestamp of your latest snapshot).

    Your Snapshots are listed by date, time, and query.
  2. Select the Snapshot to view.Explorer rerenders using the data from the Snapshot.

Share a Snapshot

You can share a Snapshot with another Lightstep user using a URL. When the user clicks the link, the same query is run using the data from the Snapshot (instead of live data).

To share a Snapshot:

  1. Click Share.

    The URL is copied to your clipboard.
  2. Paste the URL wherever you want someone to access the data.

Share a Snapshot in Slack

When you integrate Lightstep with Slack, you can share a preview of the query histogram in any channel of your workspace. Simply paste the URL from the the Share button into a Slack channel. Other Slack members can see the histogram and Lightstep users can click View Explorer to jump right to that page.

Save Your Query and Monitor the Data Going Forward

If you know you will want to revisit a query multiple times, instead of coming back to Explorer and running the query, you can create a Stream. When you create a Stream based on a query, Lightstep looks at the distribution of data from the Satellites every minute and persists example traces from different buckets of distribution for that query to ensure you always have data from 0 to p99.9, including outliers. You can create dashboards from Streams, and add alerts when data in a Stream crosses a threshold.

Learn more about Streams here.

View Latency Histogram

Once you run a query, the Latency Histogram is generated by 100% of the span data collected from the Satellites that match the query. The bottom of the histogram shows latency time, and each blue line represents the number of spans that fall into that latency time.

View Percentile Markers

By default, a marker shows the 95th percentile. You can change that using the Show percentile dropdown.

Compare with Historical Data

You can compare the current histogram with histograms from the past by choosing 1 hour1 day or 1 week prior. An overlay of that prior data displays on top of the histogram. This overlay represents data from the same time window exactly 1 hour, 1 day, or 1 week ago. If selecting a historical time period results in “No matches found,” then there were no spans matching the query during the historical window in time.

If your query returns results that differ significantly from the past, the overlay displays automatically, alerting you to a potential issue.

Now that you have an overview of latency for spans, you can get more detailed information from the Trace Analysis table.

Analyze Span Data

The Trace Analysis table shows information from a sample of spans used to create the histogram. By default, the table shows the service; the operation reporting the span, the span’s duration, and the start time. You can add other columns as needed.

Lightstep analyzes 100% of the data used to create the histogram and then selects span data that represents all ranges of performance or is otherwise statistically interesting, ensuring outliers and other anomalies are well represented.

When you click on a span, you’re taken to the Trace view where you can see the span in the context of the full trace.

Add More Columns to the Trace Analysis Table

You can add columns that show the span’s tag data to the right of the table by clicking the + icon. As you start typing, Lightstep finds tags matching your search.

Filter Data

You can filter the data in the Trace Analysis table in several ways.

Filter by Latency

You can filter to see a certain range of latencies by clicking and dragging the area to filter to in the histogram.

Use the Show percentile dropdown to mark where a certain percentile starts.

The Trace Analysis table refreshes to show spans only in the selected percentile range.

Filter by Service, Operation, or Tag

You can also filter to show only specific services, operations or tags.

Group Results

You can group your results to see interesting aggregate data about your spans. You can group by service, operation, or tag.

When you group your results, Lightstep organizes the table by those groups.

Click on a group to see the spans belonging to that group. Lightstep shows you the average latency, the error percentage, and the number of spans in that group.

You can also filter and group from the Correlations panel.

Add All Spans from the Trace to the Table

By default, Lightstep only shows spans that match your query. But often, you’ll want to see spans that participated in the same traces as the spans returned by your query. For example, if you’re trying to reach a hypothesis, you may want a more broad view of what is going on in a trace, without having to open a bunch of traces yourself.

To see all spans that took part in a trace with your query results, select Show all Spans in Traces.

The table refreshes to show spans that participate in the same traces that your results spans do.

Example: Use Explorer to Validate a Hypothesis

The Explorer view is useful for validating (or invalidating) hypotheses about a system regression. Let’s say you’re an on-call engineer and need to investigate high error rates in api-server. You suspect a downstream service caused the errors, but it is not clear which service is causing the issue or why.

Here’s how you can use Explorer to debug the issue:

  1. Run a query for service: api-server.
  2. Check Show all Spans in Traces so you can see spans outside of the api-server service.
  3. Group by service so you can quickly spot the service with issues.
  4. Sort by Error % to find the services with a high error rate.

    This will show you the upstream and downstream services of the api-server, sorted by error rate.

    Notice that the auth-service (called by api-server) has a high error rate. You’re aware that there was a recent canary release for this service, so you dig deeper by doing the following:

  5. Filter by service: auth-service.
  6. Group by canary.

Notice that canary=true has an 86% error rate (illustrating that the canary release for the auth-service is experiencing errors). You can then click on that canary=true row to see example traces and debug further.

Validating hypotheses is even easier when you can see your services in a diagram that shows service relationships and performance. Read View Service Hierarchy and Performance for more information about the Service diagram, which does just that!