Lightstep Observability offers a way to quickly see how all your services and their operations are performing in one place - the Service Directory view. From here, you can:

  • See all your services in one place
    • Search for a service
    • Filter to view by language
    • Mark “favorite” services
  • See performance over time and the top performance changes for your Key Operations
    • View service health after deployments
    • Compare latency, error, and infrastructure regression data against baseline data
    • View and sort latency, operation rate, and error rate and dive into root cause analysis
  • View all operations (ingress and egress) on a service
    • View current performance of individual operations or performance at a specific percentile or performance change over a period of time
    • Filter by ingress or egress operations
  • View Streams(retained span queries).
  • View the instrumentation quality of a service

Read Quick Start: Tracing instrumentation to learn how to add instrumentation to your service so it can report to Lightstep Observability.

Deployments

Access the service directory view

When you first open Lightstep Observability, you’re taken to the Service Directory. You can also access it from the navigation bar. Service Directory

Find services

Your services are listed in alphabetical order. To make finding services easier, you can “favorite” a service so it always appears at the top of the list.

To find a service:

  • Use the Search box to search for services by name. As you type, Lightstep Observability filters the list of services that match your entry.
  • Filter by language. All services show the language of the associated instrumentation. Use the language dropdown at the top of the list to filter by a language. Find services

To favorite a service:

  1. Find and select the service to favorite.
  2. Click the star next to the service’s name. The service now appears at the top of the list.

View service health, see top changes, and compare operation performance

We will be introducing new workflows to replace the Deployments tab and RCA view. As a result, they will soon no longer be supported. Instead, use notebooks for your investigation where you can run ad-hoc queries, view data over a longer time period, and run Change Intelligence.

The Service Health view on the Deployments tab shows you the latency, error rate, and operation rate of your Key Operations (operations whose performance is strategic to the health of your system) on the selected service.

Key Operations are displayed in order of magnitude of change in performance (you can change the order). You can quickly see the top performance changes and compare performance during latency or error regression spikes with normal performance.

View top changes in performance

Lightstep Observability displays sparkline charts for Key Operations on the service. They show recent performance for Service Level Indicators (SLIs): latency, error rate, and operation rate. Shaded yellow bars to the left of the chart indicate the magnitude of the change.

The operations are initially sorted by highest change during the visible time period (you can change that time period) and you can also change the sorting order).

If there’s a deployment marker visible, that change is measured as the difference between the selected version and all other versions. If there is no marker, the change is measured as the difference between the first and second half of the time period shown in the chart.

More About How Lightstep Observability Measures Change

When determining what has changed in your services, Lightstep Observability compares the SLI of the baseline SLI timeseries to the comparison SLI timeseries. Those time periods are determined using the data currently visible in the charts.

You can change the amount of time displayed using the time period dropdown at the top right of the page.

The baseline and comparison time periods are determined as follows:

If there is one or more deployment markers visible:

  • For latency and error rates, Lightstep Observability compares the performance after the selected version to performance in all other versions.
  • For operation rate, it compares the rate before the deployment marker to after the deployment marker.

If there are no deployment markers visible:
Lightstep Observability compares the performance of the first half of the time period to the second half.

Only changes that are relative (i.e. a change of 10ms to 500ms is ranked higher than one of 1s to 2s) are considered. The yellow bars on the sparkline chart indicate the amount of change. Lightstep Observability measures two aspects of change: size and continuity. A full bar indicates that a large, sustained change has happened. Smaller bars indicate either a smaller change or one that did not last for the full time period.

The yellow bar means that an SLI had an objectively large change, regardless of service or operation. Lightstep’s algorithm runs on each SLI independently. For example, when the bar displays for an operation’s latency, that means latency has changed – not that its change was greater compared to the other SLIs.

Expandable end

Select an operation’s sparkline charts to view larger charts for the latency, error rate, and operation rate. You use these larger charts to start your investigation.larger charts are highlighted on the right

Change the sparkline charts display

By default, the operations are sorted by the amount of detected change (largest to smallest). Use the dropdown to change the sort.

Also by default, only the latency percentile with the largest amount of change displays. You can change the sparkline charts to show more percentiles using the More ( ⋮ ) icon.

You can search for an operation using the Search field.

Change the reporting time period

By default, the data shown is from the last 60 minutes. You can customize that time period using the dropdown. Use the < > controls to move backwards and forwards through time. You can view data from your retention window (default is three days).

You can also zoom in on a time period by clicking and dragging over the time period you want a closer look at. The charts redraw to report on just the period you selected.

Monitor deployments

The Service Health view on the Deployments tab provides the ability to see how deployments specifically affect performance of your services. When you implement an attribute to display versions of your service, a deployment marker displays at the time the deployment occurred, both in the Service Health view and in Change Intelligence.

We will be introducing new workflows to replace the Deployments tab and RCA view. As a result, they will soon no longer be supported. Instead, use notebooks for your investigation where you can run ad-hoc queries, view data over a longer time period, and run Change Intelligence.

Deployment in Service Directory view

These markers allow you to quickly correlate deployment with a possible regression.

When you have multiple versions in a time window, you can view the performance of each deployed version. For example, in this image of the Service Health view, multiple versions have been deployed. Hover over the chart to see the percentage of traffic in each version.Multiple markers

Learn more about how to use this view to monitor the health of your deployments.

Compare performance from different time ranges

We will be introducing new workflows to replace the Deployments tab and RCA view. As a result, they will soon no longer be supported. Instead, use notebooks for your investigation where you can run ad-hoc queries, view data over a longer time period, and run Change Intelligence.

When you spot a latency or error rate regression, you can start an investigation by clicking the corresponding time series chart during the regression. You can choose to compare performance from before the previous deploy, an hour ago, a day ago, or select a custom baseline.

Choose a time in the middle of the regression to avoid collecting data previous to the spike.

The chart opens in the RCA (root cause analysis) view.

Latency root cause analysis

We will be introducing new workflows to replace the Deployments tab and RCA view. As a result, they will soon no longer be supported. Instead, use notebooks for your investigation where you can run ad-hoc queries, view data over a longer time period, and run Change Intelligence.

This view provides the following tools to help you with root cause analysis for latency:

  • Latency Histogram and Correlated Attributes table: Compare attributes from the baseline and regression that may be contributing to latency.
  • Operation Diagram and table: Compare operations whose performance may be affecting this operation.
  • Logs Event Analysis: See log events from spans used to create the diagram and find potential latency sources.
  • Trace Analysis table: View and group span data from the baseline and regression to narrow down your analysis.

Learn how to use these tools here.

Error rate root cause analysis

We will be introducing new workflows to replace the Deployments tab and RCA view. As a result, they will soon no longer be supported. Instead, use notebooks for your investigation where you can run ad-hoc queries, view data over a longer time period, and run Change Intelligence.

Lightstep Observability offers these tools for analyzing spikes in error rate:

  • Operation Diagram and correlated operations and attributes: Immediately see which operations have errors (based on the red halo in the diagram). Use the Operations with Errors table to see the level of regression. Use the Attributes with Errors table to view attributes that seem to be correlated with the regression. Operation diagram shows dynamically sized red halos based on the number of errors
  • Log Events Analysis table: See logs from spans used to create the diagram and find potential error sources.
  • Trace Analysis table: View and group span data from the baseline and regression to narrow down your analysis.

Learn how to use these tools here.

Add a time series to a notebook

You can add any of the time series charts to a notebook for when, during an investigation, you want to be able to run ad hoc queries, take notes, and save your analysis for use in postmortems or runbooks. Notebooks allow you to view metric and trace data from different places in Lightstep Observability together, in one place. Notebook

To add to a notebook, click Add to notebook and search to choose an existing notebook or create a new notebook.

Add time series to notebook

When you add to a notebook, a chart is created using the same query. You can see the latency for multiple percentiles and view exemplar traces. The annotation is a link back to the original, so you can quickly return to the origin of your investigation.

Service Directory query in a notebook

Learn more about notebooks.

See an overview of operations and their performance

The Operations tab on the Service Directory view shows the selected service’s operations currently reporting to Lightstep Observability in alphabetical order, along with performance metrics aggregated over the past hour.

The table provides several useful performance metrics for each operation:

  • Latency Change: Change in latency between now and the time period set using the Change Since dropdown.
  • Latency: How long the operation took to complete for a given percentile, set using the Percentile dropdown.
  • Error Change: The percentage change in error rate for the time period set using the Change Since dropdown.
  • Errors: The percentage of operations that contain an error.
  • Rate Change: The percentile change of rate in the time period set using the Change Since dropdown.
  • Rate: The number of times the operation occurred per second.
  • View stream: Add a chart to a notebook or dashboard using the query for the operation. Click the row to view the associated Stream.
  • Create stream: Create a Stream for the operation. Creating a Stream for an operation allows you to retain the data for the associated query for longer than your retention window.

Add an operation to a notebook or dashboard

You can add charts that show the operation’s performance to either a notebook or a dashboard. When you add an operation to a notebook or dashboard, three charts are created: one for latency, one for error rate, and one for operation rate.

Add an operation’s query to a notebook for when, during an investigation, you want to be able to run ad hoc queries, take notes, and save your analysis for use in postmortems or runbooks. Notebooks allow you to view metric and trace data from different places in Lightstep Observability together, in one place.

Add the query to a dashboard when you want to monitor the performance over a period of time.

To add an operation’s query to a notebook or dashboard:

  1. Click View stream for the operation and choose to add to a notebook or dashboard. Add to notebook or dashboard

    If the button says Create stream, click that to create a Stream and then click View stream.

  2. Choose an existing notebook or dashboard, or create a new one. Choose notebook or dashboard

Three charts are created (latency, error rate, and operation rate) using a query for the operation on the service. You can then edit the queries as needed and view exemplar traces. Chart for operation

Search and filter operations

  • Use the Search box to search for operations by name. As you type, Lightstep Observability filters the list of operations that match your entry.
  • View only ingress or egress operations by clicking the respective tab. Ingress operations are the first operations handling external requests from outside that service (i.e. API HTTP GET etc.). Egress operations are those that call out to external services.

To see if other services are affecting an operation, run a query for the operation in Explorer and then use the Service Diagram to view upstream and downstream services and their performance.
You can also run that query and use the Trace Analyzer table to sort, filter, and group service, operations, and attributes to find upstream or downstream issues.

View streams for an operation or service

Streams are retained span queries that continuously collect latency, error rate and operation rate data. By default, data from span queries are persisted for three days. When you save a query as a Stream, the Microsatellites automatically collect and persist the query’s data for a longer period of time.

To view all Streams for a service, click the Streams tab. The number on the tab tells you how many Streams exist for this service.

Create a Stream from the Operations tab by clicking Create Stream for an operation.

Add a Stream’s query to a notebook or dashboard

You can add charts that show the Stream’s performance to either a notebook or a dashboard. When you add a Stream, three charts are created: one for latency, one for error rate, and one for operation rate.

Add an Stream’s query to a notebook for when, during an investigation, you want to be able to run ad hoc queries, take notes, and save your analysis for use in postmortems or runbooks. Notebooks allow you to view metric and trace data from different places in Lightstep Observability together, in one place.

Add the query to a dashboard when you want to monitor the performance over a period of time.

To add a Stream’s query to a notebook or dashboard:

  1. Click View stream for the Stream and choose to add to a notebook or dashboard. Add to notebook or dashboard

    If the button says Create stream, click that to create a Stream and then click View stream.

  2. Choose an existing notebook or dashboard, or create a new one. Choose notebook or dashboard

Three charts are created (latency, error rate, and operation rate) using the Stream’s query. You can then edit the queries as needed and view exemplar traces. Chart for Stream

View a service’s dashboards

Click the Dashboards tab to view dashboards that include charts or a Stream for this service. The number on the tab tells you how many dashboards exist for this service.

Only dashboards that have charts that contain a filter for the service are shown.

Click a dashboard to view it.

Read Create and manange dashboards to learn more.

View and improve your instrumentation quality

The data you can view and use in Lightstep Observability depends on the quality of your tracing instrumentation. The better and more comprehensive your instrumentation is, the better Lightstep Observability can collect and analyze your data to provide highly actionable information.

Lightstep Observability analyzes the instrumentation on your services and determines how you can improve it to make your Lightstep Observability experience even better. It can determine whether you instrumentation:

  • Crosses services to create full traces
  • Includes interior spans to help find the critical path
  • Contains attributes to help find correlated areas of latency. If there are attributes that you’d like all services to report to Lightstep Observability (like a customer ID or Kubernetes region), you can register the corresponding attributes and Lightstep Observability will check for those when determining the IQ score.
  • Uses attributes for deployments to help monitor regressions
  • Contains hostname attributes to help find performance issues in different environments.

Click the Instrumentation Quality tab to learn how well your instrumentation measures up. The number on the tab gives your score (based on 100%).

Learn more about what your score means and how to fix it.