Lightstep offers a way to quickly see how all your services and their operations are performing in one place - the Service Directory view. From here, you can:

  • See all your services in one place
    • Search for a service
    • Filter to view by platform
    • Mark “favorite” services
  • See performance over time and the top performance changes for your ingress operations
    • View service health after deployments
    • Compare latency, error, and infrastructure regression data against baseline data
    • View and sort latency, operation rate, and error rate and dive into root cause analysis
  • View all operations (ingress and egress) on a service
    • View current performance of individual operations or performance at a specific percentile or performance change over a period of time
    • Filter by ingress or egress operations
  • View and create Streams and dashboards that include that service
  • View the instrumentation quality of that service

Read Quick Start: Instrumentation to learn how to add instrumentation to your service so it can report to Lightstep.

Access the Service Directory View

When you first open Lightstep, you’re taken to the Service Directory. You can also access it from the navigation bar.

Find Services

Your services are listed in alphabetical order. To make finding services easier, you can “favorite” a service so it always appears at the top of the list.

To find a service:

  • Use the Search box to search for services by name. As you type, Lightstep filters the list of services that match your entry.
  • Filter by platform. All services show the language of the associated instrumentation. Use the platform dropdown at the top of the list to filter by a language.

To favorite a service:

  1. Find and select the service to favorite.
  2. Click the star next to the service’s name. The service now appears at the top of the list.

View Service Health, See Top Changes, and Compare Ingress Operation Performance

The Service Health view on the Deployments tab shows you the latency, error rate, and operation rate of your ingress operations on the selected service (maximum number of operations shown is 50). You can quickly see the top performance changes and compare performance during regression spikes with normal performance.

View Top Changes in Performance

Lightstep displays sparkline charts for ingress operations on a service that show recent performance in latency, error rate, and operation rate. Shaded yellow bars to the left of the chart indicate the magnitude of the change. If there’s a deployment marker visible, the change is measured as the difference between before and after the deployment. If a specific version is selected then Lightstep measures change between the selected version and all other versions.

The operations are initially sorted by highest change during the time period initially selected (you can change that time period).

To the right, larger charts are shown for the latency, error rate, and operation rate of the selected operation. You use these larger charts to start your investigation.

Change the Sparkline Charts Display

By default, the operations are sorted by the amount of detected change (largest to smallest). Use the dropdown to change the sort.

Also by default, only the latency percentile with the largest amount of change displays. You can change the sparkline charts to show all percentiles using the hamburger icon.

You can search for an ingress operation using the Search field.

Change the Reporting Time Period

By default, the data shown is from the last 4 hours. You can change that to a number of different time periods, up to one week in the past, using the dropdown. Use the < > controls to move backwards and forwards through time. You can view data up to 10 days in the past.

You can also zoom in on a time period by clicking and dragging over the time period you want a closer look at. The charts redraw to report on just the period you selected.

Monitor Deployments

The Service Health view on the Deployments tab provides the ability to see how deployments specifically affect performance of your services. When you implement a tag to display versions of your service, a deployment marker displays at the time the deployment occurred. In the example below, a deploy of the inventory service occurred at 10:45 am. Hover over the marker in the larger charts to view details. These markers allow you to quickly correlate deployment with a possible regression.

When you have multiple versions in a time window, you can view the performance of each deployed version. For example, in this image, multiple versions have been deployed. Mouse over the chart to see the percentage of traffic in each version.

This feature requires a Satellite upgrade to the March 2020 release!

Learn more about how to use this view to monitor the health of your deployments.

Compare Performance from Different Time Ranges

When you spot a latency or error rate regression, you can start an investigation by clicking the corresponding time series chart during the regression. You can choose to compare performance from before the previous deploy, an hour ago, a day ago, or select a custom baseline.

Choose a time in the middle of the regression to avoid collecting data previous to the spike.

The chart opens in the Compare view.

Latency Root Cause Analysis

This view provides the following tools to help you with root cause analysis for latency:

  • Latency Histogram and Correlated Tags table: Compare tags from the baseline and regression that may be contributing to latency. Click on a tag in the table to view a full trace.
  • Operation Diagram and table: Compare operations whose performance may be affecting this operation. Click on an operation in the table to view a full trace.
  • Metrics Reporting: Based on the language of your tracer (available for Node/JS, Python, Java, and Go), you can see basic infrastructure metrics (cpu, memory, etc).
  • Trace Analysis table: Filter and group span data from the baseline and regression to narrow down your analysis.

Error Rate Root Cause Analysis

Lightstep offers these tools for analyzing spikes in error rate:

  • Operation Diagram and correlated operations and tags: Immediately see which operations have errors (based on the red halo in the diagram). Use the Operations with Erros table to see the level of regression. Use the Tags with Errors table to view tags that seem to be correlated with the regression. Operation diagram shows dynamically sized red halos based on the number of errors
  • Logs Analysis table: See logs from spans used to create the diagram to see potential error sources.
  • Trace Analysis table: Filter and group span data from the baseline and regression to narrow down your analysis.

Learn more about using the Comparison view to investigate latency regressions or error rate increases.

See an Overview of Operations and Their Performance

The Operations tab on the Service Directory view shows the selected service’s operations currently reporting to Lightstep in alphabetical order, along with performance metrics aggregated over the past 15 to 5 minutes (10 minutes total).

The table provides several useful performance metrics for each operation:

  • Latency Change: Change in latency between now and the time period set using the Change Since dropdown.
  • Latency: How long the operation took to complete for a given percentile, set using the Percentile dropdown.
  • Error Change: The percentage change in error rate for the time period set using the Change Since dropdown.
  • Errors: The percentage of operations that contain an error.
  • Rate Change: The percentile change of rate in the time period set using the Change Since dropdown.
  • Rate: The number of times the operation occurred per second.
  • Create/View Stream: View or create a Stream for this operation.

Search and Filter Operations

  • Use the Search box to search for operations by name. As you type, Lightstep filters the list of operations that match your entry.
  • View only ingress or egress operations by clicking the respective tab. Ingress operations are the first operations handling external requests from outside that service (i.e. API HTTP GET etc.). Egress operations are those that call out to external services.

To see if other services are affecting an operation, run a query for the operation in Explorer and then use the Service Diagram to view upstream and downstream services and their performance.
You can also run that query and use the Trace Analyzer table to sort, filter, and group service, operations, and tags to find upstream or downstream issues.

Create and View Streams for an Operation or Service

Streams allow you to passively monitor performance over time. Once you create a Stream, the Satellites automatically collect and persist comprehensive span data about that operation without requiring a request from the UI for the data. Lightstep persists both statistical time series data and example traces, making Streams especially useful for tracking SLAs and alerting.

Create and View an Operation’s Stream

You can create and view Streams created for an operation from the Operations tab. If a Stream exists for an operation, a View Stream button displays for the operation name. Click the button to view the Stream.

If a Stream doesn’t exist yet, a Create Stream button displays. Click that button to create a Stream. The Stream is given the same name as the operation.

Best Practice - Create streams for your ingress operations
Usually, ingress operations for a service are high level enough that they indicate performance problems within a single service, and granular enough that finding the root cause of a performance problem is straightforward.

Create and View Streams for a Service

You can make Streams not just for an operation, but for any combination of service, operation, and tag (anything you can query for in Explorer).

To view all Streams for a service, click the Streams tab. The number on the tab tells you how many Streams exist for this service.

Click a Stream to view it.

Create a Stream for this service by clicking Create Stream. You can add operations or tags to the query.

Read Monitor a Service Level Indicator with Streams to learn more about Streams.

View Dashboards

Dashboards are collections of Streams shown in one place. Click the Dashboards tab to view dashboards that include a Stream for this service. The number on the tab tells you how many dashboards exist for this service.

Click a dashboard to view it.

Read Create Dashboards from Streams to learn more.

View and Improve Your Instrumentation Quality

The data you can view and use in Lightstep depends on the quality of your tracing instrumentation. The better and more comprehensive your instrumentation is, the better Lightstep can collect and analyze your data to provide highly actionable information.

Lightstep analyzes the instrumentation on your services and determines how you can improve it to make your Lightstep experience even better. It can determine whether you instrumentation:

  • Crosses services to create full traces
  • Includes interior spans to help find the critical path
  • Contains custom tags to help find correlated areas of latency
  • Uses tags for deployments to help monitor regressions
  • Contains hostname tags to help find performance issues in different environments.

Click the Instrumentation Quality tab to learn how well your instrumentation measures up. The number on the tab gives your score (based on 100%).

Learn more about what your score means and how to fix it.