Lightstep Observability has a number of tools that help you in all your observability flows, whether it’s continual monitoring, triaging an incident, root cause analysis, viewing overall service health, or managing your team’s observability practices.

Monitoring

Monitoring your resources and transactions is a key part of observability. At a glance, you need to know if your transactions through your system are performant and that your resources (services, virtual memory) that those transactions consume are healthy. Unified dashboards allow you to view both your transactional performance (from trace data) and your resource health (usually from metric data) in one place. And after a deployment (even a partial deploy), you can use Lightstep Observability to ensure things are staying on track.

Unified dashboards

Using the unified dashboard experience, you can monitor both metric and span data charts in one place. Unified dashboard

You create charts for a dashboard using a query builder that works for both metric and span data. Use filters and groupings to see just the data you want. Unified query builder

Instead of the builder, you can use the Unified Query Language (UQL) in the editor to build more fine-grained queries.

Use UQL in the editor

For span data, exemplars are mapped in the chart, providing direct access to traces. A table below the chart provides a quick view into the data. Filter and group data

Filter and group data

You can click into a chart and immediately start your investigation using Change Intelligence. Change Intelligence

Using Terraform? You can use the Lightstep Terraform provider to create and manage your dashboards and charts. You can also use it to export existing dashboards into the Terraform format.

Read more:

Set up alerts

You create alerts by setting thresholds on a query to your metric or span data (you can set both a warning and critical threshold). Alert configuration

Notification destinations determine where the alert should be sent. Lightstep supports PagerDuty, Slack, and BigPanda out-of-the-box. You can use webhooks to integrate with other third-party destinations. Slack notification destination

Read more:

Deployment health

After a deployment, you want to be sure that the deployed changes healthy. Using the Service Health view, you can quickly get an overall view of service and operational health. Service Health view Lightstep Observability displays deployment markers for every deployed version of a service so you can immediately notice changes to the health of the system post-deployment. Deployment markers

Read more:

Investigate root causes

Lightstep Observability has a number of different ways to help you find the root cause of performance and error issues. It can correlate spikes in metric performance or errors with changes in span data that ocurred at the same time to determine what caused the change in performance. Using span data, Lightstep Observability analyzes traces to determine service dependencies that may be causing latency or errors in services further up or down the stack.

Triage incidents using notebooks

When you begin an investigation, you often need to run a number of queries to reach a hypothesis about the origin of an issue. Notebooks allow you to query both your metric and span data in one place to reach that hypothesis and then share those findings with other team members. Notebooks

Once you mitigate the issue, you can transfer your learnings from notebooks to begin deeper root cause analysis.

Learn more:

Find the cause of change

Lightstep’s Change Intelligence correlates metric and span data to help find the cause of metric deviations. It determines the service that emitted a metric, searches for performance changes on Key Operations from that service at the same time as the deviation, and then uses trace data to determine what caused the change. Change Intelligence

You access Change Intelligence from any chart on a dashboard, notebook, or alert. A side panel displays attributes on spans that experienced a change in performance at the same time as a deviation on the chart. You can copy queries for these attributes and paste them into a notebook, where you can continue your investigation Change Intelligence side panel

Learn more: Investigate a deviation

Investigate latency issues

You can use Lightstep Observability not only to monitor your services after a deploy, but also to compare performance over specific time periods and then dig into details to find differences that caused the latency changes.

From the Service Health view of the Service Directory, you click into the latency spike from a chart. Lightstep analyzes the change by comparing span data from the spike to data from a stable time period. Investigate latency spike

Lightstep’s root cause analysis provides a number of tools to help you understand what changed:

  • Histogram: Compares the distribution of latency between the baseline and regression so you can immediately see the amount of spans that are experiencing a regression.
  • Attribute comparison: Compares the latency of spans with a specific attribute attached to them from the two time periods. You can quickly see a correlation of a particular attribute to the regression.
  • Operation diagram: Shows all operations in the request path of the operation you’re investigating. Yellow halos represent the amount of latency each operation is experiencing, so you can quickly see if one operation down the stack is causing the issue.
  • Log events: Compares the logs attached to the span data between the baseline and regression and surfaces any differences in frequency between the two.
  • Trace Analysis table: Shows detailed span information for both the baseline and regression.

RCA analysis tools

Learn more:

Investigate errors

Similar to investigating a latency change, Lightstep Observability offers similar tools to investigate a spike in the error rate by comparing span data from the regression to a baseline.

  • Operation diagram: Shows all operations in the request path of the operation you’re investigating. Red halos represent the error percentage rate for each operation in the path, so you can quickly see if one operation down the stack is causing the issue.
  • Attribute comparison: Compares the error rate of spans with a specific attribute attached to them from the two time periods. You can quickly see a correlation of a particular attribute to the error rate.
  • Log events: Compares the logs attached to the span data between the baseline and regression and surfaces any differences in error frequency between the two.
  • Trace Analysis table: Shows detailed span information for both the baseline and regression.

Error rate analysis tools

Learn more:

Analyze real-time span data and find correlations

With Explorer, you can query and view the last hour of span data to analyze current performance. Like the root cause analysis view, Explorer offers a number of tools that analyze the data:

  • Latency histogram: Shows the distribution of spans that match your query as latency buckets represented by the blue lines. Longer blue lines mean there are more spans in that latency bucket. Lines towards the left have lower latency and towards the right, higher latency.
  • Trace Analysis table: Shows data for spans matching your query. By default, you can see the service, operation, latency, and start time for each span.
  • Attribute correlations: Shows services, operations, and attributes that are correlated with the latency and errors. That is, they appear more often in spans that are experiencing these issues.

Explorer

Additionally, Explorer provides the Service diagram that displays the relationship of the service in your query to all dependencies and possible code paths, both upstream and downstream. Areas of latency (yellow) and errors (red) are shown.

Service diagram

Learn more:

View a full trace

From all tools that you might use in your investigation, you can click through to a full-stack trace of a request. A side panel provides details of each span in the trace, allowing you to view its attributes, logs, and other details. The Trace viw illustrates the critical path for you (the time when an operation in a trace is actually doing something) so you can immediately see bottlenecks in the request.The Trace view is where you can prove out your hypotheses.

Trace view

View service health

The Service Directory view lets you see at a glance how services reporting to Lightstep Observability are performing. At a glance, you can view changes to performance on a service’s operations. View operation performance

You can also see how well a service is instrumented for tracing, and where you can make improvements. Service IQ

Learn more: