Lightstep Observability has a number of tools that help you in all your observability flows, whether it’s continual monitoring, triaging an incident, root cause analysis, viewing overall service health, or managing your team’s observability practices.

Monitoring

Monitoring your resources and transactions is a key part of observability. At a glance, you need to know if your transactions through your system are performant and that your resources (services, virtual memory) that those transactions consume are healthy. Unified dashboards allow you to view both your transactional performance (from trace data) and your resource health (usually from metric data) in one place. And after a deployment (even a partial deploy), you can use Lightstep Observability to ensure things are staying on track.

Unified dashboards

Using the unified dashboard experience, you can monitor both metric and span data charts in one place. Unified dashboard

You create charts for a dashboard using a query builder that works for both metric and span data. Unified query builder

Use filters and groupings to see just the data you want. For span data, exemplars are mapped in the chart, providing direct access to traces. A table below the chart provides a quick view into the data. Filter and group data

You can click into a chart and immediately start your investigation using Change Intelligence. Change Intelligence

Using Terraform? You can use the Lightstep Terraform provider to create and manage your dashboards and charts. You can also use it to export existing dashboards into the Terraform format.

Read more:

Set up alerts

You create alerts by setting thresholds on a query to your metric or span data (you can set both a warning and critical threshold). Alert configuration

Notification destinations determine where the alert should be sent. Lightstep supports PagerDuty, Slack, BigPanda, and Lightstep Incident Response out-of-the-box. You can use webhooks to integrate with other third-party destinations. Slack notification destination

Read more:

Deployment health

After a deployment, you want to be sure that the deployed changes healthy. Using the Service Health view, you can quickly get an overall view of service and operational health. Service Health view Lightstep Observability displays deployment markers for every deployed version of a service so you can immediately notice changes to the health of the system post-deployment. Deployment markers

Read more:

Investigate root causes

Lightstep Observability has a number of different ways to help you find the root cause of performance and error issues. It can correlate spikes in metric performance or errors with changes in span data that ocurred at the same time to determine what caused the change in performance. Using span data, Lightstep Observability analyzes traces to determine service dependencies that may be causing latency or errors in services further up or down the stack.

Triage incidents using notebooks

When you begin an investigation, you often need to run a number of queries to reach a hypothesis about the origin of an issue. Notebooks allow you to query both your metric and span data in one place to reach that hypothesis and then share those findings with other team members. Notebooks

Once you mitigate the issue, you can transfer your learnings from notebooks to begin deeper root cause analysis.

Learn more:

Find the cause of change

Lightstep’s Change Intelligence correlates metric and span data to help find the cause of metric deviations. It determines the service that emitted a metric, searches for performance changes on Key Operations from that service at the same time as the deviation, and then uses trace data to determine what caused the change. Change Intelligence

You access Change Intelligence from any metric chart on a dashboard, notebook, or alert. It displays the operations that experienced a change in performance and lists them on the page, sorted by highest degree of change first. For each operation, it shows the most likely causes (the attributes that appeared in most traces with the performance change), along with the percentage of traces where the attribute appears, in both the baseline and deviation. Most likely cause of change

Learn more: Investigate a metric deviation

Investigate latency issues

You can use Lightstep Observability not only to monitor your services after a deploy, but also to compare performance over specific time periods and then dig into details to find differences that caused the latency changes.

From the Service Health view of the Service Directory, you click into the latency spike from a chart. Lightstep analyzes the change by comparing span data from the spike to data from a stable time period. Investigate latency spike

Lightstep’s root cause analysis provides a number of tools to help you understand what changed:

  • Histogram: Compares the distribution of latency between the baseline and regression so you can immediately see the amount of spans that are experiencing a regression.
  • Attribute comparison: Compares the latency of spans with a specific attribute attached to them from the two time periods. You can quickly see a correlation of a particular attribute to the regression.
  • Operation diagram: Shows all operations in the request path of the operation you’re investigating. Yellow halos represent the amount of latency each operation is experiencing, so you can quickly see if one operation down the stack is causing the issue.
  • Log events: Compares the logs attached to the span data between the baseline and regression and surfaces any differences in frequency between the two.
  • Trace Analysis table: Shows detailed span information for both the baseline and regression.

RCA analysis tools

Learn more:

Investigate errors

Similar to investigating a latency change, Lightstep Observability offers similar tools to investigate a spike in the error rate by comparing span data from the regression to a baseline.

  • Operation diagram: Shows all operations in the request path of the operation you’re investigating. Red halos represent the error percentage rate for each operation in the path, so you can quickly see if one operation down the stack is causing the issue.
  • Attribute comparison: Compares the error rate of spans with a specific attribute attached to them from the two time periods. You can quickly see a correlation of a particular attribute to the error rate.
  • Log events: Compares the logs attached to the span data between the baseline and regression and surfaces any differences in error frequency between the two.
  • Trace Analysis table: Shows detailed span information for both the baseline and regression.

Error rate analysis tools

Learn more:

Analyze real-time span data and find correlations

With Explorer, you can query and view the last hour of span data to analyze current performance. Like the root cause analysis view, Explorer offers a number of tools that analyze the data:

  • Latency histogram: Shows the distribution of spans that match your query as latency buckets represented by the blue lines. Longer blue lines mean there are more spans in that latency bucket. Lines towards the left have lower latency and towards the right, higher latency.
  • Trace Analysis table: Shows data for spans matching your query. By default, you can see the service, operation, latency, and start time for each span.
  • Attribute correlations: Shows services, operations, and attributes that are correlated with the latency and errors. That is, they appear more often in spans that are experiencing these issues.

Explorer

Additionally, Explorer provides the Service diagram that displays the relationship of the service in your query to all dependencies and possible code paths, both upstream and downstream. Areas of latency (yellow) and errors (red) are shown.

Service diagram

Learn more:

View a full trace

From all tools that you might use in your investigation, you can click through to a full-stack trace of a request. A side panel provides details of each span in the trace, allowing you to view its attributes, logs, and other details. The Trace viw illustrates the critical path for you (the time when an operation in a trace is actually doing something) so you can immediately see bottlenecks in the request.The Trace view is where you can prove out your hypotheses.

Trace view

View service health

The Service Directory view lets you see at a glance how services reporting to Lightstep Observability are performing. At a glance, you can view changes to performance on a service’s operations. View operation performance

You can also see how well a service is instrumented for tracing, and where you can make improvements. Service IQ

Learn more: