Lightstep Observability has a number of tools that help you in all your observability flows, whether it’s continual monitoring, triaging an incident, root cause analysis, viewing overall service health, or managing your team’s observability practices.
Monitoring
Monitoring your resources and transactions is a key part of observability. At a glance, you need to know if your transactions through your system are performant and that your resources (services, virtual memory) that those transactions consume are healthy. Unified dashboards allow you to view both your transactional performance (from trace data) and your resource health (usually from metric data) in one place. And after a deployment (even a partial deploy), you can use Lightstep Observability to ensure things are staying on track.
Unified dashboards
Using the unified dashboard experience, you can monitor both metric and span data charts in one place.
You create charts for a dashboard using a query builder that works for both metric and span data. Use filters and groupings to see just the data you want.
Instead of the builder, you can use the Unified Query Language (UQL) in the editor to build more fine-grained queries.
For span data, exemplars are mapped in the chart, providing direct access to traces. A table below the chart provides a quick view into the data.
You can click into a chart and immediately start your investigation using Change Intelligence.
Using Terraform? You can use the Lightstep Terraform provider to create and manage your dashboards and charts. You can also use it to export existing dashboards into the Terraform format.
Read more:
- Create and manage unified dashboards
- Create and manage charts
- Learn about UQL
- Use Change Intelligence
Set up alerts
You create alerts by setting thresholds on a query to your metric or span data (you can set both a warning and critical threshold).
Notification destinations determine where the alert should be sent. Lightstep supports PagerDuty, Slack, and BigPanda out-of-the-box. You can use webhooks to integrate with other third-party destinations.
Read more:
Deployment health
After a deployment, you want to be sure that the deployed changes healthy. Using the Service Health view, you can quickly get an overall view of service and operational health. Lightstep Observability displays deployment markers for every deployed version of a service so you can immediately notice changes to the health of the system post-deployment.
Read more:
Investigate root causes
Lightstep Observability has a number of different ways to help you find the root cause of performance and error issues. It can correlate spikes in metric performance or errors with changes in span data that ocurred at the same time to determine what caused the change in performance. Using span data, Lightstep Observability analyzes traces to determine service dependencies that may be causing latency or errors in services further up or down the stack.
Triage incidents using notebooks
When you begin an investigation, you often need to run a number of queries to reach a hypothesis about the origin of an issue. Notebooks allow you to query both your metric and span data in one place to reach that hypothesis and then share those findings with other team members.
Once you mitigate the issue, you can transfer your learnings from notebooks to begin deeper root cause analysis.
Learn more:
Find the cause of change
Lightstep’s Change Intelligence correlates metric and span data to help find the cause of metric deviations. It determines the service that emitted a metric, searches for performance changes on Key Operations from that service at the same time as the deviation, and then uses trace data to determine what caused the change.
You access Change Intelligence from any chart on a dashboard, notebook, or alert. A side panel displays attributes on spans that experienced a change in performance at the same time as a deviation on the chart. You can copy queries for these attributes and paste them into a notebook, where you can continue your investigation
Learn more: Investigate a deviation
Investigate latency issues
You can use Lightstep Observability not only to monitor your services after a deploy, but also to compare performance over specific time periods and then dig into details to find differences that caused the latency changes.
From the Service Health view of the Service Directory, you click into the latency spike from a chart. Lightstep analyzes the change by comparing span data from the spike to data from a stable time period.
Lightstep’s root cause analysis provides a number of tools to help you understand what changed:
- Histogram: Compares the distribution of latency between the baseline and regression so you can immediately see the amount of spans that are experiencing a regression.
- Attribute comparison: Compares the latency of spans with a specific attribute attached to them from the two time periods. You can quickly see a correlation of a particular attribute to the regression.
- Operation diagram: Shows all operations in the request path of the operation you’re investigating. Yellow halos represent the amount of latency each operation is experiencing, so you can quickly see if one operation down the stack is causing the issue.
- Log events: Compares the logs attached to the span data between the baseline and regression and surfaces any differences in frequency between the two.
- Trace Analysis table: Shows detailed span information for both the baseline and regression.
Learn more:
Investigate errors
Similar to investigating a latency change, Lightstep Observability offers similar tools to investigate a spike in the error rate by comparing span data from the regression to a baseline.
- Operation diagram: Shows all operations in the request path of the operation you’re investigating. Red halos represent the error percentage rate for each operation in the path, so you can quickly see if one operation down the stack is causing the issue.
- Attribute comparison: Compares the error rate of spans with a specific attribute attached to them from the two time periods. You can quickly see a correlation of a particular attribute to the error rate.
- Log events: Compares the logs attached to the span data between the baseline and regression and surfaces any differences in error frequency between the two.
- Trace Analysis table: Shows detailed span information for both the baseline and regression.
Learn more:
Analyze real-time span data and find correlations
With Explorer, you can query and view the last hour of span data to analyze current performance. Like the root cause analysis view, Explorer offers a number of tools that analyze the data:
- Latency histogram: Shows the distribution of spans that match your query as latency buckets represented by the blue lines. Longer blue lines mean there are more spans in that latency bucket. Lines towards the left have lower latency and towards the right, higher latency.
- Trace Analysis table: Shows data for spans matching your query. By default, you can see the service, operation, latency, and start time for each span.
- Attribute correlations: Shows services, operations, and attributes that are correlated with the latency and errors. That is, they appear more often in spans that are experiencing these issues.
Additionally, Explorer provides the Service diagram that displays the relationship of the service in your query to all dependencies and possible code paths, both upstream and downstream. Areas of latency (yellow) and errors (red) are shown.
Learn more:
View a full trace
From all tools that you might use in your investigation, you can click through to a full-stack trace of a request. A side panel provides details of each span in the trace, allowing you to view its attributes, logs, and other details. The Trace viw illustrates the critical path for you (the time when an operation in a trace is actually doing something) so you can immediately see bottlenecks in the request.The Trace view is where you can prove out your hypotheses.
View service health
The Service Directory view lets you see at a glance how services reporting to Lightstep Observability are performing. At a glance, you can view changes to performance on a service’s operations.
You can also see how well a service is instrumented for tracing, and where you can make improvements.
Learn more: