Here are some common use cases most DevOps teams face on a regular basis, to show how using Lightstep can help solve issues, proactively keep them from happening, and become part of your day-to-day workflow.
Reduce MTTR/MTTU Through Differentiated Trace Analytics
You get an alert that a high-value customer is experiencing increased end-user latency with your Web application. You need to find the issue quickly.
Creating custom attributes when you are instrumenting your code helps you collect the information that’s important to you. The example in this flow uses a custom
customer attribute to keep track of performance by customer.
In Explorer, query for the customer and your web app’s service to find traces that are affecting both your customer and the web app.
- Click the “1 hour prior” historical layer to see if latency has changed.
In this example, more spans are experiencing latency than they were an hour ago.
- Use the latency filter to further refine your search to just those spans with high latency.
Look at the correlation calculations to immediately see that the service
inventoryand the operation
write-cacheare the likely culprits.
In the Correlations panel, click the Filter by
inventory:write cacheso that the spans in the table come only from that operation.
Sort the table by Duration in descending order so that the longer spans are at the top.
- Click on any span to view the span’s full trace, and use the contextual information on the right to determine the root cause.
See Error Frequency and Latency for a Single Service
You need to investigate the health of all downstream services from a service that may have issues. In this example, we have reports of the
api-server service having issues, but can’t pinpoint the problem.
To find issues more easily, use attributes from the OpenTelemetry library and create your own. The example in this flow uses error attributes to help not only identify spans with errors but also to group spans by error type.
- In Explorer, query the service with reported issues. In this example, query for the service
api-serverand click Run.
Click Service Diagram tab to view downstream service health.
In this example, the auth-service is red, showing that there are errors reporting downstream from the
api-serverservice. You may find that downstream services have errors or are highly latent. Take note of those service names.
- Use the Trace Analysis tab and click Show all Spans in Traces to view all the spans flowing through the queried service. The original query only returned operations on the queried service. By showing all spans, you can see every span that takes part in the traces that queried service participates in.
- Filter by the service you noted had issues in the Service Diagram (downstream from the original queried service), and also filter by
error=true. Now you see only spans that originate in the suspect service and have errors.
- Add a column for a attribute that you can group the results by to narrow down the issue. You’re looking for an attribute that appears on most spans, without having to open individual traces. In this example, you might add a column for
- Group by the attribute for the column you added. Grouping allows you to see the breakdown of latency averages and frequency (span count) for each value of that attribute. In this example,
TimeoutExceptionis more frequent than
RuntimeExceptionand is causing a higher magnitude of errors.
See Frequency and Latency by Status Code for an Operation
You want to see if particular HTTP status codes are contributing to latency for an operation on a service.
To find issues more easily, use attributes from the OpenTracing library when you are instrumenting your code. The example in this flow uses the
http.status_code attribute to help group spans.
- In Explorer, query the relevant service and operation. For example, the
krakend-api-gatewayservice shows high levels of latency for p50 and above.
- Group by the
http.status_codeattribute. You can do this globally or scoped to a particular arbitrary attribute (
region) by adding that to the query to view across a section of data.
View Downstream Service Performance to find a Single Ingress Operation
There was a recent deploy and you suspect that work done downstream from the
ios-client might be causing some of the latency you are seeing.
To find issues more easily, use attributes from the OpenTelemetry library when you are instrumenting your code. The example in this flow uses the
ingress.operation attribute to help find operations at the edge of a service.
In Explorer, query the relevant service and operation and then view the Service Diagram to see which downstream services have high latency contributions. In this example, you query the
api-serverhas significant latency contribution, based on the size of the yellow halo.
- Click the Trace Analysis tab and click Show all spans to see all spans that took part in the same trace as the service you initially queried on (in this case, the
- Filter by the service you notice latency on (in this case, the
api-serverservice) and by the attribute
ingress.operation: trueto filter on high-level operations.
- Group by operation so you can see the performance (error rate and average latencies) of the operations within
api-serverthat are downstream to
Monitor Performance Across a Transaction Using a Dashboard
You have an important customer-facing transaction that you want to proactively monitor. The transaction consists of several different operations on multiple services.
- In Explorer, query for the highest (most user-facing) level ingress operation.
- Click Show all Spans in Traces and then group-by
operationto quickly see all the upstream and downstream operations taking part in the same traces as the operation you queried on.
- Click individual operations to view their traces.
- Find and select an interesting span (one that you want to monitor) and click the operation name to quickly query the operation in Explorer.
This graphic shows the Trace view for spans that participate in the same trace as the
authorize-useroperation that we’re interested in. The dashboard likely needs the operations downstream from there.
- In Explorer, click Create Stream.
- Add all the streams you created to a dashboard and name it after this transaction.
Optimize Code from Client to the Backend
You want to find places to improve performance.
Collecting logs in your tracing instrumentation provides a deeper context to help you see all that’s going on.
Open a trace view for any trace showing high latency.
The most common ways to access a trace are by clicking on a span in Explorer and from a dot on the scatterplot in a Stream.
- In the Trace view, scroll down to find concurrent calls to the same service. You can see if the calls are run serially or in parallel. You may be able to improve performance here.
- You can also view log payloads to see more information about the span.