Here are some common use cases most DevOps teams face on a regular basis, to show how using LightStep can help solve issues, proactively keep them from happening, and become part of your day-to-day workflow.

Reduce MTTR/MTTU Through Differentiated Trace Analytics

Scenario:

You get an alert that a high-value customer is experiencing increased end-user latency with your Web application. You need to find the issue quickly.

Creating custom tags when you are instrumenting your code helps you collect the information that’s important to you. The example in this flow uses a custom customer_id tag to keep track of performance by customer.

Steps:

  1. In Explorer, query for the customer and your web app’s service to find traces that are affecting both your customer and the web app. Click the “1 hour prior” historical layer to see if latency has changed.
    In this example, more spans are experiencing latency than they were an hour ago.
  2. Use the latency filter to further refine your search to just those spans with high latency.
  3. Look at the correlation calculations to immediately see that the service api-server and the operation user-space-mapping are the likely culprits.
  4. Click on Service Diagram to see a hierarchical view of other services that may be impacted. This lets you quickly assess whether you need to involve other teams.

    In this example, you can see that there’s an error on the auth-service, so you should notify that team.

  5. Select the node for the service correlated with latency from the previous step and on the left, scroll down to find the correlated operation.
  6. Click on any span to view contextual trace examples and accelerate root cause analysis.

See Error Frequency and Latency for a Single Service

Scenario:

You need to investigate the health of all downstream services from a service that may have issues. In this example, we have reports of the api-server service having issues, but can’t pinpoint the problem.

To find issues more easily, use tags from the OpenTracing library and create your own. The example in this flow uses error tags to help not only identify spans with errors but also to group spans by error type.

Steps:

  1. In Explorer, query the service with reported issues. In this example, query for the service api-server and click Run.
  2. Click Service Diagram tab to view downstream service health.

    In this example, the auth-service is red, showing that there are errors reporting downstream from the api-server service. You may find that downstream services have errors or are highly latent. Take note of those service names.

  3. Use the Trace Analysis tab and click Show all Spans in Traces to view all the spans flowing through the queried service. The original query only returned operations on the queried service. By showing all spans, you can see every span that takes part in the traces that queried service participates in.
  4. Filter by the service you noted had issues in the Service Diagram (downstream from the original queried service), and also filter by error=true. Now you see only spans that originate in the suspect service and have errors.
  5. Add a column for a tag that you can group the results by to narrow down the issue. You’re looking for a tag appears on most spans, without having to open individual traces. In this example, you might add a column for exception.type.
  6. Group by the tag for the column you added. Grouping allows you to see the breakdown of latency averages and frequency (span count) for each value of that tag. In this example, TimeoutException is more frequent than RuntimeException and is causing a higher magnitude of errors.

See Frequency and Latency by Status Code for an Operation

Scenario:

You want to see if particular HTTP status codes are contributing to latency for an operation on a service.

To find issues more easily, use tags from the OpenTracing library when you are instrumenting your code. The example in this flow uses the http.status_code tag to help group spans.

Steps:

  1. In Explorer, query the relevant service and operation. For example, the api/v1/charge operation on the api-server service shows high levels of latency for p50 and above.
  2. Group by the http.status_code tag. You can do this globally or scoped to a particular arbitrary tag (tenant-id, or region) by adding that to the query to view across a section of data.

View Downstream Service Performance to find a Single Ingress Operation

Scenario:

There was a recent deploy and you suspect that work done downstream from the ios-client might be causing some of the latency you are seeing.

To find issues more easily, use tags from the OpenTracing library when you are instrumenting your code. The example in this flow uses the ingress.operation tag to help find operations at the edge of a service.

Steps:

  1. In Explorer, query the relevant service and operation and then view the Service Diagram to see which downstream services have high latency contributions. In this example, you query the ios-client service.

    The api-server has significant latency contribution, based on the size of the yellow halo.

  2. Click the Trace Analysis tab and click Show all spans to see all spans that took part in the same trace as the service you initially queried on (in this case, the ios-client service).
  3. Filter by the service you notice latency on (in this case, the api-server service) and by the tag ingress.operation: true to filter on high-level operations.
  4. Group by operation so you can see the performance (error rate and average latencies) of the operations within api-server that are downstream to ios-client.

Monitor Performance Across a Transaction Using a Dashboard

Scenario:

You have an important customer-facing transaction that you want to proactively monitor. The transaction consists of several different operations on multiple services.

Steps:

  1. In Explorer, query for the highest (most user-facing) level ingress operation.
  2. Click Show all Spans in Traces and then group-by operation to quickly see all the upstream and downstream operations taking part in the same traces as the operation you queried on.
  3. Click individual operations to view their traces.
  4. Find and select an interesting span (one that you want to monitor) and click the clock icon next to the operation to quickly query the operation in Explorer.

    This graphic shows the Trace view for spans that participate in the same trace as the api-request/charge operation that we’re interested in. The dashboard likely needs the operations downstream from api-request/charge.
  5. In Explorer, click Create Stream.
  6. Add all the streams you created to a dashboard and name it after this transaction.

Optimize Code from Client to the Backend

Scenario:

You want to find places to improve performance.

Collecting logs in your tracing instrumentation provides a deeper context to help you see all that’s going on.

Steps:

  1. Open a trace view for any trace showing high latency.

    The most common ways to access a trace are by clicking on a span in Explorer and from a dot on the scatterplot in a Stream.

  2. In the Trace view, scroll down to find concurrent calls to the same service. You can see if the calls are run serially or in parallel. You may be able to improve performance here.

     
  3. You can also view log payloads to see more information about the span.