Now that you’ve improved the incident response features in Lightstep Observability, let’s see it in action, as your on-call team will.

  1. At 4:11 pm, an alert is posted by Lightstep Observability to the #on-call channel in Slack, saying the Error Rate on the android service is over 5%. Alert in a Slack channel

  2. Mary on the team sees it and clicks one of the links in the message to open a trace with an example of the error. Trace with error

  3. Looking at the expanded trace, she can see that the 429 errors seem to originate from the get-store-data operation on the store-server service.Error code on span

  4. Mary clicks the Workflow Links tab and uses the Playbook link to open the wiki page with instructions for remediating the error. Worflow Links on Trace view

  5. One of the Playbook steps tells her to determine if there has been a recent deploy that may have caused the regression. Mary visits the Deployments view in Lightstep Observability for the store-server service. Sure enough, it looks like there was a deploy around 4pm - v.10.8.249, and it has a much higher error rate than the version before it. Deployments view showing increased error rate.

  6. She goes back to the Workflow Links and uses the Slack link to start a message the store-server service owners to let them know that their deploy may be causing errors. Slack Workflow Link She pastes a link to the Trace view into the message, and because her team has integrated Slack into Lightstep Observability, the message shows details from the trace. The team can jump right into Lightstep Observability to verify the issue. Slack message showing trace data

  7. The store-server team immediately rolls back the deploy. Once the rollback’s complete, the errors stop, and the on-call team gets a Slack alert that the errors are resolved.Slack resolution alert

  8. Mary verifies this by going back to the Deployments view. Looks like as the traffic switches to the new version, the error rate is dropping back to normal. Deployments view shows error rate returning to normal after rollback

And that’s it! From the first alert to resolution, Lightstep Observability helped Mary debug a high error rate quickly.

What did we learn?

  • Your on-call team can be notified by Slack as soon as an SLO is violated.
  • The Trace view shows metadata from the span that can help diagnose an issue.
  • Workflow Links allow your on-call team to quickly access other teams and tools to help resolve the issue.
  • The Deployments view shows performance before and after a deployment, allowing the on-call team to quickly see when a deployment causes a regression.