Now that you’ve improved the incident response features in Lightstep Observability, let’s see it in action, as your on-call team will.
-
At 4:11 pm, an alert is posted by Lightstep Observability to the #on-call channel in Slack, saying the Error Rate on the
android
service is over 5%. -
Mary on the team sees it and clicks one of the links in the message to open a trace with an example of the error.
-
Looking at the expanded trace, she can see that the 429 errors seem to originate from the
get-store-data
operation on thestore-server
service. -
Mary clicks the Workflow Links tab and uses the Playbook link to open the wiki page with instructions for remediating the error.
-
One of the Playbook steps tells her to determine if there has been a recent deploy that may have caused the regression. Mary visits the Deployments view in Lightstep Observability for the
store-server
service. Sure enough, it looks like there was a deploy around 4pm - v.10.8.249, and it has a much higher error rate than the version before it..
-
She goes back to the Workflow Links and uses the Slack link to start a message the
store-server
service owners to let them know that their deploy may be causing errors.She pastes a link to the Trace view into the message, and because her team has integrated Slack into Lightstep Observability, the message shows details from the trace. The team can jump right into Lightstep Observability to verify the issue.
-
The
store-server
team immediately rolls back the deploy. Once the rollback’s complete, the errors stop, and the on-call team gets a Slack alert that the errors are resolved. -
Mary verifies this by going back to the Deployments view. Looks like as the traffic switches to the new version, the error rate is dropping back to normal.
And that’s it! From the first alert to resolution, Lightstep Observability helped Mary debug a high error rate quickly.
What did we learn?
- Your on-call team can be notified by Slack as soon as an SLO is violated.
- The Trace view shows metadata from the span that can help diagnose an issue.
- Workflow Links allow your on-call team to quickly access other teams and tools to help resolve the issue.
- The Deployments view shows performance before and after a deployment, allowing the on-call team to quickly see when a deployment causes a regression.