So, you’ve got a distributed system and it’s obviously business critical. You’re not quite sure what’s going on, but you need to know. It’s not that there isn’t any observability, it’s that there are a bunch of approaches to observability that might allow teams to understand some incidents or outages but not all of them. It’s time to implement an observability strategy to guide and inform choices and investments. Like most things, there are tradeoffs between different approaches.
The end goal of all observability approaches is to provide information that leads teams to improve customer experiences. When changes happen that negatively impact customers, you want to know without customers having to tell you and you want to know what change (or changes) led to the problem so they can be reversed. There are often many systems involved and they may not be easy to instrument for observability without justifying the investment.
Edge Based Observability
By starting your observability improvement strategy as close to the customer as possible, you’re more likely to see what they see and understand how their experience varies by geography, resources they’re using, or even quirks in your data architecture. An edge based approach often starts with public-facing APIs and sites, paying particular attention to the different types of resources your customer may access. Some resources will naturally be slower than others. It’s important to avoid lumping all paths, resources, and APIs into a single service. If one is particularly high volume, poor customer experiences can end up averaged out or hidden even when looking at 99th or 99.9th percentile performance. If your observability platform supports it (like Lightstep does), it is also important to add dimensions that can impact performance, like customer ID post authentication, Kubernetes pod, availability zone, to the traces created on your edge services or load balancers. Creating the customer context as well as the information necessary to correlate differences in customer experience is essential to effective edge based observability.
A root span should be created for every resource and method that can have significantly different performance. Likewise, a span should be created for every request out of the edge to other resources to fulfill the request. As some downstream services are shown to be more critical or unstable, those services can also be instrumented to gain additional insight into what dimensions cause issues with their performance or stability. This allows a measured and clearly justifiable investment in observability at each step.
As with all things in software development, there are tradeoffs. While edge based observability provides “close to customer” insights immediately, implementing edge based observability quickly requires common abstractions for resources to already exist. If many different resources are handled in completely different ways, then instrumenting all of them for observability may take too much time or too many people without a clearer ROI. If there are many “edges” and none of them have significant business value or traffic, then it may be hard to justify the investment.
Service Based Observability
When a team is building and operating a particular service, it is the center of their world. By starting your observability strategy from services, you can provide actionable information to the team responsible as quickly as possible. You’ll want to start with framework based auto-instrumentation that provides a best practices baseline for resources handled by that application.
For microservices or service-oriented architectures, being able to understand the performance of different API calls by method and resource, instead of rolling everything up into “service” level observability allows you to see how they vary independently and how differing dependencies can affect various resource usage patterns. It is also critically important to have spans that measure requests to dependencies by resource and method (or other dimensions by which they may vary). Also, application or library level retries should be independent spans so that the different paths they follow to successfully fill a request are distinct from ones that fail. Even if dependencies are to third party or cloud services APIs, the same principles apply.
Service based observability allows teams better understanding of their service without changes to any other services. However, the team members can’t see context on either side of the instrumented systems. If a team owns a suite of services, then they can consistently view requests in that suite, but requests to uninstrumented services will be black boxes that can include time spent in networking or load balancing.
Mesh / RPC Library Based Observability
Observability is often introduced into a complicated environment with an uneven mandate for adoption. If there is a common way for applications or services to access each other, either through a service mesh or common RPC library, then starting with the common clients (libraries or reverse proxies) makes it easy and fast to get a baseline of the client experience of services without any changes to the services themselves. If the clients of a service are having issues with the service meeting SLOs, but the service team says that everything is fine, this approach can quickly discover the truth.
As with other approaches, there are some significant tradeoffs. While it is easy for a platform or devops team to make unilateral improvements to observability, service meshes often don’t have any awareness of resource types. Though distinguishing between GETs, POSTs, HEADs, and response codes is better than a single high level number, details can be lost with this approach. Common RPC libraries are more likely to have resource level information, but rolling out changes to libraries can be a long term effort compared to making a service mesh configuration change.
Depending on your organization and your role, different approaches will show results faster, justifying further investment. If you’re a member of a service team, starting with service based observability strategy will allow you to quickly achieve better control over your releases and the impact of releases around you. If you’re directly responsible for customer experiences, then an edge based approach to observability will show you what’s having the most impact and give you tools to get closer to the cause of the issues. If you’re on a platform team, mesh based observability can deliver insights about service behaviors to both owning and dependent service teams without code changes. Choose the approach that works for you. Eventually, all of these approaches form a complete view of the complex behavior of your systems and the impact that behavior has on customer happiness.