For most successful companies, software development happens in a technically diverse context. Some places have adopted 2 - 3 primary languages each with their own idioms. Others have stuck to one primary language but as time goes by, different frameworks and framework versions change what a “normal” application looks like. Often there’s a combination of the two. Perhaps there are two or three generations of each language, or two generations of the “old” language and two generations of the “new” one. Usually called “polyglot” (using multiple languages), but really polyglot and polytemporal (from many times), there are significant challenges in maintaining an effective development and operation practice.
As your organization starts or updates its observability strategy, a key part of that strategy is creating and maintaining the ability to think about and compare applications “horizontally”. When telemetry for observability is unique to each application or framework it will be very difficult to troubleshoot across groups, have teams responsible for different applications, or train new developers.
When trying to understand the behavior of systems, variance for a norm is what is most interesting. However, only when things have gone quite wrong does the whole system change. Usually there are aspects that change first but still have customer impact like outlier latency, a canary deployment, or a particular failing instance. Tags are attached to metrics and traces so that variation to be found in parts of the system to reduce customer impact from those parts of or changes to the system before the whole is affected.
Guidance for HTTP Services and Load Balancers
Telemetry for HTTP services is a combination of similarly structured distributed tracing root spans, derived metrics along common dimensions, and standard logging. HTTP status codes still form the primary dimension for understanding service behavior. However, without additional dimensions, it’s easy for key information to be wiped out in the volume and for total outages for a particular subset of the service to be unnoticed.
The most important dimension is the software version (usually a git short hash or semantic version string) so that you know what version actually handled a request. Method and resource (for REST) or normalized URL need to be attached to every trace, metric, and log. If authentication information is available, and your observability system deals with cardinality well, then that should be included as well.
Seeing spikes in 401s or 404s for specific users can be a strong signal that there is a problem. LIkewise seeing that a new version of the software (or potentially configuration) is correlated with latency or error increase is very important.
When operating in a cloud based environment, telemetry should include dimensionality representing that services placement, from region and availability zone to IAM and security group. Networking related dimensions like region, AZ, and instance ID will help surface partial outages in cloud provider networks. IAM and security group information can surface breaking changes to security profiles or credential access.
See below for a more complete list.
A normalized interface breaks down into the following:
- Request Rate
- Normalized URL
- Response Rate
- 400s excluding above
- 502s (for load balancers)
- 503s (for load balancers)
- Error Rate (excluding some error types)
- Retryable Errors vs Non-Retryable
Key additional dimensions
- Version (Software and/or Config)
- Resource or Normalized URL
- Authenticated User / Unauthenticated
Common additional dimensions
- Availability Zone
- Instance ID
- Container / Pod ID
- IAM Role
- Security Group
Guidance for Client Libraries
Telemetry for client libraries of HTTP services ends up tracking the same dimensions as the services themselves. This allows easy comparison of the client and server views of the operational state especially between sets of dimensions like regions or AZs. The most important additional dimension for tracing is explicitly tracking application level retries as separate spans and injecting those retry level span contexts into the requests so that errors or latency can be correlated by specific downstream dependency. Since client versions often have their own separate versioning, this should also be a tagged dimension. If there’s an issue between client versions, having this immediately be clear will save hours or days of troubleshooting time.
Often client libraries have a sense of higher level resources even if those are not necessarily tracked in the associated APIs / services. Tagging client spans with type information and, if supported in your system’s cardinality, non-security sensitive resource / object ID information can help correlate slowness into data stores or data store APIs.
The overall goal of a strategy for observability dimensionality is to make things that are similar look the same while at the same time providing enough additional context to correlate differences. By correlating behavior by version for servers and clients, change becomes visible in the telemetry itself. By separating out paths that are backed by different access patterns (method and resource tagging) then high rate but low value paths do not impact information available about low rate but high value paths. Adding information about network placement, by region, AZ, and instance, allows correlations for hardware and connectivity issues. Including information about roles and permissions helps correlate changes in cloud IAM outside of networking.
By providing standard telemetry dimensionality, developers and operators can be effective across a wider range of services in a more consistent way. Better communication and less training time makes the whole software development process more efficient and effective.