Firefighting with traditional solutions is hard. Scrolling through multitudes of dashboards only shows you there is a problem, not where, and not what caused it. Sifting through logs takes time, and it’s hard to find that needle in a haystack. Metrics and logs can only tell you so much - they let you know there’s an issue, but they can’t always tell you where and when. That’s where distributed tracing comes in.
Distributed tracing provides a view of the life of a request as it travels across multiple hosts and services communicating over various protocols. Here’s an example of a request from a client through a load balancer into several backend systems. With distributed tracing implemented, you have a window into performance at every step in the request.
Distributed tracing relies on instrumentation of the system you’re trying to observe. You can use specifications such as OpenTelemetry to provide a consistent interface across a variety of languages to write this instrumentation code. Some systems may require custom instrumentation at the service level, while others may only need instrumentation of the framework. Often, you’ll need to use a combination of these approaches.
Before you start that instrumentation, read on to learn about the different components that make up a distributed trace, and how the data from that instrumentation makes it into Cloud Observability where you can view and work with it.
In distributed tracing, a trace is a view into a request as it moves through a distributed system. Multiple spans represent different parts of the workflow and are pieced together to create a trace. A span is a named, timed operation that represents a piece of the workflow.
The diagram above shows one trace made up of several spans.
In Cloud Observability, you view traces as a “tree” of spans that reflects the time that each span started and completed. It also shows you the relationship between spans. Here’s a simplified view of a trace, as it relates to the request above.
A trace starts with a root span where the request starts. This root span can have one or more child spans, and each one of those child spans can have child spans.
Child spans don’t always finish before their parent when the two are asynchronous. For example, an RPC call might time out, and so the parent span finishes before the “hanging” child span.
As you can see in the above illustration, there can be two types of child spans. A
ChildOf span is one where the parent depends on that child span’s result (like the relationship of the load balancer and the auth span). Spans doing concurrent (perhaps distributed) work may all individually be the
ChildOf a single parent span that merges the results for all children.
The second is the
FollowsFrom relationship, where the parent span is not dependent on the child (like the auth span and the billing span). These often represent “fire-and-forget” operations, for example, an opportunistic write to cache or a message that doesn’t care about its consumer.
In order for the trace tree to be built with these relationships intact, each span needs to propagate its context to its child. The context (or trace context) contains several pieces of information that can be passed between functions inside a process or between processes over an RPC. The context tells the child span who its parent is (parent SpanID) and what trace it belongs to (trace ID). The child span creates its own ID and then propagates both that ID (as the parent span ID) and the trace ID in the context to its child span. There can be other components in the context, but the parent span ID and trace ID are what allow a trace tree to be built.
OpenTelemetry uses headers to propagate context from span to span. Tracer objects are configured with Propagator objects that support transferring the context across process boundaries. Tracers provide a default tracer for your spans, and/or a Tracer provider capable of granting access to the tracer for your component. As spans are created and completed, the Tracer dispatches them to the OpenTelemetry SDK’s Exporter, which is responsible for sending your spans to a backend system for analysis. How tracers are created and registered depends on the language. Read the OpenTelemetry spec for more info.
A span may also have zero or more key/value attributes. Attributes allow you to create metadata about the span. For example, you might create attributes that hold a customer ID, or information about the environment that the request is operating in, or an app’s release. Attributes don’t reflect any time-based event (events in OpenTelemetry). The OpenTelemetry spec defines several standard attributes. You can also implement your own attributes.
Span events (logs in OpenTracing) contain time-stamped information. A span can have zero or more events. Each is a time-stamped event name, optionally accompanied by a structured data payload of arbitrary size.
You can add events to any span where the additional context would add value and the information included would be unique to an individual trace.
Once you’ve done your instrumentation, you instantiate tracers that know how to create the spans and their associated attributes, events, and context. That instrumentation collects 100% of that data and sends it to the Cloud Observability Microsatellites. The Microsatellites then send any data that serves as examples of application errors, high latency, or other interesting events in real time to the Cloud Observability SaaS platform, which pieces together the spans into traces. You use the Cloud Observability web application to view the actual traces, along with all the associated metadata from attributes and events. Read How Cloud Observability Works for more info.
Of course, that’s not all there is to distributed tracing or Cloud Observability! Here are more resources that can help you get started.
Updated May 23, 2023