Firefighting with traditional solutions is hard. Scrolling through multitudes of dashboards only shows you there is a problem, not where, and not what caused it. Sifting through logs takes time, and it’s hard to find that needle in a haystack. Metrics and logs can only tell you so much - they let you know there’s an issue, but they can’t always tell you where and when. That’s where distributed tracing comes in.

Distributed tracing provides a view of the life of a request as it travels across multiple hosts and services communicating over various protocols. Here’s an example of a request from a client through a load balancer into several backend systems. With distributed tracing implemented, you have a window into performance at every step in the request.

Distributed tracing relies on instrumentation of the system you’re trying to observe. You can use specifications such as OpenTracing or OpenTelemetry to provide a consistent interface across a variety of languages to write this instrumentation code. Some systems may require custom instrumentation at the service level, while others may only need instrumentation of the framework. Often, you’ll need to use a combination of these approaches.

Before you start that instrumentation, read on to learn about the different components that make up a distributed trace, and how the data from that instrumentation makes into Lightstep where you can view and work with it.

Spans and Traces

In distributed tracing, a trace is a view into a request as it moves through a distributed system. Multiple spans represent different parts of the workflow and are pieced together to create a trace. A span is a named, timed operation that represents a piece of the workflow.

In Lightstep, you view traces as a “tree” of spans that reflects the time that each span started and completed. It also shows you the relationship between spans. Here’s a simplified view of a trace, as it relates to the request above.

A trace starts with a root span where the request starts. This root span can have one or more child spans, and each one of those child spans can have child spans.

Child spans don’t always finish before their parent when the two are asynchronous. For example, an RPC call might time out, and so the parent span finishes before the “hanging” child span.

As you can see in the above illustration, there can be two types of child spans. A ChildOf span is one where the parent depends on that child span’s result (like the relationship of the load balancer and the auth span). Spans doing concurrent (perhaps distributed) work may all individually be the ChildOf a single parent span that merges the results for all children.

The second is the FollowsFrom relationship, where the parent span is not dependent on the child (like the auth span and the billing span). These often represent “fire-and-forget” operations, for example, an opportunistic write to cache or a message that doesn’t care about its consumer.

SpanContext

In order for the trace tree to be built with these relationships intact, each span needs to propagate its SpanContext to its child. SpanContext tells the child span who its parent is (parent SpanID) and what trace it belongs to (trace ID). The child span creates its own ID and then propagates both that ID (as the parent span ID) and the trace ID in the SpanContext to its child span.

There can be other components in SpanContext, but the parent span ID and trace ID are what allow a trace tree to be built. Read the OpenTracing and OpenTelemetry specs for more info.

Tags/Attributes

A span may also have zero or more key/value tags (known as attributes in OpenTelemetry). Tags allow you to create metadata about the span. For example, you might create tags that hold a customer ID, or information about the environment that the request is operating in, or an app’s release. Tags do not reflect any time-based event (logs in OpenTracing and events in OpenTelemtry handle events). The OpenTracing and OpenTelemetry specs define several standard tags. For example, here are the tags available using the Java-based tracer. You can also implement your own tags.

Span Logs/Events

Span logs (events in OpenTelemetry) contain time-stamped information. A span can have zero or more logs. Each is a time-stamped event name, optionally accompanied by a structured data payload of arbitrary size.

You can add logs to any span where the additional context would add value and the information included would be unique to an individual trace.

Propagating Across the Wire

In OpenTracing, propagating span context between services is aided by the use of the inject and extract methods provided by OpenTracing. When creating a request, you inject the span context into the RPC, and when receiving that request you extract the Span Context. Here’s an example of injecting parent span context into a carrier using message headers.

Injection:

1
2
3
4
5
6
7
8
9
10
11
12
public class TracingMessageProducer extends ForwardingMessageProducer {
    void startTracing(final Message message) {
        // ...
        final SpanId parent = Tracing.peekOrCreate();
        final SpanId spanId = parent.createChild();

        addToHeaders(message, TracingHeaders.TRACE_ID, spanId.getTraceId().toString());
        addToHeaders(message, TracingHeaders.SPAN_ID, spanId.getSpanId());
        addToHeaders(message, TracingHeaders.PARENT_SPAN_ID, spanId.getParentId());
        // ...
      }
}

You then extract the context from the message headers:

Extract:

1
2
3
4
5
public class TracingMessageConsumer extends ForwardingMessageConsumer {
    void startTracing(final Message message) {
        final SpanId parentSpanId = SpanId.create(message.getStringProperty(TracingHeaders.TRACE_ID, null),
                                      message.getStringProperty(TracingHeaders.SPAN_ID, null),
                                      message.getStringProperty(TracingHeaders.PARENT_SPAN_ID, null));

And then create a new child span:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
public class TracingMessageConsumer extends ForwardingMessageConsumer {
    void startTracing(final Message message) {
        final SpanId parentSpanId = SpanId.create(message.getStringProperty(TracingHeaders.TRACE_ID, null),
                                    message.getStringProperty(TracingHeaders.SPAN_ID, null),
                                    message.getStringProperty(TracingHeaders.PARENT_SPAN_ID, null));
        final SpanId spanId = parentSpanId.createChild();
        final SpanId previousSpanId = Tracing.push(spanId);
        Tracing.setOperation(OPERATION_NAME);
        Tracing.setAttribute(TracingLogEntryKeys.EXCHANGE, message.getExchangeName());
        Tracing.setAttribute(TracingLogEntryKeys.ROUTING_KEY, message.getRoutingKey());
        Tracing.push(previousSpanId);
        // ...
    }
}

Read Instrument Your Code to learn more details about instrumentation, such as how to prioritize what to trace.

OpenTelemetry uses headers to propagate context. See the individual language Get Started guides for more info.

Sending Span Data to Lightstep with Tracers

Once you’ve done your instrumentation, you instantiate tracers that know how to create the spans and their associated tags, logs, and context. Lightstep tracers collect 100% of that data and send it to the Lightstep Satellites, who piece together the spans into traces. The Satellites then send any data that serves as examples of application errors, high latency, or other interesting events in real time to the Lightstep Hypothesis Engine. You use the Lightstep web application to view the actual traces, along with all the associated metadata from tags and logs. Read How Lightstep Works for more info.

Of course, that’s not all there is to distributed tracing or Lightstep! Here are more resources that can help you get started.

OpenTracing Instrumentation

  • OpenTracing: Learn more about how OpenTracing works to create spans and their data.
  • Quick Starts: In available languages, auto-install OpenTracing to your framework. Configure your tracer and see requests travel through the framework. For other languages, use the API to create a single span by instantiating a tracer and adding instrumentation to a small piece of code and then view that span in Lightstep.
  • Use tracing libraries other than Lightstep’s: If you’ve already instrumented your system, but were using another tracing library, no worries! Lightstep can ingest that instrumentation with almost no additional code needed.

OpenTelemetry Instrumentation

Lightstep

  • Understand Lightstep’s architecture: Read how Lightstep uses tracers, Satellites, and the Hypothesis engine to deliver visualized data to the Lightstep UI.
  • Learn how to use Lightstep: Learn, once you instrument your system, how easy it is to analyze and resolve issues and get an understanding of the value Lightstep brings to your organization.