While distributed tracing can give you a great picture of the health and efficiency of your system, the information you get is only as good as the information you provide through instrumentation of your app. Too little and you may not find the actual root cause of an issue. Too much, or focusing too far down in the stack, and you'll end up with noise that distracts from the real issues.
So where to start and how to determine the right level of instrumentation?
Maybe you want to ensure your most valuable business operations have full tracing coverage so you can monitor them and find issues quickly. Or you want to ensure your most frequently called API is always performant. Or maybe you know you have a latency issue with a particular request and you need to dig in and find the cause. These scenarios all call for instrumentation that traverses the full stack, giving you a view into a request as it travels through your system.
The following dimensions will help you think about the relative priority of instrumentation targets:
- Impact on the endpoint or on the services highly involved in your important transactions: The closer instrumentation is to your business value, the more meaningful the resulting performance and reliability data will be. Instrument enough of these code components to create a trace along the critical path of your high-value transactions.
- Widely used routing and communication packages: Homegrown RPC subsystems and routing layers reveal a great deal about application semantics and also play a role in propagation across process boundaries.
- Known areas of unpredictable latency or reliability: Adding instrumentation where you know there may be issues helps to explain and model the variability.
- Known bottlenecks: Having instrumentation for database calls, inter-region network activity, and other common areas of bottleneck results in a quicker mean time to resolution when issues arise.
Now that you have an idea of where to start, you can think about how.
Already using Jaeger or Zipkin for tracing?
While you could instrument every service, chances are you and your team don't have the bandwidth to make that happen. Instead, start at the framework with components that add the tracing logic for you. Or if you use Istio, auto-instrument your service mesh. You'll immediately see traces from service to service. At that point, you can prioritize areas where you'd like to see more detail and add instrumentation there.
OpenTracing offers a number of plugins that instrument common libraries used in most distributed systems. This registry contains plug-ins that instrument requests/responses for a number of protocols, like gRPC, and frameworks like Spring and servers like Couchbase. When you install the plugin, tracing is in place for a broad range of execution paths.
Even easier is using the OpenTracing Java Special Agent. This agent automatically connects third-party libraries you already have in your system to available OpenTracing plugins - the same code you would add by hand.
With your framework instrumented, you can immediately see traces in LightStep, but they likely won't hit every call in the request flow. Now it's time to add instrumentation to your services.
While frameworks can get you most of the way there, you likely have high-value business operations in your system that you want to be sure are running as efficiently as possible. If you find that the library plugins don't give you details you need at certain service points, you can add it using the OpenTracing APIs. Try Quick Start: Use OpenTracing to Instrument Your Code in the language of your choice to get a feel for it and then move on to Add Spans to Create Traces to start connecting everything together.
Once you've decided where to start adding more detailed tracing inside your services, you need to know how to join all those spans together into meaningful traces through your system.
Traces are all about following transactions through an application, not simply monitoring individual processes or operations. To deliver the goal of an end-to-end trace of your highest-priority operations, you will need to join those operations' spans together by propagating trace span context along with the transaction. You will want to connect the dots both within processes and between processes.
Spans have a parent/child relationship with other spans in the trace, creating the flow from the root span at the beginning of the request, to the lowest level span at the end of the request, before the flow returns back to the root. Each span (except the root) carries a reference to its parent span. This reference (along with others) is carried in the context of the process.
Many languages and frameworks provide some form of
context object that can be used to propagate the hierarchical relationship and other identifiers (for example, a request, transaction, or operation). For example, Go has
context.Context and Django has template contexts. Where possible, use the approach that is idiomatic for the platform to propagate context throughout a process.
Here's an example of using
context.Context in Go to propagate transaction context from parent to child spans.
// Start a new span as a child of the span in the current context // Returns the span and a new context.Context containing the current span span, ctx := opentracing.StartSpanFromContext(ctx, "my_operation_name")
Extract feature serializes and deserializes the span's context. This context can be sent over the wire, for example as HTTP headers. Use these methods to propagate your Span Context when crossing process boundaries.
Sometimes you may not be able to use
Extract.When that's the case, LightStep offers an additional correlation mechanism called Trace Assembly Tags.
If you already propagate a de facto transaction ID (perhaps referred to as a
trace_id, etc) around your distributed system, LightStep can use it to assemble distributed traces. In order to take advantage of this feature, add an OpenTracing tag to your spans with a key string prefixed by
"guid:" and the given transaction/correlation/etc ID as the value. For instance, given a
requestId member variable, you might write:
// Set a guid tag on an opentracing.Span object using the SetTag method and by // giving the key a name that starts with "guid:". // // LightStep will automatically include any other Span (throughout the // distributed system) that uses this same Tag key and value as part of the // same distributed trace. span.SetTag("guid:request_id", requestId)
"guid:" tags allow you to reuse existing propagation mechanisms to assemble traces in LightStep, and in so doing can greatly reduce the initial integration time for codebases and systems that have not yet instrumented with OpenTracing.
"guid:" trace assembly tags are useful in production, conventional OpenTracing
Extract propagation actually encodes more information about span relationships and provides LightStep with more information to use when visualizing or analyzing traces. Trace assembly tags are far better than nothing, but conventional OpenTracing propagation is still a best practice when it's a convenient option.
Spans may have multiple
"guid:" tags, and traces are assembled by taking the union of these assembly hints and any conventional OpenTracing span-to-span references.
Once you have some instrumentation in place, be sure to check out it's IQ Score!. LightStep can analyze your instrumentation and recommend ways to improve it. Watch your score go up as you continue to add tracing capabilities to your system.