Observability gives you the necessary information about the health and efficiency of your system. However, systems are large and complex, so the question is what do you decide to measure and where do you start?

OpenTelemetry (currently in Beta, with GA expected in November 2020) is the unified initiative that takes the best of both OpenTracing and OpenCensus forward. Think of OpenTelemetry as the next evolution of OpenTracing and OpenCensus. If OpenTelemetry doesn’t currently support the language you need, you can use OpenTracing for now and then move to OpenTelemetry when it’s ready. If you use OpenTracing in some services and OpenTelemetry in other services, traces will be correctly connected as long as you use B3 context propagation in all services.

Prioritize Where to Add Tracing

Maybe you want to ensure your most valuable business operations have full coverage so you can monitor them and find issues quickly. Or you want to ensure your most frequently called API is always performant. Or maybe you know you have a latency issue with a particular request and you need to dig in and find the cause. These scenarios all call for instrumentation that traverses the full stack, giving you a view into a request as it travels through your system.

Here are some common use cases to prioritize:

  • API calls and operations: those most directly impact your business bottom-line (whether that’s revenue or customer satisfaction)
  • Known performance bottlenecks: operations that are called most frequently, are slowing down the system, and no one knows quite how to address the issue
  • Understanding system behavior: insight into parts of the system where concrete data about how requests are actually handled is needed

When translating these priorities to code changes, it’s helpful to consider the following:

  • Existing instrumentation: Take advantage of instrumentation you already have
  • Business impact: Use the “80/20” rule - consider the 20% of the code that is most important for the business. Add additional instrumentation there.
  • System coverage and “nexus points”: Look to instrument or enable instrumentation on centralized internal communications libraries or external routing and communication packages like:
    • gRPC
    • Elasticsearch
    • Kafka
    • MongoDB

    These centralized communication hubs reveal a great deal about how the application behaves, for example in the Lightstep Service Diagram.

  • Known areas of unpredictable latency or reliability: Adding instrumentation where you know there may be issues helps to explain and model the variability.
  • Known bottlenecks: Having instrumentation for database calls, inter-region network activity, and other common areas of bottleneck results in a quicker mean time to resolution when issues arise.

Auto-Instrument at the Framework Level

Start at the framework with installers that add the tracing logic for you. You can get fairly wide coverage without touching existing code. Auto-installers are available for many languages.

If you’ve already instrumented using OpenTelemetry Collectors, it’s easy to get that instrumentation into Lightstep.

If OpenTelemetry doesn’t currently support the language you need, you can use OpenTracing for now and then move to OpenTelemetry when it’s ready. For now, check out our Quick Starts for your language.

If you use Istio and Envoy, auto-instrument your service mesh.

With your framework instrumented, you can immediately see traces in Lightstep. To get a finer-grained view into details important to your business, you add manual instrumentation to supplement the baseline auto-instrumentation.

Measure Your Code Coverage

Once you have some instrumentation in place, be sure to check out it’s IQ Score!. Lightstep can analyze your instrumentation and recommend ways to improve it. Watch your score go up as you continue to add tracing capabilities to your system.

Much of the IQ score is based on the presence of specific attributes that Lightstep needs for efficient issue mitigation. If there is metadata that you’d like all services to report to Lightstep (like a customer ID or Kubernetes region), you can register the corresponding attributes and Lightstep will check for those when determining the IQ score.

Add Instrumentation Directly to your Services

Once you’ve measured the coverage you get from instrumenting the framework, you’ll likely find specific places in your system that you need to better understand. At this point you’ll want to turn to the OpenTelemetry SDKs and APIs.

Continue adding spans to those areas and repeat our IQ Test test, until you are satisfied with the coverage. Be sure to add attributes, events, and metrics to get the full breadth of observability.