Tracing Instrumentation Best Practices

Observability gives you the necessary information about the health and efficiency of your system. However, systems are large and complex, so the question is what do you decide to measure and where do you start?

Cloud Observability supports OpenTelemetry as the way to get telemetry data (traces, logs, and metrics) from your app as requests travel through its many services and other infrastructure. If OpenTelemetry doesn’t currently support the language you need, you can use OpenTracing for now and then move to OpenTelemetry when it’s ready. If you use OpenTracing in some services and OpenTelemetry in other services, traces will be correctly connected as long as you use B3 context propagation in all services.

Prioritize where to add tracing

Maybe you want to ensure your most valuable business operations have full coverage so you can monitor them and find issues quickly. Or you want to ensure your most frequently called API is always performant. Or maybe you know you have a latency issue with a particular request and you need to dig in and find the cause. These scenarios all call for instrumentation that traverses the full stack, giving you a view into a request as it travels through your system.

Here are some common use cases to prioritize:

  • API calls and operations: those most directly impact your business bottom-line (whether that’s revenue or customer satisfaction)
  • Known performance bottlenecks: operations that are called most frequently, are slowing down the system, and no one knows quite how to address the issue
  • Understanding system behavior: insight into parts of the system where concrete data about how requests are actually handled is needed

When translating these priorities to code changes, it’s helpful to consider the following:

  • Business impact: Use the “80/20” rule - consider the 20% of the code that is most important for the business. Add additional instrumentation there.
  • System coverage and “nexus points”: Look to instrument or enable instrumentation on centralized internal communications libraries or external routing and communication packages like:
    • gRPC
    • HTTP/HTTPS
    • Elasticsearch
    • Kafka
    • MongoDB

      These centralized communication hubs reveal a great deal about how the application behaves, for example in the Cloud Observability Service Diagram.

  • Known areas of unpredictable latency or reliability: Adding instrumentation where you know there may be issues helps to explain and model the variability.
  • Known bottlenecks: Having instrumentation for database calls, inter-region network activity, and other common areas of bottleneck results in a quicker mean time to resolution when issues arise.

Auto-instrument at the framework level

Start at the framework with installers that add the tracing logic for you. You can get fairly wide coverage without touching existing code. Auto-installers are available for many languages.

If you’ve already instrumented using OpenTelemetry Collectors, it’s easy to get that instrumentation into Cloud Observability.

If OpenTelemetry doesn’t currently support the language you need, you can use OpenTracing for now and then move to OpenTelemetry when it’s ready. For now, check out our Quickstarts for your language.

If you use Istio and Envoy, auto-instrument your service mesh.

With your framework instrumented, you can immediately see traces in Cloud Observability. To get a finer-grained view into details important to your business, you add manual instrumentation to supplement the baseline auto-instrumentation.

Measure your code coverage

Once you have some instrumentation in place, be sure to check out it’s IQ Score!. Cloud Observability can analyze your instrumentation and recommend ways to improve it. Watch your score go up as you continue to add tracing capabilities to your system.

Much of the IQ score is based on the presence of specific attributes that Cloud Observability needs for efficient issue mitigation. If there is metadata that you’d like all services to report to Cloud Observability (like a customer ID or Kubernetes region), you can register the corresponding attributes and Cloud Observability will check for those when determining the IQ score.

Add instrumentation directly to your services

Once you’ve measured the coverage you get from instrumenting the framework, you’ll likely find specific places in your system that you need to better understand. At this point you’ll want to turn to the OpenTelemetry SDKs and APIs.

Continue adding spans to those areas and repeat our IQ Test test, until you are satisfied with the coverage. Be sure to add attributes, events to get the full breadth of observability.

See also

Use attributes and log events to find issues fast

Updated Nov 1, 2019