While distributed tracing can give you a great picture of the health and efficiency of your system, the information you get is only as good as the information you provide through instrumentation of your app. Too little and you may not find the actual root cause of an issue. Too much, or focusing too far down in the stack, and you'll end up with noise that distracts from the real issues.
So where to start and how to determine the right level of instrumentation?
Maybe you want to ensure your most valuable business operations have full tracing coverage so you can monitor them and find issues quickly. Or you want to ensure your most frequently called API is always performant. Or maybe you know you have a latency issue with a particular request and you need to dig in and find the cause. These scenarios all call for instrumentation that traverses the full stack, giving you a view into a request as it travels through your system.
The following dimensions will help you think about the relative priority of instrumentation targets:
- Impact on the endpoint or on the services highly involved in your important transactions: The closer instrumentation is to your business value, the more meaningful the resulting performance and reliability data will be. Instrument enough of these code components to create a trace along the critical path of your high-value transactions.
- Widely used routing and communication packages: Homegrown RPC subsystems and routing layers reveal a great deal about application semantics and also play a role in propagation across process boundaries.
- Known areas of unpredictable latency or reliability: Adding instrumentation where you know there may be issues helps to explain and model the variability.
- Known bottlenecks: Having instrumentation for database calls, inter-region network activity, and other common areas of bottleneck results in a quicker mean time to resolution when issues arise.
Now that you have an idea of where to start, you can think about how.
Already using Jaeger or Zipkin for tracing?
While you could instrument every service, chances are you and your team don't have the bandwidth to make that happen. Instead, start at the framework with components that add the tracing logic for you. You can get fairly wide coverage without touching existing code.
If you use Istio, auto-instrument your service mesh. You'll immediately see traces from service to service. At that point, you can prioritize areas where you'd like to see more detail and add instrumentation there.
With your framework instrumented, you can immediately see traces in LightStep, but they may not provide the granularity you'd like on your more impactful business operations. To get finer-grained control, you can manually add instrumentation to interior calls in your services.
While frameworks can get you most of the way there, you likely have high-value business operations in your system that you want to be sure are running as efficiently as possible. If you find that auto-instrumentation doesn't give you details you need at certain service points, you can use the OpenTracing APIs. Try Quick Start: Use OpenTracing to Instrument Your Code in the language of your choice to get a feel for it and then move on to Add Spans to Create Traces to start connecting everything together.
Once you have some instrumentation in place, be sure to check out it's IQ Score!. LightStep can analyze your instrumentation and recommend ways to improve it. Watch your score go up as you continue to add tracing capabilities to your system.
Updated 2 months ago