Observability gives you the necessary information about the health and efficiency of your system. However, systems are large and complex, so the question is what do you decide to measure and where do you start?
Prioritize Where to Add Tracing
Maybe you want to ensure your most valuable business operations have full coverage so you can monitor them and find issues quickly. Or you want to ensure your most frequently called API is always performant. Or maybe you know you have a latency issue with a particular request and you need to dig in and find the cause. These scenarios all call for instrumentation that traverses the full stack, giving you a view into a request as it travels through your system.
Here are some common use cases to prioritize:
- API calls and operations: those most directly impact your business bottom-line (whether that’s revenue or customer satisfaction)
- Known performance bottlenecks: operations that are called most frequently, are slowing down the system, and no one knows quite how to address the issue
- Understanding system behavior: insight into parts of the system where concrete data about how requests are actually handled is needed
When translating these priorities to code changes, it’s helpful to consider the following:
- Existing instrumentation: Take advantage of instrumentation you already have
- Business impact: Use the “80/20” rule - consider the 20% of the code that is most important for the business. Add additional instrumentation there.
- System coverage and “nexus points”: Look to instrument or enable instrumentation on centralized internal communications libraries or external routing and communication packages like:
These centralized communication hubs reveal a great deal about how the application behaves, for example in the Lightstep Service Diagram.
- Known areas of unpredictable latency or reliability: Adding instrumentation where you know there may be issues helps to explain and model the variability.
- Known bottlenecks: Having instrumentation for database calls, inter-region network activity, and other common areas of bottleneck results in a quicker mean time to resolution when issues arise.
Auto-Instrument at the Framework Level
Start at the framework with installers that add the tracing logic for you. You can get fairly wide coverage without touching existing code. Auto-installers are available for many languages. You can use either our OpenTracing-based installers or OpenTelemetry.
OpenTelemetry is the unified initiative that takes the best of both OpenTracing and OpenCensus forward. Think of OpenTelemetry as the evolution of OpenTracing and OpenCensus.
If you use Istio and Envoy, auto-instrument your service mesh.
With your framework instrumented, you can immediately see traces in Lightstep. To get a finer-grained view into details important to your business, you add manual instrumentation to supplement the baseline auto-instrumentation.
Measure Your Code Coverage
Once you have some instrumentation in place, be sure to check out it’s IQ Score!. Lightstep can analyze your instrumentation and recommend ways to improve it. Watch your score go up as you continue to add tracing capabilities to your system.
Add Instrumentation Directly to your Services
Once you’ve measured the coverage you get from instrumenting the framework, you’ll likely find specific places in your system that you need to better understand. At this point you’ll want to turn to the OpenTelemetry or OpenTracing APIs. Continue adding spans to those areas and repeat our IQ Test test, until you are satisfied with the coverage. Be sure to add tags, logs, and metrics to get the full breadth of observability.