Coding For Observability

The single constant in software development is change. New code needs to be written, old code needs to be refactored, features need to be developed or refined. Then, the changes need to land in production, or be passed into your customers’ hands. For many, this means deployment to a production environment (or environments) and the age old question “does it work?”

Observability aims to provide you with an understanding of the system by using its outputs–telemetry–to answer questions about the system’s current state and function.

Fundamentally, the two questions most often raised after a software change are “did it break something?” and “is it working?”. Unfortunately while they may be nearly universal, those questions end up being too vague to answer quickly and meaningfully; they are often based on a set of poorly understood assumptions about both the change and the system. Waiting until your system is on fire to try to establish expectations about its behavior is not fun.

Documenting the system behaviors that are expected to change along with a particular code change is a helpful step, as it allows both the developer team and any reviewers to more effectively think about code changes in context. This also provides an opportunity to add observability to any factors that are expected to be impacted by the change, and to validate assumptions about the current behavior of the system.

As an example, consider a system that reads photos from a cloud object store, performs some simple transformations (Transformation A and B) to them, and stores the transformed photos in a different object store along with metadata. Now suppose a new feature request means that a third transformation (C) will need to be performed. To make that change safely, one would need to know, first, how long each of the current transformations took, how often they happened, how often they failed, and the total time from the photo being available to all the transformations and metadata extraction being complete. Knowing all these things, the developer could then have the expectation that adding a new transformation should increase the total time needed to complete the transformations–likely meaning a longer gap between the time when the original photo is retrieved from the cloud to the time it is uploaded to a new object store. Perhaps most importantly, the count of transformations of the new type, C, should be zero before the change was made and not zero after the change was deployed (and the counts for other types (A and B) should stay the same).

With these expectations, if the change is deployed and suddenly the total time spent on photo transformation suddenly drops, there’s probably something wrong…and you should rollback immediately. Likewise, if the counts for other transformations stop being consistent with their states before the change, you should roll back.

When practicing this approach to software development, more and more often expectations will be met on the first before because assumptions will have been expressed, qualified, and validated. Failure cases will have been considered and accounted for. Systems will be observable and observed.

Conclusion

Coding for Observability boils down to this basic guidance, if you’re changing part of the system that you don’t currently have trustworthy observations for, add telemetry for that first. With information about how the system is currently functioning, make specific documented expectations for how the system will change when the code change is deployed. After the code is deployed, verify those expectations were met. If they weren’t, roll back and check where the misunderstanding came from. With consistent practice production deployments become simple observed experiments instead of chaotic and stressful guessing games.

Updated Feb 14, 2020