Glossary

Observability in Software Development

With the unknowns of your software’s failure modes, you want to be able to figure out what’s going on just by looking at the outputs: you want observability.

Return to Glossary Read More Below

Observability is defined as the ability of the internal states of a system to be determined by its external outputs.

“Observability” is the hot new tech buzzword. But is this actually a new concept, separate from monitoring? Or is it just a fancy new term? Today, we’ll be explaining observability: what it is, how it differs from monitoring and alerting, and why you should care.

One of the benefits of working with older technologies was the limited set of defined failure modes. Yes, things broke, but you would pretty much know what broke at any given time, or you could find out quickly, because a lot of older systems failed in pretty much the same three ways over and over again.

As systems became more complex, the possible failures became more abundant. To try to fix this problem, we created monitoring tools to help us figure out what was going on in the guts of our software. We kept track of our application performance with monitoring data collection and time series analytics. This process was manageable for a while, but it quickly got out of hand.

Modern systems — with everything turning into open-source cloud-native microservices running on Kubernetes clusters — are extraordinarily complex. Further, they’re now being developed faster than ever. Between CI/CD, DevOps, agile development, and progressive delivery, the software delivery train is speeding up.

With these complex, distributed systems being developed at the speed of light, the possible failure modes are multiplying. When something fails, it’s no longer obvious what caused it. We cannot keep up with this by simply building better applications. Nothing is perfect, everything fails at some point, and the best thing we can do as developers is to make sure that when our software fails, it’s as easy as possible for us to fix it.

Unfortunately, many modern developers don’t know what their software’s failure modes are. In many cases, there are just too many. Further, sometimes we don’t even know that we don’t know. And this is dangerous. Unknown unknowns mean you won’t put effort into fixing the problem, because you don’t know it exists.

Standard monitoring — the kind that happens after release — cannot fix this problem. It can only track known unknowns. Tracking KPIs is only as useful as the KPIs themselves are relevant to the failure they’re trying to detect. Reporting performance is only as useful as that reporting accurately represents the internal state of your system. Your monitoring is only as useful as your system is monitor-able.

This concept of monitor-able-ness has a name: observability.

Implementing Observability

The key tools for implementing observability are metrics, logs, and tracing.

Metrics are a central part of any monitoring process, but even when you have the right ones, you’re necessarily limited by the constraints of linear time. People decide on metrics based on failures they’ve already found and fixed in the past. But there may be unknown unknowns: failures you haven’t seen before, and therefore can’t anticipate. Preemptively checking your metrics to find patterns is an option, but this isn’t a replacement for being able to come back quickly from a failure. In short, metrics are necessary, but not sufficient.

While metrics should be constantly tracked, you only look at logs when your metrics are showing something strange you’d like to investigate. They’re more specific and detailed than metrics, and they exist to show you what happened in each event. Having understandable, queryable, comprehensive logs is a significant component of what separates the observable from the non-observable system.

Tracing is really just a type of logging that’s designed to record the flow of a program’s execution. Typically, tracing is more granular than standard logging: while logs may say that a program installation failed, a trace will show you the specific exception that was thrown and when during the runtime it happened. Tracing is frequently used to detect latency issues or find out which of many microservices is not working. It’s especially useful for error detection in distributed systems, to such an extent that this use case has its own name: distributed tracing.

The biggest problem with all logging, including tracing, is that the volume of data storage becomes prohibitive, fast. Sampling is a possibility, as was implemented in Google’s Dapper project, but it’s not a perfect solution. For one thing, sampling is not easy or simple: different logs may need to be sampled in different ways and the sampling strategy will need to change over time. For another, sampling is too rigid for some use cases. So while it may be tempting to be like Google, using Google’s development processes is only reasonable for companies on the same order of magnitude as Google – if you’re smaller, you have much more flexibility.

Different companies implement observability differently. Some track dozens of metrics and some track only a few; some keep all their logs and some down-sample them aggressively. Which solution works for you depends heavily on your company, your system, and your resources. But one thing is clear: observability is a real thing, it’s important, and systems that implement it from the get-go will be uniquely situated to spring back quickly from failure when it happens.

Want to Dive Deeper?

We have a lot to explore that can help you understand feature flags. Learn more about benefits, use cases, and real world applications that you can try.

Blog

Integration

4 Considerations When Integrating Systems Using APIs

View Blog

Article

Company

Is Your Product Manager Entrepreneurial Enough with Your Data?

Open Article

Blog

Features

Improving Version Targeting with Split’s Regex Matchers and Feature Flags

View Blog

Create Impact With Everything You Build

We’re excited to accompany you on your journey as you build faster, release safer, and launch impactful products.

Free Account Contact Us

Search site

Why Split

Products

Feature Delivery & Control

Feature Measurement & Learning

Enterprise Readiness

Related Links

Use Cases

By Need

By Industry

Resources

Developer Resources

Content Hub

Success

Related Links

Pricing

Company

Search site

Observability in Software Development

Implementing Observability

Want to Dive Deeper?

Introducing Switch, Split’s New AI Developer Assistant

Experimenting With Statistical Rigor to Make Data-Driven Taco Decisions

Rethinking DORA: Mean Time to Restore

Create Impact With Everything You Build

Want to Dive Deeper?

4 Considerations When Integrating Systems Using APIs

Is Your Product Manager Entrepreneurial Enough with Your Data?

Improving Version Targeting with Split’s Regex Matchers and Feature Flags

Create Impact With Everything You Build

Feature Delivery & Control

Feature Measurement & Learning

Enterprise Readiness

Related Links

By Need

By Industry

Developer Resources

Content Hub

Success

Related Links

Observability in Software Development

Implementing Observability

Want to Dive Deeper?

Create Impact With Everything You Build

Want to Dive Deeper?

Create Impact With Everything You Build

Want to see how Split can measure impact and reduce release risk?