their work, communicate with each other, invest in incremental improvements, and are empowered to do their jobs. (Instead of "elite", let's say "excellent"?) Elite teams are made up of normal engineers who: take pride in their craft, care about their users, have time to ﬁx and iterate
changed signiﬁcantly in 20 years and has fallen behind the way we build software. Our software is now large distributed systems made up of many non-uniform interacting components while the core functionality of monitoring systems has stagnated.”
and reliably track down any new problem with no prior knowledge. For software engineers, this means being able to reason about your code, identify and ﬁx bugs, and understand user experiences and behaviors ... via your instrumentation.
party, and describes the health of the system and/or its components in aggregate. Observability Describes the world from the perspective of the software, as it performs each request. Softwareexplaining itself back to you from the inside.
open-ended • Based on arbitrarily-wide structured events with span support • No indexes, schemas, or predeﬁned structure • About understanding unknown-unknowns with no prior knowledge • About systems, not code. Where in the system is the code you need to ﬁx? • Young. Early. There is much still to be discovered. • Aligned with the user's experience.
of a metrics store • Comprised of pillars (this is shitty vendorspeak) • Achievable with preaggregation. • Achievable without sampling (or inﬁnite money) (at scale) • About the health of the backend or services. • Achievable without instrumentation • Doable without tracing. • Or exclusively about tracing.
rolled out a build with a perf regression, or maybe some app instances are down. DB queries are slower than normal. Maybe we deployed a bad new query, or there is lock contention. Errors or latency are high. We will look at several dashboards that reﬂect common root causes, and one of them will show us why. “Photos are loading slowly for some people. Why?” These are known-unknowns. Monitor for them.
storage in us-east-1b has a 1/20 chance of running on degraded hardware, and will take 20x longer to complete for requests that hit the disk with a blocking call. This disproportionately impacts people looking at older archives due to our fanout model." "Canadian users who are using the French language pack on the iPad running iOS 9, are hitting a ﬁrmware condition which makes it fail saving to local cache … which is why it FEELS like photos are loading slowly" "Our newest SDK makes db queries sequentially if the developer has enabled an optional feature ﬂag. Working as intended; the reporters all had debug mode enabled. But ﬂag should be renamed for clarity sake." Monitor for .... ???
and three other data stores across three regions, and everything seems to be getting a little bit slower over the past two weeks but nothing has changed that we know of, and oddly, latency is usually back to the historical norm on Tuesdays." “All twenty microservices have 10% of available nodes enter a crash loop about ﬁve times a day, at unpredictable intervals. They have nothing in common and it doesn’t seem to impact the stateful services. It clears up before we can debug it, every time. We have tried replacing the instances." “Our users can compose their own queries that we execute server-side, and we don’t surface it to them when they are accidentally doing full table scans or even multiple full table scans, so they blame us.”
notiﬁcations have been down for days. This seems impossible, since we share a queue with them." “Disney is complaining that once in a while, but not always, they don’t see the proﬁle photo they expected to see — they see someone else’s photo! When they refresh, it’s ﬁxed.” “Sometimes a bot takes off, or an app is featured on the iTunes store, and it takes us a long time to track down which app or user is generating disproportionate pressure on shared system components. “We run a platform, and it’s hard to programmatically distinguish between errors that users are inﬂicting on themselves and problems in our code, since they all manifest as errors or timeouts."
and mostly predictable failures • Many monitoring checks • Many paging alerts • "Flip a switch" to deploy • Failures to be prevented • Production is to be feared • Debug by intuition and scar tissue of past outages • Canned dashboards • Deploys are scary • Masochistic on-call culture technical aspects, cultural associations
• Diversity of service types • Unknown-unknowns; every alert is novel • Rich, ﬂexible instrumentation • Few paging alerts • Deployment is like baking • Failures are your friend • Production is where your users live • Debug methodically by examining the evidence • Events and full context, not metrics • Deploys are opportunities • Humane on-call culture
we have are unreliable reports. Our tools were designed for a predictable world. As soon as we know the question, we usually know the answer too. We have tools that help us ask and answer questions, esp if we deﬁne them in advance.
and reason about them -- if we try, we'll be outcompeted by teams who use proper tools. Our systems are emergent and unpredictable. We need more than just your logical brain; we need your full creative self.
not ignored • Oncall is not excessively stressful • Staff turnover is low; no burnout • Outages are frequent. • Spurious alerts • Alert fatigue • Troubleshooting is unpredictable/hard • Repair is unpredictable/time-consuming • Some critical members get fried O11y gives you context and helps you resolve incidents swiftly
support • Debugging is intuitive • No cascading failures • Customer support costs are high • High % of engineering time on bugs • Fear around deploys process • Long time to ﬁnd and repro bugs • Unpredictable time to solve problems • Low conﬁdence in code when shipped O11y lets you watch deploys, ﬁnd bugs early
in prod immediately • Code paths turned on/off easily • Deploy/rollback are fast O11y helps you manage your complex build pipeline as well as deploys, so you can ship swiftly and with conﬁdence • Releases are infrequent • Need lots of human intervention • Many changes ship at once • Releases are order-dependent • Sales has to gate releases on promise trai • People avoid doing deploys at times
actual goals • Bugs and reliability are tractable • Easy to ﬁnd code to ﬁx • Answer any question w/o shipping new code O11y helps you do the right work at the right time • Waste time rebuilding and refactoring • Teams are distracted by ﬁxing the wrong thing or the wrong way • Uncontrollable ripple effects from a local change • "haunted graveyard" where people are afraid to make changes
Easy access to KPIs for devs • Feature ﬂagging • PMs have useful view of customers • Teams share view of reality O11y grounds you in reality. • Product doesn't have their ﬁnger on pulse • Devs feel their work doesn't have impact • Features get scope creep • PMF not achieved