should you care? How can you convince others to care? How does a team develop a high-performing practice? Isn't this supposed to be a talk about observability? (Yes!)
you stand. • How frequently do you deploy? • How long does it take for each deploy to go live? • How many of your deploys fail? • How long does it typically take to recover?
their work, communicate with each other, invest in incremental improvements, and are empowered to do their jobs. (Instead of "elite", let's say "excellent"?) Elite teams are made up of normal engineers who: take pride in their craft, care about their users, have time to fix and iterate
changed significantly in 20 years and has fallen behind the way we build software. Our software is now large distributed systems made up of many non-uniform interacting components while the core functionality of monitoring systems has stagnated.”
well internal states of a system can be inferred from knowledge of its external outputs. The observability and controllability of a system are mathematical duals." — wikipedia … translate??!?
your systems, just by asking questions from the outside? Can you debug your code and its behavior using its output? Can you answer new questions without shipping new code?
and reliably track down any new problem with no prior knowledge. For software engineers, this means being able to reason about your code, identify and fix bugs, and understand user experiences and behaviors ... via your instrumentation.
party, and describes the health of the system and/or its components in aggregate. Observability Describes the world from the perspective of the software, as it performs each request. Softwareexplaining itself back to you from the inside.
are unreliable symptoms or reports. Complexity is exploding everywhere, but our tools are designed for a predictable world. As soon as we know the question, we usually know the answer too.
open-ended • Based on arbitrarily-wide structured events with span support • No indexes, schemas, or predefined structure • About understanding unknown-unknowns with no prior knowledge • About systems, not code. Where in the system is the code you need to fix? • Young. Early. There is much still to be discovered. • Aligned with the user's experience.
of a metrics store • Comprised of pillars (this is shitty vendorspeak) • Achievable with preaggregation. • Achievable without sampling (or infinite money) (at scale) • About the health of the backend or services. • Achievable without instrumentation • Doable without tracing. • Or exclusively about tracing.
rolled out a build with a perf regression, or maybe some app instances are down. DB queries are slower than normal. Maybe we deployed a bad new query, or there is lock contention. Errors or latency are high. We will look at several dashboards that reflect common root causes, and one of them will show us why. “Photos are loading slowly for some people. Why?” These are known-unknowns. Monitor for them.
storage in us-east-1b has a 1/20 chance of running on degraded hardware, and will take 20x longer to complete for requests that hit the disk with a blocking call. This disproportionately impacts people looking at older archives due to our fanout model." "Canadian users who are using the French language pack on the iPad running iOS 9, are hitting a firmware condition which makes it fail saving to local cache … which is why it FEELS like photos are loading slowly" "Our newest SDK makes db queries sequentially if the developer has enabled an optional feature flag. Working as intended; the reporters all had debug mode enabled. But flag should be renamed for clarity sake." Monitor for .... ???
and three other data stores across three regions, and everything seems to be getting a little bit slower over the past two weeks but nothing has changed that we know of, and oddly, latency is usually back to the historical norm on Tuesdays." “All twenty microservices have 10% of available nodes enter a crash loop about five times a day, at unpredictable intervals. They have nothing in common and it doesn’t seem to impact the stateful services. It clears up before we can debug it, every time. We have tried replacing the instances." “Our users can compose their own queries that we execute server-side, and we don’t surface it to them when they are accidentally doing full table scans or even multiple full table scans, so they blame us.”
notifications have been down for days. This seems impossible, since we share a queue with them." “Disney is complaining that once in a while, but not always, they don’t see the profile photo they expected to see — they see someone else’s photo! When they refresh, it’s fixed.” “Sometimes a bot takes off, or an app is featured on the iTunes store, and it takes us a long time to track down which app or user is generating disproportionate pressure on shared system components. “We run a platform, and it’s hard to programmatically distinguish between errors that users are inflicting on themselves and problems in our code, since they all manifest as errors or timeouts."
and mostly predictable failures • Many monitoring checks • Many paging alerts • "Flip a switch" to deploy • Failures to be prevented • Production is to be feared • Debug by intuition and scar tissue of past outages • Canned dashboards • Deploys are scary • Masochistic on-call culture technical aspects, cultural associations
• Diversity of service types • Unknown-unknowns; every alert is novel • Rich, flexible instrumentation • Few paging alerts • Deployment is like baking • Failures are your friend • Production is where your users live • Debug methodically by examining the evidence • Events and full context, not metrics • Deploys are opportunities • Humane on-call culture
won't be built and run by burned out, exhausted people, or command-and-control teams just following orders. It can't be done. they've become too complicated. too hard.
we have are unreliable reports. Our tools were designed for a predictable world. As soon as we know the question, we usually know the answer too. We have tools that help us ask and answer questions, esp if we define them in advance.
and reason about them -- if we try, we'll be outcompeted by teams who use proper tools. Our systems are emergent and unpredictable. We need more than just your logical brain; we need your full creative self.
not ignored • Oncall is not excessively stressful • Staff turnover is low; no burnout • Outages are frequent. • Spurious alerts • Alert fatigue • Troubleshooting is unpredictable/hard • Repair is unpredictable/time-consuming • Some critical members get fried O11y gives you context and helps you resolve incidents swiftly
support • Debugging is intuitive • No cascading failures • Customer support costs are high • High % of engineering time on bugs • Fear around deploys process • Long time to find and repro bugs • Unpredictable time to solve problems • Low confidence in code when shipped O11y lets you watch deploys, find bugs early
in prod immediately • Code paths turned on/off easily • Deploy/rollback are fast O11y helps you manage your complex build pipeline as well as deploys, so you can ship swiftly and with confidence • Releases are infrequent • Need lots of human intervention • Many changes ship at once • Releases are order-dependent • Sales has to gate releases on promise trai • People avoid doing deploys at times
actual goals • Bugs and reliability are tractable • Easy to find code to fix • Answer any question w/o shipping new code O11y helps you do the right work at the right time • Waste time rebuilding and refactoring • Teams are distracted by fixing the wrong thing or the wrong way • Uncontrollable ripple effects from a local change • "haunted graveyard" where people are afraid to make changes
Easy access to KPIs for devs • Feature flagging • PMs have useful view of customers • Teams share view of reality O11y grounds you in reality. • Product doesn't have their finger on pulse • Devs feel their work doesn't have impact • Features get scope creep • PMF not achieved
encourage curiosity, emphasize ownership. don't punish. get up to your elbows in prod EVERY DAY practice many small failures practice, practice, practice senior engineers : amplify hidden costs