social groups construct the material objects of their civilizations. The things made are socially constructed just as much as technically constructed. The merging of these two things, construction and insight, is sociotechnology” — wikipedia if you change the tools people use, you can change how they behave and even who they are.
long does it take for code to go live? 3 — How many of your deploys fail? 4 — How long does it take to recover from an outage? 5 — How often are you paged outside work hours?
best engineers, and you’ll get the best team” Hire people who share your values and have the needed skills, and then the work of building a team can begin.
well internal states of a system can be inferred from knowledge of its external outputs. The observability and controllability of a system are mathematical duals." — wikipedia
asking questions from the outside? Can you debug your code and its behavior using its output? Can you answer new questions without shipping new code? o11y for software engineers:
and reliably track down any new problem with no prior knowledge. For software engineers, this means being able to reason about your code, identify and fix bugs, and understand user experiences and behaviors ... via your instrumentation.
Exploratory, open-ended investigation based on raw events • Service Level Objectives. No preaggregation. • Based on arbitrarily-wide structured events with span support • No indexes, schemas, or predefined structure • Bundling the full context of the request across network hops • Metrics != observability. Unstructured logs != observability.
and technical debt 4. Predictable releases 5. Understand user behavior https://www.honeycomb.io/wp-content/uploads/2019/06/Framework-for-an-Observability-Maturity-Model.pdf Observability Maturity Model … find your weakest category, and tackle that first
we keep shipping things anyway Our tools have rewarded guessing over debugging And vendors have happily misled you for $$$$ It’s time to fix this problem.
are unreliable symptoms or reports. Complexity is exploding everywhere, but our tools were designed for predictable worlds As soon as we know the question, we usually know the answer too.
it instrumenting two steps in front of you as you build never accept a PR unless you can explain it if it breaks watch your code go out as it deploys is it working as intended? does anything look weird look through the lens of your instrumentation
mostly predictable failures • Many monitoring checks/paging alerts • "Flip a switch" to deploy, changes are big bang and binary (all on/all off) • Failures to be prevented • Production is to be feared • Debug by intuition and scar tissue of past outages • Canned dashboards, runbooks, playbooks • Deploys are scary • Masochistic on-call culture sociotechnical causes & effects
Unknown-unknowns dominate • Every alert is a novel question • Rich, flexible instrumentation • Few paging alerts, tied to SLOs and keying off user pain • A deploy is just the beginning of gaining confidence in your code • Failures are your friend • Production is where your users live, you should be in there too, watching them every day • Debug methodically by examining the evidence and following the clues • Inspect the full context of the event • Deploys are opportunities • On-call must be sustainable, humane sociotechnical causes & effects Microservices
won't be built and run by burned out, exhausted people, or command-and-control teams just following orders. It can't be done. they've become too complicated. too hard.
and reason about them -- if we try, we'll be outcompeted by teams who use proper tools. Our systems are emergent and unpredictable. We need more than just your logical brain; we need your full creative self.