What got you here won't get you there (CodeFreeze 2020)

What got you here won't get you there (CodeFreeze 2020)

How your team can become a high-performing team by embracing observability

Ac734fc32781678475b577944bb5a9ae?s=128

Charity Majors

January 16, 2020
Tweet

Transcript

  1. 1.

    What got you here won't get you there How your

    team can become a ✨high-performing✨ team by embracing observability. Observability and the Glorious Future
  2. 3.

    What does it mean to be a high-performing team? Why

    should you care? How can you convince others to care? How does a team develop a high-performing practice? Isn't this supposed to be a talk about observability? (Yes!)
  3. 4.

    Why are computers hard? Because we don't understand them And

    we keep shipping things anyway You never learned to debug with science Vendors have happily misled you for $$$$
  4. 6.

    You only need to track ✨four things✨ to see where

    you stand. • How frequently do you deploy? • How long does it take for each deploy to go live? • How many of your deploys fail? • How long does it typically take to recover?
  5. 9.

    Excellent teams are made up of engineers who care about

    their work, communicate with each other, invest in incremental improvements, and are empowered to do their jobs. (Instead of "elite", let's say "excellent"?) Elite teams are made up of normal engineers who: take pride in their craft, care about their users, have time to fix and iterate
  6. 10.

    What you need is production excellence. this work begins with

    observability. Happier customers, happier teams.
  7. 11.

    1. make your users happy 2. make your team happy

    Every engineering org has two constituencies:
  8. 13.

    • Ephemeral and dynamic • Far-flung and loosely coupled •

    Partitioned, sharded • Distributed and replicated • Containers, schedulers • Service registries • Polyglot persistence strategies • Autoscaled, multiple failover • Emergent behaviors • ... etc Complexity is soaring
  9. 20.

    @grepory, Monitorama 2016 “Monitoring is dead.” “Monitoring systems have not

    changed significantly in 20 years and has fallen behind the way we build software. Our software is now large distributed systems made up of many non-uniform interacting components while the core functionality of monitoring systems has stagnated.”
  10. 21.

    Observability “In control theory, observability is a measure of how

    well internal states of a system can be inferred from knowledge of its external outputs. The observability and controllability of a system are mathematical duals." — wikipedia … translate??!?
  11. 22.

    Observability for software engineers: Can you understand what’s happening inside

    your systems, just by asking questions from the outside? Can you debug your code and its behavior using its output? Can you answer new questions without shipping new code?
  12. 23.

    You have an observable system when your team can quickly

    and reliably track down any new problem with no prior knowledge. For software engineers, this means being able to reason about your code, identify and fix bugs, and understand user experiences and behaviors ... via your instrumentation.
  13. 24.

    Monitoring Represents the world from the perspective of a third

    party, and describes the health of the system and/or its components in aggregate. Observability Describes the world from the perspective of the software, as it performs each request. Softwareexplaining itself back to you from the inside.
  14. 25.

    We don’t *know* what the questions are, all we have

    are unreliable symptoms or reports. Complexity is exploding everywhere, but our tools are designed for a predictable world. As soon as we know the question, we usually know the answer too.
  15. 27.

    Observability is... • High cardinality • High dimensionality • Exploratory,

    open-ended • Based on arbitrarily-wide structured events with span support • No indexes, schemas, or predefined structure • About understanding unknown-unknowns with no prior knowledge • About systems, not code. Where in the system is the code you need to fix? • Young. Early. There is much still to be discovered. • Aligned with the user's experience.
  16. 28.

    Observability is not... • Able to be built on top

    of a metrics store • Comprised of pillars (this is shitty vendorspeak) • Achievable with preaggregation. • Achievable without sampling (or infinite money) (at scale) • About the health of the backend or services. • Achievable without instrumentation • Doable without tracing. • Or exclusively about tracing.
  17. 29.

    LAMP stack The app tier capacity is exceeded. Maybe we

    rolled out a build with a perf regression, or maybe some app instances are down. DB queries are slower than normal. Maybe we deployed a bad new query, or there is lock contention. Errors or latency are high. We will look at several dashboards that reflect common root causes, and one of them will show us why. “Photos are loading slowly for some people. Why?” These are known-unknowns. Monitor for them.
  18. 30.

    Distributed systems "Any microservices running on c2.4xlarge instances and PIOPS

    storage in us-east-1b has a 1/20 chance of running on degraded hardware, and will take 20x longer to complete for requests that hit the disk with a blocking call. This disproportionately impacts people looking at older archives due to our fanout model." "Canadian users who are using the French language pack on the iPad running iOS 9, are hitting a firmware condition which makes it fail saving to local cache … which is why it FEELS like photos are loading slowly" "Our newest SDK makes db queries sequentially if the developer has enabled an optional feature flag. Working as intended; the reporters all had debug mode enabled. But flag should be renamed for clarity sake." Monitor for .... ???
  19. 31.

    Distributed systems "I have twenty microservices and a sharded db

    and three other data stores across three regions, and everything seems to be getting a little bit slower over the past two weeks but nothing has changed that we know of, and oddly, latency is usually back to the historical norm on Tuesdays." “All twenty microservices have 10% of available nodes enter a crash loop about five times a day, at unpredictable intervals. They have nothing in common and it doesn’t seem to impact the stateful services. It clears up before we can debug it, every time. We have tried replacing the instances." “Our users can compose their own queries that we execute server-side, and we don’t surface it to them when they are accidentally doing full table scans or even multiple full table scans, so they blame us.”
  20. 32.

    Distributed systems “Users in Romania are complaining that all push

    notifications have been down for days. This seems impossible, since we share a queue with them." “Disney is complaining that once in a while, but not always, they don’t see the profile photo they expected to see — they see someone else’s photo! When they refresh, it’s fixed.” “Sometimes a bot takes off, or an app is featured on the iTunes store, and it takes us a long time to track down which app or user is generating disproportionate pressure on shared system components. “We run a platform, and it’s hard to programmatically distinguish between errors that users are inflicting on themselves and problems in our code, since they all manifest as errors or timeouts."
  21. 33.

    Distributed systems These are all unknown-unknowns, which may never have

    happened before or happen again. (welcome to distributed systems)
  22. 34.

    LAMP stack • THE database • THE application • Known-unknowns

    and mostly predictable failures • Many monitoring checks • Many paging alerts • "Flip a switch" to deploy • Failures to be prevented • Production is to be feared • Debug by intuition and scar tissue of past outages • Canned dashboards • Deploys are scary • Masochistic on-call culture technical aspects, cultural associations
  23. 35.

    LAMP stack • Dev/Ops • Fragile, forbidding edifice • "Glass

    Castle" We have built our systems like glass castles, fragile and forbidding, hostile to exploration and experimentation .
  24. 36.

    Distributed systems technical aspects, cultural associations • Many storage systems

    • Diversity of service types • Unknown-unknowns; every alert is novel • Rich, flexible instrumentation • Few paging alerts • Deployment is like baking • Failures are your friend • Production is where your users live • Debug methodically by examining the evidence • Events and full context, not metrics • Deploys are opportunities • Humane on-call culture
  25. 37.

    best practices • Software ownership -- you build it, you

    run it • Robust, resilient, built for experimentation and delight • Human scale, safety measures baked in Distributed systems
  26. 38.

    Here's the dirty little secret. The next generation of systems

    won't be built and run by burned out, exhausted people, or command-and-control teams just following orders. It can't be done. they've become too complicated. too hard.
  27. 39.

    We don’t know what the questions actually are though, all

    we have are unreliable reports. Our tools were designed for a predictable world. As soon as we know the question, we usually know the answer too. We have tools that help us ask and answer questions, esp if we define them in advance.
  28. 40.

    We can no longer fit these systems in our heads

    and reason about them -- if we try, we'll be outcompeted by teams who use proper tools. Our systems are emergent and unpredictable. We need more than just your logical brain; we need your full creative self.
  29. 41.

    How observability leads to high-performing teams. Resiliency High quality code

    Predictable releases Manageable complexity and tech debt User behavior https://www.honeycomb.io/blog/toward-a-maturity-model-for-observability/
  30. 42.

    Resiliency • System uptime meets your goals • Alerts are

    not ignored • Oncall is not excessively stressful • Staff turnover is low; no burnout • Outages are frequent. • Spurious alerts • Alert fatigue • Troubleshooting is unpredictable/hard • Repair is unpredictable/time-consuming • Some critical members get fried O11y gives you context and helps you resolve incidents swiftly
  31. 43.

    High-quality code • Code is stable • Customer happiness, not

    support • Debugging is intuitive • No cascading failures • Customer support costs are high • High % of engineering time on bugs • Fear around deploys process • Long time to find and repro bugs • Unpredictable time to solve problems • Low confidence in code when shipped O11y lets you watch deploys, find bugs early
  32. 44.

    Predictable releases • Release cadence matches goals • Code goes

    in prod immediately • Code paths turned on/off easily • Deploy/rollback are fast O11y helps you manage your complex build pipeline as well as deploys, so you can ship swiftly and with confidence • Releases are infrequent • Need lots of human intervention • Many changes ship at once • Releases are order-dependent • Sales has to gate releases on promise trai • People avoid doing deploys at times
  33. 45.

    Manageable complexity and tech debt • Spend your time on

    actual goals • Bugs and reliability are tractable • Easy to find code to fix • Answer any question w/o shipping new code O11y helps you do the right work at the right time • Waste time rebuilding and refactoring • Teams are distracted by fixing the wrong thing or the wrong way • Uncontrollable ripple effects from a local change • "haunted graveyard" where people are afraid to make changes
  34. 46.

    Understand user behavior • Instrumentation is easy to add •

    Easy access to KPIs for devs • Feature flagging • PMs have useful view of customers • Teams share view of reality O11y grounds you in reality. • Product doesn't have their finger on pulse • Devs feel their work doesn't have impact • Features get scope creep • PMF not achieved
  35. 50.

    p.s. o11y is also a prerequisite for other modern things,

    like ... chaos engineering sane deploys testing in production and other modern best practices.
  36. 52.

    on call will be shared by everyone who writes code.

    on call must not be miserable. (on call will be less like a heart attack, more like dentist visits, or gym appts)
  37. 55.

    build your devs a playground ... but build guard rails

    encourage curiosity, emphasize ownership. don't punish. get up to your elbows in prod EVERY DAY practice many small failures practice, practice, practice senior engineers : amplify hidden costs
  38. 57.

    1. make your users happy 2. make your team happy

    Every engineering org has two constituencies:
  39. 58.

    1. nines don't matter if users aren't happy 2. great

    teams build high-quality systems Corollary: