What got you here won't get you there (CodeFreeze 2020)

Slide 1

Slide 1 text

What got you here won't get you there How your team can become a ✨high-performing✨ team by embracing observability. Observability and the Glorious Future

Slide 2

Slide 2 text

@mipsytipsy engineer/cofounder/CTO https://charity.wtf “the only good diff is a red diff”

Slide 3

Slide 3 text

What does it mean to be a high-performing team? Why should you care? How can you convince others to care? How does a team develop a high-performing practice? Isn't this supposed to be a talk about observability? (Yes!)

Slide 4

Slide 4 text

Why are computers hard? Because we don't understand them And we keep shipping things anyway You never learned to debug with science Vendors have happily misled you for $$$$

Slide 5

Slide 5 text

We have ✨science✨ now! What does it mean to be a high-performing team?

Slide 6

Slide 6 text

You only need to track ✨four things✨ to see where you stand. • How frequently do you deploy? • How long does it take for each deploy to go live? • How many of your deploys fail? • How long does it typically take to recover?

Slide 7

Slide 7 text

It really, really, really, really, really pays off to be a high performer. Really.

Slide 8

Slide 8 text

Elite teams are made up of all ex-Facebook, ex-Google, MIT grads...

Slide 9

Slide 9 text

Excellent teams are made up of engineers who care about their work, communicate with each other, invest in incremental improvements, and are empowered to do their jobs. (Instead of "elite", let's say "excellent"?) Elite teams are made up of normal engineers who: take pride in their craft, care about their users, have time to ﬁx and iterate

Slide 10

Slide 10 text

What you need is production excellence. this work begins with observability. Happier customers, happier teams.

Slide 11

Slide 11 text

1. make your users happy 2. make your team happy Every engineering org has two constituencies:

Slide 12

Slide 12 text

are changing are changing The world is changing fast.

Slide 13

Slide 13 text

• Ephemeral and dynamic • Far-ﬂung and loosely coupled • Partitioned, sharded • Distributed and replicated • Containers, schedulers • Service registries • Polyglot persistence strategies • Autoscaled, multiple failover • Emergent behaviors • ... etc Complexity is soaring

Slide 14

Slide 14 text

Architectural complexity 2003 2013

Slide 15

Slide 15 text

We are bad at understanding our systems.

Slide 16

Slide 16 text

Tools for understanding them known-unknowns unknown-unknowns monitoring observability

Slide 17

Slide 17 text

("understand" lol)

Slide 18

Slide 18 text

Observability is NOT the same as monitoring.

Slide 19

Slide 19 text

What's the difference between monitoring and observability?

Slide 20

Slide 20 text

@grepory, Monitorama 2016 “Monitoring is dead.” “Monitoring systems have not changed signiﬁcantly in 20 years and has fallen behind the way we build software. Our software is now large distributed systems made up of many non-uniform interacting components while the core functionality of monitoring systems has stagnated.”

Slide 21

Slide 21 text

Observability “In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. The observability and controllability of a system are mathematical duals." — wikipedia … translate??!?

Slide 22

Slide 22 text

Observability for software engineers: Can you understand what’s happening inside your systems, just by asking questions from the outside? Can you debug your code and its behavior using its output? Can you answer new questions without shipping new code?

Slide 23

Slide 23 text

You have an observable system when your team can quickly and reliably track down any new problem with no prior knowledge. For software engineers, this means being able to reason about your code, identify and ﬁx bugs, and understand user experiences and behaviors ... via your instrumentation.

Slide 24

Slide 24 text

Monitoring Represents the world from the perspective of a third party, and describes the health of the system and/or its components in aggregate. Observability Describes the world from the perspective of the software, as it performs each request. Softwareexplaining itself back to you from the inside.

Slide 25

Slide 25 text

We don’t *know* what the questions are, all we have are unreliable symptoms or reports. Complexity is exploding everywhere, but our tools are designed for a predictable world. As soon as we know the question, we usually know the answer too.

Slide 26

Slide 26 text

Many catastrophic states exist at any given time. Your system is never entirely ‘up’

Slide 27

Slide 27 text

Observability is... • High cardinality • High dimensionality • Exploratory, open-ended • Based on arbitrarily-wide structured events with span support • No indexes, schemas, or predeﬁned structure • About understanding unknown-unknowns with no prior knowledge • About systems, not code. Where in the system is the code you need to ﬁx? • Young. Early. There is much still to be discovered. • Aligned with the user's experience.

Slide 28

Slide 28 text

Observability is not... • Able to be built on top of a metrics store • Comprised of pillars (this is shitty vendorspeak) • Achievable with preaggregation. • Achievable without sampling (or inﬁnite money) (at scale) • About the health of the backend or services. • Achievable without instrumentation • Doable without tracing. • Or exclusively about tracing.

Slide 29

Slide 29 text

LAMP stack The app tier capacity is exceeded. Maybe we rolled out a build with a perf regression, or maybe some app instances are down. DB queries are slower than normal. Maybe we deployed a bad new query, or there is lock contention. Errors or latency are high. We will look at several dashboards that reﬂect common root causes, and one of them will show us why. “Photos are loading slowly for some people. Why?” These are known-unknowns. Monitor for them.

Slide 30

Slide 30 text

Distributed systems "Any microservices running on c2.4xlarge instances and PIOPS storage in us-east-1b has a 1/20 chance of running on degraded hardware, and will take 20x longer to complete for requests that hit the disk with a blocking call. This disproportionately impacts people looking at older archives due to our fanout model." "Canadian users who are using the French language pack on the iPad running iOS 9, are hitting a firmware condition which makes it fail saving to local cache … which is why it FEELS like photos are loading slowly" "Our newest SDK makes db queries sequentially if the developer has enabled an optional feature flag. Working as intended; the reporters all had debug mode enabled. But flag should be renamed for clarity sake." Monitor for .... ???

Slide 31

Slide 31 text

Distributed systems "I have twenty microservices and a sharded db and three other data stores across three regions, and everything seems to be getting a little bit slower over the past two weeks but nothing has changed that we know of, and oddly, latency is usually back to the historical norm on Tuesdays." “All twenty microservices have 10% of available nodes enter a crash loop about ﬁve times a day, at unpredictable intervals. They have nothing in common and it doesn’t seem to impact the stateful services. It clears up before we can debug it, every time. We have tried replacing the instances." “Our users can compose their own queries that we execute server-side, and we don’t surface it to them when they are accidentally doing full table scans or even multiple full table scans, so they blame us.”

Slide 32

Slide 32 text

Distributed systems “Users in Romania are complaining that all push notifications have been down for days. This seems impossible, since we share a queue with them." “Disney is complaining that once in a while, but not always, they don’t see the profile photo they expected to see — they see someone else’s photo! When they refresh, it’s fixed.” “Sometimes a bot takes off, or an app is featured on the iTunes store, and it takes us a long time to track down which app or user is generating disproportionate pressure on shared system components. “We run a platform, and it’s hard to programmatically distinguish between errors that users are inflicting on themselves and problems in our code, since they all manifest as errors or timeouts."

Slide 33

Slide 33 text

Distributed systems These are all unknown-unknowns, which may never have happened before or happen again. (welcome to distributed systems)

Slide 34

Slide 34 text

LAMP stack • THE database • THE application • Known-unknowns and mostly predictable failures • Many monitoring checks • Many paging alerts • "Flip a switch" to deploy • Failures to be prevented • Production is to be feared • Debug by intuition and scar tissue of past outages • Canned dashboards • Deploys are scary • Masochistic on-call culture technical aspects, cultural associations

Slide 35

Slide 35 text

LAMP stack • Dev/Ops • Fragile, forbidding ediﬁce • "Glass Castle" We have built our systems like glass castles, fragile and forbidding, hostile to exploration and experimentation .

Slide 36

Slide 36 text

Distributed systems technical aspects, cultural associations • Many storage systems • Diversity of service types • Unknown-unknowns; every alert is novel • Rich, ﬂexible instrumentation • Few paging alerts • Deployment is like baking • Failures are your friend • Production is where your users live • Debug methodically by examining the evidence • Events and full context, not metrics • Deploys are opportunities • Humane on-call culture

Slide 37

Slide 37 text

best practices • Software ownership -- you build it, you run it • Robust, resilient, built for experimentation and delight • Human scale, safety measures baked in Distributed systems

Slide 38

Slide 38 text

Here's the dirty little secret. The next generation of systems won't be built and run by burned out, exhausted people, or command-and-control teams just following orders. It can't be done. they've become too complicated. too hard.

Slide 39

Slide 39 text

We don’t know what the questions actually are though, all we have are unreliable reports. Our tools were designed for a predictable world. As soon as we know the question, we usually know the answer too. We have tools that help us ask and answer questions, esp if we deﬁne them in advance.

Slide 40

Slide 40 text

We can no longer ﬁt these systems in our heads and reason about them -- if we try, we'll be outcompeted by teams who use proper tools. Our systems are emergent and unpredictable. We need more than just your logical brain; we need your full creative self.

Slide 41

Slide 41 text

How observability leads to high-performing teams. Resiliency High quality code Predictable releases Manageable complexity and tech debt User behavior https://www.honeycomb.io/blog/toward-a-maturity-model-for-observability/

Slide 42

Slide 42 text

Resiliency • System uptime meets your goals • Alerts are not ignored • Oncall is not excessively stressful • Staff turnover is low; no burnout • Outages are frequent. • Spurious alerts • Alert fatigue • Troubleshooting is unpredictable/hard • Repair is unpredictable/time-consuming • Some critical members get fried O11y gives you context and helps you resolve incidents swiftly

Slide 43

Slide 43 text

High-quality code • Code is stable • Customer happiness, not support • Debugging is intuitive • No cascading failures • Customer support costs are high • High % of engineering time on bugs • Fear around deploys process • Long time to find and repro bugs • Unpredictable time to solve problems • Low confidence in code when shipped O11y lets you watch deploys, find bugs early

Slide 44

Slide 44 text

Predictable releases • Release cadence matches goals • Code goes in prod immediately • Code paths turned on/off easily • Deploy/rollback are fast O11y helps you manage your complex build pipeline as well as deploys, so you can ship swiftly and with conﬁdence • Releases are infrequent • Need lots of human intervention • Many changes ship at once • Releases are order-dependent • Sales has to gate releases on promise trai • People avoid doing deploys at times

Slide 45

Slide 45 text

Manageable complexity and tech debt • Spend your time on actual goals • Bugs and reliability are tractable • Easy to find code to fix • Answer any question w/o shipping new code O11y helps you do the right work at the right time • Waste time rebuilding and refactoring • Teams are distracted by fixing the wrong thing or the wrong way • Uncontrollable ripple effects from a local change • "haunted graveyard" where people are afraid to make changes

Slide 46

Slide 46 text

Understand user behavior • Instrumentation is easy to add • Easy access to KPIs for devs • Feature ﬂagging • PMs have useful view of customers • Teams share view of reality O11y grounds you in reality. • Product doesn't have their ﬁnger on pulse • Devs feel their work doesn't have impact • Features get scope creep • PMF not achieved

Slide 47

Slide 47 text

"But I don't have time to invest in observability..." You can't afford not to.

Slide 48

Slide 48 text

You can't afford not to.

Slide 49

Slide 49 text

Eng quality of life is linked to high performing teams and resilient systems.

Slide 50

Slide 50 text

p.s. o11y is also a prerequisite for other modern things, like ... chaos engineering sane deploys testing in production and other modern best practices.

Slide 51

Slide 51 text

where are we going?

Slide 52

Slide 52 text

on call will be shared by everyone who writes code. on call must not be miserable. (on call will be less like a heart attack, more like dentist visits, or gym appts)

Slide 53

Slide 53 text

serverless was a harbinger deploy less is coming

Slide 54

Slide 54 text

invest in your deploys, democratize access to data don't be scared by regulations

Slide 55

Slide 55 text

build your devs a playground ... but build guard rails encourage curiosity, emphasize ownership. don't punish. get up to your elbows in prod EVERY DAY practice many small failures practice, practice, practice senior engineers : amplify hidden costs

Slide 56

Slide 56 text

are changing are changing are changing

Slide 57

Slide 57 text

1. make your users happy 2. make your team happy Every engineering org has two constituencies:

Slide 58

Slide 58 text

1. nines don't matter if users aren't happy 2. great teams build high-quality systems Corollary: