What got you here won't get you there (CodeFreeze 2020)

What got you here won't get you there How your
team can become a ✨high-performing✨ team by embracing observability. Observability and the Glorious Future

@mipsytipsy engineer/cofounder/CTO https://charity.wtf “the only good diff is a red
diff”

What does it mean to be a high-performing team? Why
should you care? How can you convince others to care? How does a team develop a high-performing practice? Isn't this supposed to be a talk about observability? (Yes!)

Why are computers hard? Because we don't understand them And
we keep shipping things anyway You never learned to debug with science Vendors have happily misled you for $$$$

We have ✨science✨ now! What does it mean to be
a high-performing team?

You only need to track ✨four things✨ to see where
you stand. • How frequently do you deploy? • How long does it take for each deploy to go live? • How many of your deploys fail? • How long does it typically take to recover?

It really, really, really, really, really pays off to be
a high performer. Really.

Elite teams are made up of all ex-Facebook, ex-Google, MIT
grads...

Excellent teams are made up of engineers who care about
their work, communicate with each other, invest in incremental improvements, and are empowered to do their jobs. (Instead of "elite", let's say "excellent"?) Elite teams are made up of normal engineers who: take pride in their craft, care about their users, have time to ﬁx and iterate

What you need is production excellence. this work begins with
observability. Happier customers, happier teams.

1. make your users happy 2. make your team happy
Every engineering org has two constituencies:

are changing are changing The world is changing fast.

• Ephemeral and dynamic • Far-ﬂung and loosely coupled •
Partitioned, sharded • Distributed and replicated • Containers, schedulers • Service registries • Polyglot persistence strategies • Autoscaled, multiple failover • Emergent behaviors • ... etc Complexity is soaring

Architectural complexity 2003 2013

We are bad at understanding our systems.

Tools for understanding them known-unknowns unknown-unknowns monitoring observability

("understand" lol)

Observability is NOT the same as monitoring.

What's the difference between monitoring and observability?

@grepory, Monitorama 2016 “Monitoring is dead.” “Monitoring systems have not
changed signiﬁcantly in 20 years and has fallen behind the way we build software. Our software is now large distributed systems made up of many non-uniform interacting components while the core functionality of monitoring systems has stagnated.”

Observability “In control theory, observability is a measure of how
well internal states of a system can be inferred from knowledge of its external outputs. The observability and controllability of a system are mathematical duals." — wikipedia … translate??!?

Observability for software engineers: Can you understand what’s happening inside
your systems, just by asking questions from the outside? Can you debug your code and its behavior using its output? Can you answer new questions without shipping new code?

You have an observable system when your team can quickly
and reliably track down any new problem with no prior knowledge. For software engineers, this means being able to reason about your code, identify and ﬁx bugs, and understand user experiences and behaviors ... via your instrumentation.

Monitoring Represents the world from the perspective of a third
party, and describes the health of the system and/or its components in aggregate. Observability Describes the world from the perspective of the software, as it performs each request. Softwareexplaining itself back to you from the inside.

We don’t *know* what the questions are, all we have
are unreliable symptoms or reports. Complexity is exploding everywhere, but our tools are designed for a predictable world. As soon as we know the question, we usually know the answer too.

Many catastrophic states exist at any given time. Your system
is never entirely ‘up’

Observability is... • High cardinality • High dimensionality • Exploratory,
open-ended • Based on arbitrarily-wide structured events with span support • No indexes, schemas, or predeﬁned structure • About understanding unknown-unknowns with no prior knowledge • About systems, not code. Where in the system is the code you need to ﬁx? • Young. Early. There is much still to be discovered. • Aligned with the user's experience.

Observability is not... • Able to be built on top
of a metrics store • Comprised of pillars (this is shitty vendorspeak) • Achievable with preaggregation. • Achievable without sampling (or inﬁnite money) (at scale) • About the health of the backend or services. • Achievable without instrumentation • Doable without tracing. • Or exclusively about tracing.

LAMP stack The app tier capacity is exceeded. Maybe we
rolled out a build with a perf regression, or maybe some app instances are down. DB queries are slower than normal. Maybe we deployed a bad new query, or there is lock contention. Errors or latency are high. We will look at several dashboards that reﬂect common root causes, and one of them will show us why. “Photos are loading slowly for some people. Why?” These are known-unknowns. Monitor for them.

Distributed systems "Any microservices running on c2.4xlarge instances and PIOPS
storage in us-east-1b has a 1/20 chance of running on degraded hardware, and will take 20x longer to complete for requests that hit the disk with a blocking call. This disproportionately impacts people looking at older archives due to our fanout model." "Canadian users who are using the French language pack on the iPad running iOS 9, are hitting a firmware condition which makes it fail saving to local cache … which is why it FEELS like photos are loading slowly" "Our newest SDK makes db queries sequentially if the developer has enabled an optional feature flag. Working as intended; the reporters all had debug mode enabled. But flag should be renamed for clarity sake." Monitor for .... ???

Distributed systems "I have twenty microservices and a sharded db
and three other data stores across three regions, and everything seems to be getting a little bit slower over the past two weeks but nothing has changed that we know of, and oddly, latency is usually back to the historical norm on Tuesdays." “All twenty microservices have 10% of available nodes enter a crash loop about ﬁve times a day, at unpredictable intervals. They have nothing in common and it doesn’t seem to impact the stateful services. It clears up before we can debug it, every time. We have tried replacing the instances." “Our users can compose their own queries that we execute server-side, and we don’t surface it to them when they are accidentally doing full table scans or even multiple full table scans, so they blame us.”

Distributed systems “Users in Romania are complaining that all push
notifications have been down for days. This seems impossible, since we share a queue with them." “Disney is complaining that once in a while, but not always, they don’t see the profile photo they expected to see — they see someone else’s photo! When they refresh, it’s fixed.” “Sometimes a bot takes off, or an app is featured on the iTunes store, and it takes us a long time to track down which app or user is generating disproportionate pressure on shared system components. “We run a platform, and it’s hard to programmatically distinguish between errors that users are inflicting on themselves and problems in our code, since they all manifest as errors or timeouts."

Distributed systems These are all unknown-unknowns, which may never have
happened before or happen again. (welcome to distributed systems)

LAMP stack • THE database • THE application • Known-unknowns
and mostly predictable failures • Many monitoring checks • Many paging alerts • "Flip a switch" to deploy • Failures to be prevented • Production is to be feared • Debug by intuition and scar tissue of past outages • Canned dashboards • Deploys are scary • Masochistic on-call culture technical aspects, cultural associations

LAMP stack • Dev/Ops • Fragile, forbidding ediﬁce • "Glass
Castle" We have built our systems like glass castles, fragile and forbidding, hostile to exploration and experimentation .

Distributed systems technical aspects, cultural associations • Many storage systems
• Diversity of service types • Unknown-unknowns; every alert is novel • Rich, ﬂexible instrumentation • Few paging alerts • Deployment is like baking • Failures are your friend • Production is where your users live • Debug methodically by examining the evidence • Events and full context, not metrics • Deploys are opportunities • Humane on-call culture

best practices • Software ownership -- you build it, you
run it • Robust, resilient, built for experimentation and delight • Human scale, safety measures baked in Distributed systems

Here's the dirty little secret. The next generation of systems
won't be built and run by burned out, exhausted people, or command-and-control teams just following orders. It can't be done. they've become too complicated. too hard.

We don’t know what the questions actually are though, all
we have are unreliable reports. Our tools were designed for a predictable world. As soon as we know the question, we usually know the answer too. We have tools that help us ask and answer questions, esp if we deﬁne them in advance.

We can no longer ﬁt these systems in our heads
and reason about them -- if we try, we'll be outcompeted by teams who use proper tools. Our systems are emergent and unpredictable. We need more than just your logical brain; we need your full creative self.

How observability leads to high-performing teams. Resiliency High quality code
Predictable releases Manageable complexity and tech debt User behavior https://www.honeycomb.io/blog/toward-a-maturity-model-for-observability/

Resiliency • System uptime meets your goals • Alerts are
not ignored • Oncall is not excessively stressful • Staff turnover is low; no burnout • Outages are frequent. • Spurious alerts • Alert fatigue • Troubleshooting is unpredictable/hard • Repair is unpredictable/time-consuming • Some critical members get fried O11y gives you context and helps you resolve incidents swiftly

High-quality code • Code is stable • Customer happiness, not
support • Debugging is intuitive • No cascading failures • Customer support costs are high • High % of engineering time on bugs • Fear around deploys process • Long time to find and repro bugs • Unpredictable time to solve problems • Low confidence in code when shipped O11y lets you watch deploys, find bugs early

Predictable releases • Release cadence matches goals • Code goes
in prod immediately • Code paths turned on/off easily • Deploy/rollback are fast O11y helps you manage your complex build pipeline as well as deploys, so you can ship swiftly and with conﬁdence • Releases are infrequent • Need lots of human intervention • Many changes ship at once • Releases are order-dependent • Sales has to gate releases on promise trai • People avoid doing deploys at times

Manageable complexity and tech debt • Spend your time on
actual goals • Bugs and reliability are tractable • Easy to find code to fix • Answer any question w/o shipping new code O11y helps you do the right work at the right time • Waste time rebuilding and refactoring • Teams are distracted by fixing the wrong thing or the wrong way • Uncontrollable ripple effects from a local change • "haunted graveyard" where people are afraid to make changes

Understand user behavior • Instrumentation is easy to add •
Easy access to KPIs for devs • Feature ﬂagging • PMs have useful view of customers • Teams share view of reality O11y grounds you in reality. • Product doesn't have their ﬁnger on pulse • Devs feel their work doesn't have impact • Features get scope creep • PMF not achieved

"But I don't have time to invest in observability..." You
can't afford not to.

You can't afford not to.

Eng quality of life is linked to high performing teams
and resilient systems.

p.s. o11y is also a prerequisite for other modern things,
like ... chaos engineering sane deploys testing in production and other modern best practices.

where are we going?

on call will be shared by everyone who writes code.
on call must not be miserable. (on call will be less like a heart attack, more like dentist visits, or gym appts)

serverless was a harbinger deploy less is coming

invest in your deploys, democratize access to data don't be
scared by regulations

build your devs a playground ... but build guard rails
encourage curiosity, emphasize ownership. don't punish. get up to your elbows in prod EVERY DAY practice many small failures practice, practice, practice senior engineers : amplify hidden costs

are changing are changing are changing

1. make your users happy 2. make your team happy
Every engineering org has two constituencies:

1. nines don't matter if users aren't happy 2. great
teams build high-quality systems Corollary:

we have an opportunity here to make things better let's
do it <3

• It Charity Majors @mipsytipsy

What got you here won't get you there (CodeFree...

What got you here won't get you there (CodeFreeze 2020)

More Decks by Charity Majors

Other Decks in Technology

Featured

Transcript