Observability and Complex Systems (devopsdays AMS)

Slide 1

Slide 1 text

Charity Majors @mipsytipsy Observability & Complex Systems What got you here won't get you there, and other terrifying true tales from the computing frontier

Slide 2

Slide 2 text

@mipsytipsy engineer/cofounder/CEO https://charity.wtf “the only good diﬀ is a red diﬀ”

Slide 3

Slide 3 text

A short partial list of things I would like to touch on... "chaos engineering" you must be this tall to ride this ride. (are you? how do you evaluate this?) observability business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels and career progressions how deploys must change the mis allocation of internal tooling energy away rom deploy software why you need to test in prod why you need a canary (probably) when to know you need a canary

Slide 4

Slide 4 text

"chaos engineering" you must be this tall to ride this ride. (are you? how do you evaluate this?) business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels and career progressions how deploys must change the mis allocation of internal tooling energy away rom deploy software why you need to test in prod why you need a canary (probably) when to know you need a canary why you deﬁnitely need feature ﬂags, no matter what test doesn't mean what you think it means continued ...

Slide 5

Slide 5 text

the future of development is observavbility-observability-driven development. "O-D-D yeah YOU KNOW ME" why we have to stop leaning on intuition and tribal knowledge before it is too late why AIOps is stupid and doomed why the team is your best source o wisdom why wisdom is not truth why ops needs to learn about design principles, stat why vendors are rushing to coopt the observability message before you notice they don't actually fulﬁll the demands, and why this makes me Very Stabby cont'd ... just a brief outline

Slide 6

Slide 6 text

"How did we get here?"

Slide 7

Slide 7 text

Monitoring (time series databases, dashboards, 'metric' tools) Logs (messy ass strings, really) More recently, APM and tracing. The trifecta:

Slide 8

Slide 8 text

"What do we need to get where we're going?"

Slide 9

Slide 9 text

Our idea of what the software development lifecycle even looks like is overdue for an upgrade in the era of distributed systems.

Slide 10

Slide 10 text

Deploying code is not a binary switch. Deploying code is a process of increasing your conﬁdence in your code.

Slide 11

Slide 11 text

Development Production deploy

Slide 12

Slide 12 text

Observability Development Production

Slide 13

Slide 13 text

Observability Development Production

Slide 14

Slide 14 text

why now?

Slide 15

Slide 15 text

“Complexity is increasing” - Science

Slide 16

Slide 16 text

Architectural complexity Parse, 2015 LAMP stack, 2005

Slide 17

Slide 17 text

monitoring => observability known unknowns => unknown unknowns LAMP stack => distributed systems

Slide 18

Slide 18 text

We are all distributed systems engineers now the unknowns outstrip the knowns why does this matter more and more?

Slide 19

Slide 19 text

Distributed systems are particularly hostile to being cloned or imitated (or monitored). (clients, concurrency, chaotic traffic patterns, edge cases …)

Slide 20

Slide 20 text

Distributed systems have an inﬁnitely long list of almost-impossible failure scenarios that make staging environments particularly worthless. this is a black hole for engineering time

Slide 21

Slide 21 text

Operational literacy Is not a nice-to-have

Slide 22

Slide 22 text

Without observability, you don't have "chaos engineering". You just have chaos. So what is observability?

Slide 23

Slide 23 text

Observability is NOT the same as monitoring.

Slide 24

Slide 24 text

@grepory, Monitorama 2016 “Monitoring is dead.” “Monitoring systems have not changed significantly in 20 years and has fallen behind the way we build software. Our software is now large distributed systems made up of many non-uniform interacting components while the core functionality of monitoring systems has stagnated.”

Slide 25

Slide 25 text

Observability “In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. The observability and controllability of a system are mathematical duals." — wikipedia … translate??!?

Slide 26

Slide 26 text

Can you understand what’s happening inside your systems, just by asking questions from the outside? Can you debug your code and its behavior using its output? Can you answer new questions without shipping new code? Observability... for software engineers:

Slide 27

Slide 27 text

Monitoring Represents the world from the perspective of a third party, and describes the health of the system and/or its components in aggregate. Observability Describes the world from the ﬁrst-person perspective of the software, executing each request. Software explaining itself from the inside out.

Slide 28

Slide 28 text

We don’t *know* what the questions are, all we have are unreliable symptoms or reports. Complexity is exploding everywhere, but our tools are designed for a predictable world. As soon as we know the question, we usually know the answer too.

Slide 29

Slide 29 text

Welcome to distributed systems. it’s probably ﬁne. (it might be ﬁne?)

Slide 30

Slide 30 text

Many catastrophic states exist at any given time. Your system is never entirely ‘up’

Slide 31

Slide 31 text

Distributed systems have an inﬁnitely long list of almost-impossible failure scenarios that make staging environments particularly worthless. this is a black hole for engineering time

Slide 32

Slide 32 text

You do it. You have to do it. Do it well.

Slide 33

Slide 33 text

Let’s try some examples! Can you quickly and reliably track down problems like these?

Slide 34

Slide 34 text

The app tier capacity is exceeded. Maybe we rolled out a build with a perf regression, or maybe some app instances are down. DB queries are slower than normal. Maybe we deployed a bad new query, or there is lock contention. Errors or latency are high. We will look at several dashboards that reﬂect common root causes, and one of them will show us why. “Photos are loading slowly for some people. Why?” Monitoring (old-school LAMP stack) monitor these things

Slide 35

Slide 35 text

“Photos are loading slowly for some people. Why?” (microservices) Any microservices running on c2.4xlarge instances and PIOPS storage in us-east-1b has a 1/20 chance of running on degraded hardware, and will take 20x longer to complete for requests that hit the disk with a blocking call. This disproportionately impacts people looking at older archives due to our fanout model. Canadian users who are using the French language pack on the iPad running iOS 9, are hitting a firmware condition which makes it fail saving to local cache … which is why it FEELS like photos are loading slowly Our newest SDK makes db queries sequentially if the developer has enabled an optional feature flag. Working as intended; the reporters all had debug mode enabled. But flag should be renamed for clarity sake. wtf do i ‘monitor’ for?! Monitoring?!?

Slide 36

Slide 36 text

Problems Symptoms "I have twenty microservices and a sharded db and three other data stores across three regions, and everything seems to be getting a little bit slower over the past two weeks but nothing has changed that we know of, and oddly, latency is usually back to the historical norm on Tuesdays. “All twenty app micro services have 10% of available nodes enter a simultaneous crash loop cycle, about ﬁve times a day, at unpredictable intervals. They have nothing in common afaik and it doesn’t seem to impact the stateful services. It clears up before we can debug it, every time.” “Our users can compose their own queries that we execute server-side, and we don’t surface it to them when they are accidentally doing full table scans or even multiple full table scans, so they blame us.” Observability (microservices)

Slide 37

Slide 37 text

Still More Symptoms “Several users in Romania and Eastern Europe are complaining that all push notifications have been down for them … for days.” “Disney is complaining that once in a while, but not always, they don’t see the photo they expected to see — they see someone else’s photo! When they refresh, it’s fixed. Actually, we’ve had a few other people report this too, we just didn’t believe them.” “Sometimes a bot takes off, or an app is featured on the iTunes store, and it takes us a long long time to track down which app or user is generating disproportionate pressure on shared components of our system (esp databases). It’s different every time.” Observability “We run a platform, and it’s hard to programmatically distinguish between problems that users are inflicting themselves and problems in our own code, since they all manifest as the same errors or timeouts." (microservices)

Slide 38

Slide 38 text

These are all unknown-unknowns that may have never happened before, or ever happen again (They are also the overwhelming majority of what you have to care about for the rest of your life.)

Slide 39

Slide 39 text

Three principles of software ownership: They who write the code Can and should deploy their code And watch it run it in production. (**and be on call for it)

Slide 40

Slide 40 text

When healthy teams with good cultural values and leadership alignment try to adopt software ownership and fail, the cause is usually an observability gap.

Slide 41

Slide 41 text

Software engineers spend too much time looking at code in elaborately falsiﬁed environments, and not enough time observing it in the real world. Tighten feedback loops. Give developers the observability tooling they need to become ﬂuent in production and to debug their own systems. We aren’t “writing code”. We are “building systems”.

Slide 42

Slide 42 text

Observability for SWEs and the Future™ well-instrumented high cardinality high dimensionality event-driven structured well-owned sampled tested in prod.

Slide 43

Slide 43 text

Watch it run in production. Accept no substitute. Get used to observing your systems when they AREN’T on ﬁre

Slide 44

Slide 44 text

Real data Real users Real traﬃc Real scale Real concurrency Real network Real deploys Real unpredictabilities.

Slide 45

Slide 45 text

You care about each and every tree, not the forest. "The health of the system no longer really matters" -- me

Slide 46

Slide 46 text

Zero users care what the “system” health is All users care about THEIR experience. Nines don’t matter if users aren’t happy. Nines don’t matter if users aren’t happy. Nines don’t matter if users aren’t happy. Nines don’t matter if users aren’t happy. Nines don’t matter if users aren’t happy.

Slide 47

Slide 47 text

well-instrumented high cardinality high dimensionality event-driven structured well-owned sampled tested in prod. Observability for SWEs and the Future™

Slide 48

Slide 48 text

You win … Drastically fewer paging alerts!

Slide 49

Slide 49 text

Charity Majors @mipsytipsy

Slide 50

Slide 50 text

• Srecon Charity Majors @mipsytipsy

Slide 51

Slide 51 text

You must be able to break down by 1/millions and THEN by anything/everything else High cardinality is not a nice-to-have ‘Platform problems’ are now everybody’s problems