A Dark and Stormy Night

Slide 1

Slide 1 text

A D A R K A N D S T O R M Y N I G H T TA L E S O F O P E R A B I L I T Y A N T I - PA T T E R N S

Slide 2

Slide 2 text

Kiran Bhattaram @kiranb

Slide 3

Slide 3 text

B U LW E R - LY T T O N It was a dark and stormy night; the rain fell in torrents — except at occasional intervals, when it was checked by a violent gust of wind which swept up the streets (for it is in London that our scene lies), rattling along the housetops, and fiercely agitating the scanty flame of the lamps that struggled against the darkness.

Slide 4

Slide 4 text

DEFINITIONS What is operability? ▸ The ability to keep a system in a safe and reliable functioning condition, according to pre-deﬁned operational requirements.

Slide 5

Slide 5 text

Characteristics of operability ▸ safety & reliability ▸ scalability ▸ grace under pressure DEFINITIONS ▸ ease of upgrades ▸ observability ▸ usability ▸ cultural practices around incidents ▸ AND MORE

Slide 6

Slide 6 text

DEFINITIONS Characteristics of an operable system ▸ Converge towards a stable state. ▸ Give operators visibility and tools. ▸ Designed to be usable and unsurprising.

Slide 7

Slide 7 text

DEFINITIONS Agenda Robustness Usability Review! Observability

Slide 8

Slide 8 text

1. ROBUSTNESS

Slide 9

Slide 9 text

THE TALE OF THE SYSTEM THAT COULDN’T GIVE ANYTHING UP STORY 1

Slide 10

Slide 10 text

ROBUSTNESS Define your critical path.

Slide 11

Slide 11 text

ROBUSTNESS Harvest, Yield and Scalable Tolerant Systems Yield = successful requests total requests != uptime Harvest = data available total data * dropping requests * degrading response

Slide 12

Slide 12 text

ROBUSTNESS Controlling yield: load shedding upstream requests ▸ categories of load shedders: ▸ # of requests ▸ # of concurrent requests (protect against the long tail) ▸ overall ﬂeet utilization (keep x% of workers for core trafﬁc)

Slide 13

Slide 13 text

ROBUSTNESS Controlling harvest: circuit breakers ▸ stop calling a dependency if it seems down! ▸ what do you return? ▸ cached data ▸ nil ▸ or propagate the error upstream

Slide 14

Slide 14 text

ROBUSTNESS Controlling harvest: circuit breakers & compartmentalization http://idighardware.com/2013/10/ﬁre-doors-everything-you-always-wanted-to-know-but-were-afraid-to-ask/

Slide 15

Slide 15 text

ROBUSTNESS Putting it all together: giving things up ▸ Combine harvest/yield degradation in different ways to protect the critical path ▸ Monitor any degradation! ▸ Dark launch your rate limiters to check what they’d block.

Slide 16

Slide 16 text

ROBUSTNESS Robustness, in review ▸ know how the system sheds load ▸ know how it reacts to downstream failures Converge to a stable state.

Slide 17

Slide 17 text

2. OBSERVABILITY

Slide 18

Slide 18 text

THE TALE OF THE FRACTAL QUEUE STORY 2

Slide 19

Slide 19 text

OBSERVABILITY Instrument EVERYTHING ▸ especially with queues ▸ percentiles, not averages ▸ don’t intermingle logs (keep a searchable trace ID on requests)

Slide 20

Slide 20 text

OBSERVABILITY Over-collect data, but build dashboards carefully ▸ work metrics ▸ is the system doing the thing it’s supposed to? ▸ resource metrics ▸ how are the components of the system behaving? ▸ build your dashboard with work metrics ﬁrst.

Slide 21

Slide 21 text

THE TALE OF THE 64 ALERT WEEK STORY 4

Slide 22

Slide 22 text

OBSERVABILITY Don’t normalize deviance

Slide 23

Slide 23 text

OBSERVABILITY Knowing what to alert on ▸ Monitor the alert volume of your system! ▸ Pages should be actionable and represent user pain.

Slide 24

Slide 24 text

OBSERVABILITY Observability: what we learned ▸ Kiran has a special vendetta against unmonitored queues. ▸ Building good dashboards: work metrics & resource metrics. ▸ Monitor alert volume, too!

Slide 25

Slide 25 text

3. USABILITY

Slide 26

Slide 26 text

6. Recognition vs. recall 9. Help users recognize, diagnose, and recover from errors USABILITY A quick side note: Nielsen Heuristics 1. Visibility of system status 2. Match between system and the real world 3. User control and freedom 4. Consistency and standards 5. Error prevention 6. Recognition vs. recall 7. Flexibility and efﬁciency of use 8. Aesthetic and minimalist design 9. Help users recognize, diagnose, and recover from errors 10. Help and documentation 1. Visibility of system status 3. User control and freedom 5. Error prevention

Slide 27

Slide 27 text

Story 5: the tale of the special snowflake service

Slide 28

Slide 28 text

USABILITY Heuristic 4. Consistency and Standards ▸ pattern-matching across similar systems is really valuable! ▸ Choose boring technology: spend your innovation tokens wisely!

Slide 29

Slide 29 text

OBSERVABILITY Heuristic 3. User control and freedom ▸ Tooling is a part of the service! ▸ relatedly, deploy mechanisms are related to availability! ▸ Give operators the ability to change operational parameters.

Slide 30

Slide 30 text

THE TALE OF THE OPS SPELL BOOK STORY 6

Slide 31

Slide 31 text

USABILITY Heuristic 6. Recognition v. recall ▸ Keep checklists minimal and heavily automated. ▸ long ﬂowcharts in a runbook are :( ▸ relatedly: scripting user communications is helpful.

Slide 32

Slide 32 text

USABILITY Heuristic 1. Visibility of system status ▸ which of these are changes to production? ▸ conﬁg changes ▸ deploys ▸ utility script runs ▸ failovers ▸ adding/decreasing capacity

Slide 33

Slide 33 text

THE TALE OF THE AMBIGUOUS ERROR MESSAGE STORY 7

Slide 34

Slide 34 text

USABILITY Heuristic 9. Help users recognize, diagnose, and recover from errors ▸ error messages are a crucial part of your interface ▸ Writing a good alert message: ▸ expressed in plain language, precisely indicate the problem, and constructively suggest a solution (runbooks!) ▸ (ex.) CRITICAL: Served 5% 5xx results in the last 5 minutes!

Slide 35

Slide 35 text

USABILITY Usability, in review ▸ Operational experience matters! Consider: ▸ whether the system follows general conventions. ▸ how it alerts operators to errors clearly and unambiguously. ▸ how minimal and usable the tooling is.

Slide 36

Slide 36 text

Review ▸ Robustness ▸ Does your system converge to a stable state? ▸ Observability ▸ Can you infer what the internal state of the system looks like? ▸ Usability ▸ Do your operators have control over the state of the system? Do you adhere to general standards? REVIEW

Slide 37

Slide 37 text

THE TALE OF THE SAD QUEUE STORY THE LAST : (

Slide 38

Slide 38 text

A DARK AND STORMY NIGHT STORY THE LAST

Slide 39

Slide 39 text

Resources ▸ Harvest, Yield, and Scalable Tolerant Systems (Brewer & Fox) ▸ How Complex Systems Fail (Cook) ▸ "Going solid": a model of system dynamics and consequences for patient safety (Cook) ▸ Nielsen’s Usability Heuristics ▸ Choose Boring Technology (Dan McKinley) ▸ Site Reliability Engineering: How Google Runs Production Systems ▸ Stripe’s (upcoming) rate limiting blog post ▸ Collection of postmortems (Dan Luu) ▸ Release It! (Michael Nygard) REVIEW

Slide 40

Slide 40 text

REVIEW On Designing and Deploying Internet-Scale Services, James Hamilton ▸ list of best practices, from design, to upgrades, to incident response

Slide 41

Slide 41 text

T H A N K S ! Thanks to Ines Sombra, Charity Majors, Alyssa Frazee, Rachel Sanders, and Andy Bonventre for review!

Slide 42

Slide 42 text

APPENDIX STUFF I COULDN’T GET TO

Slide 43

Slide 43 text

OBSERVABILITY decouple deploys from releases ▸ get a minimal version in dark-reads into production asap ▸ corollary: have good kill switches! ▸ Know what rollbacks look like

Slide 44

Slide 44 text

OBSERVABILITY collect operational metrics in this shadow phase ▸ Gain historical knowledge of what the system’s healthy state looks like. ▸ Tweak your alerts and SLAs. ▸ Gameday the system! Write runbooks!