Slide 1

Slide 1 text

! OMG Strangeloop 2015!! Architectural Patterns of Resilient Distributed Systems

Slide 2

Slide 2 text

Ines Sombra @Randommood ines@fastly.com

Slide 3

Slide 3 text

Globally distributed & highly available

Slide 4

Slide 4 text

Today’s Journey Why care? Resilience literature Resilience in industry Conclusions @randommood ♥

Slide 5

Slide 5 text

OBLIGATORY DISCLAIMER SLIDE 
 All from a practitioner’s perspective! @randommood Things you may see in this talk Pugs Fast talking Life pondering Un-tweetable moments Rantifestos What surprised me this year Wedding factoids and trivia

Slide 6

Slide 6 text

Why Resilience?

Slide 7

Slide 7 text

How can I make a system more resilient? @randommood ♥

Slide 8

Slide 8 text

@randommood Resilience is the ability of a system to adapt or keep working when challenges occur

Slide 9

Slide 9 text

Defining Resilience Fault-tolerance Evolvability Scalability Failure isolation Complexity management @randommood ♥

Slide 10

Slide 10 text

It’s what really matters @randommood

Slide 11

Slide 11 text

Resilience in Literature ll l

Slide 12

Slide 12 text

Harvest & Yield Model

Slide 13

Slide 13 text

@randommood Fraction of successfully answered queries Close to uptime but more useful because it directly maps to user experience (uptime misses this) Focus on yield rather than uptime Yield

Slide 14

Slide 14 text

@randommood From Coda Hale’s “You can’t sacrifice partition tolerance” Server A Server B Server C Baby Animals Cute Harvest Fraction of the complete result

Slide 15

Slide 15 text

@randommood From Coda Hale’s “You can’t sacrifice partition tolerance” Server A Server B Server C Baby Animals Cute X 66% harvest Harvest Fraction of the complete result

Slide 16

Slide 16 text

@randommood #1: Probabilistic Availability Graceful harvest degradation under faults Randomness to make the worst-case & average-case the same Replication of high-priority data for greater harvest control Degrading results based on client capability

Slide 17

Slide 17 text

@randommood #2 Decomposition & Orthogonality Decomposing into subsystems independently intolerant to harvest degradation but the application can continue if they fail You can only provide strong consistency for the subsystems that need it Orthogonal mechanisms (state vs functionality) ♥

Slide 18

Slide 18 text

@randommood “If your system favors yield or harvest is an outcome of its design” Fox & Brewer

Slide 19

Slide 19 text

Cook & Rasmussen model

Slide 20

Slide 20 text

Economic failure boundary Unacceptable workload boundary Accident boundary Cook & Rasmussen Operating point

Slide 21

Slide 21 text

Economic failure boundary Unacceptable workload boundary Accident boundary Cook & Rasmussen

Slide 22

Slide 22 text

Economic failure boundary Unacceptable workload boundary Accident boundary Pressure towards efficiency Cook & Rasmussen

Slide 23

Slide 23 text

Economic failure boundary Unacceptable workload boundary Accident boundary Pressure towards efficiency Reduction of effort Cook & Rasmussen

Slide 24

Slide 24 text

Economic failure boundary Unacceptable workload boundary Accident boundary Pressure towards efficiency Reduction of effort Cook & Rasmussen Incident!

Slide 25

Slide 25 text

Economic failure boundary Unacceptable workload boundary Accident boundary Pressure towards efficiency Reduction of effort Safety Campaign Cook & Rasmussen

Slide 26

Slide 26 text

Economic failure boundary Unacceptable workload boundary Accident boundary Pressure towards efficiency Reduction of effort error margin Marginal boundary Safety Campaign Cook & Rasmussen

Slide 27

Slide 27 text

error margin Original marginal boundary R.I.Cook - 2004 Acceptable operating point Accident boundary Flirting with the margin

Slide 28

Slide 28 text

error margin Original marginal boundary R.I.Cook - 2004 Acceptable operating point Accident boundary Flirting with the margin

Slide 29

Slide 29 text

error margin Original marginal boundary R.I.Cook - 2004 Acceptable operating point Accident boundary Flirting with the margin

Slide 30

Slide 30 text

error margin Original marginal boundary R.I.Cook - 2004 Acceptable operating point Accident boundary Flirting with the margin

Slide 31

Slide 31 text

error margin Original marginal boundary R.I.Cook - 2004 Acceptable operating point Accident boundary Flirting with the margin

Slide 32

Slide 32 text

error margin Original marginal boundary R.I.Cook - 2004 Acceptable operating point Accident boundary Flirting with the margin

Slide 33

Slide 33 text

R.I.Cook - 2004 Accident boundary Flirting with the margin New marginal boundary!

Slide 34

Slide 34 text

R.I.Cook - 2004 Accident boundary Flirting with the margin New marginal boundary!

Slide 35

Slide 35 text

@randommood Insights from Cook’s model Engineering resilience requires a model of safety based on: mentoring, responding, adapting, and learning System safety is about what can happen, where the operating point actually is, and what we do under pressure Resilience is operator community focused

Slide 36

Slide 36 text

@randommood Engineering system resilience Build support for continuous maintenance Reveal control of system to operators Know it’s going to get moved, replaced, and used in ways you did not intend Think about configurations as interfaces

Slide 37

Slide 37 text

No content

Slide 38

Slide 38 text

No content

Slide 39

Slide 39 text

No content

Slide 40

Slide 40 text

No content

Slide 41

Slide 41 text

Borrill's model

Slide 42

Slide 42 text

Traditional
 engineering Reactive
 ops unk-unk @randommood Probability of failure Rank A system’s complexity Cascading or catastrophic failures & you don’t know where they will come from! Same area as other 2 combined

Slide 43

Slide 43 text

Traditional
 engineering Reactive
 ops unk-unk @randommood Failure areas need != strategies Probability of failure Rank

Slide 44

Slide 44 text

Traditional
 engineering Reactive
 ops unk-unk @randommood Failure areas need != strategies Probability of failure Rank Kingsbury

Slide 45

Slide 45 text

Traditional
 engineering Reactive
 ops unk-unk @randommood Failure areas need != strategies Probability of failure Rank Kingsbury VS

Slide 46

Slide 46 text

Traditional
 engineering Reactive
 ops unk-unk @randommood Failure areas need != strategies Probability of failure Rank Kingsbury Alvaro VS

Slide 47

Slide 47 text

Strategies to build resilience Code standards Programming patterns Testing (full system!) Metrics & monitoring Convergence to good state Hazard inventories Redundancies Feature flags Dark deploys Runbooks & docs Canaries System verification Formal methods Fault injection Classical engineering Reactive Operations Unknown-Unknown The goal is to build failure domain independence

Slide 48

Slide 48 text

@randommood “Thinking about building system resilience using a single discipline is insufficient. We need different strategies” Borrill

Slide 49

Slide 49 text

Wedding Trivia!!! @randommood

Slide 50

Slide 50 text

Resilience in Industry

Slide 51

Slide 51 text

@randommood Now with sparkles! ✨ ✨

Slide 52

Slide 52 text

@randommood API inherently more vulnerable to any system failures or latencies in the stack Without fault tolerance: 30 dependencies w 99.99% uptime could result in 2+ hours of downtime per month! Leveraged client libraries

Slide 53

Slide 53 text

@randommood Netflix’s resilient patterns Aggressive network timeouts & retries. Use of Semaphores. Separate threads on per- dependency thread pools Circuit-breakers to relieve pressure in underlying systems Exceptions cause app to shed load until things are healthy

Slide 54

Slide 54 text

@randommood We went on a diet just like you! #

Slide 55

Slide 55 text

$ $

Slide 56

Slide 56 text

@randommood Key insights from Chubby Library vs service? Service and client library control + storage of small data files with restricted operations Engineers don’t plan for: availability, consensus, primary elections, failures, their own bugs, operability, or the future. They also don’t understand Distributed Systems

Slide 57

Slide 57 text

@randommood Key insights from Chubby Centralized services are hard to construct but you can dedicate effort into architecting them well and making them failure-tolerant Restricting user behavior increased resilience Consumers of your service are part of your UNK- UNK scenarios

Slide 58

Slide 58 text

@randommood And the family arrives!

Slide 59

Slide 59 text

@randommood Key insights from Truce Evolution of our purging system from v1 to v3 Used Bimodal Multicast (Gossip protocol) to provide extremely fast purging speed Design concerns & system evolution Tyler McMullen Bruce Spang

Slide 60

Slide 60 text

Existing best practices won’t save you @randommood Key insights from NetSys João Taveira Araújo 
 looking suave Faild allows us to fail & recover hosts via MAC- swapping and ECMP on switches Do immediate or gradual host failure & recovery Watch Joao’s talk

Slide 61

Slide 61 text

Existing best practices won’t save you @randommood Key insights from NetSys João Taveira Araújo 
 looking suave Faild allows us to fail & recover hosts via MAC- swapping and ECMP on switches Do immediate or gradual host failure & recovery Watch Joao’s talk

Slide 62

Slide 62 text

Existing best practices won’t save you @randommood Key insights from NetSys João Taveira Araújo 
 looking suave Faild allows us to fail & recover hosts via MAC- swapping and ECMP on switches Do immediate or gradual host failure & recovery Watch Joao’s talk

Slide 63

Slide 63 text

@randommood So we have a myriad of systems with different stages of evolution Resilient systems like Varnish, Powderhorn, and Faild have taught us many lessons but some applications have availability problems, why? But wait a minute! ♥

Slide 64

Slide 64 text

@randommood Everyone okay?

Slide 65

Slide 65 text

Resilient architectural patterns

Slide 66

Slide 66 text

@randommood Redundancies are key Redundancies of resources, execution paths, checks, replication of data, replay of messages, anti-entropy build resilience Gossip / epidemic protocols too Capacity planning matters Optimizations can make your system less resilient!

Slide 67

Slide 67 text

@randommood Unawareness of proximity to error boundary means we are always guessing Complex operations make systems less resilient & more incident-prone You design operability too! Operations matter

Slide 68

Slide 68 text

@randommood Complexity if increases safety is actually good Adding resilience may come at the cost of other desired goals (e.g. performance, simplicity, cost, etc) Not all complexity is bad

Slide 69

Slide 69 text

@randommood Leverage Engineering best practices Resiliency and testing are correlated. TEST! Versioning from the start - provide an upgrade path from day 1 Upgrades & evolvability of systems is still tricky. Mixed-mode operations need to be common Re-examine the way we prototype systems

Slide 70

Slide 70 text

Bringing it together ♥

Slide 71

Slide 71 text

tl;dr OPERABILITY WHILE IN DESIGN UNK-UNK Are we favoring harvest or yield? Orthogonality & decomposition FTW Do we have enough redundancies in place? Are we resilient to our dependencies? Am I providing enough control to my operators? Would I want to be on call for this? Rank your services: what can be dropped, killed, deferred? Monitoring and alerting in place? The existence of this stresses diligence on the other two areas Have we done everything we can? Abandon hope and resort to human sacrifices ♥ ♥ Theory matters!

Slide 72

Slide 72 text

IMPROVING OPERABILITY WHILE IN DESIGN Test dependency failures Code reviews != tests. Have both Distrust client behavior, even if they are internal Version (APIs, protocols, disk formats) from the start. Support mixed-mode operations. Checksum all the things Error handling, circuit breakers, backpressure, leases, timeouts Automation shortcuts taken while in a rush will come back to haunt you Release stability is o"en tied to system stability. Iron out your deploy process Link alerts to playbooks Consolidate system configuration (data bags, config file, etc) tl;dr ♥ ♥ Operators determine resilience

Slide 73

Slide 73 text

@randommood We can’t recover from lack of design. Not minding harvest/yield means we sign up for a redesign the moment we finish coding. TODAY’S RANTIFESTO ♥ ♥

Slide 74

Slide 74 text

Thank you! github.com/Randommood/Strangeloop2015 7 7 Special thanks to Paul Borrill, Jordan West, Caitie McCaffrey, Camille Fournier, Mike O'Neill, Neha Narula, Joao Taveira, Tyler McMullen, Zac Duncan, Nathan Taylor, Ian Fung, Armon Dadgard, Peter Alvaro, Peter Bailis, Bruce Spang, Matt Whiteley, Alex Rasmussen, Aysulu Greenberg, Elaine Greenberg, and Greg Bako.