Slide 1

Slide 1 text

Architectural Patterns of Resilient Distributed Systems Full Stack Fest 2016

Slide 2

Slide 2 text

Ines Sombra @Randommood

Slide 3

Slide 3 text

Globally distributed & highly available

Slide 4

Slide 4 text

Today’s Journey Forest Company 1 2 3 4 Motivation Resilience in literature Resilience in industry Conclusions

Slide 5

Slide 5 text

Resilience is the ability of a system to adapt or keep working when challenges occur

Slide 6

Slide 6 text

Defining Resilience Fault-tolerance Evolvability Scalability Failure isolation Complexity management

Slide 7

Slide 7 text

It’s what really matters

Slide 8

Slide 8 text

How can we construct more resilient systems?

Slide 9

Slide 9 text

Resilience in Literature

Slide 10

Slide 10 text

Harvest & Yield Model

Slide 11

Slide 11 text

Fraction of successfully answered queries Close to uptime but more useful because it directly maps to user experience (uptime misses this) Focus on yield rather than uptime Yield

Slide 12

Slide 12 text

From Coda Hale’s “You can’t sacrifice partition tolerance” Server A Server B Server C Baby Animals Cute Fraction of the complete result Harvest

Slide 13

Slide 13 text

From Coda Hale’s “You can’t sacrifice partition tolerance” Server A Server B Server C Baby Animals Cute X 66% harvest Fraction of the complete result Harvest

Slide 14

Slide 14 text

Graceful harvest degradation under faults Randomness to make the worst-case & average-case the same Replication of high-priority data for greater harvest control Degrading results based on client capability #1: Probabilistic Availability

Slide 15

Slide 15 text

Decomposing into subsystems independently intolerant to harvest degradation but your application can continue if they fail Only provide strong consistency for the subsystems that need it Orthogonal mechanisms (state vs functionality) #2 Decomposition & Orthogonality 1 2 3 4 5

Slide 16

Slide 16 text

If your system favors yield or harvest is an outcome of its design “ ” ~ Fox & Brewer

Slide 17

Slide 17 text

Cook & Rasmussen model

Slide 18

Slide 18 text

Economic failure boundary Unacceptable workload boundary Accident boundary Operating point Cook & Rasmussen

Slide 19

Slide 19 text

Economic failure boundary Unacceptable workload boundary Accident boundary Cook & Rasmussen

Slide 20

Slide 20 text

Economic failure boundary Unacceptable workload boundary Accident boundary Pressure towards efficiency Cook & Rasmussen

Slide 21

Slide 21 text

Economic failure boundary Unacceptable workload boundary Accident boundary Pressure towards efficiency Reduction of effort Cook & Rasmussen

Slide 22

Slide 22 text

Economic failure boundary Unacceptable workload boundary Accident boundary Pressure towards efficiency Reduction of effort Incident! Cook & Rasmussen

Slide 23

Slide 23 text

Economic failure boundary Unacceptable workload boundary Accident boundary Pressure towards efficiency Reduction of effort Safety Campaign Cook & Rasmussen

Slide 24

Slide 24 text

Economic failure boundary Unacceptable workload boundary Accident boundary Pressure towards efficiency Reduction of effort error margin Marginal boundary Safety Campaign Cook & Rasmussen

Slide 25

Slide 25 text

error margin Original marginal boundary R.I.Cook - 2004 Acceptable operating point Accident boundary Flirting with the margin

Slide 26

Slide 26 text

error margin Original marginal boundary R.I.Cook - 2004 Acceptable operating point Accident boundary Flirting with the margin

Slide 27

Slide 27 text

error margin Original marginal boundary R.I.Cook - 2004 Acceptable operating point Accident boundary Flirting with the margin

Slide 28

Slide 28 text

error margin Original marginal boundary R.I.Cook - 2004 Acceptable operating point Accident boundary Flirting with the margin

Slide 29

Slide 29 text

error margin Original marginal boundary R.I.Cook - 2004 Acceptable operating point Accident boundary Flirting with the margin

Slide 30

Slide 30 text

error margin Original marginal boundary R.I.Cook - 2004 Acceptable operating point Accident boundary Flirting with the margin

Slide 31

Slide 31 text

R.I.Cook - 2004 Accident boundary New marginal boundary! Flirting with the margin

Slide 32

Slide 32 text

R.I.Cook - 2004 Accident boundary New marginal boundary! Flirting with the margin

Slide 33

Slide 33 text

Engineering resilience requires a model of safety based on: mentoring, responding, adapting, and learning System safety is about what can happen, where the operating point actually is, and what we do under pressure Insights from Cook’s model

Slide 34

Slide 34 text

Build support for continuous maintenance Reveal control of system to operators Know it’s going to get moved, replaced, and used in ways you did not intend Think about configurations as interfaces Engineering system resilience

Slide 35

Slide 35 text

Borrill's model

Slide 36

Slide 36 text

Traditional
 engineering Reactive
 ops unk-unk Probability of failure Rank Cascading or catastrophic failures & you don’t know where they will come from! Same area as other 2 combined A system’s complexity

Slide 37

Slide 37 text

Traditional
 engineering Reactive
 ops unk-unk Probability of failure Rank Failure areas need != strategies Kingsbury Alvaro VS

Slide 38

Slide 38 text

Code standards Programming patterns Full system testing Metrics & monitoring are MVP Convergence to good state Hazard inventories Redundancies Feature flags Dark deploys Runbooks & docs Canaries System verification Formal methods Fault injection Classical engineering Reactive Operations Unknown-Unknown The goal is to build failure domain independence Strategies to build resilience

Slide 39

Slide 39 text

Thinking about building system resilience using a single discipline is insufficient. We need different strategies “ ” ~ Borrill

Slide 40

Slide 40 text

Resilience 
 in Industry

Slide 41

Slide 41 text

No content

Slide 42

Slide 42 text

Key insights from Chubby Library vs service? Service and client library control + storage of small data files with restricted operations Engineers don’t plan for: availability, consensus, primary elections, failures, their own bugs, operability, or the future. They also don’t understand Distributed Systems

Slide 43

Slide 43 text

Key insights from Chubby Centralized services are hard to construct but you can dedicate effort into architecting them well and making them failure-tolerant Restricting user behavior increased resilience Consumers of your service are part of your UNK-UNK scenarios

Slide 44

Slide 44 text

Key patterns Aggressive network timeouts & retries. Use of Semaphores. Separate threads on per- dependency thread pools Circuit-breakers to relieve pressure in underlying systems Exceptions cause app to shed load until things are healthy A lot more in the resource repo!

Slide 45

Slide 45 text

System intuition

Slide 46

Slide 46 text

Powderhorn insights Evolution of our purging system from v1 to v3 Used Bimodal Multicast (Gossip protocol) to provide extremely fast purging speed Watch their talk! Tyler McMullen Bruce Spang

Slide 47

Slide 47 text

NetSys patterns Faild allows us to fail & recover hosts via MAC- swapping and ECMP on switches Do immediate or gradual host failure & recovery Watch Joao’s talk

Slide 48

Slide 48 text

NetSys patterns Faild allows us to fail & recover hosts via MAC- swapping and ECMP on switches Do immediate or gradual host failure & recovery Watch Joao’s talk

Slide 49

Slide 49 text

ImageOpto insights A stateless system is nice but figuring out the request cycle can be tricky Dependencies are hard: customer setup, caching layer, & libraries - we have to be resilient to all of them CDN Image Opto Origin

Slide 50

Slide 50 text

ImageOpto insights Design error types & their handling carefully Failure detection & system operability are ongoing concerns Mixed-mode & versioning of data structures Validation & system adaptability Origin CDN IO X

Slide 51

Slide 51 text

Resilient architectural patterns

Slide 52

Slide 52 text

Redundancies are key Redundancies of resources, execution paths, checks, replication of data, replay of messages, anti-entropy build resilience Gossip /epidemic protocols Capacity planning matters Optimizations can make your system less resilient!

Slide 53

Slide 53 text

Unawareness of proximity to error boundary means we are always guessing Complex operations make systems less resilient & more incident-prone You design operability too! Operations matter

Slide 54

Slide 54 text

Complexity if increases safety is actually good Adding resilience may come at the cost of other desired goals (e.g. performance, simplicity, cost, etc) Not all complexity is bad

Slide 55

Slide 55 text

Leverage Engineering Best Practices Resiliency and testing are correlated. TEST! Versioning from the start - provide an upgrade path from day 1 Upgrades & evolvability of systems is still tricky. Mixed-mode operations need to be common Re-examine the way we prototype systems

Slide 56

Slide 56 text

No content

Slide 57

Slide 57 text

IN DESIGN OPERABILITY UNK-UNK Are we favoring harvest or yield? Orthogonality & decomposition FTW Do we have enough redundancies in place? Are we resilient to our dependencies? Theory matters! Am I providing enough control to my operators? Would I want to be on call for this? Rank your services: what can be dropped, killed, deferred? Monitoring and alerting in place? The existence of this stresses diligence on the other two areas Have we done everything we can? Abandon hope and resort to human sacrifices tl;dr

Slide 58

Slide 58 text

IN DESIGN OPERABILITY tl;dr Test dependency failures Code reviews != tests. Have both Distrust client behavior, even if they are internal Version (APIs, protocols, disk formats) from the start Checksum all the things Error handling, circuit breakers, backpressure, leases, timeouts Automation shortcuts taken while rushed will come back to haunt you Release stability is often tied to system stability. Iron out your deploy process Link alerts to playbooks Consolidate system configuration (data bags, config file, etc) Operators determine resilience

Slide 59

Slide 59 text

We can’t recover from lack of design. Not minding harvest/yield means we sign up for a redesign the moment we finish coding “ ” ~ Me last year

Slide 60

Slide 60 text

Good design is hard. Unknowns are hard to predict. Let the tenets we discussed today guide your redesigns. “ ” ~ Me today

Slide 61

Slide 61 text

46 github.com/Randommood/FullStackFest2016 ~ THANK YOU ~