Architectural Patterns of Resilient Distributed Systems

Slide 1

Slide 1 text

! OMG Strangeloop 2015!! Architectural Patterns of Resilient Distributed Systems

Slide 2

Slide 2 text

Ines Sombra @Randommood [email protected]

Slide 3

Slide 3 text

Globally distributed & highly available

Slide 4

Slide 4 text

Today’s Journey Why care? Resilience literature Resilience in industry Conclusions @randommood ♥

Slide 5

Slide 5 text

OBLIGATORY DISCLAIMER SLIDE   All from a practitioner’s perspective! @randommood Things you may see in this talk Pugs Fast talking Life pondering Un-tweetable moments Rantifestos What surprised me this year Wedding factoids and trivia

Slide 6

Slide 6 text

Why Resilience?

Slide 7

Slide 7 text

How can I make a system more resilient? @randommood ♥

Slide 8

Slide 8 text

@randommood Resilience is the ability of a system to adapt or keep working when challenges occur

Slide 9

Slide 9 text

Defining Resilience Fault-tolerance Evolvability Scalability Failure isolation Complexity management @randommood ♥

Slide 10

Slide 10 text

It’s what really matters @randommood

Slide 11

Slide 11 text

Resilience in Literature ll l

Slide 12

Slide 12 text

Harvest & Yield Model

Slide 13

Slide 13 text

@randommood Fraction of successfully answered queries Close to uptime but more useful because it directly maps to user experience (uptime misses this) Focus on yield rather than uptime Yield

Slide 14

Slide 14 text

@randommood From Coda Hale’s “You can’t sacrifice partition tolerance” Server A Server B Server C Baby Animals Cute Harvest Fraction of the complete result

Slide 15

Slide 15 text

@randommood From Coda Hale’s “You can’t sacrifice partition tolerance” Server A Server B Server C Baby Animals Cute X 66% harvest Harvest Fraction of the complete result

Slide 16

Slide 16 text

@randommood #1: Probabilistic Availability Graceful harvest degradation under faults Randomness to make the worst-case & average-case the same Replication of high-priority data for greater harvest control Degrading results based on client capability

Slide 17

Slide 17 text

@randommood #2 Decomposition & Orthogonality Decomposing into subsystems independently intolerant to harvest degradation but the application can continue if they fail You can only provide strong consistency for the subsystems that need it Orthogonal mechanisms (state vs functionality) ♥

Slide 18

Slide 18 text

@randommood “If your system favors yield or harvest is an outcome of its design” Fox & Brewer

Slide 19

Slide 19 text

Cook & Rasmussen model

Slide 20

Slide 20 text

Economic failure boundary Unacceptable workload boundary Accident boundary Cook & Rasmussen Operating point

Slide 21

Slide 21 text

Economic failure boundary Unacceptable workload boundary Accident boundary Cook & Rasmussen

Slide 22

Slide 22 text

Economic failure boundary Unacceptable workload boundary Accident boundary Pressure towards eﬀiciency Cook & Rasmussen

Slide 23

Slide 23 text

Economic failure boundary Unacceptable workload boundary Accident boundary Pressure towards eﬀiciency Reduction of eﬀort Cook & Rasmussen

Slide 24

Slide 24 text

Economic failure boundary Unacceptable workload boundary Accident boundary Pressure towards eﬀiciency Reduction of eﬀort Cook & Rasmussen Incident!

Slide 25

Slide 25 text

Economic failure boundary Unacceptable workload boundary Accident boundary Pressure towards eﬀiciency Reduction of eﬀort Safety Campaign Cook & Rasmussen

Slide 26

Slide 26 text

Economic failure boundary Unacceptable workload boundary Accident boundary Pressure towards eﬀiciency Reduction of eﬀort error margin Marginal boundary Safety Campaign Cook & Rasmussen