YOW2016 - Speaker Deck

Slide 1

Slide 1 text

Architectural Patterns of Resilient Distributed Systems YOW 2016

Slide 2

Slide 2 text

Ines Sombra @Randommood

Slide 3

Slide 3 text

Globally distributed & highly available

Slide 4

Slide 4 text

Today’s Journey Forest Company 1 2 3 4 Motivation Resilience in literature Resilience in industry Conclusions Tie it all together Foundational knowledge Why Ines cares & you should too What are others doing?

Slide 5

Slide 5 text

Resilience is the ability of a system to adapt or keep working when challenges occur

Slide 6

Slide 6 text

Defining Resilience Fault-tolerance Evolvability Scalability Failure isolation Complexity management

Slide 7

Slide 7 text

How can we construct more resilient systems?

Slide 8

Slide 8 text

It’s what really matters

Slide 9

Slide 9 text

Slide 10

Slide 10 text

The Team

Slide 11

Slide 11 text

3000 × 2000 px 361 KB

Slide 12

Slide 12 text

Trim all edges by 25% http:/ /www.fastly.io/image.jpg?trim=0.25 Crop the image square and resize the width to 200px http:/ /www.fastly.io/image.jpg?crop=1:1&width=200 1000 × 667 px 92 KB 200 × 200 px 9 KB

Slide 13

Slide 13 text

CDN Image Opto Origin Image Opto Image Opto Image Opto Image Opto ImageOpto 101

Slide 14

Slide 14 text

Origin Image Opto Image Opto Image Opto Image Opto Image Opto CDN ImageOpto 101

Slide 15

Slide 15 text

Origin Image Opto Image Opto Image Opto Image Opto Image Opto CDN ImageOpto 101

Slide 16

Slide 16 text

Origin Image Opto Image Opto Image Opto Image Opto Image Opto CDN ImageOpto 101

Slide 17

Slide 17 text

Origin Image Opto Image Opto Image Opto Image Opto Image Opto CDN ImageOpto 101

Slide 18

Slide 18 text

POP

Slide 19

Slide 19 text

Resilience in Literature

Slide 20

Slide 20 text

Harvest & Yield Model

Slide 21

Slide 21 text

Fraction of successfully answered queries Focus on yield rather than uptime (think amazon during xmas) Yield

Slide 22

Slide 22 text

From Coda Hale’s “You can’t sacrifice partition tolerance” Server A Server B Server C Baby Animals Cute Fraction of the complete result Harvest

Slide 23

Slide 23 text

" 100% harvest

Slide 24

Slide 24 text

From Coda Hale’s “You can’t sacrifice partition tolerance” Server A Server B Server C Baby Animals Cute X 66% harvest Fraction of the complete result Harvest

Slide 25

Slide 25 text

☹ 66% harvest

Slide 26

Slide 26 text

From Coda Hale’s “You can’t sacrifice partition tolerance” Server A Server B Server C Baby Animals Cute X 33% harvest Fraction of the complete result Harvest X

Slide 27

Slide 27 text

33% harvest $

Slide 28

Slide 28 text

Randomness to make the worst-case & average-case the same Replication of high-priority data for greater harvest control Degrading results based on client capability #1: Probabilistic Availability

Slide 29

Slide 29 text

Break into subsystems Only provide strong consistency for the subsystems that need it Use orthogonal mechanisms #2 Decomposition & Orthogonality 1 2 3 4 5

Slide 30

Slide 30 text

If your system favors yield or harvest is an outcome of its design “ ” ~ Fox & Brewer

Slide 31

Slide 31 text

Harvest & Yield applied ImageOpto favors harvest Consistent hashing based on pristine image Replication to secondary nodes Orthogonality in CDN side Origin CDN IO X

Slide 32

Slide 32 text

Cook & Rasmussen model

Slide 33

Slide 33 text

Economic failure boundary Unacceptable workload boundary Accident boundary Pressure towards efficiency Reduction of effort error margin Marginal boundary Safety Campaign Incident! Operating point Cook & Rasmussen

Slide 34

Slide 34 text

error margin Original marginal boundary R.I.Cook - 2004 Acceptable operating point Accident boundary New marginal boundary! Flirting with the margin

Slide 35

Slide 35 text

Engineering resilience requires a model of safety based on: mentoring, responding, adapting, and learning System safety is about what can happen, where the operating point actually is, and what we do under pressure Insights from Cook’s model

Slide 36

Slide 36 text

Build support for continuous maintenance Resilience is operator community focused Know it’s going to get moved, replaced, and used in ways you did not intend Engineering system resilience

Slide 37

Slide 37 text

Cook & Rasmussen applied Unexpected use-cases Acceptable workload boundary influenced a redesign Use response to incidents as educational opportunities Origin CDN IO

Slide 38

Slide 38 text

Borrill's model

Slide 39

Slide 39 text

Classical  engineering Reactive  ops unk-unk Probability of failure Rank Cascading or catastrophic failures & you don’t know where they will come from! Same area as other 2 combined A system’s complexity

Slide 40

Slide 40 text

Classical  engineering Reactive  ops unk-unk Failure areas need != strategies Probability of failure Rank % & ' ☠'

Slide 41

Slide 41 text

Thinking about building system resilience using a single discipline is insufficient. We need different strategies “ ” ~ Borrill

Slide 42

Slide 42 text

Code standards Programming patterns Full system testing Metrics & monitoring Convergence to good state Hazard inventories Redundancies Feature flags Dark deploys Runbooks & docs Canaries System verification Formal methods Fault injection Classical engineering Reactive Operations Unknown-Unknown Strategies to build resilience

Slide 43

Slide 43 text

System verification Formal methods Fault injection Classical engineering Reactive Operations Unknown-Unknown Strategies to build resilience Code standards Programming patterns Full system testing Metrics & monitoring Convergence to good state Hazard inventories Redundancies Feature flags Dark deploys Runbooks & docs Canaries

Slide 44

Slide 44 text

Resilience   in Industry

Slide 45

Slide 45 text

No content

Slide 46

Slide 46 text

Library vs service? Service and client library control + storage of small data files with restricted operations Engineers don’t plan for: availability, consensus, primary elections, failures, their own bugs, operability, or the future. They also don’t understand Distributed Systems Key insights from Chubby %

Slide 47

Slide 47 text

Key insights from Chubby Centralized services are hard to construct but you can dedicate effort into architecting them well and making them failure-tolerant Restricting user behavior increased resilience Consumers of your service are part of your UNK-UNK scenarios

Slide 48

Slide 48 text

ImageOpto insights Dependencies are hard: customer setup, customer inputs, caching layer, & libraries - we have to be resilient from all of them Unk-Unks also lay in hidden dependencies (reduce as many of them as possible)

Slide 49

Slide 49 text

No content

Slide 50

Slide 50 text

Ship something out earlier with a limited API. Continuously invest in design of functionality and operability “ ” ~ Me today

Slide 51

Slide 51 text

In design What compromises does your system make as things go bad? Resilient systems are designed for high yield & variable harvest

Slide 52

Slide 52 text

Unawareness of proximity to error boundary means we are always guessing Complex operations make systems less resilient & more incident-prone You design operability too! Operations matter

Slide 53

Slide 53 text

Adding resilience may come at the cost of other desired goals (e.g. time, performance, simplicity, cost, etc) Redundancies help Not all complexity is bad

Slide 54

Slide 54 text

IN DESIGN OPERABILITY UNK-UNK Are we favoring harvest or yield? Are we resilient to our dependencies? Use orthogonality & decomposition Theory matters! Am I providing enough control to my operators? Operators impact resilience Narrowing your API helps The existence of this stresses diligence on the other two areas tl;dr The goal is to build failure domain independence

Slide 55

Slide 55 text

github.com/Randommood/YOW2016 ~ THANK YOU ~