Agile, Rugged, and Lean - The Paper Edition

Slide 1

Slide 1 text

Keynote ✨ ✨ ✨ ✨

Slide 2

Slide 2 text

The Paper Edition! ! " # $ %♥ ' ! " # $ %♥ ' ! " # $ %♥ ' ! " ! " # $ %♥ ' ! " # $ %♥ ' ! " # $ %♥ ' ! " # $ %♥ ' ! " # $ %♥ ' ! " # $ %♥ ' ! " $ % ♥ ' ! " # $ %♥ ' ! " # $ % # $ %♥ ♥ ' ! " $ ! " # $ %♥ ' ! ! " ' ! " % # $ %♥ ' ♥ $ %♥ Agile, Lean, Rugged

Slide 3

Slide 3 text

First .Introductions

Slide 4

Slide 4 text

@Randommood Ines Sombra

Slide 5

Slide 5 text

@adriancolyer Adrian Colyer

Slide 6

Slide 6 text

The Rules Only 5 minutes per paper Foundation ! Frontier A challenge! No Cheating!

Slide 7

Slide 7 text

A paper tour of Agile

Slide 8

Slide 8 text

Foundation !

Slide 9

Slide 9 text

We disdain old software

Slide 10

Slide 10 text

“The only systems that don’t get changed are those that are so bad nobody wants to use them”

Slide 11

Slide 11 text

When software gets older

Slide 12

Slide 12 text

Design for change Embrace modularity & information hiding Stress clarity & documentation Amputate disease-ridden parts Plan for eventual replacement Preventative medicine

Slide 13

Slide 13 text

Frontier

Slide 14

Slide 14 text

What do we want? We want agile Development Testing and veriﬁcation Delivery and we want agility of operations too!

Slide 15

Slide 15 text

Facebook Scuba " Data lives in server’s heap

Slide 16

Slide 16 text

The problem with state Restarting a database clears its memory Reading 120GB of data from disk takes about 3 hours per server (8 per machine) Even with orchestrated restarts & partial queries total of ~12 hours to restart a ﬂeet Operationally expensive & slow! #

Slide 17

Slide 17 text

“When we shutdown a server for a planned upgrade, we know that the memory state is good… so we decided to decouple the memory’s lifetime from the process’s lifetime“

Slide 18

Slide 18 text

2-3 minutes per server Fleet restarts < 1 hour now! $

Slide 19

Slide 19 text

A paper tour of Lean

Slide 20

Slide 20 text

Foundation !

Slide 21

Slide 21 text

Which system is better?

Slide 22

Slide 22 text

Which system is better?

Slide 23

Slide 23 text

Single-minded pursuit of scalability is the wrong goal

Slide 24

Slide 24 text

Common wisdom Effective scaling is evidence of solid system building Why does this happen? McSherry et al. Any system can scale arbitrarily well with a sufﬁcient lack of care in its implementation

Slide 25

Slide 25 text

! " # $ %♥ ' ! " # $ %♥ ' ! " # $ %♥ ' ! " # $ %♥ ' ! " # $ %♥ ' ! " # $ %♥ ' ! " $ % ♥ ' ! " # $ %♥ ' ! " # $ % # $ %♥ ♥ ' ! " $ ! " # $ %♥ ' ! ! " ' ! " % # $ %♥ ' ♥ $ %♥ # $ %♥ ' ! " # $ %♥ ' ! " # $ %♥ ' ! " $ % ♥ ' ! " # $ %♥ ' ! " # $ % # $ %♥ ♥ ' ! " $ ! " # $ %♥ ' ! " # $ %♥ ' ! " # $ %♥ ' ! " # $ %♥ ' ! " # $ %♥ ' ! " # $ %♥ ' ! " $ % ♥ ' ! " # $ %♥ ' ! " # $ % # $ %♥ ♥ ' ! " $ ! " # $ %♥ ' ! ! " ' ! " % # $ %♥ ' ♥ $ %♥ # $ %♥ ' ! " # $ %♥ ' ! " # $ %♥ ' ! " $ % ♥ ' ! " # $ %♥ ' ! " # $ % # $ %♥ ♥ ' ! " $ COST Conﬁguration that outperforms a single thread COST of a system is the hardware platform (number of cores) required before the platform outperforms a competent single threaded implementation

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

“If you’re building a system, make sure it’s better than your laptop. If you’re using a system, make sure it’s better than your laptop” McSherry

Slide 28

Slide 28 text

Frontier

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

No content

Slide 31

Slide 31 text

Sampling works!

Slide 32

Slide 32 text

Error bounds & confidence

Slide 33

Slide 33 text

Don’t ask wasteful questions

Slide 34

Slide 34 text

A paper tour of Rugged

Slide 35

Slide 35 text

Foundation !

Slide 36

Slide 36 text

Strategies to enhance ruggedness in the presence of failures Better way to think about system availability Ruggedness as availability

Slide 37

Slide 37 text

Harvest: fraction of the complete result Yield: fraction of answered queries

Slide 38

Slide 38 text

Yield as response ruggedness Close to uptime (% requests answered successfully) but more useful because it directly maps to user experience Failure during high & low trafﬁc generates different yields. Uptime misses this Focus on yield rather than uptime

Slide 39

Slide 39 text

Harvest as quality of response From Coda Hale’s “You can’t sacriﬁce partition tolerance” Server A Server B Server C Baby Animals Cute

Slide 40

Slide 40 text

Harvest as quality of response From Coda Hale’s “You can’t sacriﬁce partition tolerance” Server A Server B Server C Baby Animals Cute X 66% harvest

Slide 41

Slide 41 text

#1: Probabilistic Availability Graceful harvest degradation under faults Randomness to make the worst-case & average-case the same Replication of high-priority data for greater harvest control Degrading results based on client capability

Slide 42

Slide 42 text

#2 Decomposition & Orthogonality Decomposing into subsystems independently intolerant to harvest degradation (fail by reducing yield). But app can continue if they fail Only provide strong consistency for the subsystems that need it Orthogonal mechanisms (state vs functionality) %

Slide 43

Slide 43 text

Frontier

Slide 44

Slide 44 text

Ruggedness via verification Formal Methods Testing TOP-DOWN FAULT INJECTORS, INPUT GENERATORS BOTTOM-UP LINEAGE DRIVEN FAULT INJECTORS WHITE / BLACK BOX WE KNOW (OR NOT) ABOUT THE SYSTEM HUMAN ASSISTED PROOFS SAFETY CRITICAL (TLA+, COQ, ISABELLE) MODEL CHECKING PROPERTIES + TRANSITIONS (SPIN, TLA+) LIGHTWEIGHT FM BEST OF BOTH WORLDS (ALLOY, SAT) &

Slide 45

Slide 45 text

! " # $ %♥ ' ! " # $ %♥ ' ! " # $ %♥ ' ! " # $ %♥ ' ! " # $ %♥ ' ! " # $ %♥ ' ! " $ % ♥ ' ! " # $ %♥ ' ! " # $ % # $ %♥ ♥ ' ! " $ ! " # $ %♥ ' ! ! " ' ! " % # $ %♥ ' ♥ $ %♥ # $ %♥ ' ! " # $ %♥ ' ! " # $ %♥ ' ! " $ % ♥ ' ! " # $ %♥ ' ! " # $ % # $ %♥ ♥ ' ! " $ ! " # $ %♥ ' ! " # $ %♥ ' ! " # $ %♥ ' ! " # $ %♥ ' ! " # $ %♥ ' ! " # $ %♥ ' ! " $ % ♥ ' ! " # $ %♥ ' ! " # $ % # $ %♥ ♥ ' ! " $ ! " # $ %♥ ' ! ! " ' ! " % # $ %♥ ' ♥ $ %♥ # $ %♥ ' ! " # $ %♥ ' ! " # $ %♥ ' ! " $ % ♥ ' ! " # $ %♥ ' ! " # $ % # $ %♥ ♥ ' ! " $ MOLLY: Lineage Driven Fault Injection Reasons backwards from correct system outcomes & determines if a failure could have prevented it MOLLY only injects the failures it can prove might affect an outcome

Slide 46

Slide 46 text

Ruggedness with MOLLY “Without explicitly forcing a system to fail, you have no conﬁdence that it will operate correctly in failure modes” Caitie McCaffrey’s pearls of wisdom ' ( Veriﬁer Programmer

Slide 47

Slide 47 text

MOLLY helps us undestand failure

Slide 48

Slide 48 text

“Presents a middle ground between pragmatism and formalism, dictated by the importance of verifying fault tolerance in spite of the complexity of the space of faults”

Slide 49

Slide 49 text

Now let’s .Wrap things

Slide 50

Slide 50 text

Agile Lean Rugged tl;dr - foundations A scalable system may not be a lean system Pursuing scalability out of context can be COSTly Designing for change is designing for success Think about availability in terms of yield and harvest Graceful degradation is a design outcome ! !

Slide 51

Slide 51 text

Agile Lean Rugged tl;dr - Frontiers Asking the wrong question is wasteful Think about what is truly needed Use approximations State can be challenging Saving state in shared memory allows us to restart DB processes faster Reasoning backwards from correct system output helps us determine the execution failures that prevent it from happening

Slide 52

Slide 52 text

Join your local PWL and read The Morning Paper! github.com/Randommood/GotoLondon2015 Papers are a lot of fun! )

Slide 53

Slide 53 text

✨ ✨ * + ,- * + ,- * + ,- * + ,- * + ,- * + , * + ,- * + ,- * + ,- * + ,- * + ,- * + , DRANKS!