Slide 1

Slide 1 text

The Verification of a Distributed System A Practitioner’s Guide to Increasing Confidence in System Correctness

Slide 2

Slide 2 text

Distributed Systems Engineer Caitie McCaffrey caitiem.com @caitie

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

LESLIE LAMPORT “A Distributed System is one in which the failure of a computer you didn’t even know existed can render your own computer unusable”

Slide 5

Slide 5 text

Service Service Service We Are All Building Distributed Systems

Slide 6

Slide 6 text

Twitter Services

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

Overview Formal Verification Provably Correct Systems Testing in the Wild Increase Confidence in System Correctness Research A New Hope

Slide 9

Slide 9 text

References

Slide 10

Slide 10 text

Provably Correct Formal Verification

Slide 11

Slide 11 text

Formal Specifications Written description of what a system is supposed to do TLA+ Coq

Slide 12

Slide 12 text

Hour Clock Specification ————————————— MODULE HourClock ———————————————— EXTENDS Naturals VARIABLE hr HCini == hr \in (1 .. 12) HCnxt == hr’ = IF hr # 12 THEN hr + 1 ELSE 1 HC == HCini /\ [][HCnxt] _hr ————————————————————————————————————————————- THEOREM HC => []HCini ============================================= Leslie Lamport, Specifying Systems TLA+

Slide 13

Slide 13 text

Use of Formal Methods at Amazon Web Services TLA+

Slide 14

Slide 14 text

“Formal Methods Have Been a Big Success” S3 & 10+ Core Pieces of Infrastructure Verified 2 Serious Bugs Found Increased Confidence to make Optimizations Use of Formal Methods at Amazon Web Services TLA+

Slide 15

Slide 15 text

Leslie Lamport, Specifying Systems “Its a good idea to understand a system before building it, so its a good idea to write a specification of a system before implementing it” TLA+

Slide 16

Slide 16 text

“Formal methods deal with models of systems, not the systems themselves” Use of Formal Methods at Amazon Web Services

Slide 17

Slide 17 text

Program Extraction

Slide 18

Slide 18 text

POPL 2016 “Our Verified Implementation is extracted to OCaml & runs on real networks” Program Extraction COQ

Slide 19

Slide 19 text

POPL 2016 “We have developed & checked our framework in Coq, extracted it to OCaml, and built executable stores” Program Extraction COQ

Slide 20

Slide 20 text

Distributed Systems Testing in the Wild “Seems Pretty Legit”

Slide 21

Slide 21 text

Unit Tests Testing of Individual Software Components or Modules

Slide 22

Slide 22 text

Simple Testing Can Prevent Most Critical Failures

Slide 23

Slide 23 text

77% of Production failures can be reproduced by a Unit Test Simple Testing can Prevent Most Critical Failures

Slide 24

Slide 24 text

Testing error handling code could have prevented 58% of catastrophic failures Simple Testing can Prevent Most Critical Failures

Slide 25

Slide 25 text

Error Handling Code is simply empty or only contains a Log statement Error Handler aborts cluster on an overly general exception Error Handler contains comments like FIXME or TODO 35% of Catastrophic Failures Simple Testing can Prevent Most Critical Failures

Slide 26

Slide 26 text

Scala Types Are Not Testing A Short Counter Example

Slide 27

Slide 27 text

TCP Doesn’t Care About Your Type System

Slide 28

Slide 28 text

Integration Tests Testing of integrated modules to verify combined functionality

Slide 29

Slide 29 text

Three nodes or less can reproduce 98% of failures Simple Testing can Prevent Most Critical Failures

Slide 30

Slide 30 text

Property Based Testing

Slide 31

Slide 31 text

QuickCheck ScalaCheck Haskell Erlang Scala Java & & C, C++, Clojure, Common Lisp, Elm, F#, C#, Go, JavaScript, Node.js, Objective-C, OCaml, Perl, Prolog, PHP, Python, R, Ruby, Rust, Scheme, Smalltalk, StandardML , Swift Languages with Quick Check Ports:

Slide 32

Slide 32 text

ScalaCheck Examples

Slide 33

Slide 33 text

Fault Injection Introducing faults into the system under test

Slide 34

Slide 34 text

-The Verification of a Distributed System “Without explicitly forcing a system to fail, it is unreasonable to have any confidence it will operate correctly in failure modes”

Slide 35

Slide 35 text

Netflix Simian Army • Chaos Monkey: kills instances • Latency Monkey: artificial latency induced • Chaos Gorilla: simulates outage of entire availability zone.

Slide 36

Slide 36 text

Kyle has used this tool to show us that many of the Distributed Systems we know seem stable but are really just this. (cut to tire fire photo) JEPSEN credit: @aphyr Fault Injection Tool that simulates network partitions in the system under test

Slide 37

Slide 37 text

Kyle has used this tool to show us that many of the Distributed Systems we know seem stable but are really just this. (cut to tire fire photo) JEPSEN credit: @aphyr Fault Injection Tool that simulates network partitions in the system under test

Slide 38

Slide 38 text

CAUTION: Passing Tests Does Not Ensure Correctness

Slide 39

Slide 39 text

GAME DAYS Resilience Engineering: Learning to Embrace Failure Breaking your services on purpose

Slide 40

Slide 40 text

How to Run a GameDay 1. Notify Engineering Teams that Failure is Coming 2. Induce Failures 3. Monitor Systems Under Test 4. Observing Only Team Monitors Recovery Processes & Systems, Files Bugs 5. Prioritize Bugs & Get Buy-In Across Teams Resilience Engineering: Learning to Embrace Failure

Slide 41

Slide 41 text

Game Day at Stripe “During a recent game day, we tested failing over a Redis cluster by running kill -9 on its primary node, and ended up losing all data in the cluster” Game Day Exercises at Stripe: Learning from `kill -9`

Slide 42

Slide 42 text

TESTING IN PRODUCTION Some thoughts on

Slide 43

Slide 43 text

Monitoring Testing is not

Slide 44

Slide 44 text

CANARIES “Verification” in production

Slide 45

Slide 45 text

Verification Wild in the Unit & Integration Tests Property Based Testing Fault Injection Canaries

Slide 46

Slide 46 text

Research Improving the Verification of Distributed Systems Lineage Driven Fault Injection ‘Cause I’m Strong Enough: Reasoning about Consistency Choices in Distributed Systems IronFleet: Proving Practical Distributed Systems Correct

Slide 47

Slide 47 text

‘Cause I’m Strong Enough POPL 2016

Slide 48

Slide 48 text

Bank Application Bank Account must be > 0 Deposit Money Withdrawal Money

Slide 49

Slide 49 text

‘Cause I’m Strong Enough: Reasoning About Consistency Choices in Distributed Systems

Slide 50

Slide 50 text

Conclusion Use Formal Verification on Critical Components Unit Tests & Integration Tests find a multitude of Errors Increase Confidence via Property Testing & Fault Injection

Slide 51

Slide 51 text

Camille Fournier “Enjoy the ride, have fun, and test your freaking code”

Slide 52

Slide 52 text

Thank You Peter Alvaro Kyle Kingsbury Christopher Meiklejohn Alex Rasmussen Ines Sombra Nathan Taylor Alvaro Videla

Slide 53

Slide 53 text

Questions @caitie https://github.com/CaitieM20/ Talks/tree/master/ TheVerificationOfADistributedSystem Resources: