Slide 1

Slide 1 text

The Verification of a Distributed System A Practitioner’s Guide to Increasing Confidence in System Correctness

Slide 2

Slide 2 text

Distributed Systems Engineer Caitie McCaffrey caitiem.com @caitie

Slide 3

Slide 3 text

Increase Confidence by Exploring More of the Input State Space All Inputs

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

LESLIE LAMPORT “A Distributed System is one in which the failure of a computer you didn’t even know existed can render your own computer unusable”

Slide 6

Slide 6 text

Service Service Service We Are All Building Distributed Systems

Slide 7

Slide 7 text

Twitter Services

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

Overview Formal Verification Provably Correct Systems Testing in the Wild Increase Confidence in System Correctness Research A New Hope

Slide 10

Slide 10 text

References

Slide 11

Slide 11 text

Provably Correct Formal Verification

Slide 12

Slide 12 text

Formal Specifications Written description of what a system is supposed to do TLA+ Coq

Slide 13

Slide 13 text

All Inputs State Space Explored by Formal Verification

Slide 14

Slide 14 text

All Inputs Formal Verification State Space Explored by Formal Verification

Slide 15

Slide 15 text

Hour Clock Specification ————————————— MODULE HourClock ———————————————— EXTENDS Naturals VARIABLE hr HCini == hr \in (1 .. 12) HCnxt == hr’ = IF hr # 12 THEN hr + 1 ELSE 1 HC == HCini /\ [][HCnxt] _hr ————————————————————————————————————————————- THEOREM HC => []HCini ============================================= Leslie Lamport, Specifying Systems TLA+

Slide 16

Slide 16 text

Use of Formal Methods at Amazon Web Services TLA+

Slide 17

Slide 17 text

“Formal Methods Have Been a Big Success” S3 & 10+ Core Pieces of Infrastructure Verified 2 Serious Bugs Found Increased Confidence to make Optimizations Use of Formal Methods at Amazon Web Services TLA+

Slide 18

Slide 18 text

Leslie Lamport, Specifying Systems “Its a good idea to understand a system before building it, so its a good idea to write a specification of a system before implementing it” TLA+

Slide 19

Slide 19 text

“Formal methods deal with models of systems, not the systems themselves” Use of Formal Methods at Amazon Web Services

Slide 20

Slide 20 text

Program Extraction

Slide 21

Slide 21 text

POPL 2016 “Our Verified Implementation is extracted to OCaml & runs on real networks” Program Extraction COQ

Slide 22

Slide 22 text

POPL 2016 “We have developed & checked our framework in Coq, extracted it to OCaml, and built executable stores” Program Extraction COQ

Slide 23

Slide 23 text

Distributed Systems Testing in the Wild “Seems Pretty Legit”

Slide 24

Slide 24 text

Unit Tests Testing of Individual Software Components or Modules

Slide 25

Slide 25 text

Simple Testing Can Prevent Most Critical Failures

Slide 26

Slide 26 text

77% of Production failures can be reproduced by a Unit Test Simple Testing can Prevent Most Critical Failures

Slide 27

Slide 27 text

Testing error handling code could have prevented 58% of catastrophic failures Simple Testing can Prevent Most Critical Failures

Slide 28

Slide 28 text

Error Handling Code is simply empty or only contains a Log statement Error Handler aborts cluster on an overly general exception Error Handler contains comments like FIXME or TODO 35% of Catastrophic Failures Simple Testing can Prevent Most Critical Failures

Slide 29

Slide 29 text

Scala Types Are Not Testing A Short Counter Example

Slide 30

Slide 30 text

TCP Doesn’t Care About Your Type System

Slide 31

Slide 31 text

Integration Tests Testing of integrated modules to verify combined functionality

Slide 32

Slide 32 text

Three nodes or less can reproduce 98% of failures Simple Testing can Prevent Most Critical Failures

Slide 33

Slide 33 text

Property Based Testing

Slide 34

Slide 34 text

QuickCheck ScalaCheck Haskell Erlang Scala Java & & C, C++, Clojure, Common Lisp, Elm, F#, C#, Go, JavaScript, Node.js, Objective-C, OCaml, Perl, Prolog, PHP, Python, R, Ruby, Rust, Scheme, Smalltalk, StandardML , Swift Languages with Quick Check Ports:

Slide 35

Slide 35 text

ScalaCheck Examples

Slide 36

Slide 36 text

Fault Injection Introducing faults into the system under test

Slide 37

Slide 37 text

-The Verification of a Distributed System “Without explicitly forcing a system to fail, it is unreasonable to have any confidence it will operate correctly in failure modes”

Slide 38

Slide 38 text

Netflix Simian Army • Chaos Monkey: kills instances • Latency Monkey: artificial latency induced • Chaos Gorilla: simulates outage of entire availability zone.

Slide 39

Slide 39 text

Kyle has used this tool to show us that many of the Distributed Systems we know seem stable but are really just this. (cut to tire fire photo) JEPSEN credit: @aphyr Fault Injection Tool that simulates network partitions in the system under test

Slide 40

Slide 40 text

Kyle has used this tool to show us that many of the Distributed Systems we know seem stable but are really just this. (cut to tire fire photo) JEPSEN credit: @aphyr Fault Injection Tool that simulates network partitions in the system under test

Slide 41

Slide 41 text

CAUTION: Passing Tests Does Not Ensure Correctness

Slide 42

Slide 42 text

All Inputs State Space Explored Unit Tests

Slide 43

Slide 43 text

All Inputs State Space Explored Unit Tests Integration Tests

Slide 44

Slide 44 text

All Inputs State Space Explored Unit Tests Integration Tests Property Tests

Slide 45

Slide 45 text

All Inputs State Space Explored Unit Tests Integration Tests Property Tests Fault Injection Tests

Slide 46

Slide 46 text

GAME DAYS Resilience Engineering: Learning to Embrace Failure Breaking your services on purpose

Slide 47

Slide 47 text

How to Run a GameDay 1. Notify Engineering Teams that Failure is Coming 2. Induce Failures 3. Monitor Systems Under Test 4. Observing Only Team Monitors Recovery Processes & Systems, Files Bugs 5. Prioritize Bugs & Get Buy-In Across Teams Resilience Engineering: Learning to Embrace Failure

Slide 48

Slide 48 text

Game Day at Stripe “During a recent game day, we tested failing over a Redis cluster by running kill -9 on its primary node, and ended up losing all data in the cluster” Game Day Exercises at Stripe: Learning from `kill -9`

Slide 49

Slide 49 text

TESTING IN PRODUCTION Some thoughts on

Slide 50

Slide 50 text

Monitoring Testing is not

Slide 51

Slide 51 text

CANARIES “Verification” in production

Slide 52

Slide 52 text

Probability of failure Rank Catastrophic Failures Classical
 engineering Reactive
 ops unk-unk Verifying System Complexity Architectural Patterns of Resilient Distributed Systems

Slide 53

Slide 53 text

Probability of failure Rank Unit & Integration Tests Classical
 engineering Reactive
 ops unk-unk Verifying System Complexity Architectural Patterns of Resilient Distributed Systems

Slide 54

Slide 54 text

Probability of failure Rank Unit & Integration Tests Property Based Testing Classical
 engineering Reactive
 ops unk-unk Verifying System Complexity Architectural Patterns of Resilient Distributed Systems

Slide 55

Slide 55 text

Probability of failure Rank Unit & Integration Tests Property Based Testing Fault Injection Testing Classical
 engineering Reactive
 ops unk-unk Verifying System Complexity Architectural Patterns of Resilient Distributed Systems

Slide 56

Slide 56 text

Probability of failure Rank Unit & Integration Tests Property Based Testing Fault Injection Testing Canaries Game Days Monitoring Classical
 engineering Reactive
 ops unk-unk Verifying System Complexity Architectural Patterns of Resilient Distributed Systems

Slide 57

Slide 57 text

Research Improving the Verification of Distributed Systems Lineage Driven Fault Injection ‘Cause I’m Strong Enough: Reasoning about Consistency Choices in Distributed Systems IronFleet: Proving Practical Distributed Systems Correct

Slide 58

Slide 58 text

‘Cause I’m Strong Enough POPL 2016

Slide 59

Slide 59 text

Bank Application Bank Account must be > 0 Deposit Money Withdrawal Money

Slide 60

Slide 60 text

‘Cause I’m Strong Enough: Reasoning About Consistency Choices in Distributed Systems

Slide 61

Slide 61 text

Conclusion Use Formal Verification on Critical Components Unit Tests & Integration Tests find a multitude of Errors Increase Confidence via Property Testing & Fault Injection

Slide 62

Slide 62 text

Camille Fournier “Enjoy the ride, have fun, and test your freaking code”

Slide 63

Slide 63 text

Thank You Peter Alvaro Kyle Kingsbury Christopher Meiklejohn Alex Rasmussen Ines Sombra Nathan Taylor Alvaro Videla

Slide 64

Slide 64 text

Questions @caitie https://github.com/CaitieM20/ Talks/tree/master/ TheVerificationOfADistributedSystem Resources: