The Verification of a Distributed System

The Verification of a Distributed System

9128d500301ae51524e887bb680f471d?s=128

Caitie McCaffrey

December 08, 2016
Tweet

Transcript

  1. The Verification of a Distributed System A Practitioner’s Guide to

    Increasing Confidence in System Correctness
  2. Distributed Systems Engineer Caitie McCaffrey caitiem.com @caitie

  3. Increase Confidence by Exploring More of the Input State Space

    All Inputs
  4. None
  5. LESLIE LAMPORT “A Distributed System is one in which the

    failure of a computer you didn’t even know existed can render your own computer unusable”
  6. Service Service Service We Are All Building Distributed Systems

  7. Twitter Services

  8. None
  9. Overview Formal Verification Provably Correct Systems Testing in the Wild

    Increase Confidence in System Correctness Research A New Hope
  10. References

  11. Provably Correct Formal Verification

  12. Formal Specifications Written description of what a system is supposed

    to do TLA+ Coq
  13. All Inputs State Space Explored by Formal Verification

  14. All Inputs Formal Verification State Space Explored by Formal Verification

  15. Hour Clock Specification ————————————— MODULE HourClock ———————————————— EXTENDS Naturals VARIABLE

    hr HCini == hr \in (1 .. 12) HCnxt == hr’ = IF hr # 12 THEN hr + 1 ELSE 1 HC == HCini /\ [][HCnxt] _hr ————————————————————————————————————————————- THEOREM HC => []HCini ============================================= Leslie Lamport, Specifying Systems TLA+
  16. Use of Formal Methods at Amazon Web Services TLA+

  17. “Formal Methods Have Been a Big Success” S3 & 10+

    Core Pieces of Infrastructure Verified 2 Serious Bugs Found Increased Confidence to make Optimizations Use of Formal Methods at Amazon Web Services TLA+
  18. Leslie Lamport, Specifying Systems “Its a good idea to understand

    a system before building it, so its a good idea to write a specification of a system before implementing it” TLA+
  19. “Formal methods deal with models of systems, not the systems

    themselves” Use of Formal Methods at Amazon Web Services
  20. Program Extraction

  21. POPL 2016 “Our Verified Implementation is extracted to OCaml &

    runs on real networks” Program Extraction COQ
  22. POPL 2016 “We have developed & checked our framework in

    Coq, extracted it to OCaml, and built executable stores” Program Extraction COQ
  23. Distributed Systems Testing in the Wild “Seems Pretty Legit”

  24. Unit Tests Testing of Individual Software Components or Modules

  25. Simple Testing Can Prevent Most Critical Failures

  26. 77% of Production failures can be reproduced by a Unit

    Test Simple Testing can Prevent Most Critical Failures
  27. Testing error handling code could have prevented 58% of catastrophic

    failures Simple Testing can Prevent Most Critical Failures
  28. Error Handling Code is simply empty or only contains a

    Log statement Error Handler aborts cluster on an overly general exception Error Handler contains comments like FIXME or TODO 35% of Catastrophic Failures Simple Testing can Prevent Most Critical Failures
  29. Scala Types Are Not Testing A Short Counter Example

  30. TCP Doesn’t Care About Your Type System

  31. Integration Tests Testing of integrated modules to verify combined functionality

  32. Three nodes or less can reproduce 98% of failures Simple

    Testing can Prevent Most Critical Failures
  33. Property Based Testing

  34. QuickCheck ScalaCheck Haskell Erlang Scala Java & & C, C++,

    Clojure, Common Lisp, Elm, F#, C#, Go, JavaScript, Node.js, Objective-C, OCaml, Perl, Prolog, PHP, Python, R, Ruby, Rust, Scheme, Smalltalk, StandardML , Swift Languages with Quick Check Ports:
  35. ScalaCheck Examples

  36. Fault Injection Introducing faults into the system under test

  37. -The Verification of a Distributed System “Without explicitly forcing a

    system to fail, it is unreasonable to have any confidence it will operate correctly in failure modes”
  38. Netflix Simian Army • Chaos Monkey: kills instances • Latency

    Monkey: artificial latency induced • Chaos Gorilla: simulates outage of entire availability zone.
  39. Kyle has used this tool to show us that many

    of the Distributed Systems we know seem stable but are really just this. (cut to tire fire photo) JEPSEN credit: @aphyr Fault Injection Tool that simulates network partitions in the system under test
  40. Kyle has used this tool to show us that many

    of the Distributed Systems we know seem stable but are really just this. (cut to tire fire photo) JEPSEN credit: @aphyr Fault Injection Tool that simulates network partitions in the system under test
  41. CAUTION: Passing Tests Does Not Ensure Correctness

  42. All Inputs State Space Explored Unit Tests

  43. All Inputs State Space Explored Unit Tests Integration Tests

  44. All Inputs State Space Explored Unit Tests Integration Tests Property

    Tests
  45. All Inputs State Space Explored Unit Tests Integration Tests Property

    Tests Fault Injection Tests
  46. GAME DAYS Resilience Engineering: Learning to Embrace Failure Breaking your

    services on purpose
  47. How to Run a GameDay 1. Notify Engineering Teams that

    Failure is Coming 2. Induce Failures 3. Monitor Systems Under Test 4. Observing Only Team Monitors Recovery Processes & Systems, Files Bugs 5. Prioritize Bugs & Get Buy-In Across Teams Resilience Engineering: Learning to Embrace Failure
  48. Game Day at Stripe “During a recent game day, we

    tested failing over a Redis cluster by running kill -9 on its primary node, and ended up losing all data in the cluster” Game Day Exercises at Stripe: Learning from `kill -9`
  49. TESTING IN PRODUCTION Some thoughts on

  50. Monitoring Testing is not

  51. CANARIES “Verification” in production

  52. Probability of failure Rank Catastrophic Failures Classical
 engineering Reactive
 ops

    unk-unk Verifying System Complexity Architectural Patterns of Resilient Distributed Systems
  53. Probability of failure Rank Unit & Integration Tests Classical
 engineering

    Reactive
 ops unk-unk Verifying System Complexity Architectural Patterns of Resilient Distributed Systems
  54. Probability of failure Rank Unit & Integration Tests Property Based

    Testing Classical
 engineering Reactive
 ops unk-unk Verifying System Complexity Architectural Patterns of Resilient Distributed Systems
  55. Probability of failure Rank Unit & Integration Tests Property Based

    Testing Fault Injection Testing Classical
 engineering Reactive
 ops unk-unk Verifying System Complexity Architectural Patterns of Resilient Distributed Systems
  56. Probability of failure Rank Unit & Integration Tests Property Based

    Testing Fault Injection Testing Canaries Game Days Monitoring Classical
 engineering Reactive
 ops unk-unk Verifying System Complexity Architectural Patterns of Resilient Distributed Systems
  57. Research Improving the Verification of Distributed Systems Lineage Driven Fault

    Injection ‘Cause I’m Strong Enough: Reasoning about Consistency Choices in Distributed Systems IronFleet: Proving Practical Distributed Systems Correct
  58. ‘Cause I’m Strong Enough POPL 2016

  59. Bank Application Bank Account must be > 0 Deposit Money

    Withdrawal Money
  60. ‘Cause I’m Strong Enough: Reasoning About Consistency Choices in Distributed

    Systems
  61. Conclusion Use Formal Verification on Critical Components Unit Tests &

    Integration Tests find a multitude of Errors Increase Confidence via Property Testing & Fault Injection
  62. Camille Fournier “Enjoy the ride, have fun, and test your

    freaking code”
  63. Thank You Peter Alvaro Kyle Kingsbury Christopher Meiklejohn Alex Rasmussen

    Ines Sombra Nathan Taylor Alvaro Videla
  64. Questions @caitie https://github.com/CaitieM20/ Talks/tree/master/ TheVerificationOfADistributedSystem Resources: