Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Verification of a Distributed System

The Verification of a Distributed System

Caitie McCaffrey

December 08, 2016
Tweet

More Decks by Caitie McCaffrey

Other Decks in Technology

Transcript

  1. The Verification of a
    Distributed System
    A Practitioner’s Guide to Increasing Confidence in System Correctness

    View full-size slide

  2. Distributed Systems Engineer
    Caitie McCaffrey
    caitiem.com
    @caitie

    View full-size slide

  3. Increase Confidence by Exploring More
    of the Input State Space
    All Inputs

    View full-size slide

  4. LESLIE LAMPORT
    “A Distributed System is one in which
    the failure of a computer you didn’t even
    know existed can render your own
    computer unusable”

    View full-size slide

  5. Service Service
    Service
    We Are All
    Building
    Distributed
    Systems

    View full-size slide

  6. Twitter
    Services

    View full-size slide

  7. Overview
    Formal Verification
    Provably Correct Systems
    Testing in the Wild
    Increase Confidence in System Correctness
    Research
    A New Hope

    View full-size slide

  8. Provably
    Correct
    Formal
    Verification

    View full-size slide

  9. Formal Specifications
    Written description of what a system is supposed to do
    TLA+ Coq

    View full-size slide

  10. All Inputs
    State Space Explored by Formal Verification

    View full-size slide

  11. All Inputs
    Formal Verification
    State Space Explored by Formal Verification

    View full-size slide

  12. Hour Clock Specification
    ————————————— MODULE HourClock ————————————————
    EXTENDS Naturals
    VARIABLE hr
    HCini == hr \in (1 .. 12)
    HCnxt == hr’ = IF hr # 12 THEN hr + 1 ELSE 1
    HC == HCini /\ [][HCnxt] _hr
    ————————————————————————————————————————————-
    THEOREM HC => []HCini
    =============================================
    Leslie Lamport, Specifying Systems
    TLA+

    View full-size slide

  13. Use of
    Formal
    Methods at
    Amazon
    Web Services
    TLA+

    View full-size slide

  14. “Formal Methods Have Been a Big Success”
    S3 & 10+ Core Pieces of
    Infrastructure Verified
    2 Serious Bugs Found
    Increased Confidence to make
    Optimizations
    Use of Formal Methods at Amazon Web Services
    TLA+

    View full-size slide

  15. Leslie Lamport, Specifying Systems
    “Its a good idea to understand a system
    before building it, so its a good idea to write
    a specification of a system before
    implementing it”
    TLA+

    View full-size slide

  16. “Formal methods deal with
    models of systems, not the
    systems themselves”
    Use of Formal Methods at Amazon Web Services

    View full-size slide

  17. Program Extraction

    View full-size slide

  18. POPL
    2016
    “Our Verified
    Implementation is
    extracted to OCaml
    & runs on real
    networks”
    Program Extraction
    COQ

    View full-size slide

  19. POPL
    2016
    “We have developed &
    checked our framework
    in Coq, extracted it to
    OCaml, and built
    executable stores”
    Program Extraction
    COQ

    View full-size slide

  20. Distributed
    Systems Testing
    in the Wild
    “Seems Pretty Legit”

    View full-size slide

  21. Unit Tests
    Testing of Individual Software
    Components or Modules

    View full-size slide

  22. Simple
    Testing Can
    Prevent
    Most Critical
    Failures

    View full-size slide

  23. 77% of Production failures
    can be reproduced by a
    Unit Test
    Simple Testing can Prevent Most Critical Failures

    View full-size slide

  24. Testing error handling
    code could have
    prevented 58% of
    catastrophic failures
    Simple Testing can Prevent Most Critical Failures

    View full-size slide

  25. Error Handling Code is simply empty or
    only contains a Log statement
    Error Handler aborts cluster on an overly
    general exception
    Error Handler contains comments like
    FIXME or TODO
    35% of
    Catastrophic
    Failures
    Simple Testing can Prevent Most Critical Failures

    View full-size slide

  26. Scala
    Types
    Are Not
    Testing
    A Short Counter Example

    View full-size slide

  27. TCP Doesn’t Care About
    Your Type System

    View full-size slide

  28. Integration Tests
    Testing of integrated modules to
    verify combined functionality

    View full-size slide

  29. Three nodes or
    less can reproduce
    98% of failures
    Simple Testing can Prevent Most Critical Failures

    View full-size slide

  30. Property Based
    Testing

    View full-size slide

  31. QuickCheck ScalaCheck
    Haskell
    Erlang
    Scala
    Java
    & &
    C, C++, Clojure, Common Lisp, Elm, F#, C#, Go, JavaScript, Node.js, Objective-C, OCaml, Perl,
    Prolog, PHP, Python, R, Ruby, Rust, Scheme, Smalltalk, StandardML , Swift
    Languages with Quick Check Ports:

    View full-size slide

  32. ScalaCheck Examples

    View full-size slide

  33. Fault Injection
    Introducing faults into the system under test

    View full-size slide

  34. -The Verification of a Distributed System
    “Without explicitly forcing a system
    to fail, it is unreasonable to have any
    confidence it will operate correctly
    in failure modes”

    View full-size slide

  35. Netflix Simian Army
    • Chaos Monkey: kills instances
    • Latency Monkey: artificial latency
    induced
    • Chaos Gorilla: simulates outage
    of entire availability zone.

    View full-size slide

  36. Kyle has used this tool to show us that many of the
    Distributed Systems we know seem stable
    but are really just this. (cut to tire fire photo)
    JEPSEN
    credit: @aphyr
    Fault Injection Tool
    that simulates
    network
    partitions in the
    system under test

    View full-size slide

  37. Kyle has used this tool to show us that many of the
    Distributed Systems we know seem stable
    but are really just this. (cut to tire fire photo)
    JEPSEN
    credit: @aphyr
    Fault Injection Tool
    that simulates
    network
    partitions in the
    system under test

    View full-size slide

  38. CAUTION: Passing Tests
    Does Not Ensure Correctness

    View full-size slide

  39. All Inputs
    State Space
    Explored
    Unit Tests

    View full-size slide

  40. All Inputs
    State Space
    Explored
    Unit Tests
    Integration
    Tests

    View full-size slide

  41. All Inputs
    State Space
    Explored
    Unit Tests
    Integration
    Tests
    Property
    Tests

    View full-size slide

  42. All Inputs
    State Space
    Explored
    Unit Tests
    Integration
    Tests
    Property
    Tests
    Fault
    Injection
    Tests

    View full-size slide

  43. GAME DAYS
    Resilience Engineering: Learning to Embrace Failure
    Breaking your services on purpose

    View full-size slide

  44. How to Run a GameDay
    1. Notify Engineering Teams that Failure is Coming
    2. Induce Failures
    3. Monitor Systems Under Test
    4. Observing Only Team Monitors Recovery Processes
    & Systems, Files Bugs
    5. Prioritize Bugs & Get Buy-In Across Teams
    Resilience Engineering: Learning to Embrace Failure

    View full-size slide

  45. Game Day at Stripe
    “During a recent game day, we
    tested failing over a Redis cluster by
    running kill -9 on its primary
    node, and ended up losing all
    data in the cluster”
    Game Day Exercises at Stripe: Learning from `kill -9`

    View full-size slide

  46. TESTING IN
    PRODUCTION
    Some thoughts on

    View full-size slide

  47. Monitoring
    Testing
    is not

    View full-size slide

  48. CANARIES
    “Verification” in production

    View full-size slide

  49. Probability
    of failure
    Rank
    Catastrophic
    Failures
    Classical

    engineering
    Reactive

    ops
    unk-unk
    Verifying System Complexity
    Architectural Patterns of Resilient Distributed Systems

    View full-size slide

  50. Probability
    of failure
    Rank
    Unit & Integration Tests
    Classical

    engineering
    Reactive

    ops
    unk-unk
    Verifying System Complexity
    Architectural Patterns of Resilient Distributed Systems

    View full-size slide

  51. Probability
    of failure
    Rank
    Unit & Integration Tests
    Property Based Testing
    Classical

    engineering
    Reactive

    ops unk-unk
    Verifying System Complexity
    Architectural Patterns of Resilient Distributed Systems

    View full-size slide

  52. Probability
    of failure
    Rank
    Unit & Integration Tests
    Property Based Testing
    Fault Injection Testing
    Classical

    engineering
    Reactive

    ops unk-unk
    Verifying System Complexity
    Architectural Patterns of Resilient Distributed Systems

    View full-size slide

  53. Probability
    of failure
    Rank
    Unit & Integration Tests
    Property Based Testing
    Fault Injection Testing
    Canaries
    Game Days
    Monitoring
    Classical

    engineering
    Reactive

    ops unk-unk
    Verifying System Complexity
    Architectural Patterns of Resilient Distributed Systems

    View full-size slide

  54. Research
    Improving the Verification
    of Distributed Systems
    Lineage Driven Fault Injection
    ‘Cause I’m Strong Enough:
    Reasoning about Consistency Choices in Distributed Systems
    IronFleet:
    Proving Practical Distributed Systems Correct

    View full-size slide

  55. ‘Cause I’m
    Strong Enough
    POPL
    2016

    View full-size slide

  56. Bank Application
    Bank Account must be > 0
    Deposit Money
    Withdrawal Money

    View full-size slide

  57. ‘Cause I’m Strong Enough: Reasoning About Consistency Choices in Distributed Systems

    View full-size slide

  58. Conclusion
    Use Formal Verification on
    Critical Components
    Unit Tests & Integration Tests find a
    multitude of Errors
    Increase Confidence via Property
    Testing & Fault Injection

    View full-size slide

  59. Camille Fournier
    “Enjoy the ride, have fun, and
    test your freaking code”

    View full-size slide

  60. Thank You
    Peter Alvaro
    Kyle Kingsbury
    Christopher
    Meiklejohn
    Alex Rasmussen
    Ines Sombra
    Nathan Taylor
    Alvaro Videla

    View full-size slide

  61. Questions
    @caitie
    https://github.com/CaitieM20/
    Talks/tree/master/
    TheVerificationOfADistributedSystem
    Resources:

    View full-size slide