Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Verification of a Distributed System

The Verification of a Distributed System

Caitie McCaffrey

May 24, 2016
Tweet

More Decks by Caitie McCaffrey

Other Decks in Technology

Transcript

  1. The Verification of a
    Distributed System
    A Practitioner’s Guide to Increasing Confidence in System Correctness

    View Slide

  2. Distributed Systems Engineer
    Caitie McCaffrey
    caitiem.com
    @caitie

    View Slide

  3. LESLIE LAMPORT
    “A Distributed System is one in which
    the failure of a computer you didn’t even
    know existed can render your own
    computer unusable”

    View Slide

  4. Service Service
    Service
    We Are All
    Building
    Distributed
    Systems

    View Slide

  5. Twitter
    Services

    View Slide

  6. View Slide

  7. Overview
    Formal Verification
    Provably Correct Systems
    Testing in the Wild
    Increase Confidence in System Correctness
    Research
    A New Hope

    View Slide

  8. References

    View Slide

  9. Safety Liveness
    &

    View Slide

  10. Formal
    Verification

    View Slide

  11. Provably
    Correct
    Formal
    Verification

    View Slide

  12. Formal Specifications
    Written description of what a system is supposed to do
    TLA+ Coq

    View Slide

  13. Leslie Lamport, Specifying Systems
    “Its a good idea to understand a system
    before building it, so its a good idea to write
    a specification of a system before
    implementing it”
    TLA+

    View Slide

  14. Hour Clock Specification
    ————————————— MODULE HourClock ————————————————
    EXTENDS Naturals
    VARIABLE hr
    HCini == hr \in (1 .. 12)
    HCnxt == hr’ = IF hr # 12 THEN hr + 1 ELSE 1
    HC == HCini /\ [][HCnxt] _hr
    ————————————————————————————————————————————-
    THEOREM HC => []HCini
    =============================================
    Leslie Lamport, Specifying Systems
    TLA+

    View Slide

  15. Use of
    Formal
    Methods at
    Amazon
    Web Services
    TLA+

    View Slide

  16. “Formal Methods Have Been a Big Success”
    S3 & 10+ Core Pieces of
    Infrastructure Verified
    2 Serious Bugs Found
    Increased Confidence to make
    Optimizations
    Use of Formal Methods at Amazon Web Services
    TLA+

    View Slide

  17. “Formal methods deal with
    models of systems, not the
    systems themselves”
    Use of Formal Methods at Amazon Web Services

    View Slide

  18. Distributed
    Systems Testing
    in the Wild

    View Slide

  19. Distributed
    Systems Testing
    in the Wild
    “Seems Pretty Legit”

    View Slide

  20. Unit Tests
    Testing of Individual Software
    Components or Modules

    View Slide

  21. Scala
    Types
    Are Not
    Testing
    A Short Counter Example

    View Slide

  22. TCP Doesn’t Care About
    Your Type System

    View Slide

  23. Integration Tests
    Testing of integrated modules to
    verify combined functionality

    View Slide

  24. Simple
    Testing Can
    Prevent
    Most Critical
    Failures

    View Slide

  25. Three nodes or
    less can reproduce
    98% of failures
    Simple Testing can Prevent Most Critical Failures

    View Slide

  26. Testing error handling
    code could have
    prevented 58% of
    catastrophic failures
    Simple Testing can Prevent Most Critical Failures

    View Slide

  27. Error Handling Code is simply empty or
    only contains a Log statement
    Error Handler aborts cluster on an overly
    general exception
    Error Handler contains comments like
    FIXME or TODO
    35% of
    Catastrophic
    Failures
    Simple Testing can Prevent Most Critical Failures

    View Slide

  28. Property Based
    Testing

    View Slide

  29. QuickCheck ScalaCheck
    Haskell
    Erlang
    Scala
    Java
    & &
    C, C++, Clojure, Common Lisp, Elm, F#, C#, Go, JavaScript, Node.js, Objective-C, OCaml, Perl,
    Prolog, PHP, Python, R, Ruby, Rust, Scheme, Smalltalk, StandardML , Swift
    Languages with Quick Check Ports:

    View Slide

  30. ScalaCheck Examples

    View Slide

  31. Fault Injection
    Introducing faults into the system under test

    View Slide

  32. -The Verification of a Distributed System
    “Without explicitly forcing a system
    to fail, it is unreasonable to have any
    confidence it will operate correctly
    in failure modes”

    View Slide

  33. Netflix Simian Army
    • Chaos Monkey: kills instances
    • Latency Monkey: artificial latency
    induced
    • Chaos Gorilla: simulates outage
    of entire availability zone.

    View Slide

  34. Kyle has used this tool to show us that many of the
    Distributed Systems we know seem stable
    but are really just this. (cut to tire fire photo)
    JEPSEN
    credit: @aphyr
    Fault Injection Tool
    that simulates
    network
    partitions in the
    system under test

    View Slide

  35. Kyle has used this tool to show us that many of the
    Distributed Systems we know seem stable
    but are really just this. (cut to tire fire photo)
    JEPSEN
    credit: @aphyr
    Fault Injection Tool
    that simulates
    network
    partitions in the
    system under test

    View Slide

  36. CAUTION: Passing Tests
    Does Not Ensure Correctness

    View Slide

  37. GAME DAYS
    Resilience Engineering: Learning to Embrace Failure
    Breaking your services on purpose

    View Slide

  38. How to Run a GameDay
    1. Notify Engineering Teams that Failure is Coming
    2. Induce Failures
    3. Monitor Systems Under Test
    4. Observing Only Team Monitors Recovery Processes
    & Systems, Files Bugs
    5. Prioritize Bugs & Get Buy-In Across Teams
    Resilience Engineering: Learning to Embrace Failure

    View Slide

  39. Game Day at Stripe
    “During a recent game day, we
    tested failing over a Redis cluster by
    running kill -9 on its primary
    node, and ended up losing all
    data in the cluster”
    Game Day Exercises at Stripe: Learning from `kill -9`

    View Slide

  40. TESTING IN
    PRODUCTION
    Some thoughts on

    View Slide

  41. Monitoring
    Testing
    is not

    View Slide

  42. CANARIES
    “Verification” in production

    View Slide

  43. Verification
    Wild
    in the wild
    Unit & Integration Tests
    Property Based Testing
    Fault Injection
    Canaries

    View Slide

  44. Research
    Improving the Verification
    of Distributed Systems
    Lineage Driven Fault Injection
    ‘Cause I’m Strong Enough:
    Reasoning about Consistency Choices in Distributed Systems
    IronFleet:
    Proving Practical Distributed Systems Correct
    Towards Property Based
    Consistency Verification

    View Slide

  45. Lineage-driven
    Fault Injection

    View Slide

  46. Lineage-driven
    Fault Injection
    Molly In Action

    View Slide

  47. Netflix & Molly
    Distributed Tracing + FIT To
    construct call graphs
    Metric Systems to Determine
    if Call was a Success
    Used FIT to Inject Failures
    determined by Molly
    “Monkeys in Lab Coats”: Applied Failure Testing Research at Netflix

    View Slide

  48. Conclusion
    Use Formal Verification on
    Critical Components
    Unit Tests & Integration Tests find a
    multitude of Errors
    Increase Confidence via Property
    Testing & Fault Injection

    View Slide

  49. Camille Fournier
    “Enjoy the ride, have fun, and
    test your freaking code”

    View Slide

  50. Thank You
    Peter Alvaro
    Kyle Kingsbury
    Christopher
    Meiklejohn
    Alex Rasmussen
    Ines Sombra
    Nathan Taylor
    Alvaro Videla

    View Slide

  51. Questions
    @caitie
    http://github.com/CaitieM20/
    TheVerificationOfDistributedSystem
    Resources:

    View Slide