Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Architectural Patterns of Resilient Distributed Systems

Ines Sombra
September 26, 2015

Architectural Patterns of Resilient Distributed Systems

Ines Sombra

September 26, 2015
Tweet

More Decks by Ines Sombra

Other Decks in Technology

Transcript

  1. ! OMG Strangeloop 2015!!
    Architectural
    Patterns of
    Resilient
    Distributed
    Systems

    View full-size slide

  2. Globally distributed & highly available

    View full-size slide

  3. Today’s Journey Why care?
    Resilience
    literature
    Resilience in
    industry
    Conclusions
    @randommood

    View full-size slide

  4. OBLIGATORY DISCLAIMER SLIDE

    All from a
    practitioner’s
    perspective!
    @randommood
    Things you may see in this talk
    Pugs
    Fast talking
    Life pondering
    Un-tweetable moments
    Rantifestos
    What surprised me this year
    Wedding factoids and trivia

    View full-size slide

  5. Why Resilience?

    View full-size slide

  6. How can I
    make a
    system
    more
    resilient?
    @randommood

    View full-size slide

  7. @randommood
    Resilience is the ability
    of a system to adapt or
    keep working when
    challenges occur

    View full-size slide

  8. Defining Resilience
    Fault-tolerance
    Evolvability
    Scalability
    Failure isolation
    Complexity
    management
    @randommood

    View full-size slide

  9. It’s
    what
    really
    matters
    @randommood

    View full-size slide

  10. Resilience in
    Literature
    ll
    l

    View full-size slide

  11. Harvest & Yield Model

    View full-size slide

  12. @randommood
    Fraction of successfully answered queries
    Close to uptime but more useful because
    it directly maps to user experience
    (uptime misses this)
    Focus on yield rather than uptime
    Yield

    View full-size slide

  13. @randommood
    From Coda Hale’s “You can’t sacrifice partition tolerance”
    Server A Server B Server C
    Baby Animals
    Cute
    Harvest Fraction of the complete result

    View full-size slide

  14. @randommood
    From Coda Hale’s “You can’t sacrifice partition tolerance”
    Server A Server B Server C
    Baby Animals
    Cute
    X
    66% harvest
    Harvest Fraction of the complete result

    View full-size slide

  15. @randommood
    #1: Probabilistic Availability
    Graceful harvest degradation under faults
    Randomness to make the worst-case &
    average-case the same
    Replication of high-priority data for greater
    harvest control
    Degrading results based on client capability

    View full-size slide

  16. @randommood
    #2 Decomposition & Orthogonality
    Decomposing into subsystems independently
    intolerant to harvest degradation but the
    application can continue if they fail
    You can only provide strong consistency for the
    subsystems that need it
    Orthogonal mechanisms (state vs functionality)

    View full-size slide

  17. @randommood
    “If your system favors
    yield or harvest is an
    outcome of its design”
    Fox & Brewer

    View full-size slide

  18. Cook & Rasmussen model

    View full-size slide

  19. Economic failure
    boundary
    Unacceptable
    workload
    boundary
    Accident
    boundary
    Cook & Rasmussen
    Operating point

    View full-size slide

  20. Economic failure
    boundary
    Unacceptable
    workload
    boundary
    Accident
    boundary
    Cook & Rasmussen

    View full-size slide

  21. Economic failure
    boundary
    Unacceptable
    workload
    boundary
    Accident
    boundary Pressure
    towards
    efficiency
    Cook & Rasmussen

    View full-size slide

  22. Economic failure
    boundary
    Unacceptable
    workload
    boundary
    Accident
    boundary Pressure
    towards
    efficiency
    Reduction
    of effort
    Cook & Rasmussen

    View full-size slide

  23. Economic failure
    boundary
    Unacceptable
    workload
    boundary
    Accident
    boundary Pressure
    towards
    efficiency
    Reduction
    of effort
    Cook & Rasmussen
    Incident!

    View full-size slide

  24. Economic failure
    boundary
    Unacceptable
    workload
    boundary
    Accident
    boundary Pressure
    towards
    efficiency
    Reduction
    of effort
    Safety
    Campaign
    Cook & Rasmussen

    View full-size slide

  25. Economic failure
    boundary
    Unacceptable
    workload
    boundary
    Accident
    boundary Pressure
    towards
    efficiency
    Reduction
    of effort
    error
    margin
    Marginal
    boundary
    Safety
    Campaign
    Cook & Rasmussen

    View full-size slide

  26. error margin
    Original
    marginal
    boundary
    R.I.Cook - 2004
    Acceptable
    operating point
    Accident
    boundary
    Flirting with the margin

    View full-size slide

  27. error margin
    Original
    marginal
    boundary
    R.I.Cook - 2004
    Acceptable
    operating point
    Accident
    boundary
    Flirting with the margin

    View full-size slide

  28. error margin
    Original
    marginal
    boundary
    R.I.Cook - 2004
    Acceptable
    operating point
    Accident
    boundary
    Flirting with the margin

    View full-size slide

  29. error margin
    Original
    marginal
    boundary
    R.I.Cook - 2004
    Acceptable
    operating point
    Accident
    boundary
    Flirting with the margin

    View full-size slide

  30. error margin
    Original
    marginal
    boundary
    R.I.Cook - 2004
    Acceptable
    operating point
    Accident
    boundary
    Flirting with the margin

    View full-size slide

  31. error margin
    Original
    marginal
    boundary
    R.I.Cook - 2004
    Acceptable
    operating point
    Accident
    boundary
    Flirting with the margin

    View full-size slide

  32. R.I.Cook - 2004
    Accident
    boundary
    Flirting with the margin
    New marginal
    boundary!

    View full-size slide

  33. R.I.Cook - 2004
    Accident
    boundary
    Flirting with the margin
    New marginal
    boundary!

    View full-size slide

  34. @randommood
    Insights from Cook’s model
    Engineering resilience requires a model of
    safety based on: mentoring, responding,
    adapting, and learning
    System safety is about what can happen,
    where the operating point actually is, and
    what we do under pressure
    Resilience is operator community focused

    View full-size slide

  35. @randommood
    Engineering system resilience
    Build support for continuous maintenance
    Reveal control of system to operators
    Know it’s going to get moved, replaced, and
    used in ways you did not intend
    Think about configurations as interfaces

    View full-size slide

  36. Borrill's model

    View full-size slide

  37. Traditional

    engineering
    Reactive

    ops unk-unk
    @randommood
    Probability
    of failure
    Rank
    A system’s complexity
    Cascading or
    catastrophic failures &
    you don’t know where
    they will come from!
    Same area as other 2
    combined

    View full-size slide

  38. Traditional

    engineering
    Reactive

    ops unk-unk
    @randommood
    Failure areas need != strategies
    Probability
    of failure
    Rank

    View full-size slide

  39. Traditional

    engineering
    Reactive

    ops unk-unk
    @randommood
    Failure areas need != strategies
    Probability
    of failure
    Rank
    Kingsbury

    View full-size slide

  40. Traditional

    engineering
    Reactive

    ops unk-unk
    @randommood
    Failure areas need != strategies
    Probability
    of failure
    Rank
    Kingsbury
    VS

    View full-size slide

  41. Traditional

    engineering
    Reactive

    ops unk-unk
    @randommood
    Failure areas need != strategies
    Probability
    of failure
    Rank
    Kingsbury
    Alvaro
    VS

    View full-size slide

  42. Strategies to build resilience
    Code standards
    Programming
    patterns
    Testing (full system!)
    Metrics & monitoring
    Convergence to
    good state
    Hazard inventories
    Redundancies
    Feature flags
    Dark deploys
    Runbooks & docs
    Canaries
    System verification
    Formal methods
    Fault injection
    Classical engineering Reactive Operations Unknown-Unknown
    The goal is to build
    failure domain
    independence

    View full-size slide

  43. @randommood
    “Thinking about building
    system resilience using a
    single discipline is
    insufficient. We need
    different strategies”
    Borrill

    View full-size slide

  44. Wedding Trivia!!!
    @randommood

    View full-size slide

  45. Resilience in
    Industry

    View full-size slide

  46. @randommood
    Now with
    sparkles!


    View full-size slide

  47. @randommood
    API inherently more vulnerable
    to any system failures or
    latencies in the stack
    Without fault tolerance: 30
    dependencies w 99.99% uptime
    could result in 2+ hours of
    downtime per month!
    Leveraged client libraries

    View full-size slide

  48. @randommood
    Netflix’s resilient patterns
    Aggressive network timeouts &
    retries. Use of Semaphores.
    Separate threads on per-
    dependency thread pools
    Circuit-breakers to relieve
    pressure in underlying systems
    Exceptions cause app to shed
    load until things are healthy

    View full-size slide

  49. @randommood
    We went on a diet
    just like you!
    #

    View full-size slide

  50. @randommood
    Key insights from Chubby
    Library vs service? Service and client library
    control + storage of small data files with
    restricted operations
    Engineers don’t plan for: availability,
    consensus, primary elections, failures, their
    own bugs, operability, or the future. They also
    don’t understand Distributed Systems

    View full-size slide

  51. @randommood
    Key insights from Chubby
    Centralized services are hard to construct but
    you can dedicate effort into architecting them
    well and making them failure-tolerant
    Restricting user behavior increased resilience
    Consumers of your service are part of your UNK-
    UNK scenarios

    View full-size slide

  52. @randommood
    And the family arrives!

    View full-size slide

  53. @randommood
    Key insights from Truce
    Evolution of our purging
    system from v1 to v3
    Used Bimodal Multicast
    (Gossip protocol) to
    provide extremely fast
    purging speed
    Design concerns & system
    evolution
    Tyler McMullen Bruce Spang

    View full-size slide

  54. Existing
    best practices
    won’t save
    you
    @randommood
    Key insights from NetSys
    João Taveira Araújo 

    looking suave
    Faild allows us to fail &
    recover hosts via MAC-
    swapping and ECMP on
    switches
    Do immediate or gradual
    host failure & recovery
    Watch Joao’s talk

    View full-size slide

  55. Existing
    best practices
    won’t save
    you
    @randommood
    Key insights from NetSys
    João Taveira Araújo 

    looking suave
    Faild allows us to fail &
    recover hosts via MAC-
    swapping and ECMP on
    switches
    Do immediate or gradual
    host failure & recovery
    Watch Joao’s talk

    View full-size slide

  56. Existing
    best practices
    won’t save
    you
    @randommood
    Key insights from NetSys
    João Taveira Araújo 

    looking suave
    Faild allows us to fail &
    recover hosts via MAC-
    swapping and ECMP on
    switches
    Do immediate or gradual
    host failure & recovery
    Watch Joao’s talk

    View full-size slide

  57. @randommood
    So we have a myriad of systems with
    different stages of evolution
    Resilient systems like Varnish, Powderhorn,
    and Faild have taught us many lessons but
    some applications have availability
    problems, why?
    But wait a minute!

    View full-size slide

  58. @randommood
    Everyone
    okay?

    View full-size slide

  59. Resilient
    architectural
    patterns

    View full-size slide

  60. @randommood
    Redundancies are key
    Redundancies of resources,
    execution paths, checks,
    replication of data, replay of
    messages, anti-entropy build
    resilience
    Gossip / epidemic protocols too
    Capacity planning matters
    Optimizations
    can make your
    system less
    resilient!

    View full-size slide

  61. @randommood
    Unawareness of proximity to
    error boundary means we are
    always guessing
    Complex operations make
    systems less resilient & more
    incident-prone
    You design operability too!
    Operations matter

    View full-size slide

  62. @randommood
    Complexity if increases
    safety is actually good
    Adding resilience may
    come at the cost of
    other desired goals
    (e.g. performance,
    simplicity, cost, etc)
    Not all complexity is bad

    View full-size slide

  63. @randommood
    Leverage Engineering best practices
    Resiliency and testing are correlated. TEST!
    Versioning from the start - provide an upgrade
    path from day 1
    Upgrades & evolvability of systems is still tricky.
    Mixed-mode operations need to be common
    Re-examine the way we prototype systems

    View full-size slide

  64. Bringing it together

    View full-size slide

  65. tl;dr
    OPERABILITY
    WHILE IN DESIGN UNK-UNK
    Are we favoring
    harvest or yield?
    Orthogonality &
    decomposition FTW
    Do we have enough
    redundancies in
    place?
    Are we resilient to
    our dependencies?
    Am I providing
    enough control to
    my operators?
    Would I want to be
    on call for this?
    Rank your services:
    what can be
    dropped, killed,
    deferred?
    Monitoring and
    alerting in place?
    The existence of this
    stresses diligence
    on the other two
    areas
    Have we done
    everything we can?
    Abandon hope and
    resort to human
    sacrifices
    ♥ ♥
    Theory matters!

    View full-size slide

  66. IMPROVING OPERABILITY
    WHILE IN DESIGN
    Test dependency failures
    Code reviews != tests. Have both
    Distrust client behavior, even if
    they are internal
    Version (APIs, protocols, disk
    formats) from the start. Support
    mixed-mode operations.
    Checksum all the things
    Error handling, circuit breakers,
    backpressure, leases, timeouts
    Automation shortcuts taken
    while in a rush will come back to
    haunt you
    Release stability is o"en tied to
    system stability. Iron out your
    deploy process
    Link alerts to playbooks
    Consolidate system
    configuration (data bags, config
    file, etc)
    tl;dr
    ♥ ♥
    Operators determine resilience

    View full-size slide

  67. @randommood
    We can’t recover from lack of
    design. Not minding harvest/yield
    means we sign up for a redesign
    the moment we finish coding.
    TODAY’S RANTIFESTO
    ♥ ♥

    View full-size slide

  68. Thank you!
    github.com/Randommood/Strangeloop2015
    7
    7
    Special thanks to
    Paul Borrill, Jordan West, Caitie
    McCaffrey, Camille Fournier, Mike
    O'Neill, Neha Narula, Joao Taveira,
    Tyler McMullen, Zac Duncan,
    Nathan Taylor, Ian Fung, Armon
    Dadgard, Peter Alvaro, Peter Bailis,
    Bruce Spang, Matt Whiteley, Alex
    Rasmussen, Aysulu Greenberg,
    Elaine Greenberg, and Greg Bako.

    View full-size slide