$30 off During Our Annual Pro Sale. View Details »

Full Stack Fest: Architectural Patterns of Resilient Distributed Systems

Ines Sombra
September 05, 2016

Full Stack Fest: Architectural Patterns of Resilient Distributed Systems

Ines Sombra

September 05, 2016
Tweet

More Decks by Ines Sombra

Other Decks in Technology

Transcript

  1. Architectural Patterns of
    Resilient Distributed
    Systems
    Full Stack Fest 2016

    View Slide

  2. Ines
    Sombra
    @Randommood

    View Slide

  3. Globally distributed & highly available

    View Slide

  4. Today’s Journey
    Forest Company
    1
    2
    3
    4
    Motivation
    Resilience
    in literature
    Resilience in
    industry
    Conclusions

    View Slide

  5. Resilience is the
    ability of a system to
    adapt or keep
    working when
    challenges occur

    View Slide

  6. Defining Resilience
    Fault-tolerance
    Evolvability
    Scalability
    Failure isolation
    Complexity management

    View Slide

  7. It’s what really
    matters

    View Slide

  8. How can we
    construct more
    resilient systems?

    View Slide

  9. Resilience in
    Literature

    View Slide

  10. Harvest & Yield Model

    View Slide

  11. Fraction of successfully answered
    queries
    Close to uptime but more useful
    because it directly maps to user
    experience (uptime misses this)
    Focus on yield rather than uptime
    Yield

    View Slide

  12. From Coda Hale’s “You can’t sacrifice partition tolerance”
    Server A Server B Server C
    Baby Animals
    Cute
    Fraction of the complete result
    Harvest

    View Slide

  13. From Coda Hale’s “You can’t sacrifice partition tolerance”
    Server A Server B Server C
    Baby Animals
    Cute
    X
    66% harvest
    Fraction of the complete result
    Harvest

    View Slide

  14. Graceful harvest degradation under faults
    Randomness to make the worst-case &
    average-case the same
    Replication of high-priority data for greater
    harvest control
    Degrading results based on client capability
    #1: Probabilistic Availability

    View Slide

  15. Decomposing into subsystems
    independently intolerant to harvest
    degradation but your application can
    continue if they fail
    Only provide strong consistency for
    the subsystems that need it
    Orthogonal mechanisms (state vs
    functionality)
    #2 Decomposition & Orthogonality
    1
    2
    3
    4
    5

    View Slide

  16. If your system favors
    yield or harvest is an
    outcome of its
    design


    ~ Fox & Brewer

    View Slide

  17. Cook & Rasmussen model

    View Slide

  18. Economic failure
    boundary
    Unacceptable
    workload
    boundary
    Accident
    boundary
    Operating point
    Cook & Rasmussen

    View Slide

  19. Economic failure
    boundary
    Unacceptable
    workload
    boundary
    Accident
    boundary
    Cook & Rasmussen

    View Slide

  20. Economic failure
    boundary
    Unacceptable
    workload
    boundary
    Accident
    boundary Pressure
    towards
    efficiency
    Cook & Rasmussen

    View Slide

  21. Economic failure
    boundary
    Unacceptable
    workload
    boundary
    Accident
    boundary Pressure
    towards
    efficiency
    Reduction
    of effort
    Cook & Rasmussen

    View Slide

  22. Economic failure
    boundary
    Unacceptable
    workload
    boundary
    Accident
    boundary Pressure
    towards
    efficiency
    Reduction
    of effort
    Incident!
    Cook & Rasmussen

    View Slide

  23. Economic failure
    boundary
    Unacceptable
    workload
    boundary
    Accident
    boundary Pressure
    towards
    efficiency
    Reduction
    of effort
    Safety
    Campaign
    Cook & Rasmussen

    View Slide

  24. Economic failure
    boundary
    Unacceptable
    workload
    boundary
    Accident
    boundary Pressure
    towards
    efficiency
    Reduction
    of effort
    error
    margin
    Marginal
    boundary
    Safety
    Campaign
    Cook & Rasmussen

    View Slide

  25. error margin
    Original
    marginal
    boundary
    R.I.Cook - 2004
    Acceptable
    operating
    point
    Accident
    boundary
    Flirting
    with the
    margin

    View Slide

  26. error margin
    Original
    marginal
    boundary
    R.I.Cook - 2004
    Acceptable
    operating
    point
    Accident
    boundary
    Flirting
    with the
    margin

    View Slide

  27. error margin
    Original
    marginal
    boundary
    R.I.Cook - 2004
    Acceptable
    operating
    point
    Accident
    boundary
    Flirting
    with the
    margin

    View Slide

  28. error margin
    Original
    marginal
    boundary
    R.I.Cook - 2004
    Acceptable
    operating
    point
    Accident
    boundary
    Flirting
    with the
    margin

    View Slide

  29. error margin
    Original
    marginal
    boundary
    R.I.Cook - 2004
    Acceptable
    operating
    point
    Accident
    boundary
    Flirting
    with the
    margin

    View Slide

  30. error margin
    Original
    marginal
    boundary
    R.I.Cook - 2004
    Acceptable
    operating
    point
    Accident
    boundary
    Flirting
    with the
    margin

    View Slide

  31. R.I.Cook - 2004
    Accident
    boundary
    New marginal boundary!
    Flirting
    with the
    margin

    View Slide

  32. R.I.Cook - 2004
    Accident
    boundary
    New marginal boundary!
    Flirting
    with the
    margin

    View Slide

  33. Engineering resilience requires a
    model of safety based on: mentoring,
    responding, adapting, and learning
    System safety is about what can
    happen, where the operating point
    actually is, and what we do under
    pressure
    Insights from Cook’s model

    View Slide

  34. Build support for continuous
    maintenance
    Reveal control of system to operators
    Know it’s going to get moved,
    replaced, and used in ways you did
    not intend
    Think about configurations as
    interfaces
    Engineering system resilience

    View Slide

  35. Borrill's model

    View Slide

  36. Traditional

    engineering Reactive

    ops unk-unk
    Probability
    of failure
    Rank
    Cascading or catastrophic
    failures & you don’t know
    where they will come from!
    Same area as other 2
    combined
    A system’s complexity

    View Slide

  37. Traditional

    engineering Reactive

    ops unk-unk
    Probability
    of failure
    Rank
    Failure areas need != strategies
    Kingsbury
    Alvaro
    VS

    View Slide

  38. Code standards
    Programming
    patterns
    Full system testing
    Metrics &
    monitoring are MVP
    Convergence to
    good state
    Hazard inventories
    Redundancies
    Feature flags
    Dark deploys
    Runbooks & docs
    Canaries
    System verification
    Formal methods
    Fault injection
    Classical engineering Reactive Operations Unknown-Unknown
    The goal is to
    build failure
    domain
    independence
    Strategies to build resilience

    View Slide

  39. Thinking about building
    system resilience using a
    single discipline is
    insufficient. We need
    different strategies


    ~ Borrill

    View Slide

  40. Resilience 

    in Industry

    View Slide

  41. View Slide

  42. Key insights from Chubby
    Library vs service? Service and client library
    control + storage of small data files with
    restricted operations
    Engineers don’t plan for: availability,
    consensus, primary elections, failures, their
    own bugs, operability, or the future. They
    also don’t understand Distributed Systems

    View Slide

  43. Key insights from Chubby
    Centralized services are hard to construct
    but you can dedicate effort into architecting
    them well and making them failure-tolerant
    Restricting user behavior increased resilience
    Consumers of your service are part of your
    UNK-UNK scenarios

    View Slide

  44. Key patterns
    Aggressive network timeouts &
    retries. Use of Semaphores.
    Separate threads on per-
    dependency thread pools
    Circuit-breakers to relieve
    pressure in underlying systems
    Exceptions cause app to shed
    load until things are healthy
    A lot
    more in the
    resource
    repo!

    View Slide

  45. System intuition

    View Slide

  46. Powderhorn insights
    Evolution of our purging
    system from v1 to v3
    Used Bimodal Multicast
    (Gossip protocol) to
    provide extremely fast
    purging speed
    Watch their talk!
    Tyler McMullen Bruce Spang

    View Slide

  47. NetSys patterns
    Faild allows us to fail &
    recover hosts via MAC-
    swapping and ECMP on
    switches
    Do immediate or gradual
    host failure & recovery
    Watch Joao’s talk

    View Slide

  48. NetSys patterns
    Faild allows us to fail &
    recover hosts via MAC-
    swapping and ECMP on
    switches
    Do immediate or gradual
    host failure & recovery
    Watch Joao’s talk

    View Slide

  49. ImageOpto insights
    A stateless system is nice but
    figuring out the request cycle
    can be tricky
    Dependencies are hard:
    customer setup, caching layer, &
    libraries - we have to be resilient
    to all of them
    CDN
    Image Opto
    Origin

    View Slide

  50. ImageOpto insights
    Design error types & their
    handling carefully
    Failure detection & system
    operability are ongoing concerns
    Mixed-mode & versioning of data
    structures
    Validation & system adaptability
    Origin
    CDN
    IO
    X

    View Slide

  51. Resilient
    architectural
    patterns

    View Slide

  52. Redundancies are key
    Redundancies of resources,
    execution paths, checks,
    replication of data, replay of
    messages, anti-entropy build
    resilience
    Gossip /epidemic protocols
    Capacity planning matters
    Optimizations
    can make your
    system less
    resilient!

    View Slide

  53. Unawareness of proximity
    to error boundary means
    we are always guessing
    Complex operations make
    systems less resilient &
    more incident-prone
    You design operability too!
    Operations matter

    View Slide

  54. Complexity if increases
    safety is actually good
    Adding resilience may
    come at the cost of other
    desired goals (e.g.
    performance, simplicity,
    cost, etc)
    Not all complexity is bad

    View Slide

  55. Leverage Engineering Best Practices
    Resiliency and testing are correlated. TEST!
    Versioning from the start - provide an upgrade path
    from day 1
    Upgrades & evolvability of systems is still tricky.
    Mixed-mode operations need to be common
    Re-examine the way we prototype systems

    View Slide

  56. View Slide

  57. IN DESIGN OPERABILITY UNK-UNK
    Are we favoring
    harvest or yield?
    Orthogonality &
    decomposition
    FTW
    Do we have enough
    redundancies in
    place?
    Are we resilient to
    our dependencies?
    Theory matters!
    Am I providing
    enough control to
    my operators?
    Would I want to be
    on call for this?
    Rank your
    services: what can
    be dropped, killed,
    deferred?
    Monitoring and
    alerting in place?
    The existence of
    this stresses
    diligence on the
    other two areas
    Have we done
    everything we
    can?
    Abandon hope
    and resort to
    human sacrifices
    tl;dr

    View Slide

  58. IN DESIGN OPERABILITY
    tl;dr
    Test dependency failures
    Code reviews != tests. Have both
    Distrust client behavior, even if
    they are internal
    Version (APIs, protocols, disk
    formats) from the start
    Checksum all the things
    Error handling, circuit breakers,
    backpressure, leases, timeouts
    Automation shortcuts taken
    while rushed will come back to
    haunt you
    Release stability is often tied to
    system stability. Iron out your
    deploy process
    Link alerts to playbooks
    Consolidate system
    configuration (data bags, config
    file, etc)
    Operators determine resilience

    View Slide

  59. We can’t recover from lack
    of design. Not minding
    harvest/yield means we sign
    up for a redesign the
    moment we finish coding


    ~ Me last year

    View Slide

  60. Good design is hard.
    Unknowns are hard to
    predict. Let the tenets we
    discussed today guide
    your redesigns.


    ~ Me today

    View Slide

  61. 46
    github.com/Randommood/FullStackFest2016
    ~ THANK YOU ~

    View Slide