Pro Yearly is on sale from $80 to $50! »

YOW2016

C64a0152c9b0928e62d88f0bb5eb8138?s=47 Ines Sombra
December 01, 2016

 YOW2016

Slides for the "Architectural patterns of resilient distributed systems" talk given at YOW 2016 - http://yowconference.com.au

References at: https://github.com/Randommood/YOW2016

C64a0152c9b0928e62d88f0bb5eb8138?s=128

Ines Sombra

December 01, 2016
Tweet

Transcript

  1. Architectural Patterns of Resilient Distributed Systems YOW 2016

  2. Ines Sombra @Randommood

  3. Globally distributed & highly available

  4. Today’s Journey Forest Company 1 2 3 4 Motivation Resilience

    in literature Resilience in industry Conclusions Tie it all together Foundational knowledge Why Ines cares & you should too What are others doing?
  5. Resilience is the ability of a system to adapt or

    keep working when challenges occur
  6. Defining Resilience Fault-tolerance Evolvability Scalability Failure isolation Complexity management

  7. How can we construct more resilient systems?

  8. It’s what really matters

  9. !

  10. The Team

  11. 3000 × 2000 px 361 KB

  12. Trim all edges by 25% http:/ /www.fastly.io/image.jpg?trim=0.25 Crop the image

    square and resize the width to 200px http:/ /www.fastly.io/image.jpg?crop=1:1&width=200 1000 × 667 px 92 KB 200 × 200 px 9 KB
  13. CDN Image Opto Origin Image Opto Image Opto Image Opto

    Image Opto ImageOpto 101
  14. Origin Image Opto Image Opto Image Opto Image Opto Image

    Opto CDN ImageOpto 101
  15. Origin Image Opto Image Opto Image Opto Image Opto Image

    Opto CDN ImageOpto 101
  16. Origin Image Opto Image Opto Image Opto Image Opto Image

    Opto CDN ImageOpto 101
  17. Origin Image Opto Image Opto Image Opto Image Opto Image

    Opto CDN ImageOpto 101
  18. POP

  19. Resilience in Literature

  20. Harvest & Yield Model

  21. Fraction of successfully answered queries Focus on yield rather than

    uptime (think amazon during xmas) Yield
  22. From Coda Hale’s “You can’t sacrifice partition tolerance” Server A

    Server B Server C Baby Animals Cute Fraction of the complete result Harvest
  23. " 100% harvest

  24. From Coda Hale’s “You can’t sacrifice partition tolerance” Server A

    Server B Server C Baby Animals Cute X 66% harvest Fraction of the complete result Harvest
  25. ☹ 66% harvest

  26. From Coda Hale’s “You can’t sacrifice partition tolerance” Server A

    Server B Server C Baby Animals Cute X 33% harvest Fraction of the complete result Harvest X
  27. 33% harvest $

  28. Randomness to make the worst-case & average-case the same Replication

    of high-priority data for greater harvest control Degrading results based on client capability #1: Probabilistic Availability
  29. Break into subsystems Only provide strong consistency for the subsystems

    that need it Use orthogonal mechanisms #2 Decomposition & Orthogonality 1 2 3 4 5
  30. If your system favors yield or harvest is an outcome

    of its design “ ” ~ Fox & Brewer
  31. Harvest & Yield applied ImageOpto favors harvest Consistent hashing based

    on pristine image Replication to secondary nodes Orthogonality in CDN side Origin CDN IO X
  32. Cook & Rasmussen model

  33. Economic failure boundary Unacceptable workload boundary Accident boundary Pressure towards

    efficiency Reduction of effort error margin Marginal boundary Safety Campaign Incident! Operating point Cook & Rasmussen
  34. error margin Original marginal boundary R.I.Cook - 2004 Acceptable operating

    point Accident boundary New marginal boundary! Flirting with the margin
  35. Engineering resilience requires a model of safety based on: mentoring,

    responding, adapting, and learning System safety is about what can happen, where the operating point actually is, and what we do under pressure Insights from Cook’s model
  36. Build support for continuous maintenance Resilience is operator community focused

    Know it’s going to get moved, replaced, and used in ways you did not intend Engineering system resilience
  37. Cook & Rasmussen applied Unexpected use-cases Acceptable workload boundary influenced

    a redesign Use response to incidents as educational opportunities Origin CDN IO
  38. Borrill's model

  39. Classical
 engineering Reactive
 ops unk-unk Probability of failure Rank Cascading

    or catastrophic failures & you don’t know where they will come from! Same area as other 2 combined A system’s complexity
  40. Classical
 engineering Reactive
 ops unk-unk Failure areas need != strategies

    Probability of failure Rank % & ' ☠'
  41. Thinking about building system resilience using a single discipline is

    insufficient. We need different strategies “ ” ~ Borrill
  42. Code standards Programming patterns Full system testing Metrics & monitoring

    Convergence to good state Hazard inventories Redundancies Feature flags Dark deploys Runbooks & docs Canaries System verification Formal methods Fault injection Classical engineering Reactive Operations Unknown-Unknown Strategies to build resilience
  43. System verification Formal methods Fault injection Classical engineering Reactive Operations

    Unknown-Unknown Strategies to build resilience Code standards Programming patterns Full system testing Metrics & monitoring Convergence to good state Hazard inventories Redundancies Feature flags Dark deploys Runbooks & docs Canaries
  44. Resilience 
 in Industry

  45. None
  46. Library vs service? Service and client library control + storage

    of small data files with restricted operations Engineers don’t plan for: availability, consensus, primary elections, failures, their own bugs, operability, or the future. They also don’t understand Distributed Systems Key insights from Chubby %
  47. Key insights from Chubby Centralized services are hard to construct

    but you can dedicate effort into architecting them well and making them failure-tolerant Restricting user behavior increased resilience Consumers of your service are part of your UNK-UNK scenarios
  48. ImageOpto insights Dependencies are hard: customer setup, customer inputs, caching

    layer, & libraries - we have to be resilient from all of them Unk-Unks also lay in hidden dependencies (reduce as many of them as possible)
  49. None
  50. Ship something out earlier with a limited API. Continuously invest

    in design of functionality and operability “ ” ~ Me today
  51. In design What compromises does your system make as things

    go bad? Resilient systems are designed for high yield & variable harvest
  52. Unawareness of proximity to error boundary means we are always

    guessing Complex operations make systems less resilient & more incident-prone You design operability too! Operations matter
  53. Adding resilience may come at the cost of other desired

    goals (e.g. time, performance, simplicity, cost, etc) Redundancies help Not all complexity is bad
  54. IN DESIGN OPERABILITY UNK-UNK Are we favoring harvest or yield?

    Are we resilient to our dependencies? Use orthogonality & decomposition Theory matters! Am I providing enough control to my operators? Operators impact resilience Narrowing your API helps The existence of this stresses diligence on the other two areas tl;dr The goal is to build failure domain independence
  55. github.com/Randommood/YOW2016 ~ THANK YOU ~