Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Resilient Software Design

Resilient Software Design

Lessons learned from Michal Nygard's "Release It!"

Swanand Pagnis

January 21, 2015

More Decks by Swanand Pagnis

Other Decks in Technology


  1. Resilient Software Design Building software that doesn’t give up!

  2. 1. What’s all this about? 2. Anti-patterns: What not to

    do! 3. Patterns: Make your life easier.
  3. Find the odd one 1. :cloud_factory 2. "CloudFactory" 3. CloudFactory

  4. What will this code print? if fork puts "I won

    the lottery!" else puts "I am bankrupt!" end
  5. What’s all this about?

  6. Stability

  7. It Just Works™

  8. It Should Work™

  9. Bad things happen

  10. Good things happen as well

  11. But those are rare!

  12. Anything that can go wrong, will go wrong. - Murphy’s

  13. Developers think positive

  14. Too much so!

  15. Be negative.

  16. Our goal is to build software

  17. Our goal is also to minimise pain

  18. Our goal is also to save money

  19. Resilient software saves money by not breaking when needed

  20. Resilient software saves money by using optimum infrastructure

  21. Resilient software saves money by keeping developers happy

  22. Anti-patterns

  23. 1. Integration points

  24. Integration is not what you think™

  25. Database is an integration

  26. Third party services are integration.

  27. Your cache layer is an integration

  28. None
  29. Networks fail more often

  30. Socket based protocols have a special way of failing

  31. Refused connection is bad.

  32. Hanged connection is worse.

  33. Micro-services that talk to each other, will stop talking abruptly

  34. 2. Unbalanced Capacities

  35. Specially applicable to micro-services

  36. 3. Slow responses

  37. 4. Unbounded Result Sets

  38. Major anti-pattern, overlooked by many

  39. What is the size of an HTTP cookie?

  40. Patterns

  41. 1. Use timeouts!

  42. What is the default timeout on Ruby’s net/http?

  43. Now and forever, networks will always be unreliable. - Michael

    T Nygard
  44. Every network call in your system must have a timeout

  45. This includes database calls

  46. This includes API calls

  47. This includes cache lookups

  48. What to do when the timeout occurs depends on where

    it occurred
  49. Highly context specific, so the dev team should make that

  50. ProTip: Do not use Ruby’s “timeout” module

  51. Instead, depend on libraries for the timeout

  52. If you’re a library author, just use net/http’s timeout

  53. 2. Circuit Breaker

  54. None
  55. 3. Bulkheads

  56. A ship is divided into several water- tight compartments

  57. In case there is a leakage in one section, water

    doesn’t flood into other sections
  58. Same principle!

  59. Use resource pools

  60. Use rate limiting

  61. Consideration: Capacity

  62. Bulkheading often conflicts increasing or variable capacity

  63. Consideration: Performance

  64. Bulkheading often results in slightly reduced performance

  65. It’s worth it, trust me™

  66. 4. Fail fast

  67. Remember guard clauses in Ruby?

  68. # class Event def closest_event return unless self.location # …

    # … end
  69. Same principle!

  70. Any Ruby libraries?

  71. Not a lot :(

  72. shopify/semian

  73. Thank you!

  74. Questions?

  75. Swanand Pagnis Principal Engineer @ First

  76. Swanand Pagnis @_swanand on Twitter

  77. Swanand Pagnis @swanandp on GitHub