THINK BIG: CHAOS TESTING A MONOLITH

777bc656cb5c276519c2d52951d6ebca?s=47 Chaos Conf
September 26, 2019

THINK BIG: CHAOS TESTING A MONOLITH

Caroline Dickey, Mailchimp

For many companies, Chaos Engineering means testing dependencies between services or killing container instances. Those approaches don't work if your company's main product is a 23 million line PHP monolith running on bare metal. This talk explores chaos testing when best practices aren't an option.

777bc656cb5c276519c2d52951d6ebca?s=128

Chaos Conf

September 26, 2019
Tweet

Transcript

  1. 2019 Think Big: How to Chaos Test a Monolith SEPTEMBER

    26, 2019 Caroline Dickey, Site Reliability Engineer @CarolineEDickey " 1
  2. 2019 What’s a monolith? “We aren’t ready for chaos engineering”

    Three approaches to testing a monolith Three use cases for chaos engineering Agenda 2
  3. 2019 Monolithic Architecture Microservice Architecture Business Logic UI Data Access

    Layer UI ⚙ ⚙ ⚙ ⚙ Microservices 3
  4. 2019 4

  5. 2019 5

  6. 2019 6

  7. 2019 7

  8. 2019 8

  9. 2019 9

  10. 2019 10

  11. 2019 11

  12. 2019 12

  13. 2019 Some people don’t like monoliths. 13

  14. 2019 22,977,131 14

  15. 2019 22,977,131 15

  16. 2019 22,977,131 16

  17. 2019 17 Our customers depend on us

  18. 2019 18 Our customers depend on us They don’t care

    about our tech
  19. 2019 19 Our customers depend on us They don’t care

    about our tech They’ve trusted us - let’s not let them down
  20. 2019 20 Our customers depend on us They don’t care

    about our tech They’ve trusted us - let’s not let them down Chaos engineering can help
  21. 2019 “I don’t have a monolith. Why should I care?”

    - You, maybe 21
  22. 2019 What’s a monolith? “We aren’t ready for chaos engineering”

    Three approaches to testing a monolith Three use cases for chaos engineering Agenda 22
  23. 2019 “We aren’t ready for chaos engineering” 23 2019

  24. 2019 24 24 2019

  25. 2019 25 There will always be things that we can’t

    control in software engineering. Can’t control Can control
  26. 2019 26 We can’t control bad pushes 26 2019

  27. 2019 27 We can’t control database maintenance events 27 2019

  28. 2019 28 We can’t control backhoes 28 2019

  29. 2019 29 But there are also plenty of potential problem

    areas that we can do something about. Can’t control Can control
  30. 2019 30 We can control built-in redundancy 2019

  31. 2019 31 We can control error handling for failed dependencies

    2019
  32. 2019 32 We can control our monitoring and alerting 2019

  33. 2019 33 Chaos engineering is all about validating our assumptions

    about these things Can’t control Can control
  34. 2019 34 Can’t control Can control But also, making this

    bigger. Things we know about
  35. 2019 What’s a monolith? “We aren’t ready for chaos engineering”

    Three approaches to testing a monolith Three use cases for chaos engineering Agenda 35
  36. 2019 Use an architecture diagram Validate changes Test your dependencies

    36 Three approaches to testing a monolith
  37. 2019 User Secondary LB (warm) Primary LB Secondary CDN App

    Servers Primary 37
  38. Scenario Load balancer failover 38 2019

  39. 2019 User Secondary LB (warm) Primary LB Secondary CDN App

    Servers Primary ✕ 39
  40. 2019 40

  41. 2019 41

  42. 2019 42

  43. 2019 43

  44. 2019 44

  45. 2019 User Secondary LB (warm) Primary LB Secondary CDN App

    Servers Primary ✕ 45
  46. 2019 User Secondary LB (warm) Primary LB Secondary CDN App

    Servers Primary 46
  47. Scenario Make the database read-only 47 2019

  48. 2019 User Secondary LB (warm) Primary LB Secondary CDN App

    Servers Primary 48
  49. 2019 User Secondary LB (warm) Primary LB Secondary CDN App

    Servers Primary 49
  50. 2019 50 Outcome: Unexpected SQL error due to missing error

    handling in a legacy class
  51. 2019 51 Outcome: Unexpected SQL error due to missing error

    handling in a legacy class
  52. 2019 52 Pro-tip: market chaos engineering internally with an email

    newsletter!
  53. 2019 Use an architecture diagram Validate changes Test your dependencies

    53 Three approaches to testing a monolith
  54. 54 Scenario New filesystem on application servers 2019

  55. 2019 55 Ceph: A versatile, distributed storage system that can

    be used for many different types of storage services. Ceph-fuse: A FUSE (File system in USErspace) client for Ceph distributed file system. It will mount a ceph file system.
  56. 2019 56

  57. 2019 57

  58. 2019 58 Outcome: NO alerting

  59. 2019 59 Outcome: NO alerting and a homegrown Python script

    shows up attempting to remount, and fails.
  60. 2019 60 Outcome: NO alerting and a homegrown Python script

    shows up attempting to remount, and fails.
  61. 61 Scenario New caching library 2019

  62. 2019 62 Memcached: Memcached is a distributed memory caching system.

    It speeds up websites having large dynamic databasing by storing database object in Dynamic Memory. https://www.cloudways.com/blog/memcached-with-php/
  63. 2019 63 We have Memcached caches on all of our

    application servers, which interact with each other to access cached data.
  64. 2019 64 We made a fix to a new Memcached

    client library to prevent timeouts if a server is unreachable.
  65. 2019 65

  66. 2019 66

  67. 2019 67

  68. 2019 Use an architecture diagram Validate changes Test your dependencies

    68 Three approaches to testing a monolith
  69. Scenario Internal dependencies* *Sometimes known as services/microservices 69 2019

  70. 2019 70

  71. 2019 71 Requestmapper is a service that maps URLs from

    one format ( Pretty Campaign URLs or custom landing pages) to their internal format.
  72. 2019 72 Requestmapper is a service that maps URLs from

    one form ( Pretty Campaign URLs or custom landing pages) to their internal form.
  73. 2019 RM - = 73

  74. 2019 74

  75. 2019 75

  76. 2019 76

  77. 2019 77

  78. 2019 78 Pro-tip: market chaos engineering internally with an internal

    blog post!
  79. 79 Scenario 3rd party API calls 2019

  80. 2019 80

  81. 2019 81

  82. 2019 82

  83. 2019 83

  84. 2019 Use an architecture diagram Validate changes Test your dependencies

    Carefully 84 Three Four approaches to testing a monolith
  85. 2019 85 Should you chaos test in production?

  86. 2019 86 Our approach Default to testing in stage/dev Over-communicate

    about GameDays If anyone feels uncomfortable, we don’t proceed Don’t automate tests on anything that isn’t inherently resilient (like the monolith)
  87. 2019 87 Our approach Default to testing in stage/dev Over-communicate

    about GameDays If anyone feels uncomfortable, we don’t proceed Don’t automate tests on anything that isn’t inherently resilient (like the monolith)
  88. 2019 88 Our approach Default to testing in stage/dev Over-communicate

    about GameDays If anyone feels uncomfortable, we don’t proceed Don’t automate tests on anything that isn’t inherently resilient (like the monolith)
  89. 2019 89 Our approach Default to testing in stage/dev Over-communicate

    about GameDays If anyone feels uncomfortable, we don’t proceed Build confidence by testing in production incrementally
  90. 2019 What’s a monolith? “We aren’t ready for chaos engineering”

    Three approaches to testing a monolith Three use cases for chaos engineering Agenda 90
  91. 2019 Training Post-mortem counterpart Application performance 91 Three use cases

    for chaos engineering
  92. Scenario Team training 92 2019

  93. 2019 93

  94. 2019 94

  95. 2019 95

  96. 2019 96

  97. 2019 Training Post-mortem counterpart Application performance 97 Three use cases

    for chaos engineering
  98. 2019 98 Incidents happen

  99. 2019 99 Incidents happen Blameless post-mortems ✔

  100. 2019 100 Incidents happen Blameless post-mortems ✔ Explore human factors

  101. 2019 101 Incidents happen Blameless post-mortems ✔ Explore human factors

    Root Cause Analysis can be limiting
  102. 2019 102 Incidents happen Blameless post-mortems ✔ Explore human factors

    Root Cause Analysis can be limiting Use GameDays to fill in the gaps
  103. 103 Scenario Recreate incidents 2019

  104. 104 Scenario Recreate incidents 2019

  105. 2019 105

  106. 2019 106

  107. 2019 107

  108. 2019 Training Post-mortem counterpart Application performance 108 Three use cases

    for chaos engineering
  109. Proprietary & Confidential 2019 109 Chaos Performance Training Post-Mortem Load

    Testing Disaster Recovery Incident Simulation Support Tickets Cross-Team Eng. Performance Research
  110. 2019 110 Application Performance GameDay: A time-boxed opportunity to dive

    deeply into a specific topic, system, or set of tickets. Pulling from -  Engineering reports -  Support tickets
  111. 2019 111 Application Performance GameDay: A time-boxed opportunity to dive

    deeply into a specific topic, system, or set of tickets. Pulling from -  Engineering reports -  Support tickets
  112. 2019 112

  113. 2019 113 Conclusions If you don’t know where to start,

    try validating isolated parts of your infrastructure, application, or data storage. The best time to find a vulnerability is before an incident. The second best time is after an incident so that it never happens again. Chaos engineering is an effective tool for sharing knowledge and building empathy. 113 2019
  114. 2019 114 Conclusions Chaos engineering can help make any application

    more resilient, regardless of architecture. If you don’t know where to start, try validating isolated parts of your infrastructure, application, or data storage. The best time to find a vulnerability is before an incident. The second best time is after an incident so that it never happens again. Chaos engineering is an effective tool for sharing knowledge and building empathy. 114 2019
  115. 2019 115 Conclusions Chaos engineering can help make any application

    more resilient, regardless of architecture. If you don’t know where to start, try looking at an architecture diagram or identifying some changes about to be released. The best time to find a vulnerability is before an incident. The second best time is after an incident so that it never happens again. Chaos engineering is an effective tool for sharing knowledge and building empathy. 115 2019
  116. 2019 116 Conclusions Chaos engineering can help make any application

    more resilient, regardless of architecture. If you don’t know where to start, try looking at an architecture diagram or identifying some changes about to be released. The best time to find a vulnerability is before an incident. The second best time is after an incident so that it never happens again. Chaos engineering is an effective tool for sharing knowledge and building empathy. 116 2019
  117. 2019 117 Conclusions Chaos engineering can help make any application

    more resilient, regardless of architecture. If you don’t know where to start, try looking at an architecture diagram or identifying some changes about to be released. The best time to find a vulnerability is before an incident. The second best time is after an incident so that it never happens again. Chaos engineering is an effective tool for sharing knowledge and building empathy. 117 2019
  118. 2019 Thank you! Caroline Dickey caroline@mailchimp.com @CarolineEDickey " 118