Upgrade to Pro — share decks privately, control downloads, hide ads and more …

THINK BIG: CHAOS TESTING A MONOLITH

Chaos Conf
September 26, 2019

THINK BIG: CHAOS TESTING A MONOLITH

Caroline Dickey, Mailchimp

For many companies, Chaos Engineering means testing dependencies between services or killing container instances. Those approaches don't work if your company's main product is a 23 million line PHP monolith running on bare metal. This talk explores chaos testing when best practices aren't an option.

Chaos Conf

September 26, 2019
Tweet

More Decks by Chaos Conf

Other Decks in Technology

Transcript

  1. 2019 Think Big: How to Chaos Test a Monolith SEPTEMBER

    26, 2019 Caroline Dickey, Site Reliability Engineer @CarolineEDickey " 1
  2. 2019 What’s a monolith? “We aren’t ready for chaos engineering”

    Three approaches to testing a monolith Three use cases for chaos engineering Agenda 2
  3. 2019 19 Our customers depend on us They don’t care

    about our tech They’ve trusted us - let’s not let them down
  4. 2019 20 Our customers depend on us They don’t care

    about our tech They’ve trusted us - let’s not let them down Chaos engineering can help
  5. 2019 What’s a monolith? “We aren’t ready for chaos engineering”

    Three approaches to testing a monolith Three use cases for chaos engineering Agenda 22
  6. 2019 25 There will always be things that we can’t

    control in software engineering. Can’t control Can control
  7. 2019 29 But there are also plenty of potential problem

    areas that we can do something about. Can’t control Can control
  8. 2019 33 Chaos engineering is all about validating our assumptions

    about these things Can’t control Can control
  9. 2019 What’s a monolith? “We aren’t ready for chaos engineering”

    Three approaches to testing a monolith Three use cases for chaos engineering Agenda 35
  10. 2019 55 Ceph: A versatile, distributed storage system that can

    be used for many different types of storage services. Ceph-fuse: A FUSE (File system in USErspace) client for Ceph distributed file system. It will mount a ceph file system.
  11. 2019 59 Outcome: NO alerting and a homegrown Python script

    shows up attempting to remount, and fails.
  12. 2019 60 Outcome: NO alerting and a homegrown Python script

    shows up attempting to remount, and fails.
  13. 2019 62 Memcached: Memcached is a distributed memory caching system.

    It speeds up websites having large dynamic databasing by storing database object in Dynamic Memory. https://www.cloudways.com/blog/memcached-with-php/
  14. 2019 63 We have Memcached caches on all of our

    application servers, which interact with each other to access cached data.
  15. 2019 64 We made a fix to a new Memcached

    client library to prevent timeouts if a server is unreachable.
  16. 2019 71 Requestmapper is a service that maps URLs from

    one format ( Pretty Campaign URLs or custom landing pages) to their internal format.
  17. 2019 72 Requestmapper is a service that maps URLs from

    one form ( Pretty Campaign URLs or custom landing pages) to their internal form.
  18. 2019 Use an architecture diagram Validate changes Test your dependencies

    Carefully 84 Three Four approaches to testing a monolith
  19. 2019 86 Our approach Default to testing in stage/dev Over-communicate

    about GameDays If anyone feels uncomfortable, we don’t proceed Don’t automate tests on anything that isn’t inherently resilient (like the monolith)
  20. 2019 87 Our approach Default to testing in stage/dev Over-communicate

    about GameDays If anyone feels uncomfortable, we don’t proceed Don’t automate tests on anything that isn’t inherently resilient (like the monolith)
  21. 2019 88 Our approach Default to testing in stage/dev Over-communicate

    about GameDays If anyone feels uncomfortable, we don’t proceed Don’t automate tests on anything that isn’t inherently resilient (like the monolith)
  22. 2019 89 Our approach Default to testing in stage/dev Over-communicate

    about GameDays If anyone feels uncomfortable, we don’t proceed Build confidence by testing in production incrementally
  23. 2019 What’s a monolith? “We aren’t ready for chaos engineering”

    Three approaches to testing a monolith Three use cases for chaos engineering Agenda 90
  24. 2019 102 Incidents happen Blameless post-mortems ✔ Explore human factors

    Root Cause Analysis can be limiting Use GameDays to fill in the gaps
  25. Proprietary & Confidential 2019 109 Chaos Performance Training Post-Mortem Load

    Testing Disaster Recovery Incident Simulation Support Tickets Cross-Team Eng. Performance Research
  26. 2019 110 Application Performance GameDay: A time-boxed opportunity to dive

    deeply into a specific topic, system, or set of tickets. Pulling from -  Engineering reports -  Support tickets
  27. 2019 111 Application Performance GameDay: A time-boxed opportunity to dive

    deeply into a specific topic, system, or set of tickets. Pulling from -  Engineering reports -  Support tickets
  28. 2019 113 Conclusions If you don’t know where to start,

    try validating isolated parts of your infrastructure, application, or data storage. The best time to find a vulnerability is before an incident. The second best time is after an incident so that it never happens again. Chaos engineering is an effective tool for sharing knowledge and building empathy. 113 2019
  29. 2019 114 Conclusions Chaos engineering can help make any application

    more resilient, regardless of architecture. If you don’t know where to start, try validating isolated parts of your infrastructure, application, or data storage. The best time to find a vulnerability is before an incident. The second best time is after an incident so that it never happens again. Chaos engineering is an effective tool for sharing knowledge and building empathy. 114 2019
  30. 2019 115 Conclusions Chaos engineering can help make any application

    more resilient, regardless of architecture. If you don’t know where to start, try looking at an architecture diagram or identifying some changes about to be released. The best time to find a vulnerability is before an incident. The second best time is after an incident so that it never happens again. Chaos engineering is an effective tool for sharing knowledge and building empathy. 115 2019
  31. 2019 116 Conclusions Chaos engineering can help make any application

    more resilient, regardless of architecture. If you don’t know where to start, try looking at an architecture diagram or identifying some changes about to be released. The best time to find a vulnerability is before an incident. The second best time is after an incident so that it never happens again. Chaos engineering is an effective tool for sharing knowledge and building empathy. 116 2019
  32. 2019 117 Conclusions Chaos engineering can help make any application

    more resilient, regardless of architecture. If you don’t know where to start, try looking at an architecture diagram or identifying some changes about to be released. The best time to find a vulnerability is before an incident. The second best time is after an incident so that it never happens again. Chaos engineering is an effective tool for sharing knowledge and building empathy. 117 2019