Adopt chaos engineering techniques in your daily work

Adopt chaos engineering techniques in your daily work

What happens when your database server runs out of disk space? Will your customers still be able to purchase on your webshop when your web server is at max capacity? Are your surveys still valid after a bot has filled out a couple of thousand times your forms?

These are just a few of the many things that can and will go wrong in your production environments. Are you confident your systems are still delivering value to your customers when the worst possible thing happens? The only way to know for sure is to adopt chaos engineering techniques. As popularised by Netflix with their open sourced Chaos Monkey and Simian Army tools, we should put our system under constant stress to ensure that we can face disruptive disasters at any given time.

In this talk I walk through some of the disasters we faced in the past decade and how we learned how to build resilience by design in all of our projects. We'll also share with you our learnings and our successes when Armageddon takes place. It will be an exciting experience that makes you become Dr. Evil in your own company.



February 03, 2020


  1. Michelangelo van Dam (@DragonBe) Adopt chaos engineering techniques in your

    daily work
  2. Michelangelo van Dam (@DragonBe) 2 Michelangelo van Dam I'm a

    senior #php architect, co-founder and #ceo of @in2itvof, #community leader at @phpbenelux, coach at @CoderDojoBelgium, #digitalnomad, likes #coffee.
  3. Michelangelo van Dam (@DragonBe) 3 What could possibly go wrong?

  4. Michelangelo van Dam (@DragonBe) “ ” Chaos Engineering Brief intro

    to the world of chaos engineering Chaos engineering is the discipline of experimenting on a software system in production in order to build confidence in the system's capability to withstand turbulent and unexpected conditions. Source: WikiPedia 4
  5. Michelangelo van Dam (@DragonBe) “ ” Chaos Engineering Brief intro

    to the world of chaos engineering Chaos engineering is the discipline of experimenting on a software system in production in order to build confidence in the system's capability to withstand turbulent and unexpected conditions. Source: WikiPedia 5
  6. Michelangelo van Dam (@DragonBe) “ ” Chaos Engineering Brief intro

    to the world of chaos engineering Chaos engineering is experimenting in production to withstand turbulent and unexpected conditions. Source: WikiPedia 6
  7. Michelangelo van Dam (@DragonBe) Chaos Engineering 7 In other words...

    Doom 5 Artwork, courtesy of id Software
  8. Michelangelo van Dam (@DragonBe) Our hard lessons 8

  9. Michelangelo van Dam (@DragonBe) Sep 2009: Power surge destroys hardware

    9 Not too much power please Facts Constraints Solution • There was a brief power outage • When the power came back on, there was a surge in voltage • Immediately destroyed all hardware • All devices had retail grade power sockets with surge protection • UPS system protected the server, not other devices • Rewired the office with build-in, professional power surge protection • UPS upgrade to support static hardware (printer, phone, monitors, NAS devices)
  10. Michelangelo van Dam (@DragonBe) Apr 2011: AWS goes down &

    everyone else 10 Putting all eggs in different baskets ended up in the same basket after all Facts Constraints Solution • AWS had issues and was down for a whole week • With AWS, many other services were not accessible as well • We used these different services to avoid this from happening • No information given by Amazon • No information given by our service providers • We replicated critical services on-prem
  11. Michelangelo van Dam (@DragonBe) Mar 2012: Server down & inaccessible

    11 Running out of wiggle space Facts Constraints Solution • Server was under heavy load • Logs were filling up diskspace quickly • Alert of < 20% diskspace came too late • Weren’t able to SSH into server: no space left • Logging was on the same server as the application • Monitoring was on a different server, but behind same switch • Setting up centralized logging server • Moved monitoring on different hosting provider • Added mountable storage features to our servers
  12. Michelangelo van Dam (@DragonBe) Feb 2013: Internet outage by national

    ISP 12 Being offline in an online world Facts Constraints Solution • No internet for 14 hours • Deadline to deliver • QA and deploy tools were online • Major ISP in Belgium • Alternative locations had no internet too • Mobile hotspot in the office from competing provider
  13. Michelangelo van Dam (@DragonBe) Feb 2020: Covid-19 Pandemic (Coronavirus) 13

    Stay in your home! Facts Constraints Solution • Viral infection that sickens and kills people globally • China, Korea and Italy on lock-down • Belgian government advises to uphold a very good hygiene and to avoid unnecessary human contacts • Events all over the world are cancelled • Businesses are not ready for WFH workforce • Events are still depending on face to face interactions, not for sessions but for sponsors • Adopt a remote-first mentality in your company, school, hobby group and event organization
  14. Michelangelo van Dam (@DragonBe) We needed to prepare for chaos

  15. Michelangelo van Dam (@DragonBe) Inspired by Netflix 15 Chaos Monkey

    & Chaos Engineering Don’t wait for chaos, create it Observe, learn and remidiate
  16. Michelangelo van Dam (@DragonBe) Four areas of building resilience 16

    Networks Infrastructure Applications What are the areas where chaos can disrupt our operations? Humans
  17. Michelangelo van Dam (@DragonBe) What can go wrong with networks?

    17 No internet connection No IP available Latency and timeouts Wrong or bad encryption/certificates Man-in-the-Middle attacks Crucial Crucial Moderate Major Major
  18. Michelangelo van Dam (@DragonBe) What can go wrong with infrastructure?

    18 Hardware failure Infrastructure down/not responsive Resource overload Bad configuration (public vs private) Active hacking and exploits Crucial Crucial Moderate Major Major
  19. Michelangelo van Dam (@DragonBe) What can go wrong with applications?

    19 Tight coupling of services Bad coding practices Bugs and vulnerabilities Unable to disable compute intense parts Active hacking and exploits Crucial Crucial Moderate Major Major
  20. Michelangelo van Dam (@DragonBe) What are our human challenges? 20

    Bus factor Single Point of Failure / Bottleneck Diseases and strikes Insider threats / (un)voluntary data breach Staffing & Contracting Crucial Crucial Major Major Moderate
  21. Michelangelo van Dam (@DragonBe) Our remediation efforts 21

  22. Michelangelo van Dam (@DragonBe) “ ” Monitor everything 22 If

    it moves, you track it Knowing is good, but knowing everything is better. Dave Eggers, quote from his book “The Circle”
  23. Michelangelo van Dam (@DragonBe) Monitoring elements 23 Disk space CPU

    Memory Provision time Deployment time Network throughput Total logins Successful logins Failed logins Server requests Page load Data load Queue size Cache ratio DB Query times Active sessions Total sessions Global ping times Alerts HTTP Responses Much more...
  24. Michelangelo van Dam (@DragonBe) One of our dashboards 24

  25. Michelangelo van Dam (@DragonBe) Principles of Chaos Engineering 1. Build

    a Hypothesis around Steady State Behavior 2. Vary Real-world Events 3. Run Experiments in Production 4. Automate Experiments to Run Continuously 5. Minimize Blast Radius Source: 25
  26. Michelangelo van Dam (@DragonBe) Hypotheses for networks • We work

    online using services, therefore our internet connection must be up • We need an IP to connect, so we must have a fixed IP from our service provider and on our office network • When the network/internet is slow, we should be able to work offline and synchronize in the background • We should automate the process of creating and updating certificates or secrets on the fly using a battle hardened solution • We should verify each connection to ensure that no MITM attacks can occur 26
  27. Michelangelo van Dam (@DragonBe) Hypotheses for infrastructure • We want

    our work to run on immutable infrastructure allowing us to toss out broken components and rebuild quickly on new ones • We want to build our work on PaaS and SaaS so it can run on any service provider that offers these services allowing us to switch traffic from a non-responsive service to another, preferably in an automated way • We want to auto-scale on basis of service load and usage history • We provide our infrastructure-as-code where configuration settings are reviewed and policies are being applied • We monitor our chosen technology for security notifications and updates so changes are applied instantly in an automated and tested fashion 27
  28. Michelangelo van Dam (@DragonBe) Hypotheses for applications • We want

    to build cloud-native, loosely coupled applications with a single purpose architecture • We want to intercept bad code early in the development process with static and dynamic code analysers • We want to minimise bugs and vulnerabilities by having automated unit, integration and security testing executed during development • We want to be able to have unfinished features disabled during development with a possibility to gradually enable it when finished and disable compute intense features during heavy load by using feature toggles • We want to attack our applications constantly with automated penetration testing tools to find and fix attack vectors in our application 28
  29. Michelangelo van Dam (@DragonBe) Hypotheses for humans • We want

    to ensure staff, suppliers and clients can work with us digitally first • We automate recurring processes to free up time for valuable work and learning • We prevent direct and indirect access to sensitive data (UI/UX changes) • We want to remove bottlenecks and SPF’s by majority voting mechanisms • We want to share knowledge internally to eliminate the bus factor 29
  30. Michelangelo van Dam (@DragonBe) How do we test network failures

    30 Becoming evil at disrupting networks Unplugging the master power from ISP Configure route table to non-routable gateways Set DNS to Syn-flood the network Wiresharking the internal and external network
  31. Michelangelo van Dam (@DragonBe) How do we test infrastructure failures

    31 Have you turned it off and on again? No, just off Switching off the power to devices Change credentials for services Turning off or destroying services Removing payment instructions from service providers Changing configurations in production to make services public
  32. Michelangelo van Dam (@DragonBe) How do we test application failures

    32 Giving it all that you got Putting applications under constant stress Providing wrong values for forms and API calls Disabling JavaScript and CSS Constantly running automated penetration tests Switching configurations (db connection becomes storage connection)
  33. Michelangelo van Dam (@DragonBe) How do we challenge the human

    aspect 33 One can only be diverse and welcoming if you have tested it Social Engineering Tests Table-top Exercises / Business DnD Game nights Extract data from systems & score their sensitivity Internal Workshop & Certification Programs Switching roles
  34. Michelangelo van Dam (@DragonBe) What about data? 34

  35. Michelangelo van Dam (@DragonBe) Assume breach, always! 35 Adopt a

    zero-trust data policy Zombie Attack, courtesy of Perth Zombie Apocalypse Simulation
  36. Michelangelo van Dam (@DragonBe) Less is more 36 The less

    data you have, the better you are protected against loss or corruption of that data Empty vault, courtesy of Hang The Bankers
  37. Michelangelo van Dam (@DragonBe) Tools and references 37

  38. Michelangelo van Dam (@DragonBe) A look into our chaos kitchen

    38 GAUNTLT BE MEAN TO YOUR CODE AND LIKE IT PHP Mutation Testing Framework
  39. Michelangelo van Dam (@DragonBe) Open source tools we use 39

  40. Michelangelo van Dam (@DragonBe) Resources for more information 40

  41. Michelangelo van Dam (@DragonBe) 41 Questions?