Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Adopt chaos engineering techniques in your daily work

February 03, 2020

Adopt chaos engineering techniques in your daily work

What happens when your database server runs out of disk space? Will your customers still be able to purchase on your webshop when your web server is at max capacity? Are your surveys still valid after a bot has filled out a couple of thousand times your forms?

These are just a few of the many things that can and will go wrong in your production environments. Are you confident your systems are still delivering value to your customers when the worst possible thing happens? The only way to know for sure is to adopt chaos engineering techniques. As popularised by Netflix with their open sourced Chaos Monkey and Simian Army tools, we should put our system under constant stress to ensure that we can face disruptive disasters at any given time.

In this talk I walk through some of the disasters we faced in the past decade and how we learned how to build resilience by design in all of our projects. We'll also share with you our learnings and our successes when Armageddon takes place. It will be an exciting experience that makes you become Dr. Evil in your own company.


February 03, 2020

More Decks by DragonBe

Other Decks in Technology


  1. Michelangelo van Dam (@DragonBe) 2 Michelangelo van Dam I'm a

    senior #php architect, co-founder and #ceo of @in2itvof, #community leader at @phpbenelux, coach at @CoderDojoBelgium, #digitalnomad, likes #coffee.
  2. Michelangelo van Dam (@DragonBe) “ ” Chaos Engineering Brief intro

    to the world of chaos engineering Chaos engineering is the discipline of experimenting on a software system in production in order to build confidence in the system's capability to withstand turbulent and unexpected conditions. Source: WikiPedia 4
  3. Michelangelo van Dam (@DragonBe) “ ” Chaos Engineering Brief intro

    to the world of chaos engineering Chaos engineering is the discipline of experimenting on a software system in production in order to build confidence in the system's capability to withstand turbulent and unexpected conditions. Source: WikiPedia 5
  4. Michelangelo van Dam (@DragonBe) “ ” Chaos Engineering Brief intro

    to the world of chaos engineering Chaos engineering is experimenting in production to withstand turbulent and unexpected conditions. Source: WikiPedia 6
  5. Michelangelo van Dam (@DragonBe) Sep 2009: Power surge destroys hardware

    9 Not too much power please Facts Constraints Solution • There was a brief power outage • When the power came back on, there was a surge in voltage • Immediately destroyed all hardware • All devices had retail grade power sockets with surge protection • UPS system protected the server, not other devices • Rewired the office with build-in, professional power surge protection • UPS upgrade to support static hardware (printer, phone, monitors, NAS devices)
  6. Michelangelo van Dam (@DragonBe) Apr 2011: AWS goes down &

    everyone else 10 Putting all eggs in different baskets ended up in the same basket after all Facts Constraints Solution • AWS had issues and was down for a whole week • With AWS, many other services were not accessible as well • We used these different services to avoid this from happening • No information given by Amazon • No information given by our service providers • We replicated critical services on-prem
  7. Michelangelo van Dam (@DragonBe) Mar 2012: Server down & inaccessible

    11 Running out of wiggle space Facts Constraints Solution • Server was under heavy load • Logs were filling up diskspace quickly • Alert of < 20% diskspace came too late • Weren’t able to SSH into server: no space left • Logging was on the same server as the application • Monitoring was on a different server, but behind same switch • Setting up centralized logging server • Moved monitoring on different hosting provider • Added mountable storage features to our servers
  8. Michelangelo van Dam (@DragonBe) Feb 2013: Internet outage by national

    ISP 12 Being offline in an online world Facts Constraints Solution • No internet for 14 hours • Deadline to deliver • QA and deploy tools were online • Major ISP in Belgium • Alternative locations had no internet too • Mobile hotspot in the office from competing provider
  9. Michelangelo van Dam (@DragonBe) Feb 2020: Covid-19 Pandemic (Coronavirus) 13

    Stay in your home! Facts Constraints Solution • Viral infection that sickens and kills people globally • China, Korea and Italy on lock-down • Belgian government advises to uphold a very good hygiene and to avoid unnecessary human contacts • Events all over the world are cancelled • Businesses are not ready for WFH workforce • Events are still depending on face to face interactions, not for sessions but for sponsors • Adopt a remote-first mentality in your company, school, hobby group and event organization
  10. Michelangelo van Dam (@DragonBe) 2021: Supply-Chain Attacks (Still ongoing) 14

    Poisoning the well ☠ Facts Constraints Solution • Successful compromise of source code used by many other applications (SolarWinds, Codecov) • Failed attempts to breach open source languages (PHP, Python, Ruby) • Active phishing campaigns ongoing • A breach in one component impacts all dependent projects • Hard to detect because it is provided by a trusted source and signatures match up • MFA for all code repositories required • Commit signing with certificate or GPG key on hardware token (YubiKey) • Library management, monitoring and alerting
  11. Michelangelo van Dam (@DragonBe) 2021: Floods & wildfires (Still ongoing)

    15 Effects of climate change will affect everyone 🌍 Facts Constraints Solution • Parts of eastern Belgium, The Netherlands and West-Germany were flooded, including some data centers • Wildfires in Eastern Europe, US and Canada causes issues for networks and data centers • No access to data and services because DC is not reachable • No backups accessible because they were kept in DC • Rebuilding infrastructure is hard when not using infrastructure as code • Multi-regional replication of applications • Backups kept also at a different DC • Ensure you can recreate your infrastructure and applications in an automated fashion
  12. Michelangelo van Dam (@DragonBe) Inspired by Netflix 17 Chaos Monkey

    & Chaos Engineering Don’t wait for chaos, create it Observe, learn and remediate
  13. Michelangelo van Dam (@DragonBe) Four areas of building resilience 18

    Networks Infrastructure Applications What are the areas where chaos can disrupt our operations? Humans
  14. Michelangelo van Dam (@DragonBe) What can go wrong with networks?

    19 No internet connection No IP available Latency and timeouts Wrong or bad encryption/certificates Man-in-the-Middle attacks Crucial Crucial Moderate Major Major
  15. Michelangelo van Dam (@DragonBe) What can go wrong with infrastructure?

    20 Hardware failure Infrastructure down/not responsive Resource overload Bad configuration (public vs private) Active hacking and exploits Crucial Crucial Moderate Major Major
  16. Michelangelo van Dam (@DragonBe) What can go wrong with applications?

    21 Tight coupling of services Bad coding practices Bugs and vulnerabilities Unable to disable compute intense parts Active hacking and exploits Crucial Crucial Moderate Major Major
  17. Michelangelo van Dam (@DragonBe) What are our human challenges? 22

    Bus factor Single Point of Failure / Bottleneck Diseases and strikes Insider threats / (un)voluntary data breach Staffing & Contracting Crucial Crucial Major Major Moderate
  18. Michelangelo van Dam (@DragonBe) “ ” Monitor everything 24 If

    it moves, you track it Knowing is good, but knowing everything is better. Dave Eggers, quote from his book “The Circle”
  19. Michelangelo van Dam (@DragonBe) Monitoring elements 25 Disk space CPU

    Memory Provision time Deployment time Network throughput Total logins Successful logins Failed logins Server requests Page load Data load Queue size Cache ratio DB Query times Active sessions Total sessions Global ping times Alerts HTTP Responses Much more...
  20. Michelangelo van Dam (@DragonBe) Principles of Chaos Engineering 1. Build

    a Hypothesis around Steady State Behavior 2. Vary Real-world Events 3. Run Experiments in Production 4. Automate Experiments to Run Continuously 5. Minimize Blast Radius Source: https://principlesofchaos.org/ 27
  21. Michelangelo van Dam (@DragonBe) How do we test network failures

    28 Becoming evil at disrupting networks 👹 Unplugging the master power from ISP connection Configure route table to non-routable gateways Set DNS to Syn-flood the network Create a bad SSL/TLS certificate
  22. Michelangelo van Dam (@DragonBe) How do we test infrastructure failures

    29 Have you turned it off and on again? No, just off 🚦 Switching off the power to devices Change credentials for services Turning off or destroying services Removing payment instructions from service providers Changing configurations in production to make services public
  23. Michelangelo van Dam (@DragonBe) How do we test application failures

    30 Giving it all that you got Putting applications under constant stress Providing wrong values for forms and API calls Disabling JavaScript and CSS Constantly running automated penetration tests Switching configurations (db connection becomes storage connection)
  24. Michelangelo van Dam (@DragonBe) How do we challenge the human

    aspect 31 One can only be diverse and welcoming if you have tested it Social Engineering Tests Table-top Exercises / Business DnD Game nights Extract data from systems & score their sensitivity Internal Workshop & Certification Programs Switching roles
  25. Michelangelo van Dam (@DragonBe) Assume breach, always! 33 Adopt a

    zero-trust data policy Zombie Attack, courtesy of Perth Zombie Apocalypse Simulation
  26. Michelangelo van Dam (@DragonBe) Less is more 34 The less

    data you have, the better you are protected against loss or corruption of that data Empty vault, courtesy of Hang The Bankers
  27. Michelangelo van Dam (@DragonBe) A look into our chaos kitchen

    36 GAUNTLT BE MEAN TO YOUR CODE AND LIKE IT Put your app under constant stress PHPever PHP Mutation Testing Framework