Adopt chaos engineering techniques in your daily work

What happens when your database server runs out of disk space? Will your customers still be able to purchase on your webshop when your web server is at max capacity? Are your surveys still valid after a bot has filled out a couple of thousand times your forms?

These are just a few of the many things that can and will go wrong in your production environments. Are you confident your systems are still delivering value to your customers when the worst possible thing happens? The only way to know for sure is to adopt chaos engineering techniques. As popularised by Netflix with their open sourced Chaos Monkey and Simian Army tools, we should put our system under constant stress to ensure that we can face disruptive disasters at any given time.

In this talk I walk through some of the disasters we faced in the past decade and how we learned how to build resilience by design in all of our projects. We'll also share with you our learnings and our successes when Armageddon takes place. It will be an exciting experience that makes you become Dr. Evil in your own company.



February 03, 2020


  3. Michelangelo van Dam (@DragonBe) 3 What could possibly go wrong?

    Chaos engineering is the discipline of experimenting on a software system in production in order to build confidence in the system's capability to withstand turbulent and unexpected conditions.
    to the world of chaos engineering Chaos engineering is the discipline of experimenting on a software system in production in order to build confidence in the system's capability to withstand turbulent and unexpected conditions. Source: WikiPedia 5
    to the world of chaos engineering Chaos engineering is experimenting in production to withstand turbulent and unexpected conditions. Source: WikiPedia 6
    Doom 5 Artwork, courtesy of id Software
  9. Michelangelo van Dam (@DragonBe) Sep 2009: Power surge destroys hardware

    Sep 2009: Power surge destroys hardware

Facts: There was a brief power outage. When the power came back on, there was a surge in voltage that immediately destroyed all hardware.

Constraints: All devices had retail grade power sockets with surge protection. UPS system protected the server, not other devices.

Solution: Rewired the office with build-in, professional power surge protection. UPS upgrade to support static hardware (printer, phone, monitors, NAS devices).
    Apr 2011: AWS goes down & everyone else

Facts: AWS had issues and was down for a whole week. With AWS, many other services were not accessible as well. We used these different services to avoid this from happening.

Constraints: No information given by Amazon. No information given by our service providers.

Solution: We replicated critical services on-prem.
    Mar 2012: Server down & inaccessible

Facts: Server was under heavy load. Logs were filling up diskspace quickly. Alert of < 20% diskspace came too late. Weren't able to SSH into server: no space left.

Constraints: Logging was on the same server as the application. Monitoring was on a different server, but behind same switch.

Solution: Setting up centralized logging server. Moved monitoring on different hosting provider. Added mountable storage features to our servers.
    Feb 2013: Internet outage by national ISP

Facts: No internet for 14 hours. Deadline to deliver. QA and deploy tools were online.

Constraints: Major ISP in Belgium. Alternative locations had no internet too.

Solution: Mobile hotspot in the office from competing provider.
    Feb 2020: Covid-19 Pandemic (Coronavirus)

Facts: Viral infection that sickens and kills people globally. China, Korea and Italy on lock-down. Belgian government advises to uphold a very good hygiene and to avoid unnecessary human contacts.

Constraints: Events all over the world are cancelled. Businesses are not ready for WFH workforce. Events are still depending on face to face interactions, not for sessions but for sponsors.

Solution: Adopt a remote-first mentality in your company, school, hobby group and event organization.
    Chaos Monkey & Chaos Engineering: Don't wait for chaos, create it. Observe, learn and remediate.
  16. Michelangelo van Dam (@DragonBe) Four areas of building resilience 16

    Four areas of building resilience: Networks, Infrastructure, Applications, Humans
    What can go wrong with networks?

No internet connection (Crucial)
No IP available (Crucial)
Latency and timeouts (Moderate)
Wrong or bad encryption/certificates (Major)
Man-in-the-Middle attacks (Major)
    What can go wrong with infrastructure?

Hardware failure (Crucial)
Infrastructure down/not responsive (Crucial)
Resource overload (Moderate)
Bad configuration (public vs private) (Major)
Active hacking and exploits (Major)
    What can go wrong with applications?

Tight coupling of services (Crucial)
Bad coding practices (Crucial)
Bugs and vulnerabilities (Moderate)
Unable to disable compute intense parts (Major)
Active hacking and exploits (Major)
    What are our human challenges?

Bus factor (Crucial)
Single Point of Failure / Bottleneck (Crucial)
Diseases and strikes (Major)
Insider threats / (un)voluntary data breach (Major)
Staffing & Contracting (Moderate)
    "Knowing is good, but knowing everything is better." - Dave Eggers, quote from his book "The Circle"
  23. Michelangelo van Dam (@DragonBe) Monitoring elements 23 Disk space CPU

    Monitoring elements: Disk space, CPU, Memory, Provision time, Deployment time, Network throughput, Total logins, Successful logins, Failed logins, Server requests, Page load, Data load, Queue size, Cache ratio, DB Query times, Active sessions, Total sessions, Global ping times, Alerts, HTTP Responses, and much more...
  24. Michelangelo van Dam (@DragonBe) One of our dashboards 24

    Principles of Chaos Engineering:
1. Build a Hypothesis around Steady State Behavior
2. Vary Real-world Events
3. Run Experiments in Production
4. Automate Experiments to Run Continuously
5. Minimize Blast Radius

Source:
    Hypotheses for networks:
• We work online using services, therefore our internet connection must be up
• We need an IP to connect, so we must have a fixed IP from our service provider and on our office network
• When the network/internet is slow, we should be able to work offline and synchronize in the background
• We should automate the process of creating and updating certificates or secrets on the fly using a battle hardened solution
• We should verify each connection to ensure that no MITM attacks can occur
    Hypotheses for infrastructure:
• We want our work to run on immutable infrastructure allowing us to toss out broken components and rebuild quickly on new ones
• We want to build our work on PaaS and SaaS so it can run on any service provider that offers these services allowing us to switch traffic from a non-responsive service to another, preferably in an automated way
• We want to auto-scale on basis of service load and usage history
• We provide our infrastructure-as-code where configuration settings are reviewed and policies are being applied
• We monitor our chosen technology for security notifications and updates so changes are applied instantly in an automated and tested fashion
    Hypotheses for applications:
• We want to build cloud-native, loosely coupled applications with a single purpose architecture
• We want to intercept bad code early in the development process with static and dynamic code analysers
• We want to minimise bugs and vulnerabilities by having automated unit, integration and security testing executed during development
• We want to be able to have unfinished features disabled during development with a possibility to gradually enable it when finished and disable compute intense features during heavy load by using feature toggles
• We want to attack our applications constantly with automated penetration testing tools to find and fix attack vectors in our application
    Hypotheses for humans:
• We want to ensure staff, suppliers and clients can work with us digitally first
• We automate recurring processes to free up time for valuable work and learning
• We prevent direct and indirect access to sensitive data (UI/UX changes)
• We want to remove bottlenecks and SPF's by majority voting mechanisms
• We want to share knowledge internally to eliminate the bus factor
    How do we test network failures:
• Unplugging the master power from ISP
• Configure route table to non-routable gateways
• Set DNS to
• Syn-flood the network
• Wiresharking the internal and external network
    How do we test infrastructure failures:
• Switching off the power to devices
• Change credentials for services
• Turning off or destroying services
• Removing payment instructions from service providers
• Changing configurations in production to make services public
    How do we test application failures:
• Putting applications under constant stress
• Providing wrong values for forms and API calls
• Disabling JavaScript and CSS
• Constantly running automated penetration tests
• Switching configurations (db connection becomes storage connection)
    How do we challenge the human aspect:
• Social Engineering Tests
• Table-top Exercises / Business DnD
• Game nights
• Extract data from systems & score their sensitivity
• Internal Workshop & Certification Programs
• Switching roles
    Adopt a zero-trust data policy

Zombie Attack, courtesy of Perth Zombie Apocalypse Simulation
    The less data you have, the better you are protected against loss or corruption of that data

Empty vault, courtesy of Hang The Bankers
    A look into our chaos kitchen:
GAUNTLT - BE MEAN TO YOUR CODE AND LIKE IT
PHP Mutation Testing Framework
