Adopt chaos engineering techniques in your daily work

Michelangelo van Dam (@DragonBe) Adopt chaos engineering techniques in your
daily work

Michelangelo van Dam (@DragonBe) 2 Michelangelo van Dam I'm a
senior #php architect, co-founder and #ceo of @in2itvof, #community leader at @phpbenelux, coach at @CoderDojoBelgium, #digitalnomad, likes #coffee.

Michelangelo van Dam (@DragonBe) 3 What could possibly go wrong?

Michelangelo van Dam (@DragonBe) “ ” Chaos Engineering Brief intro
to the world of chaos engineering Chaos engineering is the discipline of experimenting on a software system in production in order to build conﬁdence in the system's capability to withstand turbulent and unexpected conditions. Source: WikiPedia 4

to the world of chaos engineering Chaos engineering is the discipline of experimenting on a software system in production in order to build conﬁdence in the system's capability to withstand turbulent and unexpected conditions. Source: WikiPedia 5

to the world of chaos engineering Chaos engineering is experimenting in production to withstand turbulent and unexpected conditions. Source: WikiPedia 6

Michelangelo van Dam (@DragonBe) Chaos Engineering 7 In other words...
Doom 5 Artwork, courtesy of id Software

Michelangelo van Dam (@DragonBe) Our hard lessons 8

Michelangelo van Dam (@DragonBe) Sep 2009: Power surge destroys hardware
9 Not too much power please Facts Constraints Solution • There was a brief power outage • When the power came back on, there was a surge in voltage • Immediately destroyed all hardware • All devices had retail grade power sockets with surge protection • UPS system protected the server, not other devices • Rewired the office with build-in, professional power surge protection • UPS upgrade to support static hardware (printer, phone, monitors, NAS devices)

Michelangelo van Dam (@DragonBe) Apr 2011: AWS goes down &
everyone else 10 Putting all eggs in different baskets ended up in the same basket after all Facts Constraints Solution • AWS had issues and was down for a whole week • With AWS, many other services were not accessible as well • We used these different services to avoid this from happening • No information given by Amazon • No information given by our service providers • We replicated critical services on-prem

Michelangelo van Dam (@DragonBe) Mar 2012: Server down & inaccessible
11 Running out of wiggle space Facts Constraints Solution • Server was under heavy load • Logs were filling up diskspace quickly • Alert of < 20% diskspace came too late • Weren’t able to SSH into server: no space left • Logging was on the same server as the application • Monitoring was on a different server, but behind same switch • Setting up centralized logging server • Moved monitoring on different hosting provider • Added mountable storage features to our servers

Michelangelo van Dam (@DragonBe) Feb 2013: Internet outage by national
ISP 12 Being ofﬂine in an online world Facts Constraints Solution • No internet for 14 hours • Deadline to deliver • QA and deploy tools were online • Major ISP in Belgium • Alternative locations had no internet too • Mobile hotspot in the office from competing provider

Michelangelo van Dam (@DragonBe) Feb 2020: Covid-19 Pandemic (Coronavirus) 13
Stay in your home! Facts Constraints Solution • Viral infection that sickens and kills people globally • China, Korea and Italy on lock-down • Belgian government advises to uphold a very good hygiene and to avoid unnecessary human contacts • Events all over the world are cancelled • Businesses are not ready for WFH workforce • Events are still depending on face to face interactions, not for sessions but for sponsors • Adopt a remote-first mentality in your company, school, hobby group and event organization

Michelangelo van Dam (@DragonBe) 2021: Supply-Chain Attacks (Still ongoing) 14
Poisoning the well ☠ Facts Constraints Solution • Successful compromise of source code used by many other applications (SolarWinds, Codecov) • Failed attempts to breach open source languages (PHP, Python, Ruby) • Active phishing campaigns ongoing • A breach in one component impacts all dependent projects • Hard to detect because it is provided by a trusted source and signatures match up • MFA for all code repositories required • Commit signing with certificate or GPG key on hardware token (YubiKey) • Library management, monitoring and alerting

Michelangelo van Dam (@DragonBe) 2021: Floods & wildﬁres (Still ongoing)
15 Effects of climate change will affect everyone 🌍 Facts Constraints Solution • Parts of eastern Belgium, The Netherlands and West-Germany were flooded, including some data centers • Wildfires in Eastern Europe, US and Canada causes issues for networks and data centers • No access to data and services because DC is not reachable • No backups accessible because they were kept in DC • Rebuilding infrastructure is hard when not using infrastructure as code • Multi-regional replication of applications • Backups kept also at a different DC • Ensure you can recreate your infrastructure and applications in an automated fashion

Michelangelo van Dam (@DragonBe) We needed to prepare for chaos
16

Michelangelo van Dam (@DragonBe) Inspired by Netﬂix 17 Chaos Monkey
& Chaos Engineering Don’t wait for chaos, create it Observe, learn and remediate

Michelangelo van Dam (@DragonBe) Four areas of building resilience 18
Networks Infrastructure Applications What are the areas where chaos can disrupt our operations? Humans

Michelangelo van Dam (@DragonBe) What can go wrong with networks?
19 No internet connection No IP available Latency and timeouts Wrong or bad encryption/certificates Man-in-the-Middle attacks Crucial Crucial Moderate Major Major

Michelangelo van Dam (@DragonBe) What can go wrong with infrastructure?
20 Hardware failure Infrastructure down/not responsive Resource overload Bad configuration (public vs private) Active hacking and exploits Crucial Crucial Moderate Major Major

Michelangelo van Dam (@DragonBe) What can go wrong with applications?
21 Tight coupling of services Bad coding practices Bugs and vulnerabilities Unable to disable compute intense parts Active hacking and exploits Crucial Crucial Moderate Major Major

Michelangelo van Dam (@DragonBe) What are our human challenges? 22
Bus factor Single Point of Failure / Bottleneck Diseases and strikes Insider threats / (un)voluntary data breach Staffing & Contracting Crucial Crucial Major Major Moderate

Michelangelo van Dam (@DragonBe) Our remediation efforts 23

Michelangelo van Dam (@DragonBe) “ ” Monitor everything 24 If
it moves, you track it Knowing is good, but knowing everything is better. Dave Eggers, quote from his book “The Circle”

Michelangelo van Dam (@DragonBe) Monitoring elements 25 Disk space CPU
Memory Provision time Deployment time Network throughput Total logins Successful logins Failed logins Server requests Page load Data load Queue size Cache ratio DB Query times Active sessions Total sessions Global ping times Alerts HTTP Responses Much more...

Michelangelo van Dam (@DragonBe) One of our dashboards 26

Michelangelo van Dam (@DragonBe) Principles of Chaos Engineering 1. Build
a Hypothesis around Steady State Behavior 2. Vary Real-world Events 3. Run Experiments in Production 4. Automate Experiments to Run Continuously 5. Minimize Blast Radius Source: https://principlesofchaos.org/ 27

Michelangelo van Dam (@DragonBe) How do we test network failures
28 Becoming evil at disrupting networks 👹 Unplugging the master power from ISP connection Configure route table to non-routable gateways Set DNS to 127.0.0.1 Syn-flood the network Create a bad SSL/TLS certificate

Michelangelo van Dam (@DragonBe) How do we test infrastructure failures
29 Have you turned it off and on again? No, just off 🚦 Switching off the power to devices Change credentials for services Turning off or destroying services Removing payment instructions from service providers Changing configurations in production to make services public

Michelangelo van Dam (@DragonBe) How do we test application failures
30 Giving it all that you got Putting applications under constant stress Providing wrong values for forms and API calls Disabling JavaScript and CSS Constantly running automated penetration tests Switching configurations (db connection becomes storage connection)

Michelangelo van Dam (@DragonBe) How do we challenge the human
aspect 31 One can only be diverse and welcoming if you have tested it Social Engineering Tests Table-top Exercises / Business DnD Game nights Extract data from systems & score their sensitivity Internal Workshop & Certification Programs Switching roles

Michelangelo van Dam (@DragonBe) What about data? 32

Michelangelo van Dam (@DragonBe) Assume breach, always! 33 Adopt a
zero-trust data policy Zombie Attack, courtesy of Perth Zombie Apocalypse Simulation

Michelangelo van Dam (@DragonBe) Less is more 34 The less
data you have, the better you are protected against loss or corruption of that data Empty vault, courtesy of Hang The Bankers

Michelangelo van Dam (@DragonBe) Tools and references 35

Michelangelo van Dam (@DragonBe) A look into our chaos kitchen
36 GAUNTLT BE MEAN TO YOUR CODE AND LIKE IT Put your app under constant stress PHPever PHP Mutation Testing Framework

Michelangelo van Dam (@DragonBe) Open source tools we use 37
Phabricator OWASP ZED ATTACK PROXY

Michelangelo van Dam (@DragonBe) Resources for more information 38

Michelangelo van Dam (@DragonBe) 39 Questions?

Adopt chaos engineering techniques in your dail...

Adopt chaos engineering techniques in your daily work

More Decks by DragonBe

Other Decks in Technology

Featured

Transcript