Oops, I Broke It Again - Practicing Safe Chaos in the Cloud

Oops, I Broke It Again: Practicing Safe Chaos in the
Cloud DevOpsDays Istanbul,2025 Home Content Contact Sena Yakut, 2025

About Me! Sena Yakut Cloud Security Architect @CyberWhiz DevOpsDays Istanbul,2025
Home Content Contact All links about me

o Cloud environments are dynamic and interconnected. o Failures are
expected in complex systems. o Small misconfigurations can cause big issues. o Security is always a challenge (I hope it’s not) Home Content Contact Cloud Systems Are Complex, And That’s Okay

Home Content Contact So… What Is Security Chaos Engineering in
the Cloud?

o Chaos isn’t the enemy, blind spots are. o Chaos
isn’t random → it’s controlled experimentation. o The goal: prove your security actually works under stress. Home Content Contact So… What Is Security Chaos Engineering in the Cloud?

Home Content Contact So… What Is Security Chaos Engineering in
the Cloud? Fail Intentionally and Wisely Okay, but how do we do this safely?

Home Content Contact Challenges of Security Chaos Engineering o Cultural
resistance: teams fear “breaking” security systems. o Defining safe boundaries: chaos must never become real risk. o Tooling gaps: fewer automation tools for security chaos vs. reliability chaos. o Complex validation: proving improvement after chaos is tricky. o Time & ownership: unclear who owns “safe breaking” in the org. The hardest part isn’t breaking systems → it’s convincing people it’s worth doing.

Home Content Contact Overcoming the Challenges o Start small and
safe → test in staging or an identical prod environment, not real production. o Use automation and stop conditions to control chaos safely. o Make it cross-team → involve DevSecOps, SREs, and Security together. o Treat experiments as learning, not blame.

Home Content Contact The Case of the Silent Alert A
cloud team wants to ensure that their security incident detection pipeline: Security logs → security alerts → remediations truly works when something goes wrong. They decide to run a security chaos experiment to simulate a failure.

Steady Home Content Contact Phases of Security Chaos Engineering Hypothesis
Run Exp Verify Improve

Home Content Contact Steady Security monitoring system runs smoothly. Dashboards
show “All Systems Healthy.” This is the steady state: everything appears normal. The silence of alerts is comforting… but unverified. The Case of the Silent Alert - 1

Home Content Contact The Case of the Silent Alert -
2 Hypothesis When _________ happens, ________ system will notify the team within _______ and the application’s metric _________ will remain at ________. o Brainstorming o Threat modeling o Think all the security-based scenarios o ”What if?”

2 Hypothesis 1. WAF Alerting When malicious requests (e.g., SQLi or bot traffic) happens, the WAF and CloudWatch system will notify the team within 1 minute, and the application’s metric AllowedRequests/BlockedRequests ratio will remain at safe threshold (<10% malicious allowed).

2 Hypothesis 2. GuardDuty + Remediation (Malware) When GuardDuty detects malware or a compromised instance, the automated remediation Lambda will notify the team within 2 minutes, and the application’s metric instance network connections will remain at zero (isolated state).

3 Run Exp 1. WAF Alerting Test o Send simulated SQLi/bot traffic to test rules. o Watch BlockedRequests and alert timing. o Ensure app stays stable, alerts arrive in <1 min. o Stop test if latency spikes. Keep it safe, controlled, and well-logged. Chaos ≠ production outage!

3 Run Exp 2. GuardDuty + Remediation Test o Simulate malware behavior (safe trigger or EICAR). o Check GuardDuty → EventBridge → Lambda → SOC flow. o Confirm instance isolation & alert delivery (<2 min). o Use test-only instance, rollback after. Keep it safe, controlled, and well-logged. Chaos ≠ production outage!

4 Verify 1. WAF Alerting o After a burst of SQL injection traffic, the WAF detected and blocked 95% of malicious requests within 1 minute. o However, the CloudWatch alarm for “BlockedRequests > Threshold” didn’t trigger because the metric filter name was outdated. o During the test, the application stayed healthy → no user-facing downtime. Analyze, document, collect lessons learned.

4 Verify 2. GuardDuty + Remediation (Malware) o After a simulated malware file was placed on an EC2 instance, GuardDuty generated a finding within 2 minutes and triggered the remediation Lambda. o Network activity dropped to zero, confirming isolation worked. o But isolation causes performance issues. Analyze, document, collect lessons learned.

5 Improve o Summarize what broke, what worked, and who felt it. o Fix gaps: alerts, permissions, automation. o Re-run experiments after improvements. o Apply lessons across other systems. o Automate where humans slowed down. o Turn chaos reports into new playbooks. AWS Incident Response Playbook Samples Forming a Chaos Engineering Team

Home Content Contact Final Words – Chaos, But Make It
Fun o Chaos isn’t scary when you own the experiment. o Every “oops” in testing saves a “WTF” in production. o Build chaos muscles. Laugh at your alerts. o If your system never breaks… are you sure it’s running?

Thank you! Home Content Contact DevOpsDays Istanbul,2025

Oops, I Broke It Again - Practicing Safe Chaos ...

Oops, I Broke It Again - Practicing Safe Chaos in the Cloud

Sena Yakut

More Decks by Sena Yakut

Featured

Transcript

Oops, I Broke It Again: Practicing Safe Chaos in the

About Me! Sena Yakut Cloud Security Architect @CyberWhiz DevOpsDays Istanbul,2025

o Cloud environments are dynamic and interconnected. o Failures are

Home Content Contact So… What Is Security Chaos Engineering in

o Chaos isn’t the enemy, blind spots are. o Chaos

Home Content Contact So… What Is Security Chaos Engineering in

Home Content Contact Challenges of Security Chaos Engineering o Cultural

Home Content Contact Overcoming the Challenges o Start small and

Home Content Contact The Case of the Silent Alert A

Steady Home Content Contact Phases of Security Chaos Engineering Hypothesis

Home Content Contact Steady Security monitoring system runs smoothly. Dashboards

Home Content Contact The Case of the Silent Alert -

Home Content Contact The Case of the Silent Alert -

Home Content Contact The Case of the Silent Alert -

Home Content Contact The Case of the Silent Alert -

Home Content Contact The Case of the Silent Alert -

Home Content Contact The Case of the Silent Alert -

Home Content Contact The Case of the Silent Alert -

Home Content Contact The Case of the Silent Alert -

Home Content Contact Final Words – Chaos, But Make It

Thank you! Home Content Contact DevOpsDays Istanbul,2025