Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Oops, I Broke It Again - Practicing Safe Chaos ...

Avatar for Sena Yakut Sena Yakut
November 02, 2025
28

Oops, I Broke It Again - Practicing Safe Chaos in the Cloud

Avatar for Sena Yakut

Sena Yakut

November 02, 2025
Tweet

More Decks by Sena Yakut

Transcript

  1. Oops, I Broke It Again: Practicing Safe Chaos in the

    Cloud DevOpsDays Istanbul,2025 Home Content Contact Sena Yakut, 2025
  2. o Cloud environments are dynamic and interconnected. o Failures are

    expected in complex systems. o Small misconfigurations can cause big issues. o Security is always a challenge (I hope it’s not) Home Content Contact Cloud Systems Are Complex, And That’s Okay
  3. o Chaos isn’t the enemy, blind spots are. o Chaos

    isn’t random → it’s controlled experimentation. o The goal: prove your security actually works under stress. Home Content Contact So… What Is Security Chaos Engineering in the Cloud?
  4. Home Content Contact So… What Is Security Chaos Engineering in

    the Cloud? Fail Intentionally and Wisely Okay, but how do we do this safely?
  5. Home Content Contact Challenges of Security Chaos Engineering o Cultural

    resistance: teams fear “breaking” security systems. o Defining safe boundaries: chaos must never become real risk. o Tooling gaps: fewer automation tools for security chaos vs. reliability chaos. o Complex validation: proving improvement after chaos is tricky. o Time & ownership: unclear who owns “safe breaking” in the org. The hardest part isn’t breaking systems → it’s convincing people it’s worth doing.
  6. Home Content Contact Overcoming the Challenges o Start small and

    safe → test in staging or an identical prod environment, not real production. o Use automation and stop conditions to control chaos safely. o Make it cross-team → involve DevSecOps, SREs, and Security together. o Treat experiments as learning, not blame.
  7. Home Content Contact The Case of the Silent Alert A

    cloud team wants to ensure that their security incident detection pipeline: Security logs → security alerts → remediations truly works when something goes wrong. They decide to run a security chaos experiment to simulate a failure.
  8. Home Content Contact Steady Security monitoring system runs smoothly. Dashboards

    show “All Systems Healthy.” This is the steady state: everything appears normal. The silence of alerts is comforting… but unverified. The Case of the Silent Alert - 1
  9. Home Content Contact The Case of the Silent Alert -

    2 Hypothesis When _________ happens, ________ system will notify the team within _______ and the application’s metric _________ will remain at ________. o Brainstorming o Threat modeling o Think all the security-based scenarios o ”What if?”
  10. Home Content Contact The Case of the Silent Alert -

    2 Hypothesis 1. WAF Alerting When malicious requests (e.g., SQLi or bot traffic) happens, the WAF and CloudWatch system will notify the team within 1 minute, and the application’s metric AllowedRequests/BlockedRequests ratio will remain at safe threshold (<10% malicious allowed).
  11. Home Content Contact The Case of the Silent Alert -

    2 Hypothesis 2. GuardDuty + Remediation (Malware) When GuardDuty detects malware or a compromised instance, the automated remediation Lambda will notify the team within 2 minutes, and the application’s metric instance network connections will remain at zero (isolated state).
  12. Home Content Contact The Case of the Silent Alert -

    3 Run Exp 1. WAF Alerting Test o Send simulated SQLi/bot traffic to test rules. o Watch BlockedRequests and alert timing. o Ensure app stays stable, alerts arrive in <1 min. o Stop test if latency spikes. Keep it safe, controlled, and well-logged. Chaos ≠ production outage!
  13. Home Content Contact The Case of the Silent Alert -

    3 Run Exp 2. GuardDuty + Remediation Test o Simulate malware behavior (safe trigger or EICAR). o Check GuardDuty → EventBridge → Lambda → SOC flow. o Confirm instance isolation & alert delivery (<2 min). o Use test-only instance, rollback after. Keep it safe, controlled, and well-logged. Chaos ≠ production outage!
  14. Home Content Contact The Case of the Silent Alert -

    4 Verify 1. WAF Alerting o After a burst of SQL injection traffic, the WAF detected and blocked 95% of malicious requests within 1 minute. o However, the CloudWatch alarm for “BlockedRequests > Threshold” didn’t trigger because the metric filter name was outdated. o During the test, the application stayed healthy → no user-facing downtime. Analyze, document, collect lessons learned.
  15. Home Content Contact The Case of the Silent Alert -

    4 Verify 2. GuardDuty + Remediation (Malware) o After a simulated malware file was placed on an EC2 instance, GuardDuty generated a finding within 2 minutes and triggered the remediation Lambda. o Network activity dropped to zero, confirming isolation worked. o But isolation causes performance issues. Analyze, document, collect lessons learned.
  16. Home Content Contact The Case of the Silent Alert -

    5 Improve o Summarize what broke, what worked, and who felt it. o Fix gaps: alerts, permissions, automation. o Re-run experiments after improvements. o Apply lessons across other systems. o Automate where humans slowed down. o Turn chaos reports into new playbooks. AWS Incident Response Playbook Samples Forming a Chaos Engineering Team
  17. Home Content Contact Final Words – Chaos, But Make It

    Fun o Chaos isn’t scary when you own the experiment. o Every “oops” in testing saves a “WTF” in production. o Build chaos muscles. Laugh at your alerts. o If your system never breaks… are you sure it’s running?