Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Game Days: Failing for Fun and Profit

Andy Fleener
November 12, 2017

Game Days: Failing for Fun and Profit

Talk from Chaos Day Twin Cities 11/10/2017

What happens when *that* doesn't work? This is my favorite question of all time. Complex systems are inherently unsafe and are in a constant state of failure. True system resiliency comes not from preventing failures, but ensuring the system has capacity *to* fail. At SportsEngine we've embrace failure it's a critical part of how we learn and grow. It's become part of our culture, by running Gamedays we've found a way to gamify failure.
In this talk I'll discuss how we run Gameday, the interesting outcomes we've seen by injecting failure into our systems, and the improvements to both computer systems and human systems.

Andy Fleener

November 12, 2017
Tweet

More Decks by Andy Fleener

Other Decks in Technology

Transcript

  1. @andyfleener ABOUT ME ▸ Platform Operations Manager ▸ Ruby Developer

    for 10 years ▸ In Operations for last 5 years ▸ New View Safety Nerd
  2. @andyfleener WHAT’S SYSTEMS THINKING? ▸ It can be very counterintuitive

    ▸ Success and Failure are not easily defined ▸ Emergent properties of a system are the feedback required to close the loop Creative Commons Image: Liam Ross
  3. @andyfleener A PROGRAM DESIGNED TO INCREASE RESILIENCE BY PURPOSELY INJECTING

    MAJOR FAILURES INTO CRITICAL SYSTEMS SEMI-REGULARLY TO DISCOVER FLAWS AND SUBTLE DEPENDENCIES. ACM Queue Volume 10, issue 9 Resilience Engineering: Learning to Embrace Failure(September 13, 2012)
  4. @andyfleener CHANGE A BASIC ASSUMPTION AND YOU HAVE CHANGED THE

    SYSTEM ITSELF. Eli Goldratt THEORY OF CONSTRAINTS
  5. THE VALUE OF GAME DAYS ▸ Find latent failures ▸

    Practice Incident Response ▸Learn about your systems
  6. @andyfleener IT IS SAID THAT IF YOU KNOW YOUR ENEMIES

    AND KNOW YOURSELF, YOU WILL NOT BE IMPERILED IN A HUNDRED BATTLES Sun Tzu ART OF WAR
  7. THIS CONCEPT IS OLD ▸ Security has been doing “Red

    Team” Exercises for decades ▸ Military organizations have been doing war games since the 1800s including predicting how the Japanese would attack Pearl Harbor 9 years before it happened
  8. THE GAME DAY AS WE KNOW THEM ▸Amazon and the

    Master of disaster Jesse Robbins ▸Game Days at Etsy ▸PagerDuty and Failure Fridays
  9. GAME DAYS AT SPORTSENGINE: A HISTORY ▸ Started doing Game

    Days in 2013 ▸ The first Game Day was just 3 Operations Engineers busting our staging environment ▸ We’ve been running them quarterlyish since ▸ Our last game day crossed 4 teams with an attacking team of 5 and a responding team of 6
  10. ADVICE FOR THE RED TEAM ▸ Have more games prepared

    than you think you’ll actually get to ▸ Some Games won’t have the impact you expect ▸ Fake it until you make it(alerts, monitors, support requests) ▸ Advanced technique: Apply constraints to responders
  11. @andyfleener USE TOOLS THAT EXIST ▸ Toxiproxy by Shopify ▸

    Comcast by tylertreat ▸ Chaos Monkey by Netflix Creative Commons Image: Toms River Fire Dept
  12. ADVICE FOR THE BLUE TEAM ▸ Treat your response like

    it’s PRODUCTION ▸ Use this as a way to train New/Junior Engineers on your world class Incident Response ▸ Time the response to create urgency
  13. THE FAKE DDOS ATTACK ▸ A Classic real world concern

    ▸ We’ve done this game multiple times in different ways ▸ The easiest way is to leave a “D” off ▸ It might be harder than you think ▸ Fake this by outlawing an IP block as a mitigation technique
  14. THE MISCONFIGURED NETWORK ▸ This is a super easy game

    adjust a firewall rule in a critical location ▸ Another Classic that happens throughout the internet ▸ This can be an easy way to see the devastating effects of high network latency
  15. THE SSL NEGOTIATION FAILURE ▸ Changed permissions on the ssl

    cert files ▸ Cause weird negotiation failure state that was hard to debug ▸ This actually happened later due to a failed chef configuration ▸ Big win because we understood the behavior when it happened
  16. THE LATENT BUG BOMB ▸ These are fun and super

    common real world scenarios ▸ This is the perfect way to get a dev team involved ▸ I’ve done things like add command injection endpoints ▸ These are most effective by finding the biggest blast radius
  17. THE FORK BOMB ▸ This is a great one to

    run if you want to seriously trash some servers ▸ Fork bombs are super easy to write ▸ You can write a fork bomb in any language ▸ Heres Ruby: loop { fork { load(__FILE__) } }
  18. @andyfleener IS THE KEY TO CREATING NOT JUST RELIABLE SYSTEMS

    BUT RESILIENT ONES ADDING CAPACITY TO FAIL