Slide 1

Slide 1 text

@andyfleener GAME DAYS FAILING FOR FUN AND PROFIT BY ANDY FLEENER

Slide 2

Slide 2 text

HI I’M ANDY

Slide 3

Slide 3 text

@andyfleener

Slide 4

Slide 4 text

@andyfleener ABOUT ME ▸ Platform Operations Manager ▸ Ruby Developer for 10 years ▸ In Operations for last 5 years ▸ New View Safety Nerd

Slide 5

Slide 5 text

@andyfleener WHAT’S A “NEW VIEW SAFETY NERD”

Slide 6

Slide 6 text

@andyfleener BAD IDEA: BLAME AND PUNISH GOOD IDEA: LEARN AND GROW

Slide 7

Slide 7 text

@andyfleener SYSTEMS THINKING!

Slide 8

Slide 8 text

@andyfleener WHAT’S SYSTEMS THINKING? ▸ It can be very counterintuitive ▸ Success and Failure are not easily defined ▸ Emergent properties of a system are the feedback required to close the loop Creative Commons Image: Liam Ross

Slide 9

Slide 9 text

@andyfleener COMPLEX SYSTEMS ARE PRETTY DOPE Creative Commons Image: John Smith

Slide 10

Slide 10 text

@andyfleener #REALTALK: I ENJOY BREAKING THINGS

Slide 11

Slide 11 text

@andyfleener

Slide 12

Slide 12 text

@andyfleener WHICH IS GOOD BECAUSE

Slide 13

Slide 13 text

@andyfleener LIVE LOOK AT THE INTERNET

Slide 14

Slide 14 text

@andyfleener HERE’S THE BILLIONS AND BILLIONS OF E-COINS QUESTION

Slide 15

Slide 15 text

@andyfleener IF THE SYSTEM IS UNSAFE HOW DO WE KEEP IT RUNNING?

Slide 16

Slide 16 text

@andyfleener FAILURE INJECTION!

Slide 17

Slide 17 text

@andyfleener ENTER GAME DAYS

Slide 18

Slide 18 text

@andyfleener A PROGRAM DESIGNED TO INCREASE RESILIENCE BY PURPOSELY INJECTING MAJOR FAILURES INTO CRITICAL SYSTEMS SEMI-REGULARLY TO DISCOVER FLAWS AND SUBTLE DEPENDENCIES. ACM Queue Volume 10, issue 9 Resilience Engineering: Learning to Embrace Failure(September 13, 2012)

Slide 19

Slide 19 text

@andyfleener CHANGE A BASIC ASSUMPTION AND YOU HAVE CHANGED THE SYSTEM ITSELF. Eli Goldratt THEORY OF CONSTRAINTS

Slide 20

Slide 20 text

@andyfleener IT’S THE SYSTEM, MAN

Slide 21

Slide 21 text

THE VALUE OF GAME DAYS ▸ Find latent failures ▸ Practice Incident Response ▸Learn about your systems

Slide 22

Slide 22 text

@andyfleener GAME DAYS A BRIEF INCOMPLETE HISTORY

Slide 23

Slide 23 text

@andyfleener IT IS SAID THAT IF YOU KNOW YOUR ENEMIES AND KNOW YOURSELF, YOU WILL NOT BE IMPERILED IN A HUNDRED BATTLES Sun Tzu ART OF WAR

Slide 24

Slide 24 text

THIS CONCEPT IS OLD ▸ Security has been doing “Red Team” Exercises for decades ▸ Military organizations have been doing war games since the 1800s including predicting how the Japanese would attack Pearl Harbor 9 years before it happened

Slide 25

Slide 25 text

THE GAME DAY AS WE KNOW THEM ▸Amazon and the Master of disaster Jesse Robbins ▸Game Days at Etsy ▸PagerDuty and Failure Fridays

Slide 26

Slide 26 text

@andyfleener

Slide 27

Slide 27 text

@andyfleener WHAT ABOUT SPORTSENGINE?

Slide 28

Slide 28 text

GAME DAYS AT SPORTSENGINE: A HISTORY ▸ Started doing Game Days in 2013 ▸ The first Game Day was just 3 Operations Engineers busting our staging environment ▸ We’ve been running them quarterlyish since ▸ Our last game day crossed 4 teams with an attacking team of 5 and a responding team of 6

Slide 29

Slide 29 text

@andyfleener LOGISTICS

Slide 30

Slide 30 text

@andyfleener PICK A RED TEAM

Slide 31

Slide 31 text

@andyfleener ENSURE YOU HAVE A BLUE TEAM

Slide 32

Slide 32 text

@andyfleener TREAT GAMES AS PRODUCTION INCIDENTS BY FOLLOWING THE FULL INCIDENT RESPONSE LIFECYCLE

Slide 33

Slide 33 text

@andyfleener GET MULTIPLE TEAMS INVOLVED!

Slide 34

Slide 34 text

@andyfleener OPERATIONAL CONCERNS ARE BUSINESS CONCERNS

Slide 35

Slide 35 text

@andyfleener TAKE IT SERIOUSLY BUT DON’T FORGET TO HAVE FUN

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

ADVICE FOR THE RED TEAM ▸ Have more games prepared than you think you’ll actually get to ▸ Some Games won’t have the impact you expect ▸ Fake it until you make it(alerts, monitors, support requests) ▸ Advanced technique: Apply constraints to responders

Slide 38

Slide 38 text

@andyfleener USE TOOLS THAT EXIST ▸ Toxiproxy by Shopify ▸ Comcast by tylertreat ▸ Chaos Monkey by Netflix Creative Commons Image: Toms River Fire Dept

Slide 39

Slide 39 text

ADVICE FOR THE BLUE TEAM ▸ Treat your response like it’s PRODUCTION ▸ Use this as a way to train New/Junior Engineers on your world class Incident Response ▸ Time the response to create urgency

Slide 40

Slide 40 text

@andyfleener EXAMPLE GAME CASE STUDIES

Slide 41

Slide 41 text

THE FAKE DDOS ATTACK ▸ A Classic real world concern ▸ We’ve done this game multiple times in different ways ▸ The easiest way is to leave a “D” off ▸ It might be harder than you think ▸ Fake this by outlawing an IP block as a mitigation technique

Slide 42

Slide 42 text

THE MISCONFIGURED NETWORK ▸ This is a super easy game adjust a firewall rule in a critical location ▸ Another Classic that happens throughout the internet ▸ This can be an easy way to see the devastating effects of high network latency

Slide 43

Slide 43 text

@andyfleener

Slide 44

Slide 44 text

THE SSL NEGOTIATION FAILURE ▸ Changed permissions on the ssl cert files ▸ Cause weird negotiation failure state that was hard to debug ▸ This actually happened later due to a failed chef configuration ▸ Big win because we understood the behavior when it happened

Slide 45

Slide 45 text

THE LATENT BUG BOMB ▸ These are fun and super common real world scenarios ▸ This is the perfect way to get a dev team involved ▸ I’ve done things like add command injection endpoints ▸ These are most effective by finding the biggest blast radius

Slide 46

Slide 46 text

THE FORK BOMB ▸ This is a great one to run if you want to seriously trash some servers ▸ Fork bombs are super easy to write ▸ You can write a fork bomb in any language ▸ Heres Ruby: loop { fork { load(__FILE__) } }

Slide 47

Slide 47 text

@andyfleener AS YOU CAN TELL I ENJOY BEING ON THE RED TEAM

Slide 48

Slide 48 text

@andyfleener

Slide 49

Slide 49 text

@andyfleener

Slide 50

Slide 50 text

@andyfleener THE PRIME DIRECTIVE : LEARN

Slide 51

Slide 51 text

@andyfleener BY LEARNING, OUR SYSTEM CAN CONTINUE TO GROW AND IMPROVE

Slide 52

Slide 52 text

@andyfleener WE’RE ACTIVELY ADDING CAPACITY TO THE SYSTEM TO FAIL

Slide 53

Slide 53 text

@andyfleener IS THE KEY TO CREATING NOT JUST RELIABLE SYSTEMS BUT RESILIENT ONES ADDING CAPACITY TO FAIL

Slide 54

Slide 54 text

@andyfleener TOOLS AND TOOLING BREAKOUT SESSION

Slide 55

Slide 55 text

@andyfleener STAY AWESOME AND THANKS!