Slide 1

Slide 1 text

CHAOS ENGINEERING BOOTCAMP TAMMY BUTOW, DROPBOX 1 VELOCITY SAN JOSE 2017

Slide 2

Slide 2 text

CAUSING CHAOS IN PROD
 SINCE 2009 @TAMMYBUTOW ENGINEERING MANAGER, DROPBOX TAMMY BUTOW 2

Slide 3

Slide 3 text

ASSISTING & ANSWERING YOUR CHAOS QUESTIONS @CASEYROSENTHAL ENGINEERING MANAGER, NETFLIX CASEY ROSENTHAL 3

Slide 4

Slide 4 text

+ LAYING THE FOUNDATION (9:00 - 10:30) + MORNING BREAK (10:30 - 11:00) + CHAOS TOOLS (11:00 - 11:30) + ADVANCED TOPICS + Q & A (11:30 - 12:30) THE CHAOS BOOTCAMP 4

Slide 5

Slide 5 text

• DROPBOX • NETFLIX • DIGITALOCEAN THANKS TO • GOOGLE • AMAZON • NATIONAL AUSTRALIA BANK • DATADOG 5

Slide 6

Slide 6 text

PART I: LAYING THE FOUNDATION 6

Slide 7

Slide 7 text

CHAOS ENGINEERING IS THE DISCIPLINE OF EXPERIMENTING ON A DISTRIBUTED SYSTEM IN ORDER TO BUILD CONFIDENCE IN THE SYSTEM’S CAPABILITY TO WITHSTAND TURBULENT CONDITIONS IN PRODUCTION. WHAT IS CHAOS ENGINEERING 7

Slide 8

Slide 8 text

CHAOS ENGINEERING CAN BE THOUGHT OF AS THE FACILITATION OF EXPERIMENTS TO UNCOVER SYSTEMIC WEAKNESSES. 8

Slide 9

Slide 9 text

1. DEFINE STEADY STATE 2. HYPOTHESIZE STEADY STATE WILL CONTINUE 3. INTRODUCE VARIABLES THAT REFLECT REAL WORLD EVENTS 4. TRY TO DISPROVE THE HYPOTHESIS PRINCIPLES OF CHAOS ENGINEERING 9

Slide 10

Slide 10 text

DISTRIBUTED SYSTEMS HAVE NUMEROUS SYSTEM PARTS. HARDWARE AND FIRMWARE FAILURES ARE COMMON. OUR SYSTEMS AND COMPANIES SCALE RAPIDLY HOW DO YOU BUILD A RESILIENT SYSTEM WHILE YOU SCALE? 
 WE USE CHAOS! WHY DO DISTRIBUTED SYSTEMS NEED CHAOS? 10

Slide 11

Slide 11 text

YOU CAN INJECT CHAOS AT
 ANY LAYER TO INCREASE
 SYSTEM RESILIENCE AND SYSTEM KNOWLEDGE. FULL-STACK CHAOS INJECTION CACHING HARDWARE DATABASE APPLICATION RACK 11

Slide 12

Slide 12 text

1. NETFLIX 2. DROPBOX 3. GOOGLE 4. NATIONAL AUSTRALIA BANK 5. JET WHO USES CHAOS ENGINEERING? 12

Slide 13

Slide 13 text

WHAT ARE COMMON EXCUSES TO NOT USE CHAOS ENGINEERING? NO EXCUSES. GET READY FOR CHAOS. 13

Slide 14

Slide 14 text

HANDS-ON TUTORIAL (LET’S JUMP IN!) NOW IT IS TIME TO CREATE CHAOS. WE WILL ALL BE DOING A HANDS-ON ACTIVITY WHERE WE INJECT FAILURE. 14

Slide 15

Slide 15 text

EVERYONE HAS A DIGITALOCEAN
 SERVER, USERNAME AND PASSWORD.
 1. LOGIN WITH TERMINAL 2. VISIT YOUR IP IN YOUR BROWSER TIME TO USE YOUR SERVER 15

Slide 16

Slide 16 text

YOU MUST BE MEASURING METRICS AND REPORTING ON THEM TO IMPROVE YOUR SYSTEM RESILIENCE. 16

Slide 17

Slide 17 text

CHAOS WITHOUT MONITORING IS FUTILE 17

Slide 18

Slide 18 text

THE LACK OF PROPER MONITORING IS NOT USUALLY THE SOLE CAUSE OF A PROBLEM, BUT IT IS OFTEN A SERIOUS CONTRIBUTING FACTOR. AN EXAMPLE IS THE NORTHEAST BLACKOUT OF 2003. COMMON ISSUES INCLUDE: + HAVING THE WRONG TEAM DEBUG + NOT ESCALATING + NOT HAVING A BACKUP ON-CALL 18

Slide 19

Slide 19 text

19

Slide 20

Slide 20 text

A LACK OF ALARMS LEFT OPERATORS UNAWARE OF THE NEED TO RE-DISTRIBUTE POWER AFTER OVERLOADED TRANSMISSION LINES HIT UNPRUNED FOLIAGE. THIS TRIGGERED A RACE CONDITION IN THE CONTROL SOFTWARE. 20

Slide 21

Slide 21 text

1. AVAILABILITY — 500s 2. SERVICE SPECIFIC KPIs 3. SYSTEM METRICS: CPU, IO, DISK 4. CUSTOMER COMPLAINTS WHAT SHOULD YOU MEASURE 21

Slide 22

Slide 22 text

1. UNDERSTAND SYSTEM 2. DETERMINE SLAs/SLOs/KPIs 3. SETUP MONITORING 4. INJECT CHAOS 5. MEASURE RESULTS 6. LEARN 7. INCREASE SYSTEM RESILIENCE CASE STUDY: KUBERNETES SOCK SHOP 22

Slide 23

Slide 23 text

1. DATADOG IS UP AND READY 2. THE AGENT IS ALREADY REPORTING METRICS FOR YOU! LUCKY YOU. 
 YOUR MONITORING IS ALREADY UP. 23

Slide 24

Slide 24 text

24

Slide 25

Slide 25 text

CHAOS TYPES KNOWN UNKNOWN UNKNOWN KNOWN 25

Slide 26

Slide 26 text

1. CHOOSE A SIMIAN ARMY SCRIPT LET’S INJECT KNOWN CHAOS $cd ~/SimianArmy/src/main/resources/scripts 26

Slide 27

Slide 27 text

1. CHOOSE A SIMIAN ARMY SCRIPT LET’S INJECT KNOWN CHAOS cd ~/SimianArmy/src/main/resources/scripts chaos@kube-tammy:~/SimianArmy/src/main/resources/scripts$ ls burncpu.sh faildynamodb.sh filldisk.sh networklatency.sh burnio.sh failec2.sh killprocesses.sh networkloss.sh faildns.sh fails3.sh networkcorruption.sh nullroute.sh 27

Slide 28

Slide 28 text

$vim burncpu.sh #!/bin/bash # Script for BurnCpu Chaos Monkey cat << EOF > /tmp/infiniteburn.sh #!/bin/bash while true; do openssl speed; done EOF # 32 parallel 100% CPU tasks should hit even the biggest EC2 instances for i in {1..32} do nohup /bin/bash /tmp/infiniteburn.sh & done 28

Slide 29

Slide 29 text

LET’S INJECT KNOWN CHAOS chaos@kube-tammy:~/SimianArmy/src/main/resources/scripts$ ls burncpu.sh faildynamodb.sh filldisk.sh networklatency.sh burnio.sh failec2.sh killprocesses.sh networkloss.sh faildns.sh fails3.sh networkcorruption.sh nullroute.sh chaos@kube-tammy:~/SimianArmy/src/main/resources/scripts$ chmod +x burncpu.sh chaos@kube-tammy:~/SimianArmy/src/main/resources/scripts$ ./burncpu.sh nohup: nohup: nohup: appending output to 'nohup.out' nohup: nohup: nohup: appending output to 'nohup.out' nohup: nohup: nohup: nohup: nohup: appending output to 'nohup.out' appending output to 'nohup.out' 29

Slide 30

Slide 30 text

CHAOS
 IN 
 TOP 30

Slide 31

Slide 31 text

CHAOS IN DATADOG 31

Slide 32

Slide 32 text

1. KILL WHAT I RAN AS CHAOS USER LET’S STOP THE KNOWN CHAOS pkill -u chaos 32

Slide 33

Slide 33 text

NO MORE CHAOS
 IN 
 TOP 33

Slide 34

Slide 34 text

DATADOG MONITORING 34

Slide 35

Slide 35 text

1. WE KILL MYSQL PRIMARY 2. WE KILL MYSQL REPLICA 3. WE KILL THE MYSQL PROXY WHAT KIND OF CHAOS DO WE INJECT AT DROPBOX? 35

Slide 36

Slide 36 text

WE USE SEMI SYNC, GROUP REPLICATION AND WE CREATED A TOOL CALLED AUTO REPLACE TO DO CLONES AND PROMOTIONS. HOW DO WE MAKE MYSQL RESILIENT TO KILLS? 36

Slide 37

Slide 37 text

CHAOS CREATES RESILIENCE 37

Slide 38

Slide 38 text

INJECT CHAOS IN YOUR SYSTEM 38

Slide 39

Slide 39 text

LET’S INJECT KNOWN CHAOS chaos@kube-tammy:~/SimianArmy/src/main/resources/scripts$ ls burncpu.sh faildynamodb.sh filldisk.sh networklatency.sh burnio.sh failec2.sh killprocesses.sh networkloss.sh faildns.sh fails3.sh networkcorruption.sh nullroute.sh chaos@kube-tammy:~/SimianArmy/src/main/resources/scripts$ chmod +x burncpu.sh chaos@kube-tammy:~/SimianArmy/src/main/resources/scripts$ ./burncpu.sh nohup: nohup: nohup: appending output to 'nohup.out' nohup: nohup: nohup: appending output to 'nohup.out' nohup: nohup: nohup: nohup: nohup: appending output to 'nohup.out' appending output to 'nohup.out' 39

Slide 40

Slide 40 text

WHAT TYPES OF CHAOS DID YOU INJECT? ? WHAT WAS YOUR HYPOTHESIS? 40

Slide 41

Slide 41 text

30 MIN MORNING TEA BREAK 10:30 — 11:00 THANKS TO GOOGLE! 41

Slide 42

Slide 42 text

PART II: CHAOS TOOLS 42

Slide 43

Slide 43 text

WHAT TYPES OF CHAOS DID YOU INJECT? ? WHAT WAS YOUR HYPOTHESIS? 43

Slide 44

Slide 44 text

SOME CHAOS CASE STUDIES….. 44

Slide 45

Slide 45 text

LET’S GO BACK IN TIME TO LOOK AT WORST OUTAGE STORIES WHICH THEN LED TO THE INTRODUCTION OF CHAOS ENGINEERING. 45

Slide 46

Slide 46 text

DROPBOX’S WORST OUTAGE EVER
 CHAOS @ DROPBOX https://blogs.dropbox.com/tech/2014/01/outage-post-mortem/ SOME MASTER-REPLICA PAIRS WERE IMPACTED WHICH RESULTED IN THE SITE GOING DOWN. 46

Slide 47

Slide 47 text

1. CHAOS DAYS 2. RACK SHUTDOWN 3. SERVICE DRTs NOW WE HAVE CHAOS @ DROPBOX 47

Slide 48

Slide 48 text

+ SO MANY WORST OUTAGE STORIES ARE THE DATABASE. + I LEAD DATABASES AT DROPBOX & WE DO CHAOS. + FEAR WILL NOT HELP YOU SURVIVE “THE WORST OUTAGE”. + DO YOU TEST YOUR ALERTS & MONITORING? WE DO. + HOW VALUABLE IS A POSTMORTEM IF YOU DON’T HAVE ACTION ITEMS AND DO THEM? NOT VERY. QUICK THOUGHTS….. 48

Slide 49

Slide 49 text

CHAOS @ UBER UBER’S WORST OUTAGE EVER: 1. MASTER LOG REPLICATION TO S3 FAILED 2. LOGS BACKED UP ON PRIMARY 3. ALERTS FIRE TO ENGINEER BUT THEY ARE IGNORED 4. DISK FILLS UP ON DATABASE PRIMARY 5. ENGINEER DELETES UNARCHIVED WAL FILES 6. ERROR IN CONFIG PREVENTS PROMOTION — Matt Ranney, UBER, YOW 2015 49

Slide 50

Slide 50 text

50

Slide 51

Slide 51 text

CHAOS @ UBER + UBER BUILT UDESTROY TO SIMULATE FAILURES. + DIDN’T USE NETFLIX SIMIAN ARMY AS IT WAS AWS-CENTRIC. + ENGINEERS AT UBER DON’T LIKE FAILURE TESTING (ESP. DATABASES) ……THIS IS DUE TO THEIR WORST OUTAGE EVER: — Matt Ranney, UBER, YOW 2015 51

Slide 52

Slide 52 text

+ CHAOS MONKEY + JANITOR MONKEY + CONFORMITY MONKEY CHAOS @ NETFLIX SIMIAN ARMY CONSISTS OF SERVICES (MONKEYS) IN THE CLOUD FOR GENERATING VARIOUS KINDS OF FAILURES, DETECTING ABNORMAL CONDITIONS, AND TESTING THE ABILITY TO SURVIVE THEM. THE GOAL IS THE KEEP THE CLOUD SAFE, SECURE AND HIGHLY AVAILABLE. 52

Slide 53

Slide 53 text

GITLAB’S WORST OUTAGE EVER… KEEPS REPEATING
 CHAOS @ GITLAB 1.ACCIDENTAL REMOVAL OF DATA FROM PRIMARY DATABASE 2.DATABASE OUTAGE DUE TO PROJECT_AUTHORIZATIONS HAVING TOO MUCH BLOAT 3.CI DISTRIBUTED HEAVY POLLING AND EXCESSIVE ROW LOCKING FOR SECONDS TAKES GITLAB.COM DOWN 4.SCARY DATABASE SPIKES https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/ 53

Slide 54

Slide 54 text

GITLAB ARE NOT YET DOING CHAOS ENGINEERING. 
 SHOULD BE FOR SURE. CHAOS @ GITLAB 54

Slide 55

Slide 55 text

CHAOS @ GOOGLE GOOGLE RUN DRTs AND HAVE BEEN FOR MANY YEARS 55

Slide 56

Slide 56 text

http://www.businessinsider.com/profile-of-google-disaster-recovery-testing-boss-kripa-krishnan-2016-8 56

Slide 57

Slide 57 text

“RESILIENCE HAS TO BE DESIGNED. HAS TO BE TESTED. IT’S NOT SOMETHING THAT HAPPENS AROUND A TABLE AS A SLEW OF EXCEPTIONAL ENGINEERS ARCHITECT THE PERFECT SYSTEM. PERFECTION COMES THROUGH REPEATEDLY TRYING TO BREAK THE SYSTEM” — VICTOR KLANG, TYPESAFE CHAOS @ TYPESAFE 57

Slide 58

Slide 58 text

DECIDED TO REDUCE DATABASE CAPACITY IN AWS. RESULTED IN AN OUTAGE AT 3:21AM. PAGERDUTY WAS MISCONFIGURED AND PHONES WERE ON SILENT. CHAOS @ BUILDKITE NOBODY WOKE UP DURING THE 4 HOUR OUTAGE….. 58

Slide 59

Slide 59 text

OH NOES! 59

Slide 60

Slide 60 text

“A DATABASE INDEX OPERATION RESULTED IN 90 MINUTES OF INCREASINGLY DEGRADED AVAILABILITY FOR THE STRIPE API AND DASHBOARD. IN AGGREGATE, ABOUT TWO THIRDS OF ALL API OPERATIONS FAILED DURING THIS WINDOW.” CHAOS @ STRIPE https://support.stripe.com/questions/outage-postmortem-2015-10-08-utc 60

Slide 61

Slide 61 text

INTRODUCING CHAOS IN A CONTROLLED WAY WILL RESULT IN ENGINEERS BUILDING INCREASINGLY RESILIENT SYSTEMS. HAVE I CONVINCED YOU? 61

Slide 62

Slide 62 text

THERE ARE MANY MORE YOU CAN READ ABOUT HERE:
 OUTAGES HAPPEN. https://github.com/danluu/post-mortems 62

Slide 63

Slide 63 text

CHAOS MONKEY YOU SET IT UP AS A CRON JOB THAT CALLS CHAOS MONKEY ONCE A WEEKDAY TO CREATE A SCHEDULE OF TERMINATIONS. HAS BEEN AROUND FOR MANY
 YEARS! USED AT BANKS, E-COMMERCE
 STORES, TECH COMPANIES + MORE 63

Slide 64

Slide 64 text

https://medium.com/continuous-delivery-scale/running-chaos-monkey- on-spinnaker-google-compute-engine-gce-155dc52f20ef 64

Slide 65

Slide 65 text

65 https://netflix.github.io/chaosmonkey/

Slide 66

Slide 66 text

66 https://www.spinnaker.io/

Slide 67

Slide 67 text

SPINNAKER + CHAOS MONKEY DEMO TIME 67

Slide 68

Slide 68 text

68 https://s3.amazonaws.com/quickstart-reference/spinnaker/ latest/doc/spinnaker-on-the-aws-cloud.pdf

Slide 69

Slide 69 text

69 https://s3.amazonaws.com/quickstart-reference/spinnaker/ latest/doc/spinnaker-on-the-aws-cloud.pdf

Slide 70

Slide 70 text

70 https://s3.amazonaws.com/quickstart-reference/spinnaker/ latest/doc/spinnaker-on-the-aws-cloud.pdf

Slide 71

Slide 71 text

71 https://s3.amazonaws.com/quickstart-reference/spinnaker/ latest/doc/spinnaker-on-the-aws-cloud.pdf

Slide 72

Slide 72 text

72 https://s3.amazonaws.com/quickstart-reference/spinnaker/ latest/doc/spinnaker-on-the-aws-cloud.pdf

Slide 73

Slide 73 text

DEMO TIME 73

Slide 74

Slide 74 text

74

Slide 75

Slide 75 text

75

Slide 76

Slide 76 text

76 https://blog.spinnaker.io/running-chaos-monkey-on-spinnaker- google-compute-engine-gce-155dc52f20ef

Slide 77

Slide 77 text

77 https://blog.spinnaker.io/running-chaos-monkey-on-spinnaker- google-compute-engine-gce-155dc52f20ef

Slide 78

Slide 78 text

CHAOS KONG TAKES DOWN AN ENTIRE AWS REGION.
 NETFLIX CREATED IT BECAUSE AWS
 HAD NOT YET BUILT THE ABILITY TO 
 TEST THIS. 
 
 AWS REGION OUTAGES DO HAPPEN! 78

Slide 79

Slide 79 text

CHAOS FOR KUBERNETES ASOBTI, AN ENGINEER @ BOX CREATED https://github.com/asobti/kube-monkey IT RANDOMLY DELETES KUBERNETES PODS
 IN THE CLUSTER ENCOURAGING AND 
 VALIDATING THE DEPLOYMENT OF FAILURE-RESILIENT SYSTEMS. 79

Slide 80

Slide 80 text

A SUITE OF TOOLS FOR KEEPING 
 YOUR CLOUD OPERATING IN TOP
 FORM. CHAOS MONKEY IS THE FIRST
 MEMBER. OTHER SIMIANS INCLUDE
 JANITOR MONKEY & CONFORMITY 
 MONKEY.
 
 https://github.com/Netflix/SimianArmy SIMIAN ARMY 80

Slide 81

Slide 81 text

GREMLIN PROVIDES “FAILURE AS A 
 SERVICE”. IT FINDS WEAKNESSES 
 IN YOUR SYSTEM BEFORE THEY
 END UP IN THE NEWS. 
 
 LIKE A VACCINATION, THEY SAFELY 
 INJECT HARM INTO YOUR SYSTEM
 TO BUILD IMMUNITY TO FAILURE. GREMLIN INC 81 https://gremlininc.com/

Slide 82

Slide 82 text

• DATADOG • GRAFANA • PAGERDUTY MONITORING + ALERTING 82

Slide 83

Slide 83 text

PART III: ADVANCED TOPICS
 AND Q&A 83

Slide 84

Slide 84 text

• GOOD TO USE: • MYSQL • ORCHESTRATOR • GROUP REPLICATION • SEMI SYNC CHAOS ENGINEERING FOR DATABASES https://github.com/github/orchestrator 84

Slide 85

Slide 85 text

https://github.com/github/orchestrator Authored by Shlomi Noach at GitHub. Previously at Booking.com and Outbrain 85

Slide 86

Slide 86 text

THINK ABOUT WHAT FAILURE YOU
 CAN INJECT AND THEN CATCH. 
 
 WE DO THIS WITH MAGIC POCKET
 AT DROPBOX. CHAOS ENGINEERING WITH GO 86

Slide 87

Slide 87 text

THIS PROJECT WAS STARTED FOR THE PURPOSE OF
 CONTROLLED FAILURE INJECTION DURING 
 GAME DAYS. GO CLIENT TO THE CHAOS MONKEY REST API 87 https://github.com/mlafeldt/chaosmonkey go get -u github.com/mlafeldt/chaosmonkey/lib

Slide 88

Slide 88 text

A TOOL FOR “INTUITION ENGINEERING” 
 TO HELP YOU VISUALIZE YOUR 
 NETWORK AND TRAFFIC.
 
 CREATED BY NETFLIX. VIZCERAL 88 https://github.com/Netflix/vizceral

Slide 89

Slide 89 text

VIZCERAL BY @JRSQUARED 89

Slide 90

Slide 90 text

http://locust.io/ 90

Slide 91

Slide 91 text

https://github.com/strepsirrhini-army/chaos-lemur 91

Slide 92

Slide 92 text

INDUSTRY + ACADEMIA COLLABORATION 92

Slide 93

Slide 93 text

DISORDERLY LABS 93

Slide 94

Slide 94 text

DISORDERLY LABS 94 https://people.ucsc.edu/~palvaro/molly.pdf

Slide 95

Slide 95 text

DISORDERLY LABS 95 https://people.ucsc.edu/~palvaro/socc16.pdf

Slide 96

Slide 96 text

WHERE CAN YOU LEARN MORE? 96 https://chaos-slack.herokuapp.com/

Slide 97

Slide 97 text

JOIN THE CHAOS COMMUNITY 97 http://chaos.community/

Slide 98

Slide 98 text

LOOK FORWARD TO SEEING YOU AT CHAOS COMMUNITY DAYS AND HEARING FROM YOU IN THE SLACK COMMUNITY AND ON THE MAILING LISTS. YOUR TOOL HERE! 98

Slide 99

Slide 99 text

CHAOS ENGINEERING BOOTCAMP TAMMY & CASEY 99 VELOCITY SAN JOSE 2017

Slide 100

Slide 100 text

THANKS FOR ATTENDING THE: CHAOS ENGINEERING BOOTCAMP VELOCITY SAN JOSE 2017 100