Chaos U - Planning Your Chaos Day

Chaos U - Planning Your Chaos Day

203e64aeb53ae59b2b4dcf923c163c23?s=128

Tammy Bütow

April 13, 2018
Tweet

Transcript

  1. TAMMY BUTOW PRINCIPAL SRE, GREMLIN

  2. TAMMY BUTOW PRINCIPAL SRE, GREMLIN PLANNING YOUR OWN CHAOS DAY

  3. @TAMMYBUTOW Breaking things in production on purpose since ’09 Failure

    Fridays at Gremlin Disaster Recovery at one of Australia’s biggest banks. Database & Cache Chaos Engineering at Dropbox SEV Repro & Tank at DigitalOcean Principal SRE @GremlinInc Co-Founder @girlgeekacademy Prev @DigitalOcean @Dropbox @NAB @QUT 
 Australian
  4. WHOIS GREMLIN

  5. None
  6. PLANNING YOUR OWN CHAOS DAY @TAMMYBUTOW

  7. WHAT IS A CHAOS DAY?

  8. CHAOS DAY: Dedicated team day focused on using chaos engineering

    to reveal weaknesses in your system. @TAMMYBUTOW
  9. CHAOS ENGINEERING: Thoughtful, planned experiments designed to reveal weaknesses in

    your system. @TAMMYBUTOW
  10. CHAOS ENGINEERING: @TAMMYBUTOW

  11. BUSINESS CONTINUITY PLAN: You already do experiments in production: disaster

    recovery testing.
 
 Chaos Engineering is focused on making these experiments automated and continuous. @TAMMYBUTOW
  12. INSPIRATION FOR CHAOS DAYS: • GameDays • Capture The Flag

    • Hack Days & Hack Weeks @TAMMYBUTOW
  13. GAMEDAYS: @TAMMYBUTOW

  14. HACK DAYS & HACK WEEKS: @TAMMYBUTOW

  15. CAPTURE THE FLAG: @TAMMYBUTOW

  16. WHY RUN A CHAOS DAY?

  17. THERE ARE A VARIETY OF BENEFITS… @TAMMYBUTOW

  18. FOR A VARIETY OF PEOPLE…. @TAMMYBUTOW

  19. @TAMMYBUTOW

  20. Who Topics to expect Useful Topics Influencers Engineers On-Call How

    to practice CE Continuous Chaos Waverers Engineering Managers What CE is The cost of downtime Passives Engineering Directors / VPs Why practice CE Incident & on-call reduction Moaners Specific Individuals “I’m too busy” We don’t learn by “always doing things the way we’ve always done them” Opponents Customer Support “There’s already so much chaos” Impact of SEVs and incidents on the business and teams Fanatics Specific Individuals “I believe in unit tests” CE is unit tests for alerting and monitoring Skeptics Specific Individuals “We won’t get value from this” Defence protection & training Mutineers Specific Individuals “We don’t need to do this” Data on top 5 most unreliable services & focus on resilience CHAOS DAY SOCIODYNAMICS @TAMMYBUTOW
  21. WHAT ARE THEIR MAJOR CHALLENGES? WHAT WILL THEY WIN OR

    LOSE BY COLLABORATING WITH YOU? WHAT IS THEIR INFLUENCE ON OTHER STAKEHOLDERS? @TAMMYBUTOW
  22. LET’S TALK ABOUT A FEW CHAOS DAY BENEFITS IN MORE

    DETAIL… @TAMMYBUTOW
  23. UP-SKILL YOUR TEAM ON BUILDING SOFTWARE WITH FAILURE IN MIND

    * EVERYONE IN ENGINEERING HAS A LEARNING BUDGET, THIS IS REAL WORLD EDUCATION!
 ** OFTEN LEARNING BUDGETS GO UNSPENT & ARE BETWEEN $1k—$10k+ PER PERSON PER YEAR. @TAMMYBUTOW
  24. GAIN A DEEPER 
 UNDERSTANDING OF YOUR CURRENT SYSTEM WEAKNESSES

    @TAMMYBUTOW
  25. REDUCE THE SEVERITY AND FREQUENCY OF INCIDENTS @TAMMYBUTOW 10x REDUCTION

    IN INCIDENTS
  26. PREVENT LOSS 
 CAUSED BY OUTAGES @TAMMYBUTOW

  27. LEARN FROM FAILURE @TAMMYBUTOW

  28. HOW DO YOU PLAN A CHAOS DAY?

  29. START EARLY THE GOAL: PLAN, CREATE & HOST AN IMPACTFUL

    CHAOS DAY. @TAMMYBUTOW
  30. YOUR CHAOS DAY COULD BE: • An on-site • An

    off-site • During a company retreat @TAMMYBUTOW
  31. CHAOS DAY COUNTDOWN: 90 DAYS

  32. ARE YOU READY FOR A CHAOS DAY? @TAMMYBUTOW

  33. CHAOS DAY PREREQUISITES: • Know your top 5 critical systems

    • Have monitoring & alerting • Measure the cost of downtime @TAMMYBUTOW
  34. WHO WILL ATTEND YOUR CHAOS DAY? @TAMMYBUTOW

  35. WHAT IS THE FOCUS OF YOUR CHAOS DAY? @TAMMYBUTOW

  36. HOW WILL YOU MEASURE SUCCESS ? @TAMMYBUTOW

  37. WHAT IS YOUR BUDGET? @TAMMYBUTOW

  38. WHERE WILL IT BE? @TAMMYBUTOW

  39. CHAOS DAY COUNTDOWN DETERMINE ATTENDEE AVAILABILITY FOR CHAOS DAY LOCK-IN

    CHAOS DAY 
 VENUE CHAOS DAY PLACEHOLDER INVITES AGENDA & CHAOS DAY PRE-READ INFO CREATE CHAOS DAY CREW CHAOS DAY 90 DAYS 60 DAYS 0 DAYS 30 DAYS @TAMMYBUTOW
  40. CHAOS DAY CREW

  41. CHAOS DAY CREW •VP Engineering / CTO / COO •Executive

    Assistant •Engineering Director / Manager •Principal / Staff Engineer @TAMMYBUTOW
  42. CHAOS DAY CREW EXEC EXEC ASSITANT PRINCIPAL ENGINEER ENGINEERING LEADER

    Objectives A I R C Budget R A C I Attendee List & Availability C I A R Venue C R I A Invitations & Agenda I R C A Accoutrements I R C A Chaos Engineering Experiments C I R A Extra impact C I A R RACI: Responsible, Accountable, Consulted, Informed @TAMMYBUTOW
  43. CHAOS DAY PLANNING OBJECTIVES

  44. CHAOS DAY PLAN OBJECTIVES 1.Make chaos engineering familiar 2.Identify your

    key stakeholders 3.Create the right story for your stakeholders @TAMMYBUTOW
  45. CHAOS DAY BUDGET

  46. HOW MUCH WILL THE CHAOS DAY COST? @TAMMYBUTOW

  47. CHAOS DAY ATTENDEE AVAILABILITY

  48. ATTENDEE AVAILABILITY https://doodle.com/ @TAMMYBUTOW

  49. CHAOS DAY VENUE

  50. RESEARCH VENUES @TAMMYBUTOW

  51. CHAOS DAY PLACEHOLDER INVITES

  52. PLACEHOLDER INVITES Book the day into your attendee’s calendars. Don’t

    give much away….. @TAMMYBUTOW
  53. CHAOS DAY PRE-READ INFORMATION

  54. SHARE CHAOS DAY PRE-READ INFORMATION
 WITH ATTENDEES @TAMMYBUTOW

  55. CHAOS DAY PRE-READ PACK https://github.com/gremlininc/chaos-day/ @TAMMYBUTOW

  56. CHAOS DAY AGENDA

  57. AGENDA @TAMMYBUTOW

  58. @TAMMYBUTOW Chaos Day Agenda: • Start Time (11am) • Whiteboarding

    & debate on assumptions • Lunch (midday) • Test cases and scoping • Execution • Recap / Review / Feedback • Close (4pm)
  59. WHAT CE 
 EXPERIMENTS SHOULD YOU PERFORM FIRST? @TAMMYBUTOW

  60. WHAT ARE YOUR TOP 5 MOST CRITICAL SERVICES @TAMMYBUTOW

  61. SELECT YOUR TARGET @TAMMYBUTOW

  62. STAGE THEN PRODUCTION @TAMMYBUTOW

  63. WHITEBOARDING *With so many great minds present it’s the perfect

    time to whiteboard the system’s architecture @TAMMYBUTOW
  64. EXPERIMENT SCOPING @TAMMYBUTOW

  65. Type of Attack Attack Gremlin Support (April 2018) Resource CPU

    ✓ Resource Disk ✓ Resource IO ✓ Resource Memory ✓ State Process Killer ✓ State Shutdown ✓ State Time Travel ✓ Network Blackhole ✓ Network DNS ✓ Network Latency ✓ Network Packet Loss ✓ GREMLIN EXPERIMENTS @TAMMYBUTOW
  66. GREMLIN SYSCHECK @TAMMYBUTOW

  67. GREMLIN SYSCHECK @TAMMYBUTOW

  68. @TAMMYBUTOW

  69. • Calls to DynamoDB will timeout after 1500ms • This

    will cause elevated 500 status codes in API • The UI will degrade gracefully
 CHAOS ENGINEERING HYPOTHESIS TAMMY BUTOW @TAMMYBUTOW
  70. EXPERIMENT @TAMMYBUTOW

  71. ANALYSE RESULTS @TAMMYBUTOW

  72. ELEVATED 500 RESPONSES @TAMMYBUTOW

  73. GRACEFUL DEGRADATION @TAMMYBUTOW

  74. HAVE AN ABORT PLAN @TAMMYBUTOW

  75. RECAP & REVIEW YOUR CHAOS DAY @TAMMYBUTOW

  76. TURN YOUR CHAOS DAY EXPERIMENTS INTO CONTINUOUS CHAOS @TAMMYBUTOW

  77. ESTABLISH CHAOS CREW FOR YOUR NEXT CHAOS DAY @TAMMYBUTOW

  78. ACCOUTREMENTS

  79. CHAOS DAY FOOD & DRINKS @TAMMYBUTOW

  80. DO YOU WANT A THEME FOR YOUR CHAOS DAY? @TAMMYBUTOW

  81. @TAMMYBUTOW THEMES AT HACK WEEKS / HACK DAYS “CARNIVAL THEME”

  82. @TAMMYBUTOW

  83. @TAMMYBUTOW

  84. @TAMMYBUTOW

  85. @TAMMYBUTOW

  86. EXTRA IMPACT

  87. CHAOS DAY 
 GOODIE BAGS * INCLUDE A BOOK &

    TREATS @TAMMYBUTOW
  88. CHAOS DAY COUNTDOWN: 1 DAY

  89. SEND REMINDERS @TAMMYBUTOW

  90. TELL EVERYONE TO BRING THEIR LAPTOP @TAMMYBUTOW

  91. CHAOS DAY COUNTDOWN: 0 DAYS

  92. WELCOME TO YOUR CHAOS DAY

  93. THANK YOU PRINCIPAL SRE, GREMLIN @TAMMYBUTOW GREMLIN.COM

  94. NOW FOR SOMETHING INTERACTIVE

  95. EXERCISE 1: WHAT SHOULD YOU EXPERIMENT WITH ON YOUR CHAOS

    DAY
  96. @TAMMYBUTOW @EUGENEZWU1 DIRECTOR, CUSTOMER SUCCESS PRINCIPAL SRE GREMLIN GREMLIN

  97. WHAT ARE YOUR TOP 5 MOST CRITICAL SERVICES @TAMMYBUTOW

  98. @TAMMYBUTOW

  99. STAGE THEN PRODUCTION @TAMMYBUTOW

  100. IN GROUPS OF 3-5: DISCUSS WHAT YOUR TOP 5 CRITICAL

    SERVICES ARE @TAMMYBUTOW
  101. IN YOUR GROUPS: SELECT ONE SERVICE
 TO PERFORM EXPERIMENTS
 ON

    FOR YOUR CHAOS DAY @TAMMYBUTOW
  102. SHARE: YOUR TOP 5 CRITICAL SERVICES. WHICH SERVICE DID YOU


    SELECT FOR CHAOS DAY? @TAMMYBUTOW
  103. THANK YOU! @TAMMYBUTOW AND @EUGENEZWU1 GREMLIN.COM