Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Chaos U - Planning Your Chaos Day

Chaos U - Planning Your Chaos Day

Tammy Bryant Butow

April 13, 2018
Tweet

More Decks by Tammy Bryant Butow

Other Decks in Technology

Transcript

  1. @TAMMYBUTOW Breaking things in production on purpose since ’09 Failure

    Fridays at Gremlin Disaster Recovery at one of Australia’s biggest banks. Database & Cache Chaos Engineering at Dropbox SEV Repro & Tank at DigitalOcean Principal SRE @GremlinInc Co-Founder @girlgeekacademy Prev @DigitalOcean @Dropbox @NAB @QUT 
 Australian
  2. CHAOS DAY: Dedicated team day focused on using chaos engineering

    to reveal weaknesses in your system. @TAMMYBUTOW
  3. BUSINESS CONTINUITY PLAN: You already do experiments in production: disaster

    recovery testing.
 
 Chaos Engineering is focused on making these experiments automated and continuous. @TAMMYBUTOW
  4. INSPIRATION FOR CHAOS DAYS: • GameDays • Capture The Flag

    • Hack Days & Hack Weeks @TAMMYBUTOW
  5. Who Topics to expect Useful Topics Influencers Engineers On-Call How

    to practice CE Continuous Chaos Waverers Engineering Managers What CE is The cost of downtime Passives Engineering Directors / VPs Why practice CE Incident & on-call reduction Moaners Specific Individuals “I’m too busy” We don’t learn by “always doing things the way we’ve always done them” Opponents Customer Support “There’s already so much chaos” Impact of SEVs and incidents on the business and teams Fanatics Specific Individuals “I believe in unit tests” CE is unit tests for alerting and monitoring Skeptics Specific Individuals “We won’t get value from this” Defence protection & training Mutineers Specific Individuals “We don’t need to do this” Data on top 5 most unreliable services & focus on resilience CHAOS DAY SOCIODYNAMICS @TAMMYBUTOW
  6. WHAT ARE THEIR MAJOR CHALLENGES? WHAT WILL THEY WIN OR

    LOSE BY COLLABORATING WITH YOU? WHAT IS THEIR INFLUENCE ON OTHER STAKEHOLDERS? @TAMMYBUTOW
  7. UP-SKILL YOUR TEAM ON BUILDING SOFTWARE WITH FAILURE IN MIND

    * EVERYONE IN ENGINEERING HAS A LEARNING BUDGET, THIS IS REAL WORLD EDUCATION!
 ** OFTEN LEARNING BUDGETS GO UNSPENT & ARE BETWEEN $1k—$10k+ PER PERSON PER YEAR. @TAMMYBUTOW
  8. YOUR CHAOS DAY COULD BE: • An on-site • An

    off-site • During a company retreat @TAMMYBUTOW
  9. CHAOS DAY PREREQUISITES: • Know your top 5 critical systems

    • Have monitoring & alerting • Measure the cost of downtime @TAMMYBUTOW
  10. CHAOS DAY COUNTDOWN DETERMINE ATTENDEE AVAILABILITY FOR CHAOS DAY LOCK-IN

    CHAOS DAY 
 VENUE CHAOS DAY PLACEHOLDER INVITES AGENDA & CHAOS DAY PRE-READ INFO CREATE CHAOS DAY CREW CHAOS DAY 90 DAYS 60 DAYS 0 DAYS 30 DAYS @TAMMYBUTOW
  11. CHAOS DAY CREW •VP Engineering / CTO / COO •Executive

    Assistant •Engineering Director / Manager •Principal / Staff Engineer @TAMMYBUTOW
  12. CHAOS DAY CREW EXEC EXEC ASSITANT PRINCIPAL ENGINEER ENGINEERING LEADER

    Objectives A I R C Budget R A C I Attendee List & Availability C I A R Venue C R I A Invitations & Agenda I R C A Accoutrements I R C A Chaos Engineering Experiments C I R A Extra impact C I A R RACI: Responsible, Accountable, Consulted, Informed @TAMMYBUTOW
  13. CHAOS DAY PLAN OBJECTIVES 1.Make chaos engineering familiar 2.Identify your

    key stakeholders 3.Create the right story for your stakeholders @TAMMYBUTOW
  14. @TAMMYBUTOW Chaos Day Agenda: • Start Time (11am) • Whiteboarding

    & debate on assumptions • Lunch (midday) • Test cases and scoping • Execution • Recap / Review / Feedback • Close (4pm)
  15. WHITEBOARDING *With so many great minds present it’s the perfect

    time to whiteboard the system’s architecture @TAMMYBUTOW
  16. Type of Attack Attack Gremlin Support (April 2018) Resource CPU

    ✓ Resource Disk ✓ Resource IO ✓ Resource Memory ✓ State Process Killer ✓ State Shutdown ✓ State Time Travel ✓ Network Blackhole ✓ Network DNS ✓ Network Latency ✓ Network Packet Loss ✓ GREMLIN EXPERIMENTS @TAMMYBUTOW
  17. • Calls to DynamoDB will timeout after 1500ms • This

    will cause elevated 500 status codes in API • The UI will degrade gracefully
 CHAOS ENGINEERING HYPOTHESIS TAMMY BUTOW @TAMMYBUTOW