Definitive Guide to GameDays - Road to Resilient Services

Definitive Guide to GameDays - Road to Resilient Services

GameDay is a dedicated time to intentionally create failure scenarios in a safe environment. Regularly running GameDays is an effective Chaos Engineering practice to test the resiliency of your services; to validate the technical intricacies, and to also surface conversations around observability and incident management. GameDays can also expose you to blind spots when systems are operating under suboptimal conditions. In this talk, Ho Ming will be sharing what it takes to run successful GameDays.

9fccf1fe0a5da1402f23e0566cb7c2ae?s=128

Ho Ming Li

May 15, 2018
Tweet

Transcript

  1. Definitive Guide to GameDays Road to Resilient Services Ho Ming

    Li @HoReaL Solutions Architect @ Gremlin
  2. Ho Ming Li Solutions Architect @HoReaL SW Development Release Engineering

    QA Engineering Professional Services Enterprise Support Solutions Architecture
  3. What is Chaos Engineering?

  4. Thoughtful, planned, experiments designed to reveal weaknesses in your services

  5. Business Continuity Plan Disaster Recovery Failover

  6. Unplug Cables

  7. Chaos Experiment = Micro Exercise

  8. See ACTUAL behaviours of your service under failure scenarios

  9. What is a GameDay?

  10. Dedicated time for teams to collaboratively focus on using Chaos

    Engineering practices to reveal weaknesses in your services
  11. Goal: Build (& Operate) Resilient Services

  12. Run GameDays

  13. Goal: Run your own GameDay

  14. Why Run a GameDay?

  15. Why do I want to run a GameDay?

  16. Business Owners Cost of Downtime

  17. Managers Retain Talent

  18. Engineers Please stop paging me!

  19. It’s a Team Game

  20. Managing sensitive projects: a lateral approach - Olivier d'Herbemont, Bruno

    César, Tom Curtin, Pascal Etcheber
  21. Who Topics to expect Useful Topics Influencers Engineers On-Call How

    to practice CE Continuous Chaos Waverers Senior Management What is CE Cost of Downtime Passives Engineering Managers Why practice CE Incident & on-call reduction Moaners Specific Individuals “I’m too busy” We don’t learn by “always doing things the way we’ve always done them” Opponents Support “There’s already so much chaos” Impact of incident management Fanatics Specific Individuals “I believe in unit tests” CE is unit tests for alerting and monitoring Skeptics Specific Individuals “We won’t get value from this” Defence protection & training Mutineers Specific Individuals “We don’t need to do this” Data on top 5 most unreliable services & focus on resilience
  22. WHAT ARE THEIR MAJOR CHALLENGES? What could they gain by

    collaborating with you?
  23. “Sharpen your Saw”

  24. Reduce Incident Severity and Frequency 10x

  25. None
  26. Planning a GameDay

  27. Who’s in?

  28. It’s a Team Game

  29. THE Chaos Crew to make it happen Executive: CTO /

    VP Engineering Budget / Objectives Executive Assistant / Organizer Invitations / Coordination Engineering Director / Manager Prioritization / E. Availability Engineers / Subject Matter Expert Architecture / Experiments New Hires / Interns Learning / New Perspectives
  30. When & Where? Date, Time, Venue, Environment

  31. Service about to Launch? Major re-architecture? Wait till known issues

    are fixed?
  32. Other Priorities? Never a good time “Sharpen your Saw” Start

    Finding Time Now
  33. Better In-Person

  34. Choose an Environment ? Staging Production

  35. Start in Staging Mature to Production

  36. lim delta → 0 PRODUCTION + delta

  37. Plan Experiments Run Experiments Analyze Results

  38. Do This Pre-GameDay

  39. Be Thoughtful

  40. Pick 1 Service

  41. What’s your top 5 critical services? What’s your previous outage?

  42. Define Experiments

  43. Whiteboard Service Architecture List Internal/External Dependencies

  44. What could go wrong?

  45. How to (re)Create Failure Scenario?

  46. Scoping Think about the Blast Radius

  47. Start small THEN dial

  48. DON’T START HERE

  49. START HERE

  50. None
  51. None
  52. None
  53. Safety First At what point do you stop the experiment?

    Think about the Customers
  54. Be Thoughtful

  55. Experiments should validate...

  56. Application Retries, Timeouts, Fallbacks

  57. Monitoring Metrics, Log Events don’t forget KPIs

  58. Alerting Paging System

  59. People Incident Response

  60. Think End-to-End

  61. Don’t Over-complicate Experiments

  62. Be Thoughtful

  63. Write it all down!

  64. Plan Experiments Run Experiments Analyze Results

  65. It’s GameDay!

  66. It’s a Team Game

  67. Communicate Review Plan Execute

  68. Lean In

  69. Take Time

  70. How long did it take to launch? How long till

    service recovers? How much time left before Fallback fails?
  71. Write it all down!

  72. Plan Experiments Run Experiments Analyze Results

  73. Validation / Discovery 80/20

  74. None
  75. How do you ensure that your service Stays Resilient Over

    Time?
  76. Met Expectations? Automate Experiments Re-Run Experiments

  77. Plan your Next GameDay

  78. GameDay Starter Pack https://www.gremlin.com/gameday/

  79. The End Beginning

  80. Recap

  81. “Sharpen your Saw”

  82. It’s a Team Game

  83. Be Thoughtful

  84. Safely, Thoughtfully, Collaboratively Break Things On Purpose Go Run Your

    Own GameDay!
  85. Break things together! Join us. Learn from us. Teach us.

    Chaos Engineering Community Slack (https://tinyurl.com/chaoseng)
  86. Thank You! Ho Ming Li @HoReaL Solutions Architect, Gremlin Chaos

    Engineering Community Member
  87. - End of Deck -