Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Journey Planning my First GameDay

A Journey Planning my First GameDay

I begin with a pragmatic approach in making experiments in production with the scientific method. I will explain what are the principles and the methodologies used by top tech companies, such as Netflix, AWS and Twilio. In the central part, I show how I designed, planned, scheduled and run my first chaos gameday. Finally, with this background, I will introduce the postmortems topic and a new app to support a gameday.

Yury Nino

May 26, 2019
Tweet

More Decks by Yury Nino

Other Decks in Technology

Transcript

  1. Nice to meet you YURY NIÑO Software Engineer and Chaos

    Engineer Advocate. Loves building software applications, solving resilience issues and teaching. Passionate about reading, writing and cycling.
  2. Chaos Engineering It is deliberately inducing stress or fault into

    software and/or hardware as a way of learning/verifying things about systems on production. https://www.gremlin.com
  3. Chaos GameDays are events hosted to conduct chaos experiments in

    order to validate or invalidate hypothesis about a system’s resilience. Jessie Robbins SRE Amazon
  4. GameDays can Transform our Teams Even though they are not

    real! they make Engineers gain confidence. https://tech.target.com
  5. Plan • Create an agenda for the GameDay. • Create

    a public #gamedays slack channel. • Create a calendar invite for the GameDay. • Fill a chaos experiment form. • Create a datadog dashboard for the GameDay. • Ensure team have access to the software!
  6. Application Name Finer Observa bility DataDog Hypothesis Circuit Breaker works

    Environment My Home Results Duration 5 - 10 seconds Load 1 request Actions Plan
  7. Execution: The Game starts! The Master of Disaster decides in

    secret a type of failure. The Master of Disaster declares “start of incident” and attack!!!
  8. One member of the team acts as First On-Call and

    attempts to see, triage, and mitigate the impact of the failure. Execution ...
  9. The team understands, analyzes and solves the issue. It happens

    in less than 75% of the time. The Master of Disaster will reverse the failure and the team proceed to do a post-mortem. Execution ...
  10. Postmortem Application Name Finer Observa bility DataDog Hypothesis Circuit Breaker

    works Facing latencies > 5 seconds between dashboard_api and smart_api to open the circuit. Environment My Home Results Duration 5 - 10 seconds Load 1 request Issue #4356 Configure the proper hystrix parameters according the results. Implement a fallback. Actions