Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Finding Chaos Antipatterns in Postmortems

Finding Chaos Antipatterns in Postmortems

Presented in DevOpsDays Medellín

Yury Nino

July 30, 2020
Tweet

More Decks by Yury Nino

Other Decks in Technology

Transcript

  1. DevopsDays Medellín 2020 ¡La conferencia de referencia mundial sobre DevOps

    llega a Medellín! Julio 30 al 31, 2020 Latinoamérica ¡Bienvenidos!
  2. #DevOpsDaysMDE Participa en las redes sociales @DevopsdaysMed DevopsDays Medellín 2020

    ¡La conferencia de referencia mundial sobre DevOps llega a Medellín! Julio 30 al 31, 2020 DevOpsdays Medellín 2020 Devopsdaysmed2020 DevOpsDaysMedellin
  3. DevopsDays Medellín 2020 Julio 30 al 31, 2020 CHAOS ANTIPATTERNS

    Finding Commons in Incidents to Automate Postmortems
  4. DevopsDays Medellin 2020 ¡La conferencia de referencia mundial sobre DevOps

    llega a Medellín! #DevOpsDaysMDE • What is an Antipattern? • Identifying Antipatterns ◦ Classical Outages • Post mortems ◦ Foundations ◦ Motivations • Automating Postmortems • Chaos Engineering ◦ Learning from post mortems ◦ Chaos GameDays TOPICS To be covered
  5. Each outage is unique! But there are anti patterns and

    sometimes we can use those to create defenses! Inspired in Laura Nolan’s quotes. Outages can be routines, no incidents! The antipatterns we perceive are determined by the stories we live. Inspired in Laura Nolan’s quotes.
  6. Antipatrón October 27, 1980 ARPANET a malfunctioning IMP corrupted routing

    data, software recomputed checksums propagating bad data with good checksums, incorrect sequence numbers caused buffers to fill, full buffers caused loss of keepalive packets and nodes took themselves off the network! http://www.faqs.org/rfcs/rfc789.html
  7. At 10:25pm PDT on June 4, loss of power at

    an AWS Sydney facility resulting from severe weather in that area lead to disruption to a significant number of instances in an Availability Zone. Due to the signature of the power loss, power isolation breakers did not engage, resulting in backup energy reserves draining into the degraded power grid.
  8. • Total site outage for 11 hours. • One of

    several MongoDB shards outgrew its RAM, hitting a performance cliff. • Backlog of queries. • Resharding while at full capacity is hard.
  9. AntiPatterns, like their pattern counterparts, define an industry vocabulary for

    the common processes and implementations. AntiPatterns highlight the most common problems that face the software industry and provide the tools to recognize them and to determine their causes. Is this TALK about Antipatterns?
  10. What is common here? • Hardware issues. • Configuration errors.

    • Security vulnerabilities. • Hitting limits. • Conflicts. • Out of sync times. • Slowness. • Automation Interactions.
  11. A postmortem documents what happened after a critical incident and

    helps to prevent repeat outages. They provide a summary of events that log how the response was handled, and what resolution steps were taken.
  12. We do not write Postmortems!!! 60% No 18 Yes 27

    No 40% Yes A survey of 45 Software Engineers, showed that Postmortems are not a common practice.
  13. Conducting a postmortem is an expensive and highly time-consuming task.

    Looking up to automate this exhaustive work! Using the ANTIPATTERNS identified after reviewing a compilation of classical postmortems.
  14. Benefits of writing Postmortems That may be true for you.

    But not for others Engage in a structured, collaborative process allows everyone to contribute what they learned and can build resiliency. Friends with Benefits An opportunity to rebuilding confidence in people who were closely involved in the incident! Partners, customers, and end-users may also want to know what happened and what steps you have taken to improve their experience.
  15. Failures are an inevitable part of making software products and

    services, however it is not necessary to repeat the mistakes of the past!!!
  16. Chaos Engineering It is the discipline of experimenting in production

    on a distributed system in order to reveal their weakness and to build confidence in their resilience capability. https://principlesofchaos.org/
  17. Combine with Gamedays!!! GameDays were created by Jesse Robbins inspired

    by his experience & training as a firefighter. A Chaos GameDay is an event hosted to conduct chaos experiments to validate or invalidate a hypothesis resilience.
  18. DevopsDays Medellín 2020 ¡La conferencia de referencia mundial sobre DevOps

    llega a Medellín! Julio 30 al 31, 2020 #DevOpsDaysMDE ¿Preguntas?
  19. DevopsDays Medellín 2020 ¡La conferencia de referencia mundial sobre DevOps

    llega a Medellín! Julio 30 al 31, 2020 #DevOpsDaysMDE ¡Muchas Gracias!