Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Chaos Engineering - Here We Go

Avatar for Lothar Wieske Lothar Wieske
September 13, 2019

Chaos Engineering - Here We Go

Trivadis Tech-Event 2019 (25-jähriges Jubiläum)
Mövenpick-Hotel Regensdorf, 13. September 2019

Freitag, 13. September 2019

Chaos Engineering - Here We Go

Avatar for Lothar Wieske

Lothar Wieske

September 13, 2019
Tweet

More Decks by Lothar Wieske

Other Decks in Technology

Transcript

  1. BASEL | BERN | BRUGG | BUKAREST | DÜSSELDORF |

    FRANKFURT A.M. | FREIBURG I.BR. | GENF HAMBURG | KOPENHAGEN | LAUSANNE | MANNHEIM | MÜNCHEN | STUTTGART | WIEN | ZÜRICH Chaos Engineering Lothar Wieske TechEvent Zürich 13.09.2019
  2. BASEL | BERN | BRUGG | BUKAREST | DÜSSELDORF |

    FRANKFURT A.M. | FREIBURG I.BR. | GENF HAMBURG | KOPENHAGEN | LAUSANNE | MANNHEIM | MÜNCHEN | STUTTGART | WIEN | ZÜRICH news.trivadis.com/blog @lwieske Chaos Engineering Here We Go Lothar
  3. BASEL | BERN | BRUGG | BUKAREST | DÜSSELDORF |

    FRANKFURT A.M. | FREIBURG I.BR. | GENF HAMBURG | KOPENHAGEN | LAUSANNE | MANNHEIM | MÜNCHEN | STUTTGART | WIEN | ZÜRICH Lothar I am solutions architect and digital disruptor. Since 2009, I work at the intersection between cloud and analytics. Digital disruption is coming to ever more sectors and I want to understand its technological, societal and economical impacts. Before 2009, I managed large project budgets, turned to an architect later on and built a digital radiology and migrated the Miles & More. @lwieske news.trivadis.com/blog
  4. Cloud native technologies empower organizations to build and run scalable

    applications in modern, dynamic environments such as public, private, and hybrid clouds. Containers, service meshes, microservices, immutable infrastructure, and declarative APIs exemplify this approach.
  5. 2012: Netflix Open Sourced Chaos Monkey. 2016: Netflix Completed Transition

    To a 100% AWS Infrastructure Cloud Changed the Way Netflix Runs the Company
  6. Netflix Handled Amazon Maintenance Update • Amazon performed a major

    maintenance update at the end of September 2014 in order to patch a security vulnerability in a Xen hypervisor affecting about 10% of their global fleet of cloud servers. • Netflix has a long history of using their Simian army - Chaos Monkey, Gorilla and Kong – to force reboots of their servers in order to see how the overall system reacts and what can be done to improve resilience. The problem this time was that the operation would affect some of their database servers, more exactly 218 Cassandra nodes. It is one thing to perform a live restart of a server streaming a video, and it is a lot more difficult to do the same to a stateful database. • Out of our 2700+ production Cassandra nodes, 218 were rebooted. • 22 Cassandra nodes were on hardware that did not reboot successfully. • They were detected and replaced with minimal human intervention. • Netflix experienced 0 downtime that weekend.
  7. PRINCIPLES OF CHAOS ENGINEERING • The following principles describe an

    ideal application of Chaos Engineering, applied to the processes of experimentation described above. The degree to which these principles are pursued strongly correlates to the confidence we can have in a distributed system at scale. • Build a Hypothesis around Steady State Behavior • Vary Real-world Events • Run Experiments in Production • Automate Experiments to Run Continuously • Minimize Blast Radius • Experimenting in production has the potential to cause unnecessary customer pain. While there must be an allowance for some short-term negative impact, it is the responsibility and obligation of the Chaos Engineer to ensure the fallout from experiments are minimized and contained.
  8. Chaos Engineering Is Not Just Tools. Culture Is Part Of

    Your System. Complexity Is Part Of Your System. Testing In Production? Yes You Can! You Should Chaos Engineer Everything Cloud and Microservices – Among Others
  9. Integration Workshops Orientation Workshops Elaboration Workshops Conception Workshops Cloud Native

    Leadership Cloud Native Apps Cloud Native Architectures Teams & Skills DevOps Cloud Native Data Cloud Native Journey Cloud Native Landscape Walkthrough Cloud Native Security Cloud Native Lighthouse
  10. Session Feedback – now • Please use the Trivadis Events

    mobile app to give feedback on each session • Use "My schedule" if you have registered for a session • Otherwise use "Agenda" and the search function • If the mobile app does not work (or if you have a Windows smartphone / Desktop), use your smartphone browser • URL: http://trivadis.quickmobileplatform.eu/ • User name: <your_loginname> (such as “svv”) • Password: sent by e-mail...