Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ChaosSpark.pdf

Yury Nino
February 27, 2020

 ChaosSpark.pdf

Yury Nino

February 27, 2020
Tweet

More Decks by Yury Nino

Other Decks in Technology

Transcript

  1. 4 Yury Niño Roa DevOps Engineer | Chaos Engineering Advocate

    @yurynino Ingeniera de Sistemas. Especialista en Ingeniería de Software Coorganizadora de GDG Bogotá, GDG Cloud Bogotá y Women Techmakers Bogotá “A single point of failure can trigger a chain reaction across the value chain, including suppliers to the final customer, and cause a severe business interruption” Dimitar Pachov
  2. Agenda Why the world needs High Availability? How to achieve

    High Availability? Apache Spark is Highly Available The promise: Resilience Patterns Chaos Engineering Chaos Principles … Testing High Availability with Chaos!
  3. Modern applications are distributed, HA and resilient by default! However

    … 38% of customers at traditional banks experienced disruption to their service every year, compared with 21% of challenger bank customers. The FCA says bank outages have risen 138% in the past year in the world!
  4. Apache … • Spark is an analytics engine for large-scale

    data processing. • Spark is available for both batch and streaming data. • Spark allows to write applications in Java, Scala, Python, and SQL. • Spark makes easy to build parallel apps. • Spark combines SQL, streaming, and complex analytics. • Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud.
  5. Single-Node Recovery with LFS If you just want to be

    able to restart the Master if it goes down, Filesystem mode can take care of it. When Applications and Workers register, they have enough state written to the provided directory so that they can be recovered upon a restart of the Master.
  6. Standby Master with Zookeeper ZooKeeper provides leader election and state

    storage, you can launch multiple Masters in your cluster connected to the same ZooKeeper instance. One will be elected “leader” and the others will remain in standby mode. If the current leader dies, another Master will be elected, recover the old Master’s state, and then resume scheduling.
  7. Chaos Engineering It is the discipline of experimenting in production

    on a distributed system in order to reveal their weakness and to build confidence in their resilience capability. https://principlesofchaos.org/
  8. 2008 Chaos Engineering began at Netflix 2010 Chaos Monkey was

    launched 2018 A lot of resources for Chaos Engineering. 2014 Role of Chaos Engineer was created. History of Chaos Engineering Kolton Andrus
  9. What my mom thinks I do What my friends thinks

    I do What software engineers think I do What I really do Who is a Chaos Engineer? Help service owners to increase their resilience through education, tools and encouragement.
  10. Experiment: Hypothesis Validate that there is no interruption in computing

    metrics when the different Spark components fail. To simulate such failures, we employed a whack-a-mole approach and killed the various Spark components.
  11. If we want to have Banking Systems distributed, highly available,

    reliable and resilient! We must be reliable and resilient! Take care of yourself!