ChaosSpark.pdf

Testing High Availability In Apache Spark With Chaos Engineering February
27th 2020 Bogotá Colombia

4 Yury Niño Roa DevOps Engineer | Chaos Engineering Advocate
@yurynino Ingeniera de Sistemas. Especialista en Ingeniería de Software Coorganizadora de GDG Bogotá, GDG Cloud Bogotá y Women Techmakers Bogotá “A single point of failure can trigger a chain reaction across the value chain, including suppliers to the final customer, and cause a severe business interruption” Dimitar Pachov

How many of you have seen these messages interacting with
your Bank?

Or these messages?

Agenda Why the world needs High Availability? How to achieve
High Availability? Apache Spark is Highly Available The promise: Resilience Patterns Chaos Engineering Chaos Principles … Testing High Availability with Chaos!

Do you have clear the difference between: Fault Tolerance, High
Availability & Disaster Recovery

High Availability Principles Component Redundancy. No Single Points of Failure.
Failure Detection and Response.

Why Banks need High Availability?

Modern applications are distributed, HA and resilient by default! However
… 38% of customers at traditional banks experienced disruption to their service every year, compared with 21% of challenger bank customers. The FCA says bank outages have risen 138% in the past year in the world!

Banking Software Systems must be distributed, reliable and resilient!

How to achieve that?

Banks are data-driven organizations that places emphasis on the quality!
Take advantage of this!

Apache … • Spark is an analytics engine for large-scale
data processing. • Spark is available for both batch and streaming data. • Spark allows to write applications in Java, Scala, Python, and SQL. • Spark makes easy to build parallel apps. • Spark combines SQL, streaming, and complex analytics. • Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud.

Apache Spark is composed by …

Apache Spark Cluster Mode

Apache Spark is Highly Available

Apache Spark provides HA Standby Masters with Single-Node Recovery

Single-Node Recovery with LFS If you just want to be
able to restart the Master if it goes down, Filesystem mode can take care of it. When Applications and Workers register, they have enough state written to the provided directory so that they can be recovered upon a restart of the Master.

Standby Master with Zookeeper ZooKeeper provides leader election and state
storage, you can launch multiple Masters in your cluster connected to the same ZooKeeper instance. One will be elected “leader” and the others will remain in standby mode. If the current leader dies, another Master will be elected, recover the old Master’s state, and then resume scheduling.

Standby Master with Zookeeper

The Promise: Resilience Patterns

Circuit Breaker Pattern Bulkhead Pattern Leader Election Compensating Transactions Health
Endpoint Monitoring Resilience Patterns

Leader Election

How to test that Master is broken: Chaos Engineering

Chaos Engineering It is the discipline of experimenting in production
on a distributed system in order to reveal their weakness and to build conﬁdence in their resilience capability. https://principlesofchaos.org/

2008 Chaos Engineering began at Netﬂix 2010 Chaos Monkey was
launched 2018 A lot of resources for Chaos Engineering. 2014 Role of Chaos Engineer was created. History of Chaos Engineering Kolton Andrus

What my mom thinks I do What my friends thinks
I do What software engineers think I do What I really do Who is a Chaos Engineer? Help service owners to increase their resilience through education, tools and encouragement.

Who are doing Chaos Engineering?

Hypothesize about Steady State Run Experiments Vary Real-World Events Automate
Experiments Chaos Engineering Principles

Testing with Chaos

Netﬂix Chaos Experiment with Spark

Experiment: Hypothesis Validate that there is no interruption in computing
metrics when the different Spark components fail. To simulate such failures, we employed a whack-a-mole approach and killed the various Spark components.

Experiment: Running

Netﬂix Chaos Results with Spark

How to begin ... https://chaosengineering.slack.com https://github.com/dastergon/ awesome-chaos-engineering https://www.infoq.com/chaos-engineering @yurynino

If we want to have Banking Systems distributed, highly available,
reliable and resilient! We must be reliable and resilient! Take care of yourself!

Thanks for coming!!! @yurynino

ChaosSpark.pdf

ChaosSpark.pdf

More Decks by Yury Nino

Other Decks in Technology

Featured

Transcript