Slide 1

Slide 1 text

Chaos Engineering in 5 Minutes Pavlos Ratis (@dastergon)

Slide 2

Slide 2 text

N-Tier Architecture Credits: https://appybistro.com/2015/07/27/salad-lovers-multi-layered-creamy-salad/

Slide 3

Slide 3 text

CC BY-SA 3.0 - Sylvain Pedneault - Self-photographed Outage

Slide 4

Slide 4 text

¯\_(ツ)_/¯

Slide 5

Slide 5 text

Public Domain - https://en.wikipedia.org/wiki/Montparnasse_derailment Failure is inevitable

Slide 6

Slide 6 text

Systems are fragile Courtesy of SpaceX

Slide 7

Slide 7 text

How to make them more reliable?

Slide 8

Slide 8 text

Testing

Slide 9

Slide 9 text

– Edsger W. Dijkstra “Testing shows the presence, not the absence of bugs” CC BY-SA 3.0 - https://en.wikiquote.org/wiki/Edsger_W._Dijkstra#Quotes_about_Dijkstra

Slide 10

Slide 10 text

Credits: http://corgibytes.com/blog/2016/03/28/pyramid-of-tests/

Slide 11

Slide 11 text

BUT

Slide 12

Slide 12 text

We cannot anticipate all failures https://www.theguardian.com/technology/2014/aug/14/google-undersea-fibre-optic-cables-shark-attacks https://www.theregister.co.uk/2017/03/01/aws_s3_outage/

Slide 13

Slide 13 text

Resiliency (noun) - re·sil·ien·cy • The ability to become strong, healthy, or successful again after something bad happens •The ability of something to return to its original shape after it has been pulled, stretched, pressed, bent, etc.

Slide 14

Slide 14 text

– Hsueh, M.C., Tsai, T.K. and Iyer, R.K., 1997. Fault injection techniques and tools. Computer, 30(4), pp.75-82. “Fault injection is important to evaluating the dependability of computer systems.” Fault Injection Public Domain - https://en.wikipedia.org/wiki/NASA

Slide 15

Slide 15 text

This is NOT Rocket Science!

Slide 16

Slide 16 text

ENTER CHAOS ENGINEERING

Slide 17

Slide 17 text

– Principles of Chaos Engineering (http://principlesofchaos.org/) “Chaos Engineering is the discipline of experimenting on a distributed system
 in order to build confidence in the system’s capability
 to withstand turbulent conditions in production.”

Slide 18

Slide 18 text

Chaos as in… • Killing random cloud VM Instances or containers • Killing random Kubernetes pods • Killing MySQL Master or Slaves • Introducing extra latency or packet loss between micro- services • Killing a critical supporting service (i.e logging server) while serving traffic • “Unplugging” a whole datacenter or availability zone

Slide 19

Slide 19 text

Process • Define a steady state hypothesis (i.e HTTP 2xx & 3xx) • Run experiments in production environment (i.e. Destroy a critical service) * • Try to disprove the hypothesis * Minimize the Blast Radius

Slide 20

Slide 20 text

Courtesy of SpaceX Confidence

Slide 21

Slide 21 text

Credits: https://www.netflix.com

Slide 22

Slide 22 text

The Tools https://github.com/Netflix/SimianArmy https://github.com/Netflix/chaosmonkey https://www.gremlin.com/

Slide 23

Slide 23 text

Summary • Uncovers the weaknesses of your system. • Builds confidence in your infrastructure. Chaos Engineering…

Slide 24

Slide 24 text

More Resources (Shameless plug) : Github: dastergon/awesome-chaos-engineering Thank you!

Slide 25

Slide 25 text

No content