approach to resilience testing of distributed software systems • Chaos Experiment - define a "normal/steady" state of the system (e.g. by monitoring a set of system and business metrics) - pseudo-randomly inject faults (e.g. by terminating VMs, killing containers or changing network behavior) - try to discover system weaknesses by deviation from expected or steady-state behavior The harder it is to disrupt the steady state, the more confidence we have in the behavior of the system. http://principlesofchaos.org/
(warthog) from Disney’s animated film The Lion King 2. In Swahili, pumbaa means “to be foolish, silly, weak- minded, careless, negligent” 3. It's also an open source Chaos Testing tool for Docker containers 1. https://github.com/gaia-adm/pumba 2. Linux, Windows, MacOS, Docker
injecting different failures • The "victim" container can be specified, providing name/s or regex • Radom selection is also supported (with `--random` flag) • It's possible to define a repeatable time interval and duration parameters to better control the Chaos • Pumba can disturb either single Docker host, Swarm cluster, and Kubernetes cluster
kill (send termination or other signal) to the main process within a Docker container 3. remove "victim" containers, with their links and volumes 4. pause all processes within a "victim" Docker container for a specified time
at container level (filter by IP too) 2. delay egress traffic for the specified containers 3. add packet-loss based on different probability loss models (2-3-4 state Markov, Gilbert, Simple Gilbert and Bernoulli) 4. rate limit egress traffic for the specified containers
on (default) network device of Docker container for 5 minutes $ pumba netem --duration 5m delay --time 3000 mydb # add a delay of 3000ms ± 30ms, # with the next random element depending 20% on the last one, # for all outgoing packets on device of all Docker container, # with name start with for 10 minutes $ pumba netem --duration 5m --interface eth1 delay \ --time 3000 --jitter 30 --correlation 20 re2:^hp # add a delay of 3000ms ± 40ms, where variation in delay # is described by normal distribution, # for all outgoing packets on main network device of randomly # chosen Docker container # from the specified list, for 5 minutes $ pumba --random netem --duration 5m delay --time 3000 \ --jitter 40 --distribution normal \ container1 container2 container3
a native framework for routing, bridging, firewalling, address translation and much else. • Before a packet leaves the output interface, it passes through Linux Traffic Control (tc). This component is a powerful tool for scheduling, shaping, classifying and prioritizing traffic. • The basic component of Linux Traffic Control is the queuing discipline (qdisc). The simplest implementation of a qdisc is first in first out (FIFO). There are others too. • The network emulation (netem) project adds queuing disciplines that emulate wide area network properties such as latency, jitter, loss, duplication, corruption and reordering.