Culture from Chaos

A97a75c945507f70992f579a730b0657?s=47 Doug Barth
November 05, 2015

Culture from Chaos

This talk explores the cultural benefits we can gain from chaos engineering if we avoid automating our failure injection early on.

This talk was given at the first Chaos Community Day, which was organized by Netflix and held at Uber's office.

A97a75c945507f70992f579a730b0657?s=128

Doug Barth

November 05, 2015
Tweet

Transcript

  1. 4.

    11/4/15 Running experiments manually is labor-intensive and ultimately unsustainable. Automate

    experiments and run them continuously. Chaos Engineering builds automation into the system to drive both orchestration and analysis. CULTURE FROM CHAOS Automate Experiments to Run Continuously @dougbarth
  2. 7.

    11/4/15 An overview 1 hour meeting Agenda preannounced Get through

    as much as possible CULTURE FROM CHAOS @dougbarth
  3. 11.

    11/4/15 CULTURE FROM CHAOS IPTABLES -I INPUT 1 -P TCP

    --DPORT 9160 -J DROP IPTABLES -I INPUT 1 -P TCP --DPORT 7000 -J DROP IPTABLES -I OUTPUT 1 -P TCP --SPORT 9160 -J DROP IPTABLES -I OUTPUT 1 -P TCP --SPORT 7000 -J DROP Network isolation @dougbarth
  4. 12.

    11/4/15 CULTURE FROM CHAOS TC QDISC ADD DEV ETH0 ROOT

    NETEM DELAY 500MS 100MS LOSS 15% Network latency @dougbarth
  5. 14.

    11/4/15 Optimized for learning Tailor tests for the system Adjust

    the tests on the fly based on system feedback CULTURE FROM CHAOS @dougbarth
  6. 15.

    11/4/15 Knowledge sharing Reducing bus factor Senior engineers teaching juniors

    Learn communication tricks CULTURE FROM CHAOS @dougbarth
  7. 16.

    11/4/15 System design Failover is not an option Stateless app

    servers Multi-master datastores 1% suboptimal routing Capacity planning CULTURE FROM CHAOS @dougbarth
  8. 17.

    11/4/15 Incident response training Declare an IC Have them choose

    SMEs (eg. hands on keyboards) Learn how to work as a group Setup suboptimal scenarios: conference calls CULTURE FROM CHAOS @dougbarth
  9. 18.

    11/4/15 Incident triage Setup a response team Introduce an unannounced

    failure Get response team to organize and resolve the failure CULTURE FROM CHAOS @dougbarth
  10. 19.

    11/4/15 Experiments that aren’t introducing new insights should be automated

    and used to monitor the ongoing health of the system. New experiments should be devised to continue to push the bounds of the system. CULTURE FROM CHAOS Use Experiments for Learning @dougbarth