Chaos Engineering - Google Next Community Summit 2018 - Tammy Butow

Chaos Engineering - Google Next Community Summit 2018 - Tammy Butow

Chaos Engineering
Google Next Community Summit 2018
Tammy Butow

203e64aeb53ae59b2b4dcf923c163c23?s=128

Tammy Bütow

July 23, 2018
Tweet

Transcript

  1. 2.

    @tammybutow Principal SRE @ Gremlin Previously: SRE Manager @ Dropbox

    Databases, Block Storage, Code Workflows. Used Chaos Engineering to get a 10x reduction in incidents! HELLO!
  2. 3.

    GREMLIN • Gremlin launched in Dec 2017. github.com/gremlin • We

    are practitioners of Chaos Engineering. We break things on purpose! • We build software that helps engineers build more reliable systems through failure injection. gremlin.com @tammybutow #GoogleNext18
  3. 4.

    “Thoughtful planned experiments designed to reveal the weaknesses in our

    systems.” @tammybutow #GoogleNext18 - KOLTON ANDRUS, CEO @ GREMLIN PREVIOUSLY ENGINEER @ NETFLIX & AMAZON CHAOS ENGINEERING
  4. 5.

    @tammybutow #GoogleNext18 • You can inject chaos at any layer

    of your stack to increase system resilience. • Injecting failure will also train your engineering teams for on-call. • Include engineers, engineering managers, designers, PMs, TPMs, VPs and more! API 
 APPLICATION CACHING
 DATABASE OPERATING SYSTEM HARDWARE 
 RACK NETWORK / POWER
 FULL-STACK CHAOS ENGINEERING
  5. 7.

    • Eventually systems will break for you in many undesired

    ways. • Be proactive and break them first on purpose with controlled chaos. • Advanced Chaos Engineering involves doing Chaos Engineering with CI/CD @tammybutow #GoogleNext18 CONTROLLED CHAOS ENGINEERING
  6. 8.

    @tammybutow #GoogleNext18 1. Form a hypothesis 2. Baseline your metrics

    3. Consider the blast radius 4. Run your Chaos Engineering experiment 5. Measure the results of your experiment 6. Find & fix issues or scale the experiment HOW TO RUN A CHAOS EXPERIMENT
  7. 9.

    @tammybutow #GoogleNext18 1. Identify your top 5 critical services 2.

    Choose one of these services (e.g. Kafka) 3. Whiteboard the service with your team 4. Select the experiment: resource/state/network 5. Determine the scope: number of machines/impact HOW TO CHOOSE A CHAOS EXPERIMENT
  8. 11.

    @tammybutow #GoogleNext18 • Availability — 500s • Service specific KPIs

    • System metrics: CPU, IO, DISK • Customer complaints WHAT SHOULD YOU MEASURE?
  9. 12.

    @tammybutow #GoogleNext18 • Resources: CPU, DISK, IO & Memory •

    State: Processes, Shutdown & Time Travel • Network: Blackhole, DNS, Latency & Packet Loss POD/CONTAINER CHAOS ENGINEERING ⚡
  10. 14.

    @tammybutow #GoogleNext18 cd scripts ./burncpu.sh chaos $ chaos $ HELLO

    WORLD OF CHAOS ENGINEERING github.com/tammybutow/chaosengineeringbootcamp
  11. 15.

    @tammybutow #GoogleNext18 • We can increase CPU, Disk, Memory &

    IO consumption • Good to catch problems before they turn into high severity incidents and downtime for customers. • Chaos Engineering enables you to proactively monitor your monitoring for issues. RESOURCE CHAOS ENGINEERING
  12. 16.

    @tammybutow #GoogleNext18 There are many ways to perform process chaos

    engineering experiments: • Kill one process • Loop kill a process • Spawn a new process • Fork bomb You can also do Time Travel Chaos Engineering! STATE CHAOS — PROCESS & TIME
  13. 17.

    @tammybutow #GoogleNext18 • Kill self • Kill a container from

    the host • Use one container to kill another container • Use one container to kill several containers • Use several containers to kill several containers STATE CHAOS — PODS/CONTAINERS
  14. 18.

    @tammybutow #GoogleNext18 NETWORKING CHAOS — DNS • Perform regular DNS

    failover • Ensure you can handle DNS outages without impacting customers • Use Chaos Engineering to ensure your team are trained to handle DNS issues
  15. 19.

    @tammybutow #GoogleNext18 MOAR NETWORKING CHAOS • Latency — Inject latency

    into egress network traffic. • Packet Loss — Induce packet loss into egress network traffic. • Blackhole — Drops network traffic.
  16. 20.

    THANKS! @tammybutow Principal SRE @ Gremlin Join us in the

    Chaos Slack: gremlin.com/slack Start breaking things: gremlin.com/community Come to chaosconf.io