Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Chaos Engineering - Google Next Community Summit 2018 - Tammy Butow

Chaos Engineering - Google Next Community Summit 2018 - Tammy Butow

Chaos Engineering
Google Next Community Summit 2018
Tammy Butow

Tammy Bryant Butow

July 23, 2018
Tweet

More Decks by Tammy Bryant Butow

Other Decks in Technology

Transcript

  1. @tammybutow Principal SRE @ Gremlin Previously: SRE Manager @ Dropbox

    Databases, Block Storage, Code Workflows. Used Chaos Engineering to get a 10x reduction in incidents! HELLO!
  2. GREMLIN • Gremlin launched in Dec 2017. github.com/gremlin • We

    are practitioners of Chaos Engineering. We break things on purpose! • We build software that helps engineers build more reliable systems through failure injection. gremlin.com @tammybutow #GoogleNext18
  3. “Thoughtful planned experiments designed to reveal the weaknesses in our

    systems.” @tammybutow #GoogleNext18 - KOLTON ANDRUS, CEO @ GREMLIN PREVIOUSLY ENGINEER @ NETFLIX & AMAZON CHAOS ENGINEERING
  4. @tammybutow #GoogleNext18 • You can inject chaos at any layer

    of your stack to increase system resilience. • Injecting failure will also train your engineering teams for on-call. • Include engineers, engineering managers, designers, PMs, TPMs, VPs and more! API 
 APPLICATION CACHING
 DATABASE OPERATING SYSTEM HARDWARE 
 RACK NETWORK / POWER
 FULL-STACK CHAOS ENGINEERING
  5. • Eventually systems will break for you in many undesired

    ways. • Be proactive and break them first on purpose with controlled chaos. • Advanced Chaos Engineering involves doing Chaos Engineering with CI/CD @tammybutow #GoogleNext18 CONTROLLED CHAOS ENGINEERING
  6. @tammybutow #GoogleNext18 1. Form a hypothesis 2. Baseline your metrics

    3. Consider the blast radius 4. Run your Chaos Engineering experiment 5. Measure the results of your experiment 6. Find & fix issues or scale the experiment HOW TO RUN A CHAOS EXPERIMENT
  7. @tammybutow #GoogleNext18 1. Identify your top 5 critical services 2.

    Choose one of these services (e.g. Kafka) 3. Whiteboard the service with your team 4. Select the experiment: resource/state/network 5. Determine the scope: number of machines/impact HOW TO CHOOSE A CHAOS EXPERIMENT
  8. @tammybutow #GoogleNext18 • Availability — 500s • Service specific KPIs

    • System metrics: CPU, IO, DISK • Customer complaints WHAT SHOULD YOU MEASURE?
  9. @tammybutow #GoogleNext18 • Resources: CPU, DISK, IO & Memory •

    State: Processes, Shutdown & Time Travel • Network: Blackhole, DNS, Latency & Packet Loss POD/CONTAINER CHAOS ENGINEERING ⚡
  10. @tammybutow #GoogleNext18 cd scripts ./burncpu.sh chaos $ chaos $ HELLO

    WORLD OF CHAOS ENGINEERING github.com/tammybutow/chaosengineeringbootcamp
  11. @tammybutow #GoogleNext18 • We can increase CPU, Disk, Memory &

    IO consumption • Good to catch problems before they turn into high severity incidents and downtime for customers. • Chaos Engineering enables you to proactively monitor your monitoring for issues. RESOURCE CHAOS ENGINEERING
  12. @tammybutow #GoogleNext18 There are many ways to perform process chaos

    engineering experiments: • Kill one process • Loop kill a process • Spawn a new process • Fork bomb You can also do Time Travel Chaos Engineering! STATE CHAOS — PROCESS & TIME
  13. @tammybutow #GoogleNext18 • Kill self • Kill a container from

    the host • Use one container to kill another container • Use one container to kill several containers • Use several containers to kill several containers STATE CHAOS — PODS/CONTAINERS
  14. @tammybutow #GoogleNext18 NETWORKING CHAOS — DNS • Perform regular DNS

    failover • Ensure you can handle DNS outages without impacting customers • Use Chaos Engineering to ensure your team are trained to handle DNS issues
  15. @tammybutow #GoogleNext18 MOAR NETWORKING CHAOS • Latency — Inject latency

    into egress network traffic. • Packet Loss — Induce packet loss into egress network traffic. • Blackhole — Drops network traffic.
  16. THANKS! @tammybutow Principal SRE @ Gremlin Join us in the

    Chaos Slack: gremlin.com/slack Start breaking things: gremlin.com/community Come to chaosconf.io