Chaos Engineering - Google Next Community Summit 2018 - Tammy Butow

Slide 1

Slide 1 text

CHAOS ENGINEERING @tammybutow ⚡ Break things on purpose

Slide 2

Slide 2 text

@tammybutow Principal SRE @ Gremlin Previously: SRE Manager @ Dropbox Databases, Block Storage, Code Workﬂows. Used Chaos Engineering to get a 10x reduction in incidents! HELLO!

Slide 3

Slide 3 text

GREMLIN • Gremlin launched in Dec 2017. github.com/gremlin • We are practitioners of Chaos Engineering. We break things on purpose! • We build software that helps engineers build more reliable systems through failure injection. gremlin.com @tammybutow #GoogleNext18

Slide 4

Slide 4 text

“Thoughtful planned experiments designed to reveal the weaknesses in our systems.” @tammybutow #GoogleNext18 - KOLTON ANDRUS, CEO @ GREMLIN PREVIOUSLY ENGINEER @ NETFLIX & AMAZON CHAOS ENGINEERING

Slide 5

Slide 5 text

@tammybutow #GoogleNext18 • You can inject chaos at any layer of your stack to increase system resilience. • Injecting failure will also train your engineering teams for on-call. • Include engineers, engineering managers, designers, PMs, TPMs, VPs and more! API   APPLICATION CACHING  DATABASE OPERATING SYSTEM HARDWARE   RACK NETWORK / POWER  FULL-STACK CHAOS ENGINEERING

Slide 6

Slide 6 text

@tammybutow #GoogleNext18

Slide 7

Slide 7 text

• Eventually systems will break for you in many undesired ways. • Be proactive and break them ﬁrst on purpose with controlled chaos. • Advanced Chaos Engineering involves doing Chaos Engineering with CI/CD @tammybutow #GoogleNext18 CONTROLLED CHAOS ENGINEERING

Slide 8

Slide 8 text

@tammybutow #GoogleNext18 1. Form a hypothesis 2. Baseline your metrics 3. Consider the blast radius 4. Run your Chaos Engineering experiment 5. Measure the results of your experiment 6. Find & ﬁx issues or scale the experiment HOW TO RUN A CHAOS EXPERIMENT

Slide 9

Slide 9 text

@tammybutow #GoogleNext18 1. Identify your top 5 critical services 2. Choose one of these services (e.g. Kafka) 3. Whiteboard the service with your team 4. Select the experiment: resource/state/network 5. Determine the scope: number of machines/impact HOW TO CHOOSE A CHAOS EXPERIMENT

Slide 10

Slide 10 text

@tammybutow #GoogleNext18 EXAMPLE SYSTEM: KUBERNETES RETAIL STORE

Slide 11

Slide 11 text

@tammybutow #GoogleNext18 • Availability — 500s • Service speciﬁc KPIs • System metrics: CPU, IO, DISK • Customer complaints WHAT SHOULD YOU MEASURE?

Slide 12

Slide 12 text

@tammybutow #GoogleNext18 • Resources: CPU, DISK, IO & Memory • State: Processes, Shutdown & Time Travel • Network: Blackhole, DNS, Latency & Packet Loss POD/CONTAINER CHAOS ENGINEERING ⚡

Slide 13

Slide 13 text

@tammybutow #GoogleNext18 github.com/tammybutow/chaosengineeringbootcamp K8S CHAOS ENGINEERING BOOTCAMP

Slide 14

Slide 14 text

@tammybutow #GoogleNext18 cd scripts ./burncpu.sh chaos $ chaos $ HELLO WORLD OF CHAOS ENGINEERING github.com/tammybutow/chaosengineeringbootcamp

Slide 15

Slide 15 text

@tammybutow #GoogleNext18 • We can increase CPU, Disk, Memory & IO consumption • Good to catch problems before they turn into high severity incidents and downtime for customers. • Chaos Engineering enables you to proactively monitor your monitoring for issues. RESOURCE CHAOS ENGINEERING

Slide 16

Slide 16 text

@tammybutow #GoogleNext18 There are many ways to perform process chaos engineering experiments: • Kill one process • Loop kill a process • Spawn a new process • Fork bomb You can also do Time Travel Chaos Engineering! STATE CHAOS — PROCESS & TIME

Slide 17

Slide 17 text

@tammybutow #GoogleNext18 • Kill self • Kill a container from the host • Use one container to kill another container • Use one container to kill several containers • Use several containers to kill several containers STATE CHAOS — PODS/CONTAINERS

Slide 18

Slide 18 text

@tammybutow #GoogleNext18 NETWORKING CHAOS — DNS • Perform regular DNS failover • Ensure you can handle DNS outages without impacting customers • Use Chaos Engineering to ensure your team are trained to handle DNS issues

Slide 19

Slide 19 text

@tammybutow #GoogleNext18 MOAR NETWORKING CHAOS • Latency — Inject latency into egress network traffic. • Packet Loss — Induce packet loss into egress network traffic. • Blackhole — Drops network traffic.

Slide 20

Slide 20 text

THANKS! @tammybutow Principal SRE @ Gremlin Join us in the Chaos Slack: gremlin.com/slack Start breaking things: gremlin.com/community Come to chaosconf.io