Failure Friday: Start Injecting Failure Today!

Failure Friday! DEVOPSDAYS LONDON 2013

OPERATIONS ENGINEER Doug Barth [email protected] @dougbarth

American Just George Washington riding an eagle and holding an
RPG.

Failure Friday “Do not fear failure” by Tomasz Stasiuk

What is PagerDuty?

911 dispatch for IT 999 for you guys :)

• Downstream providers fail • 3 phone providers • 3
email providers • 6 SMS providers • PagerDuty providers fail • 2 cloud providers • 3 data centers Designed for reliability

Hung up on details • Bugs in exceptional code paths
• Systems not recovering as quickly as expected • What is normal when things are abnormal?

Simian Army • Chaos Monkey • Latency Monkey • Chaos
Gorilla • AWS only • Requires ASGs “wp7wallpaper_Evil_monkey_09” by skyler817

Keep it simple “KISS Band Member Cupcakes” by Clever Cupcakes

Process “How to Draw an Owl” by Chester

Get buy in “Angry Boss” by Kaushal Karkhanis

Schedule • 1 hour recurring meeting • Developers & Operations
• List of attacks and identify victim • Finish as much as possible

Before starting • Disable cron jobs & CM system •
Announce the start • Open up relevant dashboards • Leave alarms enabled

Attacks • Test a single host and then DC •
5 minutes • Return to a working state • Stop if things break

Keep a log • Keep track of actions taken •
Times are super important • Also track discoveries and TODOs • Share dashboards/metrics • Chat rooms make this easy

Graphs are awesome

Finishing up • Sound the all clear • Enable crons
& CM • Move TODOs to issue tracker

Attack Strategies “Unicorn Attack!” by Sam Howzit

Process Failure service stop cassandra

Reboot hosts shutdown -r now

Network Isolation iptables -I INPUT 1 -p tcp --dport 9160
-j DROP iptables -I INPUT 1 -p tcp --dport 7000 -j DROP iptables -I OUTPUT 1 -p tcp --sport 9160 -j DROP iptables -I OUTPUT 1 -p tcp --sport 7000 -j DROP

Slow node tc qdisc add dev eth0 root netem delay
500ms 100ms loss 5%

Beneﬁts “Results Reader Board” by Rosa Say

Issues fixed • Aggressive restarts by monit • Large files
on ext3 volumes • Failing to restart due to bad /etc/fstab file • High latency from network isolated cache • Low capacity with a lost DC • Missing alerts/metrics

Cultural impact • Knowledge sharing • Highlights untestable systems •
Keeps failure handling on everyone’s mind

Software Engineer Ken Rose “FF ramps up new on-call people
quickly. If we waited for errors to actually happen, it might be a while before a new on-call person actually sees a certain type of error (e.g., db host down). FF brings those types of errors to the surface so everyone involved can see what should happen.”

Future plans “Robot Swordsman Fight.” by Patrick Gage Kelley

Break more things • Start testing whole DC outages •
Break multiple services at once • Distribute failure testing to teams • Automate

Summary • Failures will happen • Proactively test failure handling
now • Choose something easy: app server, cache • Automate later

http://pagerduty.com/jobs Thank you. OPERATIONS ENGINEER Doug Barth [email protected] @dougbarth

Failure Friday: Start Injecting Failure Today!

Failure Friday: Start Injecting Failure Today!

Doug Barth

More Decks by Doug Barth

Other Decks in Technology

Featured

Transcript

Failure Friday! DEVOPSDAYS LONDON 2013

OPERATIONS ENGINEER Doug Barth [email protected] @dougbarth

American Just George Washington riding an eagle and holding an

Failure Friday “Do not fear failure” by Tomasz Stasiuk

What is PagerDuty?

911 dispatch for IT 999 for you guys :)

• Downstream providers fail • 3 phone providers • 3

Hung up on details • Bugs in exceptional code paths

Simian Army • Chaos Monkey • Latency Monkey • Chaos

Keep it simple “KISS Band Member Cupcakes” by Clever Cupcakes

Process “How to Draw an Owl” by Chester

Get buy in “Angry Boss” by Kaushal Karkhanis

Schedule • 1 hour recurring meeting • Developers & Operations

Before starting • Disable cron jobs & CM system •

Attacks • Test a single host and then DC •

Keep a log • Keep track of actions taken •

Graphs are awesome

Finishing up • Sound the all clear • Enable crons

Attack Strategies “Unicorn Attack!” by Sam Howzit

Process Failure service stop cassandra

Reboot hosts shutdown -r now

Network Isolation iptables -I INPUT 1 -p tcp --dport 9160

Slow node tc qdisc add dev eth0 root netem delay

Beneﬁts “Results Reader Board” by Rosa Say

Issues ﬁxed • Aggressive restarts by monit • Large ﬁles

Cultural impact • Knowledge sharing • Highlights untestable systems •

Software Engineer Ken Rose “FF ramps up new on-call people

Future plans “Robot Swordsman Fight.” by Patrick Gage Kelley

Break more things • Start testing whole DC outages •

Summary • Failures will happen • Proactively test failure handling

http://pagerduty.com/jobs Thank you. OPERATIONS ENGINEER Doug Barth [email protected] @dougbarth