Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Failure Friday: Start Injecting Failure Today!
Search
Doug Barth
September 12, 2014
Technology
0
130
Failure Friday: Start Injecting Failure Today!
DevOpsDays Toronto 2014
Video:
http://vimeo.com/107528697
Doug Barth
September 12, 2014
Tweet
Share
More Decks by Doug Barth
See All by Doug Barth
Zero Trust Networks: In Theory and In Practice
dougbarth
1
520
Culture from Chaos
dougbarth
0
100
IPsec mesh network: Perfect for the cloud?
dougbarth
1
1.5k
Failure Friday: Start Injecting Failure Today! (DevOpsDays Austin 2015)
dougbarth
0
710
Making PagerDuty More Reliable Using XtraDB Cluster
dougbarth
0
200
Ensuring Success During Disaster - SRECon 2015
dougbarth
0
270
Minimum Viable Ops
dougbarth
0
92
Policy as Code
dougbarth
1
300
Living without scheduled maintenance
dougbarth
2
1.1k
Other Decks in Technology
See All in Technology
Claude Codeが働くAI中心の業務システム構築の挑戦―AIエージェント中心の働き方を目指して
os1ma
9
1.5k
リリース2ヶ月で収益化した話
kent_code3
1
180
OPENLOGI Company Profile for engineer
hr01
1
37k
風が吹けばWHOISが使えなくなる~なぜWHOIS・RDAPはサーバー証明書のメール認証に使えなくなったのか~
orangemorishita
15
5.4k
専門分化が進む分業下でもユーザーが本当に欲しかったものを追求するプロダクトマネジメント/Focus on real user needs despite deep specialization and division of labor
moriyuya
0
980
恐怖!テストコードなき夜
tsukuboshi
2
110
私とAWSとの関わりの歩み~意志あるところに道は開けるかも?~
nagisa53
1
160
AI によるドキュメント処理を加速するためのOCR 結果の永続化と再利用戦略
tomoaki25
0
390
Strands Agents & Bedrock AgentCoreを1分でおさらい
minorun365
PRO
6
220
Claude Codeから我々が学ぶべきこと
s4yuba
6
1.7k
2025新卒研修・HTML/CSS #弁護士ドットコム
bengo4com
3
13k
LLMで構造化出力の成功率をグンと上げる方法
keisuketakiguchi
0
290
Featured
See All Featured
GitHub's CSS Performance
jonrohan
1031
460k
Large-scale JavaScript Application Architecture
addyosmani
512
110k
10 Git Anti Patterns You Should be Aware of
lemiorhan
PRO
656
60k
Agile that works and the tools we love
rasmusluckow
329
21k
[Rails World 2023 - Day 1 Closing Keynote] - The Magic of Rails
eileencodes
35
2.5k
Scaling GitHub
holman
461
140k
Making the Leap to Tech Lead
cromwellryan
134
9.5k
Site-Speed That Sticks
csswizardry
10
750
Balancing Empowerment & Direction
lara
1
530
How GitHub (no longer) Works
holman
314
140k
Bootstrapping a Software Product
garrettdimon
PRO
307
110k
Bash Introduction
62gerente
614
210k
Transcript
9/15/14 @dougbarth DEVOPSDAYS TORONTO 2014 Failure Friday!
9/15/14 FAILURE FRIDAY! Dev Ops
9/15/14 FAILURE FRIDAY! DevOps Engineer
9/15/14 “DO NOT FEAR FAILURE” BY TOMASZ STASIUK
9/15/14 FAILURE FRIDAY! How is babby PagerDuty formed?
9/15/14 FAILURE FRIDAY!
9/15/14 Designed for reliability FAILURE FRIDAY! Downstream providers fail 3
phone providers 3 email providers 6 SMS providers PagerDuty providers fail 2 cloud providers 3 data centers
9/15/14 Hung up on details FAILURE FRIDAY! Bugs in exceptional
code paths Systems not recovering as quickly as expected What is normal when things are abnormal?
9/15/14 FAILURE FRIDAY!
9/15/14 Simian Army FAILURE FRIDAY! Chaos Monkey Latency Monkey Chaos
Gorilla Chaos Kong “WP7WALLPAPER_EVIL_MONKEY_09” BY SKYLER817
9/15/14 Keep it simple FAILURE FRIDAY! “KISS BAND MEMBER CUPCAKES”
BY CLEVER CUPCAKES
9/15/14 Process FAILURE FRIDAY! “HOW TO DRAW AN OWL” BY
CHESTER
9/15/14 Get buy in FAILURE FRIDAY! “ANGRY BOSS” BY KAUSHAL
KARKHANIS
9/15/14 Schedule FAILURE FRIDAY! 1 hour recurring meeting Developers &
Operations List of attacks and identify victim Finish as much as possible
9/15/14 Before starting FAILURE FRIDAY! Disable cron jobs & CM
system Announce the start Open up relevant dashboards Leave alarms enabled
9/15/14 Attacks FAILURE FRIDAY! Test a single host and then
DC 5 minutes Return to a working state Stop if things break
9/15/14 Keep a log FAILURE FRIDAY! Keep track of actions
taken Times are super important Also track discoveries and TODOs Share dashboards/metrics Chat rooms make this easy
9/15/14 Graphs are awesome FAILURE FRIDAY!
9/15/14 Finishing up FAILURE FRIDAY! Sound the all clear Enable
crons & CM Move TODOs to issue tracker
9/15/14 Attack Strategies FAILURE FRIDAY! “UNICORN ATTACK!” BY SAM HOWZIT
9/15/14 FAILURE FRIDAY! SERVICE STOP CASSANDRA
9/15/14 FAILURE FRIDAY! SHUTDOWN -R NOW
9/15/14 FAILURE FRIDAY! IPTABLES -I INPUT 1 -P TCP --DPORT
9160 -J DROP IPTABLES -I INPUT 1 -P TCP --DPORT 7000 -J DROP ! IPTABLES -I OUTPUT 1 -P TCP --SPORT 9160 -J DROP IPTABLES -I OUTPUT 1 -P TCP --SPORT 7000 -J DROP
9/15/14 FAILURE FRIDAY! TC QDISC ADD DEV ETH0 ROOT NETEM
DELAY 500MS 100MS LOSS 5%
9/15/14 “RESULTS READER BOARD” BY ROSA SAY
9/15/14 Issues fixed FAILURE FRIDAY! Aggressive restarts by monit Large
files on ext3 volumes Failing to restart due to bad /etc/fstab file High latency from network isolated cache Low capacity with a lost DC Missing alerts/metrics
9/15/14 Cultural impact FAILURE FRIDAY! Knowledge sharing Highlights untestable systems
Keeps failure handling on everyone’s mind
9/15/14 Future plans “ROBOT SWORDSMAN FIGHT.” BY PATRICK GAGE KELLEY
9/15/14 Break more things FAILURE FRIDAY! Start testing whole DC
outages Break multiple services at once Distribute failure testing to teams Automate
9/15/14 Break more things FAILURE FRIDAY! Start testing whole DC
outages Break multiple services at once Distribute failure testing to teams Automate
9/15/14 Summary FAILURE FRIDAY! Failures will happen Proactively test failure
handling now Choose something easy: app server, cache Automate later
9/15/14 pagerduty.com/jobs Thank you.