Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Failure Friday: Start Injecting Failure Today!
Search
Doug Barth
September 12, 2014
Technology
0
130
Failure Friday: Start Injecting Failure Today!
DevOpsDays Toronto 2014
Video:
http://vimeo.com/107528697
Doug Barth
September 12, 2014
Tweet
Share
More Decks by Doug Barth
See All by Doug Barth
Zero Trust Networks: In Theory and In Practice
dougbarth
1
500
Culture from Chaos
dougbarth
0
93
IPsec mesh network: Perfect for the cloud?
dougbarth
1
1.5k
Failure Friday: Start Injecting Failure Today! (DevOpsDays Austin 2015)
dougbarth
0
680
Making PagerDuty More Reliable Using XtraDB Cluster
dougbarth
0
200
Ensuring Success During Disaster - SRECon 2015
dougbarth
0
260
Minimum Viable Ops
dougbarth
0
83
Policy as Code
dougbarth
1
290
Living without scheduled maintenance
dougbarth
2
1.1k
Other Decks in Technology
See All in Technology
AI・LLM事業部のSREとタスクの自動運転
shinyorke
PRO
0
260
コード品質向上で得られる効果と実践的取り組み
ham0215
1
190
AIエージェントキャッチアップと論文リサーチ
os1ma
5
650
ソフトウェア開発現代史: なぜ日本のソフトウェア開発は「滝」なのか?製造業の成功体験とのギャップ #jassttokyo
takabow
2
690
お問い合わせ対応の改善取り組みとその進め方
masartz
0
140
IAMのマニアックな話 2025 ~40分バージョン ~
nrinetcom
PRO
4
620
リポジトリをまるっとAIに食わせるRepomixの話
yamadashy
0
240
パスキーでのログインを 実装してみよう!
hibiki_cube
0
560
新卒エンジニア研修の試行錯誤と工夫/nikkei-tech-talk-31
nishiuma
0
150
バクラクでのSystem Risk Records導入による変化と改善の取り組み/Changes and Improvement Initiatives Resulting from the Implementation of System Risk Records
taddy_919
0
170
Oracle Cloud Infrastructure:2025年3月度サービス・アップデート
oracle4engineer
PRO
0
300
ClineにNext.jsのプロジェクト改善をお願いしてみた / 20250321_reacttokyo_LT
optim
1
1.1k
Featured
See All Featured
The Art of Programming - Codeland 2020
erikaheidi
53
13k
Unsuck your backbone
ammeep
669
57k
BBQ
matthewcrist
88
9.5k
How To Stay Up To Date on Web Technology
chriscoyier
790
250k
Principles of Awesome APIs and How to Build Them.
keavy
126
17k
Git: the NoSQL Database
bkeepers
PRO
429
65k
Chrome DevTools: State of the Union 2024 - Debugging React & Beyond
addyosmani
4
460
How to train your dragon (web standard)
notwaldorf
91
5.9k
StorybookのUI Testing Handbookを読んだ
zakiyama
28
5.6k
Fontdeck: Realign not Redesign
paulrobertlloyd
83
5.4k
Creating an realtime collaboration tool: Agile Flush - .NET Oxford
marcduiker
28
2k
The Web Performance Landscape in 2024 [PerfNow 2024]
tammyeverts
4
490
Transcript
9/15/14 @dougbarth DEVOPSDAYS TORONTO 2014 Failure Friday!
9/15/14 FAILURE FRIDAY! Dev Ops
9/15/14 FAILURE FRIDAY! DevOps Engineer
9/15/14 “DO NOT FEAR FAILURE” BY TOMASZ STASIUK
9/15/14 FAILURE FRIDAY! How is babby PagerDuty formed?
9/15/14 FAILURE FRIDAY!
9/15/14 Designed for reliability FAILURE FRIDAY! Downstream providers fail 3
phone providers 3 email providers 6 SMS providers PagerDuty providers fail 2 cloud providers 3 data centers
9/15/14 Hung up on details FAILURE FRIDAY! Bugs in exceptional
code paths Systems not recovering as quickly as expected What is normal when things are abnormal?
9/15/14 FAILURE FRIDAY!
9/15/14 Simian Army FAILURE FRIDAY! Chaos Monkey Latency Monkey Chaos
Gorilla Chaos Kong “WP7WALLPAPER_EVIL_MONKEY_09” BY SKYLER817
9/15/14 Keep it simple FAILURE FRIDAY! “KISS BAND MEMBER CUPCAKES”
BY CLEVER CUPCAKES
9/15/14 Process FAILURE FRIDAY! “HOW TO DRAW AN OWL” BY
CHESTER
9/15/14 Get buy in FAILURE FRIDAY! “ANGRY BOSS” BY KAUSHAL
KARKHANIS
9/15/14 Schedule FAILURE FRIDAY! 1 hour recurring meeting Developers &
Operations List of attacks and identify victim Finish as much as possible
9/15/14 Before starting FAILURE FRIDAY! Disable cron jobs & CM
system Announce the start Open up relevant dashboards Leave alarms enabled
9/15/14 Attacks FAILURE FRIDAY! Test a single host and then
DC 5 minutes Return to a working state Stop if things break
9/15/14 Keep a log FAILURE FRIDAY! Keep track of actions
taken Times are super important Also track discoveries and TODOs Share dashboards/metrics Chat rooms make this easy
9/15/14 Graphs are awesome FAILURE FRIDAY!
9/15/14 Finishing up FAILURE FRIDAY! Sound the all clear Enable
crons & CM Move TODOs to issue tracker
9/15/14 Attack Strategies FAILURE FRIDAY! “UNICORN ATTACK!” BY SAM HOWZIT
9/15/14 FAILURE FRIDAY! SERVICE STOP CASSANDRA
9/15/14 FAILURE FRIDAY! SHUTDOWN -R NOW
9/15/14 FAILURE FRIDAY! IPTABLES -I INPUT 1 -P TCP --DPORT
9160 -J DROP IPTABLES -I INPUT 1 -P TCP --DPORT 7000 -J DROP ! IPTABLES -I OUTPUT 1 -P TCP --SPORT 9160 -J DROP IPTABLES -I OUTPUT 1 -P TCP --SPORT 7000 -J DROP
9/15/14 FAILURE FRIDAY! TC QDISC ADD DEV ETH0 ROOT NETEM
DELAY 500MS 100MS LOSS 5%
9/15/14 “RESULTS READER BOARD” BY ROSA SAY
9/15/14 Issues fixed FAILURE FRIDAY! Aggressive restarts by monit Large
files on ext3 volumes Failing to restart due to bad /etc/fstab file High latency from network isolated cache Low capacity with a lost DC Missing alerts/metrics
9/15/14 Cultural impact FAILURE FRIDAY! Knowledge sharing Highlights untestable systems
Keeps failure handling on everyone’s mind
9/15/14 Future plans “ROBOT SWORDSMAN FIGHT.” BY PATRICK GAGE KELLEY
9/15/14 Break more things FAILURE FRIDAY! Start testing whole DC
outages Break multiple services at once Distribute failure testing to teams Automate
9/15/14 Break more things FAILURE FRIDAY! Start testing whole DC
outages Break multiple services at once Distribute failure testing to teams Automate
9/15/14 Summary FAILURE FRIDAY! Failures will happen Proactively test failure
handling now Choose something easy: app server, cache Automate later
9/15/14 pagerduty.com/jobs Thank you.