Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Failure Friday: Start Injecting Failure Today!
Search
Doug Barth
September 12, 2014
Technology
0
140
Failure Friday: Start Injecting Failure Today!
DevOpsDays Toronto 2014
Video:
http://vimeo.com/107528697
Doug Barth
September 12, 2014
Tweet
Share
More Decks by Doug Barth
See All by Doug Barth
Zero Trust Networks: In Theory and In Practice
dougbarth
1
550
Culture from Chaos
dougbarth
0
120
IPsec mesh network: Perfect for the cloud?
dougbarth
1
1.6k
Failure Friday: Start Injecting Failure Today! (DevOpsDays Austin 2015)
dougbarth
0
760
Making PagerDuty More Reliable Using XtraDB Cluster
dougbarth
0
220
Ensuring Success During Disaster - SRECon 2015
dougbarth
0
290
Minimum Viable Ops
dougbarth
0
110
Policy as Code
dougbarth
1
320
Living without scheduled maintenance
dougbarth
2
1.1k
Other Decks in Technology
See All in Technology
スケールアップ企業でQA組織が機能し続けるための組織設計と仕組み〜ボトムアップとトップダウンを両輪としたアプローチ〜
tarappo
4
330
スピンアウト講座06_認証系(API-OAuth-MCP)入門
overflowinc
0
890
Copilot 宇宙へ 〜生成AIで「専門データの壁」を壊す方法〜
nakasho
0
150
イベントで大活躍する電子ペーパー名札を作る(その2) 〜 M5PaperとM5PaperS3 〜 / IoTLT @ JLCPCB オープンハードカンファレンス
you
PRO
0
190
How to install a gem
indirect
0
230
AlloyDB 奮闘記
hatappi
0
200
TypeScript 7.0の現在地と備え方
uhyo
7
2k
PostgreSQL 18のNOT ENFORCEDな制約とDEFERRABLEの関係
yahonda
0
100
新規事業×QAの挑戦:不確実性を乗りこなす!フェーズごとに求められるQAの役割変革
hacomono
PRO
0
170
SSoT(Single Source of Truth)で「壊して再生」する設計
kawauso
1
290
Laravelで学ぶOAuthとOpenID Connectの基礎と実装
kyoshidaxx
4
1.7k
JEDAI認定プログラム JEDAI Order 2026 受賞者一覧 / JEDAI Order 2026 Winners
databricksjapan
0
250
Featured
See All Featured
Efficient Content Optimization with Google Search Console & Apps Script
katarinadahlin
PRO
1
430
Optimising Largest Contentful Paint
csswizardry
37
3.6k
The #1 spot is gone: here's how to win anyway
tamaranovitovic
2
990
Sharpening the Axe: The Primacy of Toolmaking
bcantrill
46
2.7k
The B2B funnel & how to create a winning content strategy
katarinadahlin
PRO
1
310
AI Search: Implications for SEO and How to Move Forward - #ShenzhenSEOConference
aleyda
1
1.2k
Max Prin - Stacking Signals: How International SEO Comes Together (And Falls Apart)
techseoconnect
PRO
0
120
The Success of Rails: Ensuring Growth for the Next 100 Years
eileencodes
47
8k
Unsuck your backbone
ammeep
672
58k
Git: the NoSQL Database
bkeepers
PRO
432
67k
What the history of the web can teach us about the future of AI
inesmontani
PRO
1
480
Evolution of real-time – Irina Nazarova, EuRuKo, 2024
irinanazarova
9
1.2k
Transcript
9/15/14 @dougbarth DEVOPSDAYS TORONTO 2014 Failure Friday!
9/15/14 FAILURE FRIDAY! Dev Ops
9/15/14 FAILURE FRIDAY! DevOps Engineer
9/15/14 “DO NOT FEAR FAILURE” BY TOMASZ STASIUK
9/15/14 FAILURE FRIDAY! How is babby PagerDuty formed?
9/15/14 FAILURE FRIDAY!
9/15/14 Designed for reliability FAILURE FRIDAY! Downstream providers fail 3
phone providers 3 email providers 6 SMS providers PagerDuty providers fail 2 cloud providers 3 data centers
9/15/14 Hung up on details FAILURE FRIDAY! Bugs in exceptional
code paths Systems not recovering as quickly as expected What is normal when things are abnormal?
9/15/14 FAILURE FRIDAY!
9/15/14 Simian Army FAILURE FRIDAY! Chaos Monkey Latency Monkey Chaos
Gorilla Chaos Kong “WP7WALLPAPER_EVIL_MONKEY_09” BY SKYLER817
9/15/14 Keep it simple FAILURE FRIDAY! “KISS BAND MEMBER CUPCAKES”
BY CLEVER CUPCAKES
9/15/14 Process FAILURE FRIDAY! “HOW TO DRAW AN OWL” BY
CHESTER
9/15/14 Get buy in FAILURE FRIDAY! “ANGRY BOSS” BY KAUSHAL
KARKHANIS
9/15/14 Schedule FAILURE FRIDAY! 1 hour recurring meeting Developers &
Operations List of attacks and identify victim Finish as much as possible
9/15/14 Before starting FAILURE FRIDAY! Disable cron jobs & CM
system Announce the start Open up relevant dashboards Leave alarms enabled
9/15/14 Attacks FAILURE FRIDAY! Test a single host and then
DC 5 minutes Return to a working state Stop if things break
9/15/14 Keep a log FAILURE FRIDAY! Keep track of actions
taken Times are super important Also track discoveries and TODOs Share dashboards/metrics Chat rooms make this easy
9/15/14 Graphs are awesome FAILURE FRIDAY!
9/15/14 Finishing up FAILURE FRIDAY! Sound the all clear Enable
crons & CM Move TODOs to issue tracker
9/15/14 Attack Strategies FAILURE FRIDAY! “UNICORN ATTACK!” BY SAM HOWZIT
9/15/14 FAILURE FRIDAY! SERVICE STOP CASSANDRA
9/15/14 FAILURE FRIDAY! SHUTDOWN -R NOW
9/15/14 FAILURE FRIDAY! IPTABLES -I INPUT 1 -P TCP --DPORT
9160 -J DROP IPTABLES -I INPUT 1 -P TCP --DPORT 7000 -J DROP ! IPTABLES -I OUTPUT 1 -P TCP --SPORT 9160 -J DROP IPTABLES -I OUTPUT 1 -P TCP --SPORT 7000 -J DROP
9/15/14 FAILURE FRIDAY! TC QDISC ADD DEV ETH0 ROOT NETEM
DELAY 500MS 100MS LOSS 5%
9/15/14 “RESULTS READER BOARD” BY ROSA SAY
9/15/14 Issues fixed FAILURE FRIDAY! Aggressive restarts by monit Large
files on ext3 volumes Failing to restart due to bad /etc/fstab file High latency from network isolated cache Low capacity with a lost DC Missing alerts/metrics
9/15/14 Cultural impact FAILURE FRIDAY! Knowledge sharing Highlights untestable systems
Keeps failure handling on everyone’s mind
9/15/14 Future plans “ROBOT SWORDSMAN FIGHT.” BY PATRICK GAGE KELLEY
9/15/14 Break more things FAILURE FRIDAY! Start testing whole DC
outages Break multiple services at once Distribute failure testing to teams Automate
9/15/14 Break more things FAILURE FRIDAY! Start testing whole DC
outages Break multiple services at once Distribute failure testing to teams Automate
9/15/14 Summary FAILURE FRIDAY! Failures will happen Proactively test failure
handling now Choose something easy: app server, cache Automate later
9/15/14 pagerduty.com/jobs Thank you.