Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Failure Friday: Start Injecting Failure Today!
Search
Doug Barth
September 12, 2014
Technology
0
120
Failure Friday: Start Injecting Failure Today!
DevOpsDays Toronto 2014
Video:
http://vimeo.com/107528697
Doug Barth
September 12, 2014
Tweet
Share
More Decks by Doug Barth
See All by Doug Barth
Zero Trust Networks: In Theory and In Practice
dougbarth
1
390
Culture from Chaos
dougbarth
0
82
IPsec mesh network: Perfect for the cloud?
dougbarth
1
1.4k
Failure Friday: Start Injecting Failure Today! (DevOpsDays Austin 2015)
dougbarth
0
530
Making PagerDuty More Reliable Using XtraDB Cluster
dougbarth
0
190
Ensuring Success During Disaster - SRECon 2015
dougbarth
0
210
Minimum Viable Ops
dougbarth
0
74
Policy as Code
dougbarth
1
280
Living without scheduled maintenance
dougbarth
2
1.1k
Other Decks in Technology
See All in Technology
Algyan イベント振り返り
linyixian
0
200
Hands-on / Kaname Frusawa / Cloud Compare Users Meetup 2024 at University of Tokyo on April 17
paraworld
2
480
本当のAWS基礎
toru_kubota
0
360
Databricks:『生成AI World Cup』のご案内
databricksjapan
2
160
Databricks における 『MLOps』
databricksjapan
2
160
長期運用プロジェクトでのMySQLからTiDB移行の検証
colopl
2
800
自動生成を活用した、運用保守コストを抑える Error/Alert/Runbook の一元集約管理 / Centralized management of Error/Alert/Runbook to minimize operational costs using automated code generation
biwashi
12
2.3k
LLM とプロンプトエンジニアリング/チューターをビルドする / LLM and Prompt Engineering and Building Tutors
ks91
PRO
0
250
4年前、あるじゃん老害エンジニアLT合戦に登壇、米国西海岸コンピュータ歴史博物館体験記の続編
toshi_atsumi
0
220
AOAI をきっかけに 社内の Azure 管理を見直した話
recruitengineers
PRO
1
180
開発生産性向上サービスを作るFindyが自分たちで開発生産性を爆上げした組織づくりの歩み / Findy's path to boosting its own development productivity 2024-04-17
ma3tk
3
470
NgRx Signal Store
rainerhahnekamp
0
140
Featured
See All Featured
From Idea to $5000 a Month in 5 Months
shpigford
377
45k
GitHub's CSS Performance
jonrohan
1024
450k
How GitHub Uses GitHub to Build GitHub
holman
468
290k
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
154
14k
Infographics Made Easy
chrislema
238
18k
Git: the NoSQL Database
bkeepers
PRO
422
63k
Principles of Awesome APIs and How to Build Them.
keavy
120
16k
Design by the Numbers
sachag
274
18k
10 Git Anti Patterns You Should be Aware of
lemiorhan
647
58k
YesSQL, Process and Tooling at Scale
rocio
163
13k
It's Worth the Effort
3n
180
27k
Building Applications with DynamoDB
mza
88
5.6k
Transcript
9/15/14 @dougbarth DEVOPSDAYS TORONTO 2014 Failure Friday!
9/15/14 FAILURE FRIDAY! Dev Ops
9/15/14 FAILURE FRIDAY! DevOps Engineer
9/15/14 “DO NOT FEAR FAILURE” BY TOMASZ STASIUK
9/15/14 FAILURE FRIDAY! How is babby PagerDuty formed?
9/15/14 FAILURE FRIDAY!
9/15/14 Designed for reliability FAILURE FRIDAY! Downstream providers fail 3
phone providers 3 email providers 6 SMS providers PagerDuty providers fail 2 cloud providers 3 data centers
9/15/14 Hung up on details FAILURE FRIDAY! Bugs in exceptional
code paths Systems not recovering as quickly as expected What is normal when things are abnormal?
9/15/14 FAILURE FRIDAY!
9/15/14 Simian Army FAILURE FRIDAY! Chaos Monkey Latency Monkey Chaos
Gorilla Chaos Kong “WP7WALLPAPER_EVIL_MONKEY_09” BY SKYLER817
9/15/14 Keep it simple FAILURE FRIDAY! “KISS BAND MEMBER CUPCAKES”
BY CLEVER CUPCAKES
9/15/14 Process FAILURE FRIDAY! “HOW TO DRAW AN OWL” BY
CHESTER
9/15/14 Get buy in FAILURE FRIDAY! “ANGRY BOSS” BY KAUSHAL
KARKHANIS
9/15/14 Schedule FAILURE FRIDAY! 1 hour recurring meeting Developers &
Operations List of attacks and identify victim Finish as much as possible
9/15/14 Before starting FAILURE FRIDAY! Disable cron jobs & CM
system Announce the start Open up relevant dashboards Leave alarms enabled
9/15/14 Attacks FAILURE FRIDAY! Test a single host and then
DC 5 minutes Return to a working state Stop if things break
9/15/14 Keep a log FAILURE FRIDAY! Keep track of actions
taken Times are super important Also track discoveries and TODOs Share dashboards/metrics Chat rooms make this easy
9/15/14 Graphs are awesome FAILURE FRIDAY!
9/15/14 Finishing up FAILURE FRIDAY! Sound the all clear Enable
crons & CM Move TODOs to issue tracker
9/15/14 Attack Strategies FAILURE FRIDAY! “UNICORN ATTACK!” BY SAM HOWZIT
9/15/14 FAILURE FRIDAY! SERVICE STOP CASSANDRA
9/15/14 FAILURE FRIDAY! SHUTDOWN -R NOW
9/15/14 FAILURE FRIDAY! IPTABLES -I INPUT 1 -P TCP --DPORT
9160 -J DROP IPTABLES -I INPUT 1 -P TCP --DPORT 7000 -J DROP ! IPTABLES -I OUTPUT 1 -P TCP --SPORT 9160 -J DROP IPTABLES -I OUTPUT 1 -P TCP --SPORT 7000 -J DROP
9/15/14 FAILURE FRIDAY! TC QDISC ADD DEV ETH0 ROOT NETEM
DELAY 500MS 100MS LOSS 5%
9/15/14 “RESULTS READER BOARD” BY ROSA SAY
9/15/14 Issues fixed FAILURE FRIDAY! Aggressive restarts by monit Large
files on ext3 volumes Failing to restart due to bad /etc/fstab file High latency from network isolated cache Low capacity with a lost DC Missing alerts/metrics
9/15/14 Cultural impact FAILURE FRIDAY! Knowledge sharing Highlights untestable systems
Keeps failure handling on everyone’s mind
9/15/14 Future plans “ROBOT SWORDSMAN FIGHT.” BY PATRICK GAGE KELLEY
9/15/14 Break more things FAILURE FRIDAY! Start testing whole DC
outages Break multiple services at once Distribute failure testing to teams Automate
9/15/14 Break more things FAILURE FRIDAY! Start testing whole DC
outages Break multiple services at once Distribute failure testing to teams Automate
9/15/14 Summary FAILURE FRIDAY! Failures will happen Proactively test failure
handling now Choose something easy: app server, cache Automate later
9/15/14 pagerduty.com/jobs Thank you.