Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Failure Friday: Start Injecting Failure Today!
Search
Doug Barth
September 12, 2014
Technology
0
120
Failure Friday: Start Injecting Failure Today!
DevOpsDays Toronto 2014
Video:
http://vimeo.com/107528697
Doug Barth
September 12, 2014
Tweet
Share
More Decks by Doug Barth
See All by Doug Barth
Zero Trust Networks: In Theory and In Practice
dougbarth
1
470
Culture from Chaos
dougbarth
0
87
IPsec mesh network: Perfect for the cloud?
dougbarth
1
1.5k
Failure Friday: Start Injecting Failure Today! (DevOpsDays Austin 2015)
dougbarth
0
610
Making PagerDuty More Reliable Using XtraDB Cluster
dougbarth
0
190
Ensuring Success During Disaster - SRECon 2015
dougbarth
0
230
Minimum Viable Ops
dougbarth
0
79
Policy as Code
dougbarth
1
290
Living without scheduled maintenance
dougbarth
2
1.1k
Other Decks in Technology
See All in Technology
ExaDB-D dbaascli で出来ること
oracle4engineer
PRO
0
3.8k
TypeScript、上達の瞬間
sadnessojisan
46
13k
複雑なState管理からの脱却
sansantech
PRO
1
140
いざ、BSC討伐の旅
nikinusu
2
780
Why does continuous profiling matter to developers? #appdevelopercon
salaboy
0
180
【若手エンジニア応援LT会】ソフトウェアを学んできた私がインフラエンジニアを目指した理由
kazushi_ohata
0
150
第1回 国土交通省 データコンペ参加者向け勉強会③- Snowflake x estie編 -
estie
0
130
【Startup CTO of the Year 2024 / Audience Award】アセンド取締役CTO 丹羽健
niwatakeru
0
980
iOSチームとAndroidチームでブランチ運用が違ったので整理してます
sansantech
PRO
0
130
VideoMamba: State Space Model for Efficient Video Understanding
chou500
0
190
適材適所の技術選定 〜GraphQL・REST API・tRPC〜 / Optimal Technology Selection
kakehashi
1
170
【令和最新版】AWS Direct Connectと愉快なGWたちのおさらい
minorun365
PRO
5
750
Featured
See All Featured
Understanding Cognitive Biases in Performance Measurement
bluesmoon
26
1.4k
4 Signs Your Business is Dying
shpigford
180
21k
Why Our Code Smells
bkeepers
PRO
334
57k
Automating Front-end Workflow
addyosmani
1366
200k
Refactoring Trust on Your Teams (GOTO; Chicago 2020)
rmw
31
2.7k
Intergalactic Javascript Robots from Outer Space
tanoku
269
27k
Dealing with People You Can't Stand - Big Design 2015
cassininazir
364
24k
Faster Mobile Websites
deanohume
305
30k
Fashionably flexible responsive web design (full day workshop)
malarkey
405
65k
Navigating Team Friction
lara
183
14k
Save Time (by Creating Custom Rails Generators)
garrettdimon
PRO
27
840
How STYLIGHT went responsive
nonsquared
95
5.2k
Transcript
9/15/14 @dougbarth DEVOPSDAYS TORONTO 2014 Failure Friday!
9/15/14 FAILURE FRIDAY! Dev Ops
9/15/14 FAILURE FRIDAY! DevOps Engineer
9/15/14 “DO NOT FEAR FAILURE” BY TOMASZ STASIUK
9/15/14 FAILURE FRIDAY! How is babby PagerDuty formed?
9/15/14 FAILURE FRIDAY!
9/15/14 Designed for reliability FAILURE FRIDAY! Downstream providers fail 3
phone providers 3 email providers 6 SMS providers PagerDuty providers fail 2 cloud providers 3 data centers
9/15/14 Hung up on details FAILURE FRIDAY! Bugs in exceptional
code paths Systems not recovering as quickly as expected What is normal when things are abnormal?
9/15/14 FAILURE FRIDAY!
9/15/14 Simian Army FAILURE FRIDAY! Chaos Monkey Latency Monkey Chaos
Gorilla Chaos Kong “WP7WALLPAPER_EVIL_MONKEY_09” BY SKYLER817
9/15/14 Keep it simple FAILURE FRIDAY! “KISS BAND MEMBER CUPCAKES”
BY CLEVER CUPCAKES
9/15/14 Process FAILURE FRIDAY! “HOW TO DRAW AN OWL” BY
CHESTER
9/15/14 Get buy in FAILURE FRIDAY! “ANGRY BOSS” BY KAUSHAL
KARKHANIS
9/15/14 Schedule FAILURE FRIDAY! 1 hour recurring meeting Developers &
Operations List of attacks and identify victim Finish as much as possible
9/15/14 Before starting FAILURE FRIDAY! Disable cron jobs & CM
system Announce the start Open up relevant dashboards Leave alarms enabled
9/15/14 Attacks FAILURE FRIDAY! Test a single host and then
DC 5 minutes Return to a working state Stop if things break
9/15/14 Keep a log FAILURE FRIDAY! Keep track of actions
taken Times are super important Also track discoveries and TODOs Share dashboards/metrics Chat rooms make this easy
9/15/14 Graphs are awesome FAILURE FRIDAY!
9/15/14 Finishing up FAILURE FRIDAY! Sound the all clear Enable
crons & CM Move TODOs to issue tracker
9/15/14 Attack Strategies FAILURE FRIDAY! “UNICORN ATTACK!” BY SAM HOWZIT
9/15/14 FAILURE FRIDAY! SERVICE STOP CASSANDRA
9/15/14 FAILURE FRIDAY! SHUTDOWN -R NOW
9/15/14 FAILURE FRIDAY! IPTABLES -I INPUT 1 -P TCP --DPORT
9160 -J DROP IPTABLES -I INPUT 1 -P TCP --DPORT 7000 -J DROP ! IPTABLES -I OUTPUT 1 -P TCP --SPORT 9160 -J DROP IPTABLES -I OUTPUT 1 -P TCP --SPORT 7000 -J DROP
9/15/14 FAILURE FRIDAY! TC QDISC ADD DEV ETH0 ROOT NETEM
DELAY 500MS 100MS LOSS 5%
9/15/14 “RESULTS READER BOARD” BY ROSA SAY
9/15/14 Issues fixed FAILURE FRIDAY! Aggressive restarts by monit Large
files on ext3 volumes Failing to restart due to bad /etc/fstab file High latency from network isolated cache Low capacity with a lost DC Missing alerts/metrics
9/15/14 Cultural impact FAILURE FRIDAY! Knowledge sharing Highlights untestable systems
Keeps failure handling on everyone’s mind
9/15/14 Future plans “ROBOT SWORDSMAN FIGHT.” BY PATRICK GAGE KELLEY
9/15/14 Break more things FAILURE FRIDAY! Start testing whole DC
outages Break multiple services at once Distribute failure testing to teams Automate
9/15/14 Break more things FAILURE FRIDAY! Start testing whole DC
outages Break multiple services at once Distribute failure testing to teams Automate
9/15/14 Summary FAILURE FRIDAY! Failures will happen Proactively test failure
handling now Choose something easy: app server, cache Automate later
9/15/14 pagerduty.com/jobs Thank you.