×
Copy
Open
Link
Embed
Share
Beginning
This slide
Copy link URL
Copy link URL
Copy iframe embed code
Copy iframe embed code
Copy javascript embed code
Copy javascript embed code
Share
Tweet
Share
Tweet
Slide 1
Slide 1 text
3/20/15 Ensuring success during disaster SRECON 2015
Slide 2
Slide 2 text
3/20/15 ENSURING SUCCESS DURING DISASTER @dougbarth
Slide 3
Slide 3 text
3/20/15 ENSURING SUCCESS DURING DISASTER
Slide 4
Slide 4 text
3/20/15 ENSURING SUCCESS DURING DISASTER Dev Ops Consider killing this
Slide 5
Slide 5 text
3/20/15 ENSURING SUCCESS DURING DISASTER How is babby PagerDuty formed?
Slide 6
Slide 6 text
3/20/15 ENSURING SUCCESS DURING DISASTER
Slide 7
Slide 7 text
3/20/15 ENSURING SUCCESS DURING DISASTER Reliability
Slide 8
Slide 8 text
3/20/15 ENSURING SUCCESS DURING DISASTER Agenda
Slide 9
Slide 9 text
3/20/15 ENSURING SUCCESS DURING DISASTER DR vs. HA Data DR Failover Active/Active Legacy Systems Q&A
Slide 10
Slide 10 text
3/20/15 ENSURING SUCCESS DURING DISASTER Disaster Recovery
Slide 11
Slide 11 text
3/20/15 ENSURING SUCCESS DURING DISASTER A plan for surviving rare failure events that threaten our ability to continue operating
Slide 12
Slide 12 text
3/20/15 ENSURING SUCCESS DURING DISASTER High availability
Slide 13
Slide 13 text
3/20/15 ENSURING SUCCESS DURING DISASTER DR HA Rare failures Common failures Slow recovery Fast recovery
Slide 14
Slide 14 text
3/20/15 ENSURING SUCCESS DURING DISASTER Latency
Slide 15
Slide 15 text
3/20/15 ENSURING SUCCESS DURING DISASTER DR needs to exist
Slide 16
Slide 16 text
3/20/15 ENSURING SUCCESS DURING DISASTER DR needs to be tested
Slide 17
Slide 17 text
3/20/15 ENSURING SUCCESS DURING DISASTER DR needs to be tested for correctness
Slide 18
Slide 18 text
3/20/15 ENSURING SUCCESS DURING DISASTER DR needs to be tested for capacity
Slide 19
Slide 19 text
3/20/15 ENSURING SUCCESS DURING DISASTER DR needs to be tested for execution
Slide 20
Slide 20 text
3/20/15 ENSURING SUCCESS DURING DISASTER Data DR
Slide 21
Slide 21 text
3/20/15 ENSURING SUCCESS DURING DISASTER You can never have too many copies
Slide 22
Slide 22 text
3/20/15 ENSURING SUCCESS DURING DISASTER Lose a disk? Lose a server?
Slide 23
Slide 23 text
3/20/15 ENSURING SUCCESS DURING DISASTER Secondary Primary Secondary
Slide 24
Slide 24 text
3/20/15 ENSURING SUCCESS DURING DISASTER Data corruption?
Slide 25
Slide 25 text
3/20/15 ENSURING SUCCESS DURING DISASTER Backups
Slide 26
Slide 26 text
3/20/15 ENSURING SUCCESS DURING DISASTER Test restorations
Slide 27
Slide 27 text
3/20/15 ENSURING SUCCESS DURING DISASTER DROP TABLE USERS; DELETE FROM USERS;
Slide 28
Slide 28 text
3/20/15 ENSURING SUCCESS DURING DISASTER Delayed secondary Primary Secondary (delayed 2 hrs)
Slide 29
Slide 29 text
3/20/15 ENSURING SUCCESS DURING DISASTER Failover
Slide 30
Slide 30 text
3/20/15 ENSURING SUCCESS DURING DISASTER Primary DC & Secondary DC
Slide 31
Slide 31 text
3/20/15 ENSURING SUCCESS DURING DISASTER The flip
Slide 32
Slide 32 text
3/20/15 ENSURING SUCCESS DURING DISASTER Testing == outage
Slide 33
Slide 33 text
3/20/15 ENSURING SUCCESS DURING DISASTER So it goes untested
Slide 34
Slide 34 text
3/20/15 ENSURING SUCCESS DURING DISASTER Tested infrequently
Slide 35
Slide 35 text
3/20/15 ENSURING SUCCESS DURING DISASTER Flip test Flip test Breaking change Failure window
Slide 36
Slide 36 text
3/20/15 ENSURING SUCCESS DURING DISASTER Forgotten roles
Slide 37
Slide 37 text
3/20/15 ENSURING SUCCESS DURING DISASTER Low capacity
Slide 38
Slide 38 text
3/20/15 ENSURING SUCCESS DURING DISASTER Expensive
Slide 39
Slide 39 text
3/20/15 ENSURING SUCCESS DURING DISASTER Active/Active
Slide 40
Slide 40 text
3/20/15 ENSURING SUCCESS DURING DISASTER PagerDuty Active/Active
Slide 41
Slide 41 text
3/20/15 ENSURING SUCCESS DURING DISASTER Multiple regions Use all the AZs!
Slide 42
Slide 42 text
3/20/15 ENSURING SUCCESS DURING DISASTER
Slide 43
Slide 43 text
3/20/15 ENSURING SUCCESS DURING DISASTER Constant validation
Slide 44
Slide 44 text
3/20/15 ENSURING SUCCESS DURING DISASTER Failure Friday HTTPS://WWW.PAGERDUTY.COM/BLOG/FAILURE-FRIDAY-AT-PAGERDUTY/
Slide 45
Slide 45 text
3/20/15 ENSURING SUCCESS DURING DISASTER Trade latency for reliability
Slide 46
Slide 46 text
3/20/15 ENSURING SUCCESS DURING DISASTER 30ms RTT
Slide 47
Slide 47 text
3/20/15 ENSURING SUCCESS DURING DISASTER Legacy Systems
Slide 48
Slide 48 text
3/20/15 ENSURING SUCCESS DURING DISASTER Older systems still use failover
Slide 49
Slide 49 text
3/20/15 ENSURING SUCCESS DURING DISASTER Correctness Capacity Execution
Slide 50
Slide 50 text
3/20/15 ENSURING SUCCESS DURING DISASTER Correctness Capacity Execution
Slide 51
Slide 51 text
3/20/15 ENSURING SUCCESS DURING DISASTER 1% of requests go to DR site
Slide 52
Slide 52 text
3/20/15 ENSURING SUCCESS DURING DISASTER Finds issues immediately
Slide 53
Slide 53 text
3/20/15 ENSURING SUCCESS DURING DISASTER Requires DR uses production DB DR Prod App App App App App App
Slide 54
Slide 54 text
3/20/15 ENSURING SUCCESS DURING DISASTER 30ms per DB call
Slide 55
Slide 55 text
3/20/15 ENSURING SUCCESS DURING DISASTER Trade latency to test correctness
Slide 56
Slide 56 text
3/20/15 ENSURING SUCCESS DURING DISASTER Correctness Capacity Execution
Slide 57
Slide 57 text
3/20/15 ENSURING SUCCESS DURING DISASTER DR is a mirror of production
Slide 58
Slide 58 text
3/20/15 ENSURING SUCCESS DURING DISASTER
Slide 59
Slide 59 text
3/20/15 ENSURING SUCCESS DURING DISASTER dr-compare.rb
Slide 60
Slide 60 text
3/20/15 ENSURING SUCCESS DURING DISASTER Checks that DR footprint is the same as production
Slide 61
Slide 61 text
3/20/15 ENSURING SUCCESS DURING DISASTER Surfaces missing roles
Slide 62
Slide 62 text
3/20/15 ENSURING SUCCESS DURING DISASTER Correctness Capacity Execution
Slide 63
Slide 63 text
3/20/15 ENSURING SUCCESS DURING DISASTER The flip
Slide 64
Slide 64 text
3/20/15 ENSURING SUCCESS DURING DISASTER Production impact
Slide 65
Slide 65 text
3/20/15 ENSURING SUCCESS DURING DISASTER Scale up staging
Slide 66
Slide 66 text
3/20/15 ENSURING SUCCESS DURING DISASTER Practice flip there
Slide 67
Slide 67 text
3/20/15 ENSURING SUCCESS DURING DISASTER Tested infrequently
Slide 68
Slide 68 text
3/20/15 ENSURING SUCCESS DURING DISASTER PagerDuty Active/Active
Slide 69
Slide 69 text
3/20/15 ENSURING SUCCESS DURING DISASTER Correctness Capacity Execution
Slide 70
Slide 70 text
3/20/15 ENSURING SUCCESS DURING DISASTER pagerduty.com/jobs
[email protected]
Slide 71
Slide 71 text
3/20/15
[email protected]
Questions?