Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Disaster Recovery: A Process, Not a Tool
Search
Richard Yen
June 09, 2026
Technology
28
0
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
Disaster Recovery: A Process, Not a Tool
As presented at PGDay Boston 2026
Richard Yen
June 09, 2026
More Decks by Richard Yen
See All by Richard Yen
pg_stats: How Postgres Internal Stats Work
richyen
0
11
Playing Nice with Your Friends: Database Diversity with Postgres FDWs
richyen
0
160
How to Ride Elephants Safely: Working with Postgres when Your DBA is not Around
richyen
0
160
Scaling the Wall of Text: Best Practices for Logging in PostgreSQL
richyen
0
180
How to Ride Elephants Safely: Working with Postgres when Your DBA is not Around
richyen
0
140
How to Ride Elephants Safely: Working with Postgres when Your DBA is not Around
richyen
0
66
How to Ride Elephants Safely: Working with PostgreSQL when your DBA is not around
richyen
0
59
Playing Nice with Your Friends: Database Diversity with Postgres FDWs
richyen
0
160
Explaining EXPLAIN: A Dive Into PostgreSQL EXPLAIN Plans
richyen
0
170
Other Decks in Technology
See All in Technology
データレイクの「見えない問題」を可視化する
sansantech
PRO
1
200
アジャイルな経理と Claude Code と経営の未来
kawaguti
PRO
3
190
FPC(フレキシブル)基板にZephyr実装してみた。
iotengineer22
0
160
クレデンシャル流出 ― 攻撃 3 時間 vs 復旧 10 時間。この非対称性にどう備えるか
kazzpapa3
3
540
技術・能力を向上する原理原則 #きのこセッションa #きのこ2026
bash0c7
0
110
Comment regagner la souveraineté de vos données tout en étant payé grâce à Nostr !
rlifchitz
0
180
AIのReact習熟度を測る
uhyo
2
680
新しいUbuntu/GNOMEが使いたいからXからWaylandへ移行頑張ってるの巻 2026-06-20
nobutomurata
0
160
自分が詳しくない領域でAIを使う #プロヒス2026
konifar
20
7.3k
從開發到部署全都交給 AI:實作 AI 驅動的自動化流程
appleboy
0
140
Flow 不死:AI 時代 DevOps 的不變本質
cheng_wei_chen
2
480
フィジカル版Github Onshapeの紹介
shiba_8ro
0
320
Featured
See All Featured
jQuery: Nuts, Bolts and Bling
dougneiner
66
8.5k
Information Architects: The Missing Link in Design Systems
soysaucechin
0
980
AI in Enterprises - Java and Open Source to the Rescue
ivargrimstad
0
1.3k
BBQ
matthewcrist
89
10k
Odyssey Design
rkendrick25
PRO
2
700
A Guide to Academic Writing Using Generative AI - A Workshop
ks91
PRO
1
330
Introduction to Domain-Driven Design and Collaborative software design
baasie
1
860
Art, The Web, and Tiny UX
lynnandtonic
304
22k
KATA
mclloyd
PRO
35
15k
How to Create Impact in a Changing Tech Landscape [PerfNow 2023]
tammyeverts
55
3.4k
JAMstack: Web Apps at Ludicrous Speed - All Things Open 2022
reverentgeek
1
480
ReactJS: Keep Simple. Everything can be a component!
pedronauck
666
130k
Transcript
Disaster Recovery A Process, not a Tool June 9, 2026
Richard Yen
The Changed Landscape
None
The Changed Landscape •99.99% Uptime •p95, p99 metrics •Status pages
•Social Media affects reputation
Agenda 1. Where We Are 2. Where We Need to
Be 3. How We’ll Get There 4. Some Stories Along the Way
Where We Are
A disaster is any sustained event that compromises the system’s
availability, correctness, or business trust
How DR is Usually Done 1. Prepare 2. Prevent
How DR is Usually Done 1. Prepare 2. Prevent “An
ounce of prevention is worth a pound of cure”
Disaster Recovery is the act of restoring business operations
Where We Need to Be
Postgres Makes Recovery Easy • pg_dump/pg_restore • pg_basebackup • pg_stat_replication
• pg_stat_activity • Point-In-Time Recovery • repmgr/efm • Third-party backup tools
RPO & RTO
RPO & RTO – It’s going to cost you
RPO & RTO Talk to your leadership, and you’ll discover
how much it’s really worth to them
RPO 1. 24-hour RPO -- $ 2. 15-minute RPO --
$$ 3. Near-zero RPO -- $$$
RTO is your team’s ability to execute the DR plan
How We’ll Get There
3 Layers of DR Planning 1. Infrastructure failure 2. Procedural
failure 3. Human failure
3 Layers of DR Planning 1. Infrastructure failure 2. Procedural
failure 3. Human failure Recovery is not always about failing over
Runbook Engineering Should Assume 1. Stress 2. Chaos 3. Confusion
4. Exhaustion 5. Ambiguity
Runbook Engineering Should Assume 1. Stress 2. Chaos 3. Confusion
4. Exhaustion 5. Ambiguity
Runbook Engineering: Anti-patterns 1. Wiki Pages 2. Stale documents 3.
Unclear owner 4. Vague instructions
Runbook Engineering: Non-Technical Essentials 1. Incident Commander 2. Communications Owner
3. Notification Cadence 4. Escalation Chain 5. Risk Authorization
Runbook Validation 1. Can a new engineer follow it? 2.
Does it assume access? 3. Are commands and names current? 4. Does it get regular playtime?
Runbook Validation: Level Up Your Ability 1. Prove that your
Runbook works 2. Reduce the time it takes to complete 3. Simulate failure 4. Test with unavailable human resources
Runbook Validation: Level Up Your Ability 1. Prove that your
Runbook works 2. Reduce the time it takes to complete 3. Simulate failure 4. Test with unavailable human resources This is how you reduce RTO
Validation Metrics 1. Did recovery succeed? 2. How long did
each section take? 3. What vagueness needs to be clarified? 4. Identify documentation gaps
Validation Metrics 1. Did recovery succeed? 2. How long did
each section take? 3. What vagueness needs to be clarified? 4. Identify documentation gaps 5. Be Encouraging! Go out for dinner!
Don’t Blame, or You’ll Feel Lame 1. Communication is Key
2. People hide when they feel shame 3. When people don’t feel safe to ask, they guess 4. Guessing hurts your RTO
Make your RPO worth it by investing in your RTO
© Copyright Microsoft Corporation. All rights reserved.