Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Disaster Recovery: A Process, Not a Tool
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Richard Yen
June 09, 2026
Technology
7
0
Share
Disaster Recovery: A Process, Not a Tool
As presented at PGDay Boston 2026
Richard Yen
June 09, 2026
More Decks by Richard Yen
See All by Richard Yen
Playing Nice with Your Friends: Database Diversity with Postgres FDWs
richyen
0
160
How to Ride Elephants Safely: Working with Postgres when Your DBA is not Around
richyen
0
150
Scaling the Wall of Text: Best Practices for Logging in PostgreSQL
richyen
0
170
How to Ride Elephants Safely: Working with Postgres when Your DBA is not Around
richyen
0
130
How to Ride Elephants Safely: Working with Postgres when Your DBA is not Around
richyen
0
61
How to Ride Elephants Safely: Working with PostgreSQL when your DBA is not around
richyen
0
56
Playing Nice with Your Friends: Database Diversity with Postgres FDWs
richyen
0
150
Explaining EXPLAIN: A Dive Into PostgreSQL EXPLAIN Plans
richyen
0
160
Explaining EXPLAIN: An Introduction to PostgreSQL EXPLAIN Plans
richyen
0
220
Other Decks in Technology
See All in Technology
個人の発見を、組織の知恵に 〜生成AI活用を"探索"から"組織の仕組み"へ〜
kintotechdev
2
480
Ruby::Boxでできること、Refinementsでできること
joker1007
3
360
プラットフォームエンジニア ワークショップ/ platform-workshop
databricksjapan
0
160
AI Adaptable なテストを整える工夫 / Ways to Make Your Tests AI-Adaptable
bitkey
PRO
2
200
APIテストとは?
nagix
0
170
地元にいないローカルオーガナイザーの立ち回り
uvb_76
1
430
oracle-to-databricks-migration-with-llm-and-dbt
casek
1
410
Platform engineering for developers, architects & the rest of us (AI agents)
danielbryantuk
0
170
Spring Boot における AOT Cache 活用テクニックと 起動時間改善事例
ntt_dsol_java
0
200
インフラが苦手でも大丈夫! 紙芝居 Kubernetes -WWGT 10周年編-
aoi1
1
320
JEP 522 Deep Dive - G1 GC同期コスト削減によるスループット向上を徹底検証&解説
tabatad
1
600
AI フレンドリーなエラー監視を TypeScript で実現する
shinyaigeek
2
210
Featured
See All Featured
Joys of Absence: A Defence of Solitary Play
codingconduct
1
380
It's Worth the Effort
3n
188
29k
The Limits of Empathy - UXLibs8
cassininazir
1
340
The agentic SEO stack - context over prompts
schlessera
0
790
Ethics towards AI in product and experience design
skipperchong
2
290
Code Reviewing Like a Champion
maltzj
528
40k
Bash Introduction
62gerente
615
210k
Game over? The fight for quality and originality in the time of robots
wayneb77
1
180
Information Architects: The Missing Link in Design Systems
soysaucechin
0
950
Organizational Design Perspectives: An Ontology of Organizational Design Elements
kimpetersen
PRO
1
720
Fireside Chat
paigeccino
42
3.9k
DBのスキルで生き残る技術 - AI時代におけるテーブル設計の勘所
soudai
PRO
65
55k
Transcript
Disaster Recovery A Process, not a Tool June 9, 2026
Richard Yen
The Changed Landscape
None
The Changed Landscape •99.99% Uptime •p95, p99 metrics •Status pages
•Social Media affects reputation
Agenda 1. Where We Are 2. Where We Need to
Be 3. How We’ll Get There 4. Some Stories Along the Way
Where We Are
A disaster is any sustained event that compromises the system’s
availability, correctness, or business trust
How DR is Usually Done 1. Prepare 2. Prevent
How DR is Usually Done 1. Prepare 2. Prevent “An
ounce of prevention is worth a pound of cure”
Disaster Recovery is the act of restoring business operations
Where We Need to Be
Postgres Makes Recovery Easy • pg_dump/pg_restore • pg_basebackup • pg_stat_replication
• pg_stat_activity • Point-In-Time Recovery • repmgr/efm • Third-party backup tools
RPO & RTO
RPO & RTO – It’s going to cost you
RPO & RTO Talk to your leadership, and you’ll discover
how much it’s really worth to them
RPO 1. 24-hour RPO -- $ 2. 15-minute RPO --
$$ 3. Near-zero RPO -- $$$
RTO is your team’s ability to execute the DR plan
How We’ll Get There
3 Layers of DR Planning 1. Infrastructure failure 2. Procedural
failure 3. Human failure
3 Layers of DR Planning 1. Infrastructure failure 2. Procedural
failure 3. Human failure Recovery is not always about failing over
Runbook Engineering Should Assume 1. Stress 2. Chaos 3. Confusion
4. Exhaustion 5. Ambiguity
Runbook Engineering Should Assume 1. Stress 2. Chaos 3. Confusion
4. Exhaustion 5. Ambiguity
Runbook Engineering: Anti-patterns 1. Wiki Pages 2. Stale documents 3.
Unclear owner 4. Vague instructions
Runbook Engineering: Non-Technical Essentials 1. Incident Commander 2. Communications Owner
3. Notification Cadence 4. Escalation Chain 5. Risk Authorization
Runbook Validation 1. Can a new engineer follow it? 2.
Does it assume access? 3. Are commands and names current? 4. Does it get regular playtime?
Runbook Validation: Level Up Your Ability 1. Prove that your
Runbook works 2. Reduce the time it takes to complete 3. Simulate failure 4. Test with unavailable human resources
Runbook Validation: Level Up Your Ability 1. Prove that your
Runbook works 2. Reduce the time it takes to complete 3. Simulate failure 4. Test with unavailable human resources This is how you reduce RTO
Validation Metrics 1. Did recovery succeed? 2. How long did
each section take? 3. What vagueness needs to be clarified? 4. Identify documentation gaps
Validation Metrics 1. Did recovery succeed? 2. How long did
each section take? 3. What vagueness needs to be clarified? 4. Identify documentation gaps 5. Be Encouraging! Go out for dinner!
Don’t Blame, or You’ll Feel Lame 1. Communication is Key
2. People hide when they feel shame 3. When people don’t feel safe to ask, they guess 4. Guessing hurts your RTO
Make your RPO worth it by investing in your RTO
© Copyright Microsoft Corporation. All rights reserved.