Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Disaster Recovery: A Process, Not a Tool

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.

Disaster Recovery: A Process, Not a Tool

As presented at PGDay Boston 2026

Avatar for Richard Yen

Richard Yen

June 09, 2026

More Decks by Richard Yen

Other Decks in Technology

Transcript

  1. Agenda 1. Where We Are 2. Where We Need to

    Be 3. How We’ll Get There 4. Some Stories Along the Way
  2. A disaster is any sustained event that compromises the system’s

    availability, correctness, or business trust
  3. How DR is Usually Done 1. Prepare 2. Prevent “An

    ounce of prevention is worth a pound of cure”
  4. Postgres Makes Recovery Easy • pg_dump/pg_restore • pg_basebackup • pg_stat_replication

    • pg_stat_activity • Point-In-Time Recovery • repmgr/efm • Third-party backup tools
  5. RPO & RTO Talk to your leadership, and you’ll discover

    how much it’s really worth to them
  6. RPO 1. 24-hour RPO -- $ 2. 15-minute RPO --

    $$ 3. Near-zero RPO -- $$$
  7. 3 Layers of DR Planning 1. Infrastructure failure 2. Procedural

    failure 3. Human failure Recovery is not always about failing over
  8. Runbook Engineering: Non-Technical Essentials 1. Incident Commander 2. Communications Owner

    3. Notification Cadence 4. Escalation Chain 5. Risk Authorization
  9. Runbook Validation 1. Can a new engineer follow it? 2.

    Does it assume access? 3. Are commands and names current? 4. Does it get regular playtime?
  10. Runbook Validation: Level Up Your Ability 1. Prove that your

    Runbook works 2. Reduce the time it takes to complete 3. Simulate failure 4. Test with unavailable human resources
  11. Runbook Validation: Level Up Your Ability 1. Prove that your

    Runbook works 2. Reduce the time it takes to complete 3. Simulate failure 4. Test with unavailable human resources This is how you reduce RTO
  12. Validation Metrics 1. Did recovery succeed? 2. How long did

    each section take? 3. What vagueness needs to be clarified? 4. Identify documentation gaps
  13. Validation Metrics 1. Did recovery succeed? 2. How long did

    each section take? 3. What vagueness needs to be clarified? 4. Identify documentation gaps 5. Be Encouraging! Go out for dinner!
  14. Don’t Blame, or You’ll Feel Lame 1. Communication is Key

    2. People hide when they feel shame 3. When people don’t feel safe to ask, they guess 4. Guessing hurts your RTO