Distributed Point-in-Time Recovery with Postgres | PGConf.Russia 2018 | Eren Basak

024d6a0dd14fb31c804969a57a06dfbe?s=47 Citus Data
February 06, 2018

Distributed Point-in-Time Recovery with Postgres | PGConf.Russia 2018 | Eren Basak

Postgres has a nice feature called Point-in-time Recovery (PITR) that would allow you to go back in time. In this talk, we will discuss what are the use-cases of PITR, how to prepare your database for PITR by setting good base backup and WAL shipping setups, with some examples. We will expand the discussion with how to achieve PITR if you have a distributed and sharded Postgres setup by mentioning challenges such as clock differences and ways to overcome them, such as two-phase commit and pg_create_restore_point.

024d6a0dd14fb31c804969a57a06dfbe?s=128

Citus Data

February 06, 2018
Tweet

Transcript

  1. Distributed Point-in-Time Recovery with Postgres Eren Başak Cloud Software Engineer

    Citus Data PGConf.Russia 2018
  2. • What is Point-in-Time Recovery • How to do point-in-time

    recovery • Distributed Point-in-time-Recovery • Citus Cloud way of doing PITR Overview 2 Eren Başak | Citus Data | PGConf.Russia 2018 | February 2018
  3. • Point-in-the-past copy of an existing database • “I want

    02:45 pm yesterday copy of my database” • A fork Point-in-Time Recovery 3 Eren Başak | Citus Data | PGConf.Russia 2018 | February 2018 regular database PITR database PITR
  4. • DB Admin mistakes (DROP a wrong column) • User

    deletes data by mistake “I ran our unit tests against the production database” • Want an independent copy of the production database • Playground for data analysts • Understand the impact of bigger changes (a new index) Point-in-Time Recovery 4 Eren Başak | Citus Data | PGConf.Russia 2018 | February 2018 Benefits and Use Cases
  5. • Periodic base backups Run pg_basebackup and archive the backups

    • Wal Archiving archive_command = ‘cp %p “/somewhere/reliable/%f”’ How to? 5 Eren Başak | Citus Data | PGConf.Russia 2018 | February 2018 Prerequisites
  6. 1. Determine a PITR target Timestamp or named restore points

    2. Restore a proper backup 3. Prepare recovery.conf How to? 6 Eren Başak | Citus Data | PGConf.Russia 2018 | February 2018 Recovery Steps
  7. 1. Restore Command to fetch necessary WAL files restore_command =

    'cp “/somewhere/reliable/%f” %p' 2. Recovery Target a. Named Restore Point: recovery_target_name = ‘my-restore-point’ b. Time: recovery_target_date = ‘2018-01-24 06:37:00 +0300’ 3. Other Settings a. recovery_target_inclusive = true|false b. standby_mode = true|false c. recovery_target_action= shutdown|pause|promote How to? 7 Eren Başak | Citus Data | PGConf.Russia 2018 | February 2018 recovery.conf
  8. 1. recovery.conf -> recovery.done after promotion 2. SELECT pg_is_in_recovery() 3.

    SELECT pg_last_xact_replay_timestamp() How to? 8 Eren Başak | Citus Data | PGConf.Russia 2018 | February 2018 Monitoring the progress
  9. 1. Multiple PostgreSQL servers working together a. Citus b. Postgres-XL

    c. Application level sharding 2. PITR of all servers at once 3. May need to update metadata pg_dist_node for Citus Distributed PITR 9 Eren Başak | Citus Data | PGConf.Russia 2018 | February 2018
  10. 1. Use a suitable target time All servers should have

    backups before the selected time 2. PITR all servers to the target time Distributed PITR 10 Eren Başak | Citus Data | PGConf.Russia 2018 | February 2018 Simple approach
  11. 1. Clock differences A distributed transaction has completed in one

    node but not started at another node 2. Ongoing transactions What if one transaction is aborted while others are ongoing Distributed PITR 11 Eren Başak | Citus Data | PGConf.Russia 2018 | February 2018 Simple approach: Problems
  12. 1. Periodically create distributed restore points a. Block all writes

    / take locks b. Run pg_create_restore_point() on all servers c. Store restore point name on somewhere else 2. Pick a suitable distributed restore point 3. Execute PITR with distributed restore point name Distributed PITR 12 Eren Başak | Citus Data | PGConf.Russia 2018 | February 2018 Distributed Restore Points
  13. Citus is an extension to scale-out Postgres. Citus Cloud: Managed

    Citus offering from Citus Data • Nodes are on AWS EC2 • Daily backups of all servers to S3 • WAL archival to S3 • Backups are stored for 7 days • Using WAL-E (and soon WAL-G) PITR at Citus Cloud 13 Eren Başak | Citus Data | PGConf.Russia 2018 | February 2018
  14. 1. User selects a target time and instance types. Temporary/non-production

    PITR clusters don’t need to be as beefy as the production cluster. 2. Citus Cloud creates a new cluster. 3. Restore backups for each server. 4. Update Coordinator Metadata after PITR is complete. PITR at Citus Cloud 14 Eren Başak | Citus Data | PGConf.Russia 2018 | February 2018
  15. citus_create_restore_point() 1. Open connections from coordinator to workers 2. Send

    BEGIN commands 3. Block distributed transactions by locking metadata 4. Run pg_create_restore_point() on the coordinator 5. Send pg_create_restore_point() commands PITR at Citus Cloud 15 Eren Başak | Citus Data | PGConf.Russia 2018 | February 2018
  16. PITR at Citus Cloud 16 Eren Başak | Citus Data

    | PGConf.Russia 2018 | February 2018 Serve r 1 Serve r 2 Serve r 3 backu ps WAL AWS S3 wal- e Production Cluster Normal State of a Citus Cluster
  17. PITR at Citus Cloud 17 Eren Başak | Citus Data

    | PGConf.Russia 2018 | February 2018 Serve r 1 Serve r 2 Serve r 3 backu ps WAL AWS S3 wal- e Production Cluster During PITR Serve r 1 Serve r 2 Serve r 3 wal- e Fork Cluster
  18. PITR at Citus Cloud 18 Eren Başak | Citus Data

    | PGConf.Russia 2018 | February 2018 Serve r 1 Serve r 2 Serve r 3 backu ps WAL AWS S3 wal- e Production Cluster After PITR is completed Serve r 1 Serve r 2 Serve r 3 wal- e Fork Cluster
  19. © 2017 Citus Data. All right reserved. eren@citusdata.com Thank You!

    Eren Başak www.citusdata.com @aamederen