Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥

Tutorial on Modern PostgreSQL High Availability

OnGres
April 16, 2018

Tutorial on Modern PostgreSQL High Availability

It’s 3am. Your phone rings. PostgreSQL is down, you need to promote a replica to master. Why the h**l isn’t this automatic?

If you thought of this before, you want automatic High Availability (HA). Don’t miss this tutorial! We will enter the world of Modern PostgreSQL HA.

Good news, there are several new, “modern” solutions for PostgreSQL HA. However, there are several solutions and it's not easy to pick one. Most require non-trivial setups, and there are many small caveats about HA like how to provide entry points to the application, HA correctness, HA vs. read scaling, external dependencies, interaction with cloud environments, and so forth.

----------------

This presentation is a 3h tutorial presented at PostgreSQL Conferences like https://postgresconf.org/conferences/2018/program/proposals/tutorial-on-modern-postgresql-high-availability It is accompanied by a set of lab exercises. Code and instructions are available here: https://github.com/ongres/pg_modern_ha_tutorial_exercises

OnGres

April 16, 2018
Tweet

More Decks by OnGres

Other Decks in Technology

Transcript

  1. CEO ALVARO HERNANDEZ @ahachete DBA and Java Software Developer OnGres

    and ToroDB Founder Well-Known member of the PostgreSQL Community World- Class Database Expert (+30 Talks in last 2 years)
  2. PGConf.NY 2018 What HA is… NOT • Easy • A

    mechanism to improve the performance • A risk-free or data-loss free silver-bullet • Totally Available (in the CAP sense). Either CP or Eventually Consistent with High Availability • Something integrated into core
  3. PGConf.NY 2018 What I don’t call “Modern” HA • Aurora

    is modern, but opaque • Hardware-assisted, DRBD, iSCSI, etc: a lot of limitations • Typically limited to only 2 nodes • Hot-standby (waste of resources) HA built on disk replication or OS virtualization
  4. PGConf.NY 2018 What I don’t call “Modern” HA • HA

    is a distributed “agreement” problem • Only well-known consensus algorithms (Paxos, Raft, ZAB) are proven to work • Solutions based on home-grown algorithms would probably fail Solutions not based on consensus algorithms
  5. PGConf.NY 2018 Some open-source “Modern” HA solutions 
 Patroni (forked

    from Governor)
 https://github.com/zalando/patroni
 Python 
 Stolon
 https://github.com/sorintlab/stolon
 Go
  6. PGConf.NY 2018 Architecture of typical "modern" HA solutions DISTRIBUTED CONSENSUS

    CLUSTER ( ETCD, CONSUL, ZOOKEPER) Disk ENTRY POINT (LOAD BALANCER, DNS, PROXY) NOC 1/ AZ 1 NOC 2 / AZ 2 NOC 3 / AZ 3 Consensus Node Consensus Node Consensus Node HA Agent PG HA Agent PG HA Agent PG SR (+ WAL shipping?) SR (+ WAL shipping?) Feedback/Scripting Feedback/Scripting Feedback/Scripting
  7. PGConf.NY 2018 Current limitations / problems • Involved set up

    • Invasive (agent controls PG, even config) • No entry-point included (Stolon does, but perf, testing?) • Consensus is external (yet another moving part) • Hard automatic deployments (complexity) • Tested, but more is welcome Stolon Architecture
  8. PGConf.NY 2018 Replication options (I) • Asynchronous (default): ✓ data-loss

    risk ✓ may require pg_rewind • Synchronous: ✓ lower performance (pick wisely!) ✓ may lead to cluster unavailability
  9. PGConf.NY 2018 Replication options (II) • Synchronous + asynchronous: ✓

    Multiple synchronous, PG 9.6: 
 ✓ Quorum commit, PG 10:
 synchronous_standby_names = ‘FIRST x (node_1, node_2, …, node_N)’ synchronous_standby_names = 'ANY x
 (node_1, node_2, …, node_N)'
  10. PGConf.NY 2018 Replication options (III) • If you include synchronous,

    you may also want (9.6): (otherwise, you achieve no data-loss, but only casual reads) synchronous_commit = ‘remote_apply’
  11. PGConf.NY 2018 The entry point (I) How do PostgreSQL clients

    connect to master? PostgreSQL does not provide support in the protocol for advertising the master and/or propagating cluster topology. MongoDB does: https://github.com/mongodb/ specifications/blob/master/source/server- discovery-and-monitoring/server-discovery-and- monitoring.rst
  12. PGConf.NY 2018 The entry point (II) • Client libraries in

    PG 10 add support for multiple hosts/ ports: 
 • As well as read-write only or any server:
 postgresql://...?target_session_attrs=read- write|any host=host1,host2 port=port1,port2
 postgresql://host1:port2,host2:port2/
  13. PGConf.NY 2018 The entry point (III) • Similarly, JDBC introduced

    support for multiple hosts and read-write/read-only host selection (compatible with earlier PG versions too): jdbc:postgresql://node1,node2,node3,..., nodeN/db?targetServerType=master jdbc:postgresql://node1,node2,..., nodeN/db? targetServerType=preferSlave&loadBalanceHosts=t rue|false
  14. PGConf.NY 2018 The entry point (IV) • HAProxy, pgbouncer, cloud

    load balancers, DNS • A bit DIY: how do entry point knows of changes to master state? Typically scripting required • Patroni healthcheck URLs return 200|503: 
 • Stolon provides a proxy as entry-point. Performance? GET|OPTIONS http://ip:8008/(master|replica)
  15. PGConf.NY 2018 The entry point (V) • Consul is a

    distributed consensus database, that also provides authoritative DNS support • It could be used to consistently propagate master state changes via DNS • No changes required on clients • Consul DNS not used by open source HA solutions
  16. PGConf.NY 2018 Stolon Architecture • Clients connect to one or

    more proxies, which speak PG protocol. At least 1. • Keepers are PostgreSQL managers: initdb, start, stop, configure, promote, pg_rewind. One per PG. • Sentinel are services that act to make the components converge to the clusterview. At least one. Stolon Architecture
  17. PGConf.NY 2018 Stolon entry point? • A single proxy turns

    into a SPOF. • Multiple proxies, how are they addressed by clients? • Ideally, with another layer: a load balancer on top of them. Client Client Client Proxy Proxy Proxy HA Proxy HA Proxy HA Proxy
  18. PGConf.NY 2018 Measuring Stolon Proxy Performance • pgbench on m5.4xlarge

    (16 cores, 64GB RAM) • shared_buffers=16GB, max_connections = 1000 • pgbench scale=1,000 (16GB), on a ramdisk • client connections: 1, 10, 50, 100, 200, 500, 1000 • Each test run for 3 minutes
  19. PGConf.NY 2018 TPS (Transactions Per Second) - pgbench Direct PostgreSQL

    connections vs Stolon proxy tps 0 7500 15000 22500 30000 # clients 1 10 50 100 200 500 1000 12.606,5 14.108,4 17.168,1 17.167,5 15.582,6 9.608,56 1.734,67 16.402,6 19.980,1 2.458,5 24.054,7 29.079,8 18.509,4 2.857,04 direct proxy
  20. PGConf.NY 2018 Latency - pgbench Direct PostgreSQL connections vs Stolon

    proxy latency (ms) 0 20 40 60 80 # clients 1 10 50 100 200 500 1000 direct proxy
  21. PGConf.NY 2018 Stolon interesting facts •Supports “standby cluster” (cluster with

    0 masters, replicating in cascading topology). Ideal for migrations to Stolon. •You can by-pass the proxy and connect directly to the PG. “At your own risk”. •Proxy(ies) send(s) all queries to master. r/o scaling is a DIY. •Does not support PG semi-synchronous modes. If you go full sync, cluster may hang if just 1 node fails.
  22. PGConf.NY 2018 Difficulties in automation / deployment • You may

    not be able to use already created automation “roles” for PostgreSQL, as they “take over” it and control even initdb process • Many moving parts, many automations needed • Provisioning is done by HA or AutoScaling? • You may need to rely on fixed private IPs or use internal load balancer
  23. PGConf.NY 2018 STONITH • Master gets partitioned away from cluster.

    But… clients still connected and writing! • You need to fence/STONITH “old” master. • Architecture/environment/cloud dependent (scripting again…) • Possible unreplicated transactions due to asynchronous mode (data loss) • Best is done through proxy (Stolon)
  24. PGConf.NY 2018 Application retry logic • Application needs to be

    programmed to retry connections, as they may fail and be restarted • Transactions may be aborted on connection fails, augment retry logic to also transactions • On PG>=10 you may check status of a tx:
 BEGIN; SELECT txid_current(); …; END;
 SELECT txid_status(txid);
  25. PGConf.NY 2018 HA replicas vs scale read-only replicas • They

    are not the same! • Should have the same configuration and same hardware on HA replicas • Fine-tune, if desired, scale read-only replicas for specific workloads --but never promote them! • Application should be aware of read-only replicas’ stale- reads (use synchronous_commit=remote-apply or read- your-own-writes from the master)
  26. PGConf.NY 2018 What about logical replication? • Logical decoding (basic

    infrastructure) only since 9.4, logical replication (patch required, external) since 9.6, only in core in 10 • Does not replicate DDL • Replicates on a per-database, not per-cluster level • Does not prevent writes on replicas You cannot do HA with logical replication, use streaming replication
  27. PGConf.NY 2018 K8S-based PostgreSQL HA solutions • Kubernetes provides service-independent

    support for implementing HA services. • It includes generic TCP proxy services and leader election. • However, it is based on persistent volumes (PV) and statefulsets (petsets), not Streaming Replication. PVs may be tied to a single datacenter (AZ). • Patroni and Stolon may use K8S, but using Streaming Replication. K8S is “just” used as the Distributed Consensus Database (no etcd, Consul needed).
  28. PGConf.NY 2018 Kubernetes Operator • Kubernetes also offers operators (https://coreos.com/

    blog/introducing-operators.html) a pattern to systematize operational actions. This can be used to greatly simplify PostgreSQL deployments and even HA: “An Operator is an application-specific controller that extends the Kubernetes API to create, configure, and manage instances of complex stateful applications on behalf of a Kubernetes user” • CrunchyData and Patroni (incubating) have PG
  29. PGConf.NY 2018 Multi-datacenter HA: Hydra (OnGres + Groupon) • Uses

    Consul KV and DNS Query APIs to provide consistent, multi data center entry points • Failover in a single data center, followed by instances in other data centers • Switchover between full data centers • Automatic election of most recent replica • Async REST API for management
  30. PGConf.NY 2018 Multi-datacenter HA: Hydra (OnGres + Groupon) Consul Cluster

    Consul Consul PG Hydra Agent Consul Agent PG Master Hydra Agent Consul Agent Consul Cluster PG Hydra Agent Consul Agent PG Hydra Agent Consul Agent Consul Consul GDS Client PG Hydra Agent Consul Agent PG Hydra Agent Consul Agent GDS Client Hydra Server REST API GUI/ other Hydra MGMT Cascading Rep. Cascading Rep. Replication DNS use DNS query DNS query in SAC1 & SNC1 SNC1 SAC1 met part of the proyect Replication Replication Consul GOSSIP (ASYNC) DNS use
  31. PGConf.NY 2018 General considerations • Everything should be done with

    Deployment Manager (DM)
 • Several deployment steps, each one generates a DM
 • A DM is never deleted automatically
 • Image generation / adaptation to the case
  32. Conceptos Fundamentales Automating deployment of HA in GCP General structure

    ๏gcp •deploy •infrastructure ‣ deployment-manager •vm-images ‣ ansible
  33. PGConf.NY 2018 General structure (II) STEPS Image-step Deployment-manager ansible-helper VPC

    xx cluster xx image generation load balancer Health checker ubuntu etcd postgres patroni Helpers DM Yamls Ansible
  34. PGConf.NY 2018 Deployment steps 1. VPC 2. Subnets 3. Base

    image (OS + basic) 4. DCDB image 5. DCDB cluster 6. S3/GS Bucket (WAL-E) 7. PostgreSQL image 8. PostgreSQL cluster 9. PostgreSQL load balancer Deployment manager Image generation
  35. PGConf.NY 2018 Images • Image generation + image instantiation •

    Reason: control software updates, avoid incompatibilities • How do we tailor them? Via metadata: instanceMetadata: - key: etcd-cluster value: {{ properties["etcdCluster"] }} - key: startup-script value: /usr/local/bin/fix_etcd_service.sh
  36. PGConf.NY 2018 Images (II) #!/bin/bash function instance_metadata() { curl -s

    -H "Metadata-Flavor: Google" http:// metadata.google.internal/computeMetadata/v1/instance/$1 } function instance_metadata_kv() { instance_metadata attributes/$1 } name=`instance_metadata_kv template-name` my_ip=`instance_metadata network-interfaces/0/ip` etcd_cluster=`instance_metadata_kv etcd-cluster` sed -i "s+__NAME__+$name+" /etc/systemd/system/etcd.service sed -i "s+__MY_IP__+$my_ip+" /etc/systemd/system/etcd.service sed -i "s+__ETCD_INITIAL_CLUSTER__+$etcd_cluster+" /etc/systemd/ system/etcd.service systemctl daemon-reload service etcd restart
  37. PGConf.NY 2018 Images (III) [Unit] Description=Etcd Service [Service] PermissionsStartOnly=true Environment=ETCD_NAME=__NAME__

    Environment=ETCD_INITIAL_CLUSTER_TOKEN=etcd-cluster-1 Environment=ETCD_LISTEN_PEER_URLS=http://__MY_IP__:2380 Environment=ETCD_LISTEN_CLIENT_URLS=http://__MY_IP__:2379 Environment=ETCD_ADVERTISE_CLIENT_URLS=http://__MY_IP__:2379 Environment=ETCD_ADVERTISE_PEER_URLS=http://__MY_IP__:2380 Environment=ETCD_DATA_DIR=/var/lib/etcd Environment=ETCD_INITIAL_CLUSTER_STATE=new Environment=ETCD_INITIAL_CLUSTER=__ETCD_INITIAL_CLUSTER__ ExecStart={{ etcd_install_path }}/{{ etcd_etcd_binary_file_name }} --name "$ETCD_NAME" --initial-advertise-peer-urls "$ETCD_ADVERTISE_PEER_URLS" --listen- peer-urls "$ETCD_LISTEN_PEER_URLS" --listen-client-urls "$ETCD_LISTEN_CLIENT_URLS" --advertise-client-urls "$ETCD_ADVERTISE_CLIENT_URLS" --initial-cluster-token "$ETCD_INITIAL_CLUSTER_TOKEN" --initial-cluster "$ETCD_INITIAL_CLUSTER" --initial- cluster-state "$ETCD_INITIAL_CLUSTER_STATE" Restart=always RestartSec=3s [Install] WantedBy=multi-user.target