Tutorial on Modern PostgreSQL High Availability

Slide 1

Slide 1 text

Modern PostgreSQL High Availability

Slide 2

Slide 2 text

CEO ALVARO HERNANDEZ @ahachete DBA and Java Software Developer OnGres and ToroDB Founder Well-Known member of the PostgreSQL Community World- Class Database Expert (+30 Talks in last 2 years)

Slide 3

Slide 3 text

DEV Working @ OnGres (www.ongres.com) Software Developer PostgreSQL support @dovaleac pgdoval PABLO GONZALEZ

Slide 4

Slide 4 text

PGConf.NY 2018 Introduction

Slide 5

Slide 5 text

PGConf.NY 2018 What HA is… NOT • Easy • A mechanism to improve the performance • A risk-free or data-loss free silver-bullet • Totally Available (in the CAP sense). Either CP or Eventually Consistent with High Availability • Something integrated into core

Slide 6

Slide 6 text

PGConf.NY 2018 What HA is… NOT SR != HA (Streaming Replication)

Slide 7

Slide 7 text

PGConf.NY 2018 What I don’t call “Modern” HA • Aurora is modern, but opaque • Hardware-assisted, DRBD, iSCSI, etc: a lot of limitations • Typically limited to only 2 nodes • Hot-standby (waste of resources) HA built on disk replication or OS virtualization

Slide 8

Slide 8 text

PGConf.NY 2018 What I don’t call “Modern” HA • HA is a distributed “agreement” problem • Only well-known consensus algorithms (Paxos, Raft, ZAB) are proven to work • Solutions based on home-grown algorithms would probably fail Solutions not based on consensus algorithms

Slide 9

Slide 9 text

PGConf.NY 2018 Some open-source “Modern” HA solutions   Patroni (forked from Governor)  https://github.com/zalando/patroni  Python   Stolon  https://github.com/sorintlab/stolon  Go

Slide 10

Slide 10 text

PGConf.NY 2018 Architecture of typical "modern" HA solutions DISTRIBUTED CONSENSUS CLUSTER ( ETCD, CONSUL, ZOOKEPER) Disk ENTRY POINT (LOAD BALANCER, DNS, PROXY) NOC 1/ AZ 1 NOC 2 / AZ 2 NOC 3 / AZ 3 Consensus Node Consensus Node Consensus Node HA Agent PG HA Agent PG HA Agent PG SR (+ WAL shipping?) SR (+ WAL shipping?) Feedback/Scripting Feedback/Scripting Feedback/Scripting

Slide 11

Slide 11 text

PGConf.NY 2018 Current limitations / problems • Involved set up • Invasive (agent controls PG, even config) • No entry-point included (Stolon does, but perf, testing?) • Consensus is external (yet another moving part) • Hard automatic deployments (complexity) • Tested, but more is welcome Stolon Architecture

Slide 12

Slide 12 text

PGConf.NY 2018 HA caveats / recommendations / best practices

Slide 13

Slide 13 text

PGConf.NY 2018 Replication options (I) • Asynchronous (default): ✓ data-loss risk ✓ may require pg_rewind • Synchronous: ✓ lower performance (pick wisely!) ✓ may lead to cluster unavailability

Slide 14

Slide 14 text

PGConf.NY 2018 Replication options (II) • Synchronous + asynchronous: ✓ Multiple synchronous, PG 9.6:   ✓ Quorum commit, PG 10:  synchronous_standby_names = ‘FIRST x (node_1, node_2, …, node_N)’ synchronous_standby_names = 'ANY x  (node_1, node_2, …, node_N)'

Slide 15

Slide 15 text

PGConf.NY 2018 Replication options (III) • If you include synchronous, you may also want (9.6): (otherwise, you achieve no data-loss, but only casual reads) synchronous_commit = ‘remote_apply’

Slide 16

Slide 16 text

PGConf.NY 2018 The entry point (I) How do PostgreSQL clients connect to master? PostgreSQL does not provide support in the protocol for advertising the master and/or propagating cluster topology. MongoDB does: https://github.com/mongodb/ specifications/blob/master/source/server- discovery-and-monitoring/server-discovery-and- monitoring.rst

Slide 17

Slide 17 text

PGConf.NY 2018 The entry point (II) • Client libraries in PG 10 add support for multiple hosts/ ports:   • As well as read-write only or any server:  postgresql://...?target_session_attrs=read- write|any host=host1,host2 port=port1,port2  postgresql://host1:port2,host2:port2/

Slide 18

Slide 18 text

PGConf.NY 2018 The entry point (III) • Similarly, JDBC introduced support for multiple hosts and read-write/read-only host selection (compatible with earlier PG versions too): jdbc:postgresql://node1,node2,node3,..., nodeN/db?targetServerType=master jdbc:postgresql://node1,node2,..., nodeN/db? targetServerType=preferSlave&loadBalanceHosts=t rue|false

Slide 19

Slide 19 text

PGConf.NY 2018 The entry point (IV) • HAProxy, pgbouncer, cloud load balancers, DNS • A bit DIY: how do entry point knows of changes to master state? Typically scripting required • Patroni healthcheck URLs return 200|503:   • Stolon provides a proxy as entry-point. Performance? GET|OPTIONS http://ip:8008/(master|replica)

Slide 20

Slide 20 text

PGConf.NY 2018 The entry point (V) • Consul is a distributed consensus database, that also provides authoritative DNS support • It could be used to consistently propagate master state changes via DNS • No changes required on clients • Consul DNS not used by open source HA solutions

Slide 21

Slide 21 text

PGConf.NY 2018 Stolon overview

Slide 22

Slide 22 text

PGConf.NY 2018 Stolon Architecture • Clients connect to one or more proxies, which speak PG protocol. At least 1. • Keepers are PostgreSQL managers: initdb, start, stop, configure, promote, pg_rewind. One per PG. • Sentinel are services that act to make the components converge to the clusterview. At least one. Stolon Architecture

Slide 23

Slide 23 text

PGConf.NY 2018 Stolon entry point? • A single proxy turns into a SPOF. • Multiple proxies, how are they addressed by clients? • Ideally, with another layer: a load balancer on top of them. Client Client Client Proxy Proxy Proxy HA Proxy HA Proxy HA Proxy

Slide 24

Slide 24 text

PGConf.NY 2018 Measuring Stolon Proxy Performance • pgbench on m5.4xlarge (16 cores, 64GB RAM) • shared_buffers=16GB, max_connections = 1000 • pgbench scale=1,000 (16GB), on a ramdisk • client connections: 1, 10, 50, 100, 200, 500, 1000 • Each test run for 3 minutes

Slide 25

Slide 25 text

PGConf.NY 2018 TPS (Transactions Per Second) - pgbench Direct PostgreSQL connections vs Stolon proxy tps 0 7500 15000 22500 30000 # clients 1 10 50 100 200 500 1000 12.606,5 14.108,4 17.168,1 17.167,5 15.582,6 9.608,56 1.734,67 16.402,6 19.980,1 2.458,5 24.054,7 29.079,8 18.509,4 2.857,04 direct proxy

Slide 26

Slide 26 text

PGConf.NY 2018 Latency - pgbench Direct PostgreSQL connections vs Stolon proxy latency (ms) 0 20 40 60 80 # clients 1 10 50 100 200 500 1000 direct proxy

Slide 27

Slide 27 text

PGConf.NY 2018 Stolon interesting facts •Supports “standby cluster” (cluster with 0 masters, replicating in cascading topology). Ideal for migrations to Stolon. •You can by-pass the proxy and connect directly to the PG. “At your own risk”. •Proxy(ies) send(s) all queries to master. r/o scaling is a DIY. •Does not support PG semi-synchronous modes. If you go full sync, cluster may hang if just 1 node fails.

Slide 28

Slide 28 text

PGConf.NY 2018 Other HA gotchas

Slide 29

Slide 29 text

PGConf.NY 2018 Difficulties in automation / deployment • You may not be able to use already created automation “roles” for PostgreSQL, as they “take over” it and control even initdb process • Many moving parts, many automations needed • Provisioning is done by HA or AutoScaling? • You may need to rely on fixed private IPs or use internal load balancer

Slide 30

Slide 30 text

PGConf.NY 2018 STONITH • Master gets partitioned away from cluster. But… clients still connected and writing! • You need to fence/STONITH “old” master. • Architecture/environment/cloud dependent (scripting again…) • Possible unreplicated transactions due to asynchronous mode (data loss) • Best is done through proxy (Stolon)

Slide 31

Slide 31 text

PGConf.NY 2018 Application retry logic • Application needs to be programmed to retry connections, as they may fail and be restarted • Transactions may be aborted on connection fails, augment retry logic to also transactions • On PG>=10 you may check status of a tx:  BEGIN; SELECT txid_current(); …; END;  SELECT txid_status(txid);

Slide 32

Slide 32 text

PGConf.NY 2018 HA replicas vs scale read-only replicas • They are not the same! • Should have the same configuration and same hardware on HA replicas • Fine-tune, if desired, scale read-only replicas for specific workloads --but never promote them! • Application should be aware of read-only replicas’ stale- reads (use synchronous_commit=remote-apply or read- your-own-writes from the master)

Slide 33

Slide 33 text

PGConf.NY 2018 What about logical replication? • Logical decoding (basic infrastructure) only since 9.4, logical replication (patch required, external) since 9.6, only in core in 10 • Does not replicate DDL • Replicates on a per-database, not per-cluster level • Does not prevent writes on replicas You cannot do HA with logical replication, use streaming replication

Slide 34

Slide 34 text

PGConf.NY 2018 Kubernetes (K8S)

Slide 35

Slide 35 text

PGConf.NY 2018 K8S-based PostgreSQL HA solutions • Kubernetes provides service-independent support for implementing HA services. • It includes generic TCP proxy services and leader election. • However, it is based on persistent volumes (PV) and statefulsets (petsets), not Streaming Replication. PVs may be tied to a single datacenter (AZ). • Patroni and Stolon may use K8S, but using Streaming Replication. K8S is “just” used as the Distributed Consensus Database (no etcd, Consul needed).

Slide 36

Slide 36 text

PGConf.NY 2018 Kubernetes Operator • Kubernetes also offers operators (https://coreos.com/ blog/introducing-operators.html) a pattern to systematize operational actions. This can be used to greatly simplify PostgreSQL deployments and even HA: “An Operator is an application-specific controller that extends the Kubernetes API to create, configure, and manage instances of complex stateful applications on behalf of a Kubernetes user” • CrunchyData and Patroni (incubating) have PG

Slide 37

Slide 37 text

PGConf.NY 2018 CrunchyData operator

Slide 38

Slide 38 text

PGConf.NY 2018 CrunchyData operator

Slide 39

Slide 39 text

PGConf.NY 2018 Multi-data center HA

Slide 40

Slide 40 text

PGConf.NY 2018 Multi-datacenter HA: Hydra (OnGres + Groupon) • Uses Consul KV and DNS Query APIs to provide consistent, multi data center entry points • Failover in a single data center, followed by instances in other data centers • Switchover between full data centers • Automatic election of most recent replica • Async REST API for management

Slide 41

Slide 41 text

PGConf.NY 2018 Multi-datacenter HA: Hydra (OnGres + Groupon) Consul Cluster Consul Consul PG Hydra Agent Consul Agent PG Master Hydra Agent Consul Agent Consul Cluster PG Hydra Agent Consul Agent PG Hydra Agent Consul Agent Consul Consul GDS Client PG Hydra Agent Consul Agent PG Hydra Agent Consul Agent GDS Client Hydra Server REST API GUI/ other Hydra MGMT Cascading Rep. Cascading Rep. Replication DNS use DNS query DNS query in SAC1 & SNC1 SNC1 SAC1 met part of the proyect Replication Replication Consul GOSSIP (ASYNC) DNS use

Slide 42

Slide 42 text

PGConf.NY 2018 Automating deployment of HA in GCP

Slide 43

Slide 43 text

PGConf.NY 2018 General considerations • Everything should be done with Deployment Manager (DM)  • Several deployment steps, each one generates a DM  • A DM is never deleted automatically  • Image generation / adaptation to the case

Slide 44

Slide 44 text

Conceptos Fundamentales Automating deployment of HA in GCP General structure ๏gcp •deploy •infrastructure ‣ deployment-manager •vm-images ‣ ansible

Slide 45

Slide 45 text

PGConf.NY 2018 General structure (II) STEPS Image-step Deployment-manager ansible-helper VPC xx cluster xx image generation load balancer Health checker ubuntu etcd postgres patroni Helpers DM Yamls Ansible

Slide 46

Slide 46 text

PGConf.NY 2018 Deployment steps 1. VPC 2. Subnets 3. Base image (OS + basic) 4. DCDB image 5. DCDB cluster 6. S3/GS Bucket (WAL-E) 7. PostgreSQL image 8. PostgreSQL cluster 9. PostgreSQL load balancer Deployment manager Image generation

Slide 47

Slide 47 text

PGConf.NY 2018 Images • Image generation + image instantiation • Reason: control software updates, avoid incompatibilities • How do we tailor them? Via metadata: instanceMetadata: - key: etcd-cluster value: {{ properties["etcdCluster"] }} - key: startup-script value: /usr/local/bin/fix_etcd_service.sh

Slide 48

Slide 48 text

PGConf.NY 2018 Images (II) #!/bin/bash function instance_metadata() { curl -s -H "Metadata-Flavor: Google" http:// metadata.google.internal/computeMetadata/v1/instance/$1 } function instance_metadata_kv() { instance_metadata attributes/$1 } name=ìnstance_metadata_kv template-name` my_ip=ìnstance_metadata network-interfaces/0/ip` etcd_cluster=ìnstance_metadata_kv etcd-cluster` sed -i "s+__NAME__+$name+" /etc/systemd/system/etcd.service sed -i "s+__MY_IP__+$my_ip+" /etc/systemd/system/etcd.service sed -i "s+__ETCD_INITIAL_CLUSTER__+$etcd_cluster+" /etc/systemd/ system/etcd.service systemctl daemon-reload service etcd restart

Slide 49

Slide 49 text

PGConf.NY 2018 Images (III) [Unit] Description=Etcd Service [Service] PermissionsStartOnly=true Environment=ETCD_NAME=__NAME__ Environment=ETCD_INITIAL_CLUSTER_TOKEN=etcd-cluster-1 Environment=ETCD_LISTEN_PEER_URLS=http://__MY_IP__:2380 Environment=ETCD_LISTEN_CLIENT_URLS=http://__MY_IP__:2379 Environment=ETCD_ADVERTISE_CLIENT_URLS=http://__MY_IP__:2379 Environment=ETCD_ADVERTISE_PEER_URLS=http://__MY_IP__:2380 Environment=ETCD_DATA_DIR=/var/lib/etcd Environment=ETCD_INITIAL_CLUSTER_STATE=new Environment=ETCD_INITIAL_CLUSTER=__ETCD_INITIAL_CLUSTER__ ExecStart={{ etcd_install_path }}/{{ etcd_etcd_binary_file_name }} --name "$ETCD_NAME" --initial-advertise-peer-urls "$ETCD_ADVERTISE_PEER_URLS" --listen- peer-urls "$ETCD_LISTEN_PEER_URLS" --listen-client-urls "$ETCD_LISTEN_CLIENT_URLS" --advertise-client-urls "$ETCD_ADVERTISE_CLIENT_URLS" --initial-cluster-token "$ETCD_INITIAL_CLUSTER_TOKEN" --initial-cluster "$ETCD_INITIAL_CLUSTER" --initial- cluster-state "$ETCD_INITIAL_CLUSTER_STATE" Restart=always RestartSec=3s [Install] WantedBy=multi-user.target