PGDay.IT 2018

PGDay.IT 2018

PostgreSQL database as a service on Kubernetes


Oleksii Kliukin

June 29, 2018


  1. 2.

    2 Oleksii Kliukin Database Engineer @ Zalando PostgreSQL Contributor Berlin

    PostgreSQL meetup organizer twitter: @hintbits About me
  2. 3.

    3 A brief history of PostgreSQL at Zalando Live DEMO

    (what can possibly go wrong?) How to stop worrying (and embrace Patroni) Kubernetes: the real thing What is in the name: Postgres Operator SELECT topics FROM agenda; Kubernetes-native Patroni
  3. 4.


    In Italy since 2013 • 2018: Czech Republic, Ireland • HQ in Berlin • Tech HUBS: Dortmund, Helsinki, Dublin and Lisbon
  4. 5.

    5 PostgreSQL at Zalando as at May 2017 > 300

    In the data centers > 170 Databases on AWS Managed by DB team > 250 Databases in other Kubernetes clusters > 165 Run in the ACID’s Kubernetes cluster
  5. 12.

    12 Let’s start with names ` ` primary standby standby

    PostgreSQL (HA) cluster stream ing replication template0 postgres template1 PostgreSQL instance
  6. 13.

    13 Vintage (DC) and modern (AWS) PostgreSQL environments DC1 DC2

    (one hour delay) NFS WAL archive AWS VPC
  7. 15.

    15 Spilo Docker image at Zalando • PGDATA on an

    external volume (EBS or i3/c5 NVME) • Environment-variables based configuration • One container per one EC2 instance • PostgreSQL versions from 9.4 up to 10 • Plenty of extensions (all contrib, PostGIS, timescaleDB, PL/V8, pg_cron, etc) • Additional tools (pgbouncer, pgq) • Extremely lightweight (69MB)
  8. 17.

    17 Cluster Security Group Auto-Scaling Availability Zone A Data Volume

    Root volume Master Elastic IP Cloud Formation Stack Replica DB Availability Zone B Data Volume Root volume Master DB Availability Zone C Data Volume Root volume Replica DB Replica ELB Security Group Replica Elastic Load Balancer 5432 5432, 8008 GET /replica db.zalando db-repl.zalando S3 bucket: Backup + WAL User Data: - Docker image - Backup schedule - Superuser password - Replication password - Postgres parameters Etcd
  9. 19.

    19 What is Patroni • Automatic failover solution for PostgreSQL

    streaming-replication • A daemon that manages one PostgreSQL instance • Keeps the state of the cluster in a DCS (Etcd, Zookeeper, Consul, Kubernetes), also referred to as a consistency layer • For new instances decides whether to initialize a new cluster or join an existing one • For running instances executes promotion/demotion when necessary • A number of additional related functions (global configuration, scheduled actions, pause mode, pg_rewind support, etc)
  10. 20.

    20 What Patroni is not • Not an arbiter for

    the whole HA cluster • Not a swiss-army knife of Postgres maintenance • Not a substitute for a proper monitoring • Not a tool to use if you don’t understand how Etcd (or another DCS that you use) works. • Not a silver bullet (but tries to balance easy-to-use vs extensibility) • Not just an internal project of Zalando (IBM Compose, Red Hat and many other companies use it)
  11. 21.

    21 Why distributed consistency? Etcd cluster Primary candidate Primary candidate

    Take leader Take leader Primary candidate Take leader
  12. 23.

    23 • A set of open-source components running on one

    or more servers • A container orchestration system • An abstraction layer over your real or virtualized hardware • An “infrastructure as code” system • Automatic resource allocation • Next step after Ansible/Chef/Puppet What is Kubernetes?
  13. 24.

    24 • An operating system • A magical way to

    make your infrastructure scalable • An excuse to fire your devops (someone has to configure it) • A good solution for running 2-3 servers What Kubernetes is not?
  14. 25.

    25 Kubernetes • Node • Pod • Container • Persistent

    Volumes • Service/Endpoint • Labels • Secrets Terminology: traditional DC compared to Kubernetes Traditional infrastructure • Physical server • Virtual machine • Individual application • NAS/SAN • Load balancer • Application registry/hardware information • Password files
  15. 26.

    26 Declarative resource description (manifest) apiVersion: v1 kind: Service metadata:

    name: nginx labels: app: nginx spec: ports: - port: 80 name: web clusterIP: None selector: app: nginx
  16. 27.

    27 Building a PostgreSQL cluster on Kubernetes • A statefulset

    to bind pods with persistent volumes and provide auto-recovery • A service to route client connections • Spilo as a docker container (Patroni + PostgreSQL) for HA • Secrets to store database user passwords
  17. 28.

    28 • At least four long YAML manifests to write

    • Different parts of PostgreSQL configuration spread over multiple manifests • No easy way to work with a cluster as a whole (update, delete) • Manual generation of DB objects, i.e. users, and their passwords. Manual deployment of HA PostgreSQL cluster on Kubernetes
  18. 29.

    29 • A template for your manifests • Only one

    place to fill-in deployment-related values • Requires running a special pod (tiller) in your Kubernetes cluster Initial approach to automation: HELM
  19. 30.

    30 • Implement a controller application to act on custom

    resources • CRD (custom resource definitions) to describe a domain-specific object (i.e. a Postgres cluster) • Encapsulates knowledge of a human operating the service Kubernetes operator pattern
  20. 31.

    31 • Defines a custom Postgres resource • Watches instances

    of Postgres, creates/updates/deletes corresponding Kubernetes objects • Allows updating running-cluster resources (memory, cpu, volumes), postgres configuration • Creates databases, users and automatically generates passwords • Auto-repairs, smart rolling updates (switchover to replicas before updating the master) Zalando Postgres operator
  21. 33.

    33 Simple Postgres manifest apiVersion: "" kind: postgresql metadata: name:

    acid-minimal-cluster spec: teamId: "ACID" volume: size: 1Gi numberOfInstances: 2 users: # database owner zalando: - superuser - createdb # role for application foo foo_user: #databases: name->owner databases: foo: zalando postgresql: version: "10"
  22. 34.

    34 Just a piece of cake • Operator starts pods

    with Spilo docker image • Operator provides environment variables to Spilo • Operator makes sure all Kubernetes objects are in sync • Spilo generates Patroni configuration • Patroni creates roles and configures PostgreSQL • Patroni makes sure there is only one master • Patroni uses Kubernetes for cluster state and leader lock • Patroni creates roles and applies configuration • Patroni changes service endpoints on failover
  23. 35.

    deploy cluster manifest Stateful set Spilo pod Kubernetes cluster PATRONI

    operator pod Endpoint Service Client application operator config map Cluster secrets DB deployer create create create watch Infrastructure roles
  24. 37.

    37 • External dependency: Etcd • Etcd should always be

    available • Rock-solid Etcd vs Kubernetes cluster with frequent upgrades • There is already an Etcd deployed for Kubernetes Patroni with Etcd in Kubernetes $ etcdctl ls --recursive --sort -p /service/batman /service/batman/config /service/batman/history /service/batman/initialize /service/batman/leader /service/batman/members/ /service/batman/members/postgresql0 /service/batman/members/postgresql1 /service/batman/optime/ /service/batman/optime/leader $ etcdctl get /service/batman/leader postgresql1 $ etcdctl get /service/batman/members/postgresql1 {"conn_url":"postgres://","api_ur l":"","state":"running","ro le":"master","xlog_location":50476648,"timeline":2} $ etcdctl get /service/batman/history [[1,50331744,"no recovery target specified","2018-01-18T16:04:46+01:00"]]
  25. 38.

    38 Kubernetes-native Patroni • Use Kubernetes as a DCS •

    Patroni keys as Kubernetes metadata (pods, configmaps or endpoints) • ResourceVersions for compare-and-set • “Soft” TTL
  26. 39.

    39 $ kubectl get pods acid-minimal-cluster-0 -o jsonpath='{.metadata.annotations}' map[status:{"conn_url":"postgres://","api_url":"" ,"state":"running","role":"replica","xlog_location":26809088,"timeline":1}]

    $ kubectl get endpoints acid-minimal-cluster -o jsonpath='{.metadata.annotations}' map[acquireTime:2018-06-14T10:52:14.617442+00:00 leader:acid-minimal-cluster-1 optime:26809520 renewTime:2018-06-14T10:52:45.735291+00:00 transitions:2 ttl:30] $ kubectl get endpoints acid-minimal-cluster-config -o jsonpath='{.metadata.annotations.config}' {"loop_wait":10,"maximum_lag_on_failover":33554432,"postgresql":{"parameters":{"archive_mode":"on","arc hive_timeout":"1800s","hot_standby":"on","max_replication_slots":5,"max_wal_senders":5,""wal_level":" hot_standby","wal_log_hints":"on"},"use_pg_rewind":true,"use_slots":true},"retry_timeout":10,"ttl":30 } $kubectl get endpoints acid-minimal-cluster-config -o jsonpath='{.metadata.annotations.initialize}' 6566889706167685175 Patroni metadata in Kubernetes objects
  27. 40.
  28. 41.

    41 Should you run your PostgreSQL clusters on Kubernetes Strong

    interest in the community • Zalando Postgres Operator • CrunchyData Postgres Operator • Red Hat Project Atomic • KubeDB • Project Habitat
  29. 42.

    42 • Kubernetes cluster autoscaler vs multiple AWS AZs and

    persistent volumes • Frequent Kubernetes cluster upgrades • Switching between multiple independent Kubernetes clusters • Clients outside of the DB Kubernetes cluster • Too many databases in one cluster: AWS request limit exceeded. • Resource management (OOM and noisy neighbour issues) • Docker PID reuse Operational challenges
  30. 43.

    43 • Detect the to-be-decommissioned node by lack of the

    ready label and SchedulingDisalbed status • Move all master pods: ◦ move replicas to the already updated node ◦ Failover to those replicas • Pod Disruption Budget to prevent killing nodes with at least one primary. • Anti-affinity to prevent scheduling pods on “not-ready” nodes Dealing with Kubernetes updates
  31. 44.

    44 Why not AWS RDS or Aurora PostgreSQL Not an

    easy answer :) Full control • Independent of cloud provider • Real super user available • Custom extensions, PAM • Streaming/WAL replication in and out • Local storage not supported on RDS (NVMe SSDs) Costs? Cost of development? ...
  32. 46.

    46 PostgreSQL monitoring with bg_mon • PostgreSQL allows a custom

    child process called background worker • Background workers in PostgreSQL can attach to the database. • Background workers in PostgreSQL can do everything else. • Let’s make (an open-source) one that emits top-line DB statistics via the REST API
  33. 47.

    47 • PAM module written in C • Open-source:

    • Equivalent of arbitrary-long automatically generated, auto-expiring passwords. • Can supply arbitrary key=value pairs to check in the OAuth response (i.e. realm=/employees) OAUTH2 PAM authentication
  34. 48.

    48 OAUTH2 PAM authentication Operator configuration: pam_configuration: uid realm=/employees

    pam_role_name: users Operator sets PAM_OAUTH2 Spilo environment variable, adds a line to pg_hba.conf hostssl all +users all pam Spilo writes /etc/pam.d/postgresql using PAM_OAUTH2 value.
  35. 49.

    49 Made possible by great people inside and outside of

    Zalando Patroni and Spilo:, Alexander Kukushkin, Ants Aasma, Feike Steenbergen, Josh Berkus Postgres Operator: Murat Kabilov, Sergey Dudoladov, Manuel Gómez, PAM Oauth2: Alexander Kukushkin Put it all together in a sane way: Jan Mußler