Slide 1

Slide 1 text

PostgreSQL as a Distributed Computing Platform Marco Slot

Slide 2

Slide 2 text

Small data architecture

Slide 3

Slide 3 text

Big data architecture

Slide 4

Slide 4 text

Records Data warehouse Real-time analytics Big data architecture using postgres Messaging

Slide 5

Slide 5 text

PostgreSQL is a perfect building block for distributed systems

Slide 6

Slide 6 text

Features! PostgreSQL contains many useful features for building a distributed system: ● Well-defined protocol, libpq ● Crash safety ● Concurrent execution ● Transactions ● Access controls ● 2PC ● Replication ● Custom functions ● …

Slide 7

Slide 7 text

Extensions! Built-in / contrib: ● postgres_fdw ● dblink RPC! ● plpgsql Third-party open source: ● pglogical ● pg_cron ● citus

Slide 8

Slide 8 text

Extensions! Built-in / contrib: ● postgres_fdw ● dblink RPC! ● plpgsql Third-party open source: ● pglogical ● pg_cron ● pg_paxos ● citus Yours!

Slide 9

Slide 9 text

dblink Run queries on remote postgres server SELECT dblink_connect(node_id, format('host=%s port=%s dbname=postgres', node_name, node_port)) FROM nodes; SELECT dblink_send_query(node_id, $$SELECT pg_database_size('postgres')$$) FROM nodes; SELECT sum(size::bigint) FROM nodes, dblink_get_result(nodes.node_id) AS r(size text); SELECT dblink_disconnect(node_id) FROM nodes;

Slide 10

Slide 10 text

RPC using dblink For every postgres function, we can create a client-side stub using dblink. CREATE FUNCTION func(input text) ... CREATE FUNCTION remote_func(host text, port int, input text) RETURNS text LANGUAGE sql AS $function$ SELECT res FROM dblink( format('host=%s port=%s', host, port), format('SELECT * FROM func(%L)', input)) AS res(output text); $function$;

Slide 11

Slide 11 text

PL/pgSQL Procedural language for Postgres: CREATE FUNCTION distributed_database_size(dbname text) RETURNS bigint LANGUAGE plpgsql AS $function$ DECLARE total_size bigint; BEGIN PERFORM dblink_send_query(node_id, format('SELECT pg_database_size(%L)', dbname) FROM nodes; SELECT sum(size::bigint) INTO total_size FROM nodes, dblink_get_result(nodes.node_id) AS r(size text); RETURN total_size END; $function$;

Slide 12

Slide 12 text

Distributed system in progress... With these extensions, we can already create a simple distributed computing system. Nodes Nodes Nodes Nodes Parallel operation using dblink SELECT transform_data() Data 1 Data 2 Data 3 postgres_fdw?

Slide 13

Slide 13 text

pglogical / logical replication Asynchronously replicate changes to another database. Nodes Nodes Nodes Nodes

Slide 14

Slide 14 text

pg_paxos Consistently replicate changes between databases. Nodes Nodes Nodes

Slide 15

Slide 15 text

pg_cron Cron-based job scheduler for postgres: CREATE EXTENSION pg_cron; SELECT cron.schedule('* * * * */10', 'SELECT transform_data()'); Internally uses libpq, meaning it can also schedule jobs on other nodes. pg_cron provides a way for nodes to act autonomously

Slide 16

Slide 16 text

Citus Transparently shards tables across multiple nodes Coordinator E1 E4 E2 E5 E2 E5 Events create_distributed_table('events', 'event_id');

Slide 17

Slide 17 text

Citus MX Nodes can have the distributed tables too Coordinator E1 E4 E2 E5 E2 E5 Events Events Events Events

Slide 18

Slide 18 text

How to build a distributed system using only PostgreSQL & extensions?

Slide 19

Slide 19 text

Building a streaming publish-subscribe system Producers Postgres nodes Consumers topic: adclick

Slide 20

Slide 20 text

Storage nodes E1 E4 E2 E5 E2 E5 Events Events Events Coordinator Events CREATE TABLE Use Citus to create a distributed table

Slide 21

Slide 21 text

Distributed Table Creation $ psql -h coordinator CREATE TABLE events ( event_id bigserial, ingest_time timestamptz default now(), topic_name text not null, payload jsonb ); SELECT create_distributed_table('events', 'event_id'); $ psql -h any-node INSERT INTO events (topic_name, payload) VALUES ('adclick','{...}');

Slide 22

Slide 22 text

Sharding strategy Shard is chosen by hashing the value in the partition column. Application-defined: ● stream_id text not null Optimise data distribution: ● event_id bigserial Optimise ingest capacity and availability: ● sid int default pick_local_value()

Slide 23

Slide 23 text

Producers connect to a random node and perform COPY or INSERT into events Producers E1 E4 E2 E5 E2 E5 Events Events Events COPY / INSERT

Slide 24

Slide 24 text

Consumers in a group together consume events at least / exactly once. Consumers E1 E4 E2 E5 E2 E5 topic: adclick% Consumer group

Slide 25

Slide 25 text

Consumers obtain leases for consuming a shard. Lease are kept in a separate table on each node: CREATE TABLE leases ( consumer_group text not null, shard_id bigint not null, owner text, new_owner text, last_heartbeat timestamptz, PRIMARY KEY (consumer_group, shard_id) ); Consumer leases

Slide 26

Slide 26 text

Consumers obtain leases for consuming a shard. SELECT * FROM claim_lease('click-analytics', 'node-2', 102008); Under the covers: Insert a new lease or set new_owner to steal lease. CREATE FUNCTION claim_lease(group_name text, node_name text, shard_id int) … INSERT INTO leases (consumer_group, shard_id, owner, last_heartbeat) VALUES (group_name, shard, node_name, now()) ON CONFLICT (consumer_group, shard_id) DO UPDATE SET new_owner = node_name WHERE leases.new_owner IS NULL; Consumer leases

Slide 27

Slide 27 text

Distributing leases across consumers Distributed algorithm for distributing leases across nodes SELECT * FROM obtain_leases('click-analytics', 'node-2') -- gets all available lease tables -- claim all unclaimed shards -- claim random shards until #claims >= #shards/#consumers Not perfect, but ensures all shards are consumed with load balancing (unless C>S)

Slide 28

Slide 28 text

Consumers E1 E4 E2 E5 E2 E5 leases First consumer consumes all obtain_leases leases leases

Slide 29

Slide 29 text

Consumers E1 E4 E2 E5 E2 E5 First consumer consumes all leases leases leases

Slide 30

Slide 30 text

Consumers E1 E4 E2 E5 E2 E5 Second consumer steals leases from first consumer obtain_leases leases leases leases

Slide 31

Slide 31 text

Consumers E1 E4 E2 E5 E2 E5 Second consumer steals leases from first consumer

Slide 32

Slide 32 text

Consuming events Consumer wants to receive all events once. Several options: ● SQL level ● Logical decoding utility functions ● Use a replication connection ● PG10 logical replication / pglogical

Slide 33

Slide 33 text

Consuming events Get a batch of events from a shard: SELECT * FROM poll_events('click-analytics', 'node-2', 102008, 'adclick', ''); -- Check if node has the lease Set owner = new_owner if new_owner is set -- Get all pending events (pg_logical_slot_peek_changes) -- Progress the replication slot (pg_logical_slot_get_changes) -- Return remaining events if still owner

Slide 34

Slide 34 text

Consumer loop E1 E4 E2 E5 E2 E5 1. Call poll_events for each leased shard 2. Process events from each batch 3. Repeat with event IDs of last event in each batch poll_events

Slide 35

Slide 35 text

Failure handling Producer / consumer fails to connect to storage node: → Connect to different node Storage node fails: → Use pick_local_value() for partition column, failover to hot standby Consumer fails to consume batch → Events are repeated until confirmed Consumer fails and does not come back → Consumers periodically call obtain_leases → Old leases expire

Slide 36

Slide 36 text

Use pg_cron to periodically expire leases on coordinator: SELECT cron.schedule('* * * * *', 'SELECT expire_leases()'); CREATE FUNCTION expire_leases() ... UPDATE leases SET owner = new_owner, last_heartbeat = now() WHERE last_heartbeat < now() - interval '2 minutes' AND new_owner IS NOT NULL Maintenance: Lease expiration

Slide 37

Slide 37 text

Use pg_cron to periodically expire leases on coordinator: $ psql -h coordinator SELECT cron.schedule('* * * * *', 'SELECT expire_events()'); CREATE FUNCTION expire_events() ... PERFORM master_modify_multiple_shards($$ DELETE FROM events WHERE ingest_time < now() - interval '1 day' $$); Maintenance: Delete old events

Slide 38

Slide 38 text

Prototyped a functional, highly available publish-subscribe systems in https://goo.gl/R1suAo ~300 lines of code

Slide 39

Slide 39 text

Demo

Slide 40

Slide 40 text

Records Data warehouse Real-time analytics Big data architecture using postgres Messaging

Slide 41

Slide 41 text

Questions? [email protected]