Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PostgreSQL as a Distributed Computing Platform

Marco Slot
September 08, 2017

PostgreSQL as a Distributed Computing Platform

One of the unique things about postgres is that extensions can add new functionality that falls outside the scope of a SQL database. Several postgres extensions add the ability to perform commands or query data on other servers. These extensions can be combined in interesting ways to form advanced distributed systems on top of postgres.

In this talk, I explore how extensions such as dblink, postgres_fdw, pglogical, pg_cron, and Citus together with PL/pgSQL can be used as building blocks for distributed systems. I also give a demonstration of using PostgreSQL as a distributed computing platform, including a Kafka-like distributed queue built on top of postgres.

Marco Slot

September 08, 2017
Tweet

More Decks by Marco Slot

Other Decks in Technology

Transcript

  1. PostgreSQL as a
    Distributed Computing Platform
    Marco Slot

    View full-size slide

  2. Small data architecture

    View full-size slide

  3. Big data architecture

    View full-size slide

  4. Records
    Data warehouse
    Real-time analytics
    Big data architecture using postgres
    Messaging

    View full-size slide

  5. PostgreSQL is a perfect building block
    for distributed systems

    View full-size slide

  6. Features!
    PostgreSQL contains many useful features for building a distributed system:
    ● Well-defined protocol, libpq
    ● Crash safety
    ● Concurrent execution
    ● Transactions
    ● Access controls
    ● 2PC
    ● Replication
    ● Custom functions
    ● …

    View full-size slide

  7. Extensions!
    Built-in / contrib:
    ● postgres_fdw
    ● dblink RPC!
    ● plpgsql
    Third-party open source:
    ● pglogical
    ● pg_cron
    ● citus

    View full-size slide

  8. Extensions!
    Built-in / contrib:
    ● postgres_fdw
    ● dblink RPC!
    ● plpgsql
    Third-party open source:
    ● pglogical
    ● pg_cron
    ● pg_paxos
    ● citus
    Yours!

    View full-size slide

  9. dblink
    Run queries on remote postgres server
    SELECT dblink_connect(node_id,
    format('host=%s port=%s dbname=postgres', node_name, node_port))
    FROM nodes;
    SELECT dblink_send_query(node_id, $$SELECT pg_database_size('postgres')$$)
    FROM nodes;
    SELECT sum(size::bigint)
    FROM nodes, dblink_get_result(nodes.node_id) AS r(size text);
    SELECT dblink_disconnect(node_id)
    FROM nodes;

    View full-size slide

  10. RPC using dblink
    For every postgres function, we can create a client-side stub using dblink.
    CREATE FUNCTION func(input text)
    ...
    CREATE FUNCTION remote_func(host text, port int, input text) RETURNS text
    LANGUAGE sql AS $function$
    SELECT res FROM dblink(
    format('host=%s port=%s', host, port),
    format('SELECT * FROM func(%L)', input))
    AS res(output text);
    $function$;

    View full-size slide

  11. PL/pgSQL
    Procedural language for Postgres:
    CREATE FUNCTION distributed_database_size(dbname text)
    RETURNS bigint LANGUAGE plpgsql AS $function$
    DECLARE
    total_size bigint;
    BEGIN
    PERFORM dblink_send_query(node_id, format('SELECT pg_database_size(%L)', dbname)
    FROM nodes;
    SELECT sum(size::bigint) INTO total_size
    FROM nodes, dblink_get_result(nodes.node_id) AS r(size text);
    RETURN total_size
    END;
    $function$;

    View full-size slide

  12. Distributed system in progress...
    With these extensions, we can already create a simple distributed computing
    system.
    Nodes
    Nodes Nodes Nodes
    Parallel operation using dblink
    SELECT
    transform_data()
    Data 1 Data 2 Data 3
    postgres_fdw?

    View full-size slide

  13. pglogical / logical replication
    Asynchronously replicate changes to another database.
    Nodes
    Nodes Nodes Nodes

    View full-size slide

  14. pg_paxos
    Consistently replicate changes between databases.
    Nodes
    Nodes
    Nodes

    View full-size slide

  15. pg_cron
    Cron-based job scheduler for postgres:
    CREATE EXTENSION pg_cron;
    SELECT cron.schedule('* * * * */10', 'SELECT transform_data()');
    Internally uses libpq, meaning it can also schedule jobs on other nodes.
    pg_cron provides a way for nodes to act autonomously

    View full-size slide

  16. Citus
    Transparently shards tables across multiple nodes
    Coordinator
    E1 E4 E2 E5 E2 E5
    Events create_distributed_table('events',
    'event_id');

    View full-size slide

  17. Citus MX
    Nodes can have the distributed tables too
    Coordinator
    E1 E4 E2 E5 E2 E5
    Events
    Events Events Events

    View full-size slide

  18. How to build a distributed system
    using only PostgreSQL & extensions?

    View full-size slide

  19. Building a streaming publish-subscribe system
    Producers
    Postgres nodes
    Consumers
    topic: adclick

    View full-size slide

  20. Storage nodes
    E1 E4 E2 E5 E2 E5
    Events Events Events
    Coordinator
    Events
    CREATE TABLE
    Use Citus to create a distributed table

    View full-size slide

  21. Distributed Table Creation
    $ psql -h coordinator
    CREATE TABLE events (
    event_id bigserial,
    ingest_time timestamptz default now(),
    topic_name text not null,
    payload jsonb
    );
    SELECT create_distributed_table('events', 'event_id');
    $ psql -h any-node
    INSERT INTO events (topic_name, payload) VALUES ('adclick','{...}');

    View full-size slide

  22. Sharding strategy
    Shard is chosen by hashing the value in the partition column.
    Application-defined:
    ● stream_id text not null
    Optimise data distribution:
    ● event_id bigserial
    Optimise ingest capacity and availability:
    ● sid int default pick_local_value()

    View full-size slide

  23. Producers connect to a random node and perform COPY or INSERT into events
    Producers
    E1 E4 E2 E5 E2 E5
    Events Events Events
    COPY / INSERT

    View full-size slide

  24. Consumers in a group together consume events at least / exactly once.
    Consumers
    E1 E4 E2 E5 E2 E5
    topic: adclick%
    Consumer
    group

    View full-size slide

  25. Consumers obtain leases for consuming a shard.
    Lease are kept in a separate table on each node:
    CREATE TABLE leases (
    consumer_group text not null,
    shard_id bigint not null,
    owner text,
    new_owner text,
    last_heartbeat timestamptz,
    PRIMARY KEY (consumer_group, shard_id)
    );
    Consumer leases

    View full-size slide

  26. Consumers obtain leases for consuming a shard.
    SELECT * FROM claim_lease('click-analytics', 'node-2', 102008);
    Under the covers: Insert a new lease or set new_owner to steal lease.
    CREATE FUNCTION claim_lease(group_name text, node_name text, shard_id int)

    INSERT INTO leases (consumer_group, shard_id, owner, last_heartbeat)
    VALUES (group_name, shard, node_name, now())
    ON CONFLICT (consumer_group, shard_id) DO UPDATE
    SET new_owner = node_name
    WHERE leases.new_owner IS NULL;
    Consumer leases

    View full-size slide

  27. Distributing leases across consumers
    Distributed algorithm for distributing leases across nodes
    SELECT * FROM obtain_leases('click-analytics', 'node-2')
    -- gets all available lease tables
    -- claim all unclaimed shards
    -- claim random shards until #claims >= #shards/#consumers
    Not perfect, but ensures all shards are consumed with load balancing (unless C>S)

    View full-size slide

  28. Consumers
    E1 E4 E2 E5 E2 E5
    leases
    First consumer consumes all
    obtain_leases
    leases leases

    View full-size slide

  29. Consumers
    E1 E4 E2 E5 E2 E5
    First consumer consumes all
    leases leases leases

    View full-size slide

  30. Consumers
    E1 E4 E2 E5 E2 E5
    Second consumer steals leases from first consumer
    obtain_leases
    leases leases leases

    View full-size slide

  31. Consumers
    E1 E4 E2 E5 E2 E5
    Second consumer steals leases from first consumer

    View full-size slide

  32. Consuming events
    Consumer wants to receive all events once.
    Several options:
    ● SQL level
    ● Logical decoding utility functions
    ● Use a replication connection
    ● PG10 logical replication / pglogical

    View full-size slide

  33. Consuming events
    Get a batch of events from a shard:
    SELECT * FROM poll_events('click-analytics', 'node-2', 102008, 'adclick',
    '');
    -- Check if node has the lease
    Set owner = new_owner if new_owner is set
    -- Get all pending events (pg_logical_slot_peek_changes)
    -- Progress the replication slot (pg_logical_slot_get_changes)
    -- Return remaining events if still owner

    View full-size slide

  34. Consumer loop
    E1 E4 E2 E5 E2 E5
    1. Call poll_events for each leased shard
    2. Process events from each batch
    3. Repeat with event IDs of last event in each batch
    poll_events

    View full-size slide

  35. Failure handling
    Producer / consumer fails to connect to storage node:
    → Connect to different node
    Storage node fails:
    → Use pick_local_value() for partition column, failover to hot standby
    Consumer fails to consume batch
    → Events are repeated until confirmed
    Consumer fails and does not come back
    → Consumers periodically call obtain_leases
    → Old leases expire

    View full-size slide

  36. Use pg_cron to periodically expire leases on coordinator:
    SELECT cron.schedule('* * * * *', 'SELECT expire_leases()');
    CREATE FUNCTION expire_leases()
    ...
    UPDATE leases
    SET owner = new_owner, last_heartbeat = now()
    WHERE last_heartbeat < now() - interval '2 minutes'
    AND new_owner IS NOT NULL
    Maintenance: Lease expiration

    View full-size slide

  37. Use pg_cron to periodically expire leases on coordinator:
    $ psql -h coordinator
    SELECT cron.schedule('* * * * *', 'SELECT expire_events()');
    CREATE FUNCTION expire_events()
    ...
    PERFORM master_modify_multiple_shards($$
    DELETE FROM events
    WHERE ingest_time < now() - interval '1 day'
    $$);
    Maintenance: Delete old events

    View full-size slide

  38. Prototyped a functional, highly available publish-subscribe systems in
    https://goo.gl/R1suAo
    ~300 lines of code

    View full-size slide

  39. Records
    Data warehouse
    Real-time analytics
    Big data architecture using postgres
    Messaging

    View full-size slide