Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Challenges of Distributing Postgres:
 A Citus Story (QConSF)

Ozgun Erdogan
November 20, 2017

Challenges of Distributing Postgres:
 A Citus Story (QConSF)

Set theory forms the basis for relational algebra and relational databases, and SQL is the lingua franca of modern RDBMS’s. Even with all the attention given to NoSQL in recent years, the lion share of database usage remains relational. But until recently, nearly all relational database solutions have been limited to the resources of a single node. Not anymore.

This talk is about my team’s journey tackling the challenges of distributing SQL. Specifically in the context of my favorite (open source) database: Postgres. I believe that too many developers spend too much time worrying about scaling their databases. So at Citus Data, we created an extension to Postgres that enables developers to scale out compute, memory, and storage by distributing queries across a cluster of nodes.

This talk describes the distributed systems challenges we faced at Citus in scaling out Postgres—and how we addressed them. I’ll talk about how we use PostgreSQL’s extension APIs to parallelize queries in a distributed cluster. I’ll cover the architecture of a distributed query planner and specifically how the join order planner has to choose between broadcast, co-located, and repartition joins in order to minimize network I/O. And if there’s time, I’ll walk through the dynamic executor logic that we built. The end result: a distributed database and a lot less time spent worrying about scale.

Ozgun Erdogan

November 20, 2017
Tweet

More Decks by Ozgun Erdogan

Other Decks in Technology

Transcript

  1. Challenges of Distributing
    Postgres:

    A Citus Story
    Ozgun Erdogan
    QCon San Francisco | November 2017

    View full-size slide

  2. Developers Love Postgres
    PostgreSQL
    MySQL
    MongoDB
    SQL Server +
    Oracle
    0%
    18%
    35%
    53%
    70%
    Mar-11 Sept-11 Mar-12 Sept-12 Mar-13 Sept-13 Mar-14 Sept-14 Mar-15 Sept-15 Mar-16
    RDBMS: PostgreSQL, MySQL, Microsoft SQL Server, Oracle
    Ozgun Erdogan | QCon San Francisco 2017

    View full-size slide

  3. I love Postgres, too
    3 Ozgun Erdogan | QCon San Francisco 2017
    Ozgun Erdogan
    CTO of Citus Data
    Distributed Systems
    Distributed Databases
    Formerly of Amazon
    Love drinking margaritas

    View full-size slide

  4. Our mission at Citus Data
    5 Ozgun Erdogan | QCon San Francisco 2017
    Make it so SaaS businesses

    never have to worry about
    scaling their database again

    View full-size slide

  5. What is the Citus database?
    1.Scales out PostgreSQL
    2.Extension to PostgreSQL
    3.Available in 3 Ways
    Ozgun Erdogan | QCon San Francisco 2017
    • Using sharding & replication
    • Query engine parallelizes SQL queries across many nodes
    • Using PostgreSQL extension APIs

    View full-size slide

  6. Citus, Packaged Three Ways
    Ozgun Erdogan | QCon San Francisco 2017
    Open

    Source
    Enterprise

    Software
    Fully-Managed

    Database as a Service
    github.com/citusdata/citus

    View full-size slide

  7. Simplified Citus Architecture

    View full-size slide

  8. 3 Challenges Distributing Postgres
    1. PostgreSQL and High Availability
    2. PostgreSQL is huge. How to keep up with it
    3. Distributed transactions
    Ozgun Erdogan | QCon San Francisco 2017

    View full-size slide

  9. PostgreSQL &

    High Availability (HA)
    Designing for a Cloud-native world
    1

    View full-size slide

  10. Why is High Availability hard?
    PostgreSQL replication uses one primary &
    multiple secondary nodes. Two challenges:
    1. Most Postgres clients aren’t smart. When the
    primary fails, they retry the same IP.
    2. Postgres replicates entire state. This makes it
    resource intensive to reconstruct new nodes from a
    primary.
    Ozgun Erdogan | QCon San Francisco 2017

    View full-size slide

  11. Database Failures Should Be Transparent
    Ozgun Erdogan | QCon San Francisco 2017

    View full-size slide

  12. Database Failures Shouldn’t Be a Big Deal
    1. PostgreSQL streaming replication to replicate from
    primary to secondary. Back up to S3.
    2. Volume level replication to replicate to secondary’s
    volume. Back up to S3.
    3. Incremental backups to S3. Reconstruct secondary
    nodes from S3.
    Ozgun Erdogan | QCon San Francisco 2017
    3 Methods for HA & Backups in Postgres

    View full-size slide

  13. Postgres - Streaming Replication (1)
    Write-ahead logs
    (streaming repl.)
    Table foo
    Primary –
    PostgreSQL
    streaming repl.
    Table bar
    WAL logs
    Table foo
    Table bar
    WAL logs
    Secondary –
    PostgreSQL
    streaming repl.
    Monitoring Agents -
    streaming repl.
    setup & auto failover
    S3 / Blob Storage
    (Encrypted)
    Backup
    Process
    Ozgun Erdogan | QCon San Francisco 2017

    View full-size slide

  14. Postgres – AWS RDS & Azure (2)
    Postgres
    Primary
    Monitoring Agents
    (Auto node failover)
    Persistent Volume
    Postgres
    Standby
    S3 / Blob Storage
    (Encrypted)
    Table foo
    Table bar
    WAL logs
    Table foo
    Table bar
    WAL logs
    Backup process
    Backup
    Process
    Persistent Volume
    Ozgun Erdogan | QCon San Francisco 2017

    View full-size slide

  15. Postgres – Reconstruct from WAL (3)
    Postgres
    Primary
    Monitoring Agents
    (Auto node failover)
    Persistent Volume
    Postgres
    Secondary
    Backup
    Process
    S3 / Blob Storage
    (Encrypted)
    Table foo
    Table bar
    WAL logs
    Persistent Volume
    Table foo
    Table bar
    WAL logs
    Backup process
    Ozgun Erdogan | QCon San Francisco 2017

    View full-size slide

  16. WHO DOES THIS? PRIMARY BENEFITS
    Streaming Replication
    (local / ephemeral disk)
    On-prem

    Manual EC2
    Simple to set up
    Direct I/O: High I/O & large storage
    Disk Mirroring
    RDS

    Azure Preview
    Works for MySQL and PostgreSQL

    Data durability in cloud environments
    Reconstruct from WAL
    Heroku

    Citus Cloud
    Enables Fork and PITR
    Node reconstruction in background
    (Data durability in cloud environments)
    How do these approaches compare?
    17 Ozgun Erdogan | QCon San Francisco 2017

    View full-size slide

  17. Summary
    • In PostgreSQL, a database node’s state gets
    replicated in its entirety. The replication can be set up
    in three ways.
    • Reconstructing a secondary node from S3 makes
    bringing up or shooting down nodes easy.
    • When you shard your database, the state you need to
    replicate per node becomes smaller.
    Ozgun Erdogan | QCon San Francisco 2017

    View full-size slide

  18. PostgreSQL is huge
    How do you scale all features?
    2

    View full-size slide

  19. 3 ways to build a distributed database
    1. Build a distributed database from scratch
    2. Middleware sharding (mimic the parser)
    3. Fork your favorite database (like PostgreSQL)
    Ozgun Erdogan | QCon San Francisco 2017

    View full-size slide

  20. Example Transaction Block
    Ozgun Erdogan | QCon San Francisco 2017

    View full-size slide

  21. Postgres Features, Tools & Frameworks
    • PostgreSQL manual (US Letter)
    • Clients for diff programming
    languages
    • ORMs, libraries, GUIs
    • Tools (dump, restore, analyze)
    • New features
    Ozgun Erdogan | QCon San Francisco 2017

    View full-size slide

  22. At First, Forked PostgreSQL with Style
    Ozgun Erdogan | QCon San Francisco 2017

    View full-size slide

  23. Two Stage Query Optimization
    1. Plan to minimize network I/O
    2. Nodes talk to each other using SQL over libpq
    3. Learned to cooperate with planner / executor bit by bit
    (Volcano style executor)
    Ozgun Erdogan | QCon San Francisco 2017

    View full-size slide

  24. Citus Architecture (Simplified)
    25
    SELECT avg(revenue)
    FROM sales
    Coordinator
    SELECT sum(revenue), count(revenue)
    FROM table_1001
    SELECT sum … FROM table_1003
    Worker node 1
    Table metadata
    Table_1001
    Table_1003
    SELECT sum … FROM table_1002
    SELECT sum … FROM table_1004
    Worker node 2
    Table_1002
    Table_1004
    Worker node N
    .
    .
    .
    .
    .
    .
    Each node PostgreSQL with Citus installed
    1 shard = 1 PostgreSQL table
    Ozgun Erdogan | QCon San Francisco 2017

    View full-size slide

  25. Unfork Citus using Extension APIs
    CREATE EXTENSION citus;
    • System catalogs – Distributed metadata
    • Planner hook – Insert, Update, Delete, Select
    • Executor hook – Insert, Update, Delete, Select
    • Utility hook – Alter Table, Create Index, Vacuum, etc.
    • Transaction & resources handling – file descriptors, etc.
    • Background worker process – Maintenance processes
    (distributed deadlock detection, task tracker, etc.)
    • Logical decoding – Online data migrations
    Ozgun Erdogan | QCon San Francisco 2017

    View full-size slide

  26. PostgreSQL has transactions.
    How to handle distributed transactions
    3

    View full-size slide

  27. BEGIN
    INSERT
    UPDATE
    SELECT
    COMMIT
    ROLLBA
    CK

    View full-size slide

  28. Consistency in Distributed Databases
    1. 2PC: All participating nodes need to be up
    2. Paxos: Achieves consensus with quorum
    3. Raft: More understandable alternative to
    Paxos
    Ozgun Erdogan | QCon San Francisco 2017

    View full-size slide

  29. Concurrency in Distributed Databases
    Ozgun Erdogan | QCon San Francisco 2017

    View full-size slide

  30. What is a Lock?
    • Protects against concurrent modifications.
    • Locks are released at the end of a
    transaction.

    View full-size slide

  31. Transactions Block on 1st Conflicting Lock

    View full-size slide

  32. Transactions and Concurrency
    • Transactions that don’t modify the same
    row can run concurrently.

    View full-size slide

  33. Transactions and Concurrency
    • Transactions that don’t modify the same
    row can run concurrently.

    View full-size slide

  34. Transactions and Concurrency
    • Transactions that don’t modify the same
    row can run concurrently.

    View full-size slide

  35. Transactions and Concurrency
    • Transactions that don’t modify the same
    row can run concurrently.

    View full-size slide

  36. Transactions and Concurrency
    • Transactions that don’t modify the same
    row can run concurrently.

    View full-size slide

  37. Transactions and Concurrency
    • Transactions that don’t modify the same
    row can run concurrently.

    View full-size slide

  38. Transactions and Concurrency
    • Transactions that don’t modify the same
    row can run concurrently.

    View full-size slide

  39. Transactions and Concurrency
    • Transactions that don’t modify the same
    row can run concurrently.

    View full-size slide

  40. Transactions and Concurrency
    • Transactions that don’t modify the same
    row can run concurrently.

    View full-size slide

  41. Distributed transactions are a complex topic
    • Most articles on distributed transactions focus on data
    consistency.
    • Data consistency is only one side of the coin. If you’re
    using a relational database, your application benefits
    from another key feature: deadlock detection.
    • https://www.citusdata.com/blog/2017/08/31/
    databases-and-distributed-deadlocks-a-faq
    Ozgun Erdogan | QCon San Francisco 2017

    View full-size slide

  42. So now what? We talked about 3
    challenges distributing Postgres
    1. PostgreSQL, Replication, High Availability
    2. Tradeoffs in building a distributed database—
    and how we chose PostgreSQL’s extension
    APIs
    3. Distributed deadlock detection & distributed
    transactions
    Ozgun Erdogan | QCon San Francisco 2017

    View full-size slide

  43. 45
    Solving a very complex problem

    View full-size slide

  44. 46
    “SQL is hard, not impossible, to scale”

    View full-size slide

  45. © 2017 Citus Data. All right reserved.
    [email protected]
    Questions?
    @citusdata
    Ozgun Erdogan
    www.citusdata.com

    View full-size slide