Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Parallelism in PostgreSQL 15

Thomas Munro
April 20, 2023

Parallelism in PostgreSQL 15

A short talk given at Citus Con 2023 about some concepts and problems in parallel query execution.

Video: https://youtu.be/3v-cthowG10

Thomas Munro

April 20, 2023


  1. Thomas Munro | Citus Con: An Event for Postgres 2023

    fosstodon.org/@tmunro @MengTangmu Parallelism in PostgreSQL 15
  2. $ whoami Thomas Munro • PostgreSQL user/hacker/committer • Work relevant

    to this talk: parallel hash joins and various parallel infrastructure • Other interests include: database/kernel interfaces (memory, I/O, threads, IPC), portability, modernisation • Based in New Zealand (highest number of PostgreSQL committers in the world… per capita)
  3. What do I mean by parallelism? For the purposes of

    this talk… • Parallelism within a CPU thread (pipelined execution) • Vectorisation (SIMD) • I/O parallelism • Storage/compute separation with fi lter pushdown (Aurora? Neon?) • Distribution across multiple networked computers (FDW, Citus, Greenplum, …) • Multiple CPU cores inside one computer executing one query:
 Single image OS, shared memory and storage
  4. Arrival of parallelism in relational databases 7.1 Oracle (Predessor) DB2

    4.1 POSTGRES, PostgreSQL 9.6 6.0 Informix Sybase 11.5 SQL Server 7.0 Ingres 2006 QUEL SQL + Parallelism + Parallelism '80s VMS, Sequent '90s (SMP) VMS, UN*X, NT '00s Multicore '10s Multicore ubiquitous '60s, '70s 
  5. Not always as easy as it sounds • Some problems

    are “embarrassingly” parallelisable • Painting a fence: just add more people • Adding up the sums of a large pile of receipts: subtotals, then total • Some problems are inherently trickier • Finding matching pairs of numbered socks from two large randomised piles: Sort into number order and merge? Partition into buckets by number ranges?
  6. Cost-based optimiser • What are all the ways we could

    access the base relations? • Which is the best join order? • Which is the best join strategy for each join? • … and more • Consider many paths, estimate the cost (execution time) for each and make the cheapest into the plan
  7. Parallelism = more paths • Parallel Sequential Scan, Parallel Index

    Scan -> scatter “partial” relation to processes on demand • Some executor plan nodes also have “Parallel” variants that communicate between processes • Gather -> collects tuples into one process • Partial/Finalize Aggregate -> perform two-phase aggregation, for example summing subtotals computed by worker processes
  8. Counting in parallel Seq Scan Parallel Seq Scan Partial Aggregate

    Partial Aggregate Partial Aggregate Gather Finalize Aggregate Parallel Seq Scan Parallel Seq Scan Scatter Gather cituscon=> explain (costs off) select count(*) from t; QUERY PLAN --------------------- Aggregate -> Seq Scan on t 
 cituscon=> explain (costs off) select count(*) from t; QUERY PLAN ------------------------------------------ Finalize Aggregate -> Gather Workers Planned: 2 -> Partial Aggregate -> Parallel Seq Scan on t
  9. parallel_setup_cost=1000 13316 └─ postgres -D /data/clusters/main 
 25002 ├─ postgres:

    parallel worker for PID 13337 
 25001 ├─ postgres: parallel worker for PID 13337 13441 ├─ postgres: fred salesdb [local] idle 13437 ├─ postgres: fred salesdb [local] idle 13337 ├─ postgres: fred salesdb [local] SELECT 13323 ├─ postgres: logical replication launcher 13321 ├─ postgres: autovacuum launcher 13320 ├─ postgres: walwriter 13319 ├─ postgres: background writer 13318 └─ postgres: checkpointer The cost to start up and shut down shared memory and worker processes. 5ms-50ms depending on OS. (Yeah, ideally we would use threads. One day!)
  10. parallel_tuple_cost=0.1 SELECT * FROM very_large_table WHERE …; We don’t want

    to use parallelism if a huge number of tuples will be transferred back into the leader process after a light amount of work in the worker processes.
  11. Rule-based parallel degree • Number of workers to consider is

    based on the “driving” table: • ALTER TABLE … SET (parallel_workers = N) • SET min_parallel_table_scan_size = ‘8MB’
 8MB → 1 worker
 24MB → 2 workers
 72MB → 3 workers
 x → log(x / min_parallel_table_scan_size) / log(3) + 1 workers • SET min_parallel_index_scan_size = ‘512kB’ • SET max_parallel_workers_per_gather = 2
  12. Nest Loop Join Index Scan Parallel Seq Scan Nest Loop

    Join Index Scan Nest Loop Join Index Scan “Parallel- oblivious” Parallel-aware, scans large table, determines parallel degree
  13. Gather { Scatter { Gather Parallel Seq Scan Nest Loop

    Join Index Scan Time Indexes are already ef fi ciently shared between backends
  14. Hash table Hash table Gather { } Duplicated e ff

    ort Scatter { Gather Parallel Seq Scan Hash Join Hash Private hash table Seq Scan Build Probe Time Probe Build Build Build Probe Probe
  15. Hash table Hash table Gather { Scatter { Gather Parallel

    Seq Scan Hash Join Hash Private hash table Parallel Seq Scan We cannot join arbitrarily chosen blocks from two relations. The results would be nonsense!
  16. Gather { Scatter { Gather }Scatter }Gather Parallel Seq Scan

    Parallel Hash Join Parallel Hash Parallel Seq Scan Shared hash table Build Probe Time Probe Build Build Build Probe Probe
  17. Hash join memory struggles • work_mem (4MB) × hash_mem_multiplier (2)

    × processes • If the data does not fi t, it will fi rst be broken up into separate partitions written out to disk fi rst, to be processed one at a time • Problem: when work_mem is small, the set of partitions grows very large, and the number of fi les and per-partition bu ff ers grows very large; perhaps by more than can be saved! This is an open problem.
  18. Some things that prevent 
 parallelism • FULL OUTER JOINs

    (expected in PostgreSQL 16) • Many FDWs don’t support parallelism • Cursors • max_rows (set by GUIs like DbVisualizer) • Queries that write or lock rows • Functions not marked PARALLEL SAFE
  19. Major future projects • Write queries (INSERT, UPDATE, DELETE) •

    Management of total resource usage across whole system (memory, CPU, I/O)? • Alternative strategies for hash join memory management • Threads rather than processes • Replacing the recursive volcano executor design • More general partition-based parallelism