Parallelism in PostgreSQL 15

Slide 1

Slide 1 text

Thomas Munro | Citus Con: An Event for Postgres 2023 fosstodon.org/@tmunro @MengTangmu Parallelism in PostgreSQL 15

Slide 2

Slide 2 text

$ whoami Thomas Munro • PostgreSQL user/hacker/committer • Work relevant to this talk: parallel hash joins and various parallel infrastructure • Other interests include: database/kernel interfaces (memory, I/O, threads, IPC), portability, modernisation • Based in New Zealand (highest number of PostgreSQL committers in the world… per capita)

Slide 3

Slide 3 text

What do I mean by parallelism? For the purposes of this talk… • Parallelism within a CPU thread (pipelined execution) • Vectorisation (SIMD) • I/O parallelism • Storage/compute separation with fi lter pushdown (Aurora? Neon?) • Distribution across multiple networked computers (FDW, Citus, Greenplum, …) • Multiple CPU cores inside one computer executing one query:  Single image OS, shared memory and storage

Slide 4

Slide 4 text

Arrival of parallelism in relational databases 7.1 Oracle (Predessor) DB2 4.1 POSTGRES, PostgreSQL 9.6 6.0 Informix Sybase 11.5 SQL Server 7.0 Ingres 2006 QUEL SQL + Parallelism + Parallelism '80s VMS, Sequent '90s (SMP) VMS, UN*X, NT '00s Multicore '10s Multicore ubiquitous '60s, '70s   Mainframes

Slide 5

Slide 5 text

https://github.com/karlrupp/microprocessor-trend-data (Creative Commons Attribution 4.0 International Public License) “The free lunch is over*” *Herb Sutter, writing in Dr Dobbs Journal 2005

Slide 6

Slide 6 text

Not always as easy as it sounds • Some problems are “embarrassingly” parallelisable • Painting a fence: just add more people • Adding up the sums of a large pile of receipts: subtotals, then total • Some problems are inherently trickier • Finding matching pairs of numbered socks from two large randomised piles: Sort into number order and merge? Partition into buckets by number ranges?

Slide 7

Slide 7 text

How do we make plans anyway?

Slide 8

Slide 8 text

🎛 SELECT … Planner Executor

Slide 9

Slide 9 text

Cost-based optimiser • What are all the ways we could access the base relations? • Which is the best join order? • Which is the best join strategy for each join? • … and more • Consider many paths, estimate the cost (execution time) for each and make the cheapest into the plan

Slide 10

Slide 10 text

Parallelism = more paths • Parallel Sequential Scan, Parallel Index Scan -> scatter “partial” relation to processes on demand • Some executor plan nodes also have “Parallel” variants that communicate between processes • Gather -> collects tuples into one process • Partial/Finalize Aggregate -> perform two-phase aggregation, for example summing subtotals computed by worker processes

Slide 11

Slide 11 text

Counting in parallel Seq Scan Parallel Seq Scan Partial Aggregate Partial Aggregate Partial Aggregate Gather Finalize Aggregate Parallel Seq Scan Parallel Seq Scan Scatter Gather cituscon=> explain (costs off) select count(*) from t; QUERY PLAN --------------------- Aggregate -> Seq Scan on t     cituscon=> explain (costs off) select count(*) from t; QUERY PLAN ------------------------------------------ Finalize Aggregate -> Gather Workers Planned: 2 -> Partial Aggregate -> Parallel Seq Scan on t

Slide 12

Slide 12 text

How do we control parallelism?

Slide 13

Slide 13 text

parallel_setup_cost=1000 13316 └─ postgres -D /data/clusters/main   25002 ├─ postgres: parallel worker for PID 13337   25001 ├─ postgres: parallel worker for PID 13337 13441 ├─ postgres: fred salesdb [local] idle 13437 ├─ postgres: fred salesdb [local] idle 13337 ├─ postgres: fred salesdb [local] SELECT 13323 ├─ postgres: logical replication launcher 13321 ├─ postgres: autovacuum launcher 13320 ├─ postgres: walwriter 13319 ├─ postgres: background writer 13318 └─ postgres: checkpointer The cost to start up and shut down shared memory and worker processes. 5ms-50ms depending on OS. (Yeah, ideally we would use threads. One day!)

Slide 14

Slide 14 text

parallel_tuple_cost=0.1 SELECT * FROM very_large_table WHERE …; We don’t want to use parallelism if a huge number of tuples will be transferred back into the leader process after a light amount of work in the worker processes.

Slide 15

Slide 15 text

Rule-based parallel degree • Number of workers to consider is based on the “driving” table: • ALTER TABLE … SET (parallel_workers = N) • SET min_parallel_table_scan_size = ‘8MB’  8MB → 1 worker  24MB → 2 workers  72MB → 3 workers  x → log(x / min_parallel_table_scan_size) / log(3) + 1 workers • SET min_parallel_index_scan_size = ‘512kB’ • SET max_parallel_workers_per_gather = 2

Slide 16

Slide 16 text

Nest Loop Join Index Scan Parallel Seq Scan Nest Loop Join Index Scan Nest Loop Join Index Scan “Parallel- oblivious” Parallel-aware, scans large table, determines parallel degree

Slide 17

Slide 17 text

Selected plan types

Slide 18

Slide 18 text

Gather { Scatter { Gather Parallel Seq Scan Nest Loop Join Index Scan Time Indexes are already ef fi ciently shared between backends

Slide 19

Slide 19 text

Hash table Hash table Gather { } Duplicated e ff ort Scatter { Gather Parallel Seq Scan Hash Join Hash Private hash table Seq Scan Build Probe Time Probe Build Build Build Probe Probe

Slide 20

Slide 20 text

Hash table Hash table Gather { Scatter { Gather Parallel Seq Scan Hash Join Hash Private hash table Parallel Seq Scan We cannot join arbitrarily chosen blocks from two relations. The results would be nonsense!

Slide 21

Slide 21 text

Gather { Scatter { Gather }Scatter }Gather Parallel Seq Scan Parallel Hash Join Parallel Hash Parallel Seq Scan Shared hash table Build Probe Time Probe Build Build Build Probe Probe

Slide 22

Slide 22 text

What can go wrong?

Slide 23

Slide 23 text

Hash join memory struggles • work_mem (4MB) × hash_mem_multiplier (2) × processes • If the data does not fi t, it will fi rst be broken up into separate partitions written out to disk fi rst, to be processed one at a time • Problem: when work_mem is small, the set of partitions grows very large, and the number of fi les and per-partition bu ff ers grows very large; perhaps by more than can be saved! This is an open problem.

Slide 24

Slide 24 text

Some things that prevent   parallelism • FULL OUTER JOINs (expected in PostgreSQL 16) • Many FDWs don’t support parallelism • Cursors • max_rows (set by GUIs like DbVisualizer) • Queries that write or lock rows • Functions not marked PARALLEL SAFE

Slide 25

Slide 25 text

Major future projects • Write queries (INSERT, UPDATE, DELETE) • Management of total resource usage across whole system (memory, CPU, I/O)? • Alternative strategies for hash join memory management • Threads rather than processes • Replacing the recursive volcano executor design • More general partition-based parallelism

Slide 26

Slide 26 text

Thomas Munro Citus Con: An Event for Postgres 2023 fosstodon.org/@tmunro @MengTangmu fi n