pg_stats: How Postgres Internal Stats Work

pg_stats How Postgres Internal Stats Work June 2026 Richard Yen
POSETTE 2026

SELECT * FROM customers WHERE state = ' CA' ;

CREATE TABLE customers ( id bigserial PRIMARY KEY, city text NOT NULL, state text NOT NULL, signup_date date NOT NULL ); -- Insert 1,000,000 rows

QUERY PLAN ----------------------------------------------------------------- Seq Scan on customers (cost=0.00..19682.66 rows=173829 width=26) (actual time=0.025..120.574 rows=172001 loops=1) Filter: (state = 'CA'::text) Rows Removed by Filter: 827972 Buffers: shared hit=4601 read=2582 Planning: Buffers: shared hit=139 Planning Time: 0.371 ms Execution Time: 128.136 ms (8 rows)

Query Plans are made by the Query Planner

Your Data Summary Read by the Planner customers pg_statistic (pg_stats)

pg_stats is a VIEW of pg_statistic

pg_stats is a VIEW of pg_statistic Human-Readable NOT Human-Readable

Postgres Trivia! SELECT relname , relkind FROM pg_class WHERE relname
LIKE ' pg%' AND relkind in (' v','r ') ORDER BY relkind ;

ANALYZE customers; (Builds the summary in pg_statistic)

A Simple Example

CREATE TABLE customers ( id bigserial PRIMARY KEY, city text
NOT NULL, state text NOT NULL, signup_date date NOT NULL ); -- Insert 1,000,000 rows ANALYZE customers;

SELECT attname,n_distinct,null_frac,correlation FROM pg_stats WHERE tablename = 'customers' ORDER BY
attname ; attname | n_distinct | null_frac | correlation ------------- +------------ +----------- +-------------- city | 10106 | 0 | 0.0021338463 id | - 1 | 0 | 1 signup_date | 1822 | 0 | 1 state | 50 | 0 | 0.06440461 (4 rows)

SELECT attname,n_distinct,null_frac,correlation FROM pg_stats WHERE tablename = 'customers' ORDER BY
attname ; attname | n_distinct | null_frac | correlation ------------- +------------ +----------- +-------------- city | 10106 | 0 | 0.0021338463 id | - 1 | 0 | 1 signup_date | 1822 | 0 | 1 state | 50 | 0 | 0.06440461 (4 rows) Range -1 to +1 Affects index scan selection

SELECT unnest( most_common_vals ::text::text[]) AS state, unnest( most_common_freqs ) AS
frequency FROM pg_stats WHERE tablename = 'customers' AND attname = 'state' LIMIT 5; state | frequency ------- +------------- CA | 0.17403333 TX | 0.1165 NY | 0.08586667 FL | 0.0666 IL | 0.0474 (5 rows)

random_page_cost seq_page_cost cpu_tuple_cost Plans are generated based on cost calculation

Cost calculation affects: Scan Type • Sequential Scan • Index
Scan • Bitmap Heap Scan Join Type • Nested Loop • Hash Join • Merge Join

EXPLAIN ANALYZE SELECT * FROM customers WHERE state = 'CA';
-- Seq Scan on customers (rows ~180000) EXPLAIN ANALYZE SELECT * FROM customers WHERE state = 'WY'; -- Index Scan using customers_state_idx (rows=400)

-- Seq Scan on customers (rows ~180000) QUERY PLAN ------------------------------------------------------- Seq Scan on customers (cost=0.00..19682.66 rows=174029 ...) (actual time=0.042..50.257 rows=172001 loops=1) Filter: (state = 'CA'::text) Rows Removed by Filter: 827972 Planning Time: 0.222 ms Execution Time: 55.811 ms (8 rows)

EXPLAIN ANALYZE SELECT * FROM customers WHERE state = 'WY';
-- Index Scan using customers_state_idx (rows ~4000) QUERY PLAN --------------------------------------------------- Index Scan using customers_state_idx on customers (cost=0.42..13116.39 rows=4233 ...) (actual time=0.045..21.238 rows=4300 loops=1) Index Cond: (state = 'WY'::text) Planning Time: 0.051 ms Execution Time: 21.470 ms (5 rows)

SET enable_seqscan = off;

-- Index Scan on customers (rows ~180000) QUERY PLAN --------------------------------------------------- Index Scan using customers_state_idx on customers (cost=0.42..32172.73 rows=170529 width=26) (actual time=0.053..75.656 rows=172001 loops=1) Index Cond: (state = 'CA'::text) Buffers: shared hit=288 read=7043 Planning Time: 0.080 ms Execution Time: 80.221 ms (5 rows)

EXPLAIN SELECT * FROM customers WHERE state = 'CA'; --
Index Scan on customers (rows ~180000) QUERY PLAN --------------------------------------------------- Index Scan using customers_state_idx on customers (cost=0.42.. 32172.73 rows=170529 width=26) (actual time=0.053..75.656 rows=172001 loops=1) Index Cond: (state = 'CA'::text) Buffers: shared hit=288 read=7043 Planning Time: 0.080 ms Execution Time: 80.221 ms (5 rows)

MCV helps select seq scan when data is skewed

Histograms

SELECT (unnest( histogram_bounds ::text::date[]))::date AS bucket_bound FROM pg_stats WHERE tablename
= 'customers' AND attname = ' signup_date ' LIMIT 8; --- First 8 histogram boundaries for signup_date --- (default target = 100 buckets, each ~1% of rows) bucket_bound -------------- 2018 - 01 - 01 2018 - 03 - 01 2018 - 04 - 28 2018 - 07 - 04 2018 - 09 - 05 2018 - 10 - 28 2018 - 12 - 27 2019 - 02 - 14 (8 rows)

ALTER TABLE customers ALTER COLUMN signup_date SET STATISTICS 1000;

= 'customers' AND attname = ' signup_date ' LIMIT 8; --- First 8 histogram boundaries for signup_date --- (now 1000 buckets, each ~0.1% of rows) bucket_bound -------------- 2018 - 01 - 01 2018 - 01 - 03 2018 - 01 - 05 2018 - 01 - 07 2018 - 01 - 10 2018 - 01 - 12 2018 - 01 - 14 2018 - 01 - 16 (8 rows)

= 'customers' AND attname = ' signup_date ' LIMIT 8; --- First 8 histogram boundaries for signup_date --- (now 1000 buckets , each ~0.1% of rows) bucket_bound -------------- 2018 - 01 - 01 2018 - 01 - 03 2018 - 01 - 05 2018 - 01 - 07 2018 - 01 - 10 2018 - 01 - 12 2018 - 01 - 14 2018 - 01 - 16 (8 rows) bucket_bound -------------- 2018 - 01 - 01 2018 - 03 - 01 2018 - 04 - 28 2018 - 07 - 04 2018 - 09 - 05 2018 - 10 - 28 2018 - 12 - 27 2019 - 02 - 14 (8 rows) Beware of Trade - offs!

Now, about correlations!

EXPLAIN ANALYZE SELECT * FROM customers WHERE city = 'Cheyenne'
AND state = 'WY'; ... Index Scan on customers (cost=... rows=8 width=...) (actual rows=4012)

AND state = 'WY’; ... Index Scan on customers (cost=... rows=8 width=...) (actual rows=4012 )

𝑃(city = Cheyenne AND state = WY) = 𝑃(city) ×
𝑃(state)

CREATE STATISTICS customers_city_state (dependencies, ndistinct ) ON city, state FROM
customers; ANALYZE customers;

CREATE STATISTICS customers_city_state ( dependencies, ndistinct ) ON city, state
FROM customers; ANALYZE customers; Statistic Types

AND state = 'WY'; ... Index Scan on customers (cost=... rows=4087 width=...) (actual rows=4012)

Query Planning checklist 1. Compare estimated vs actual rows. Run
EXPLAIN ANALYZE. Look for the deepest node where the estimate disagrees with reality. 2. Check pg_stats for that column. Look at n_distinct , the MCV list. Do they match what you know about your data? 3. If the stats look stale, run ANALYZE. Especially after a big batch load, a migration, or a partition swap. Autovacuum may not have caught up.

Query Planning checklist (cont’d) 4. If estimates are still off
on a single column, raise the statistics target. ALTER COLUMN ... SET STATISTICS 1000 . 5. If the bad estimate involves two columns in the same WHERE clause, consider correlation. Create extended statistics on the pair. 6. Only after all of that should you consider rewriting the query.

The Query Planner is only as smart as the statistics
you feed it.

pg_stats: How Postgres Internal Stats Work

pg_stats: How Postgres Internal Stats Work

More Decks by Richard Yen

Other Decks in Technology

Featured

Transcript