Slide 1

Slide 1 text

Postgres at any scale Craig Kerstiens @ Citus Data

Slide 2

Slide 2 text

Who am I? • Craig Kerstiens • Run fully managed database-as-a-service • Curate Postgres Weekly • Blog at craigkerstiens.com • Previously Heroku, Truviso, Accenture

Slide 3

Slide 3 text

Itty bitty little data

Slide 4

Slide 4 text

Database is just a hash

Slide 5

Slide 5 text

Why Postgres?

Slide 6

Slide 6 text

“It’s the emacs of databases” @craigkerstiens

Slide 7

Slide 7 text

Postgres • Datatypes • GIS • Index types • Full text search • JSONB

Slide 8

Slide 8 text

Datatypes • Varchar(255) • Timestamp vs. Timestamptz • JSON vs. JSONB

Slide 9

Slide 9 text

Constraints • Django/SQL Alchemy good by default • Is this string really under 255 characters? • Is the IP address really an IP address? • Does the foreign key exist? • What about if you delete a user, do you clean up the rest of the records?

Slide 10

Slide 10 text

Master your tools psql is great, leverage it psqlrc (\x auto, \timing, saving history)

Slide 11

Slide 11 text

Backups Backups let you 1. recover data, but also 2. verify quality of your data Backups don’t exist unless you test them

Slide 12

Slide 12 text

Medium

Slide 13

Slide 13 text

Configure your system • Setup your logs • Configure your memory • Tweak vacuum • Adjust checkpoints

Slide 14

Slide 14 text

Don’t learn it all http://pgtune.leopard.in.ua/

Slide 15

Slide 15 text

Cache rules everything around me

Slide 16

Slide 16 text

Cache hit rate SELECT 'index hit rate' as name, (sum(idx_blks_hit) - sum(idx_blks_read)) / sum(idx_blks_hit + idx_blks_read) as ratio FROM pg_statio_user_indexes union all SELECT 'cache hit rate' as name, case sum(idx_blks_hit) when 0 then 'NaN'::numeric else to_char((sum(idx_blks_hit) - sum(idx_blks_read)) / sum(idx_blks_hit + idx_blks_read), '99.99')::numeric end as ratio FROM pg_statio_user_indexes)

Slide 17

Slide 17 text

Cache hit rate name | ratio ----------------+------------------------ cache hit rate | 0.99

Slide 18

Slide 18 text

Index hit rate SELECT relname, 100 * idx_scan / (seq_scan + idx_scan), n_live_tup FROM pg_stat_user_tables ORDER BY n_live_tup DESC;

Slide 19

Slide 19 text

Index hit rate relname | percent_of_times_index_used | rows_in_table ---------------------+-----------------------------+--------------- events | 0 | 669917 app_infos_user_info | 0 | 198218 app_infos | 50 | 175640 user_info | 3 | 46718 rollouts | 0 | 34078 favorites | 0 | 3059

Slide 20

Slide 20 text

Rough guidelines Cache hit rate >= 99% Index hit rate >= 95% where on > 10,000 rows

Slide 21

Slide 21 text

Understanding Specific Query Performance

Slide 22

Slide 22 text

Understanding Specific Query Performance SELECT last_name FROM employees WHERE salary >= 50000;

Slide 23

Slide 23 text

Understanding Specific Query Performance EXPLAIN SELECT last_name FROM employees WHERE salary >= 50000; QUERY PLAN -------------------------------------------- ------ Seq Scan on employees (cost=0.00..35811.00 rows=1 width=6) Filter: (salary >= 50000) (3 rows)

Slide 24

Slide 24 text

Understanding Specific Query Performance EXPLAIN ANALYZE SELECT last_name FROM employees WHERE salary >= 50000; QUERY PLAN -------------------------------------------------- Seq Scan on employees (cost=0.00..35811.00 rows=1 width=6) (actual time=2.401..295.247 rows=1428 loops=1) Filter: (salary >= 50000) Total runtime: 295.379 (3 rows)

Slide 25

Slide 25 text

Rough guidelines Page response times < 100 ms Common queries < 10ms Rare queries < 100ms

Slide 26

Slide 26 text

Understanding Specific Query Performance EXPLAIN ANALYZE SELECT last_name FROM employees WHERE salary >= 50000; QUERY PLAN -------------------------------------------------- Seq Scan on employees (cost=0.00..35811.00 rows=1 width=6) (actual time=2.401..295.247 rows=1428 loops=1) Filter: (salary >= 50000) Total runtime: 295.379 (3 rows)

Slide 27

Slide 27 text

Indexes CREATE INDEX idx_emps ON employees (salary);

Slide 28

Slide 28 text

Indexes EXPLAIN ANALYZE SELECT last_name FROM employees WHERE salary >= 50000; QUERY PLAN -------------------------------------------------- Index Scan using idx_emps on employees (cost=0.00..8.49 rows=1 width=6) (actual time = 0.047..1.603 rows=1428 loops=1) Index Cond: (salary >= 50000) Total runtime: 1.771 ms (3 rows)

Slide 29

Slide 29 text

pg_stat_statements $ select * from pg_stat_statements where query ~ 'from users where email'; userid | 16384 dbid | 16388 query | select * from users where email = ?; calls | 2 total_time | 0.000268 rows | 2 shared_blks_hit | 16 shared_blks_read | 0 shared_blks_dirtied | 0 shared_blks_written | 0 local_blks_hit | 0 local_blks_read | 0 local_blks_dirtied | 0 local_blks_written | 0 ...

Slide 30

Slide 30 text

pg_stat_statements SELECT (total_time / 1000 / 60) as total, (total_time/calls) as avg, query FROM pg_stat_statements ORDER BY 1 DESC LIMIT 100;

Slide 31

Slide 31 text

pg_stat_statements total | avg | query --------+--------+------------------------- 295.76 | 10.13 | SELECT id FROM users... 219.13 | 80.24 | SELECT * FROM ... (2 rows)

Slide 32

Slide 32 text

Indexes

Slide 33

Slide 33 text

Indexes B-Tree Generalized Inverted Index (GIN) Generalized Search Tree (GIST) Space Partitioned GIST (SP-GIST) Block Range Index (BRIN)

Slide 34

Slide 34 text

Which do you use?

Slide 35

Slide 35 text

BTree You usually want this

Slide 36

Slide 36 text

Generalized Inverted Index (GIN) Use with multiple values in 1 column Array/hStore

Slide 37

Slide 37 text

Generalized Search Tree (GIST) Full text search Shapes

Slide 38

Slide 38 text

Indexes B-Tree Generalized Inverted Index (GIN) Generalized Search Tree (GIST) Space Partitioned GIST (SP-GIST) Block Range Index (BRIN)

Slide 39

Slide 39 text

Other index tips

Slide 40

Slide 40 text

Be specific • Composite indexes • Functional indexes • Conditional indexes

Slide 41

Slide 41 text

Faster writes? Faster reads?

Slide 42

Slide 42 text

Every write you make 1. Figure out query plan 2. Write to disk 3. Wait for ack 4. Return

Slide 43

Slide 43 text

Every write you make 1. Figure out query plan 2. Write to disk 3. Wait for ack 4. Update the index 5. Return

Slide 44

Slide 44 text

Every write you make 1. Figure out query plan 2. Write to disk 3. Wait for ack 4. Update the index 5. Update the other index 6. Return

Slide 45

Slide 45 text

Every write you make 1. Figure out query plan 2. Write to disk 3. Wait for ack 4. Update the index 5. Update the other index 6. Update the other other index 7. Return

Slide 46

Slide 46 text

Unused indexes? SELECT schemaname || '.' || relname AS table, indexrelname AS index, pg_size_pretty(pg_relation_size(i.indexrelid)) AS index_size, idx_scan as index_scans FROM pg_stat_user_indexes ui JOIN pg_index i ON ui.indexrelid = i.indexrelid WHERE NOT indisunique AND idx_scan < 50 AND pg_relation_size(relid) > 5 * 8192 ORDER BY pg_relation_size(i.indexrelid) / nullif(idx_scan, 0) DESC NULLS FIRST, pg_relation_size(i.indexrelid) DESC;

Slide 47

Slide 47 text

Unused indexes table | index | index_size | index_scans -------------------------+-----------------------------+------------+------------- public.formations | index_formations_on_user_id | 40 Gb | 30 public.resources | slack_pool_created_at_idx | 10 Gb | 40 (2 rows)

Slide 48

Slide 48 text

Migrations not null is good except in a migration

Slide 49

Slide 49 text

Proper migrations 1 becomes 3: 1. Allow nulls, set a default value though 2. Gradually backfill old data 3. Add your constraint

Slide 50

Slide 50 text

Large scale

Slide 51

Slide 51 text

Index creation CREATE INDEX CONCURRENTLY ... roughly 2-3x slower Doesn’t lock table

Slide 52

Slide 52 text

Connection pooling Application/Framework layer Stand alone daemon

Slide 53

Slide 53 text

Pooler options pgBouncer pgpool

Slide 54

Slide 54 text

Scaling cache

Slide 55

Slide 55 text

Scaling cache

Slide 56

Slide 56 text

Offload read traffic to a replica

Slide 57

Slide 57 text

Replication slony londiste bucardo pgpool wal-e barman

Slide 58

Slide 58 text

Replication slony londiste bucardo pgpool wal-e/wal-g barman

Slide 59

Slide 59 text

Sharding • Split up large tables • Split up data by customer • 1 database per customer • 1 schema per customer • Shard within your application

Slide 60

Slide 60 text

Sharding • Scalable options • Split out large tables • Multi-tenant application level sharding

Slide 61

Slide 61 text

Citus

Slide 62

Slide 62 text

Citus

Slide 63

Slide 63 text

Logical backup pg_dump can be human readable, is portable

Slide 64

Slide 64 text

Physical backup The bytes on disk Base backup

Slide 65

Slide 65 text

Logical Good across architectures Good for portability Has load on DB Works < 50 GB More initial setup Less portability Limited load on system Use above 50 GB Physical

Slide 66

Slide 66 text

Bloat

Slide 67

Slide 67 text

WITH constants AS ( SELECT current_setting('block_size')::numeric AS bs, 23 AS hdr, 4 AS ma ), bloat_info AS ( SELECT ma,bs,schemaname,tablename, (datawidth+(hdr+ma-(case when hdr%ma=0 THEN ma ELSE hdr%ma END)))::numeric AS datahdr, (maxfracsum*(nullhdr+ma-(case when nullhdr%ma=0 THEN ma ELSE nullhdr%ma END))) AS nullhdr2 FROM ( SELECT schemaname, tablename, hdr, ma, bs, SUM((1-null_frac)*avg_width) AS datawidth, MAX(null_frac) AS maxfracsum, hdr+( SELECT 1+count(*)/8 FROM pg_stats s2 WHERE null_frac<>0 AND s2.schemaname = s.schemaname AND s2.tablename = s.tablename ) AS nullhdr FROM pg_stats s, constants GROUP BY 1,2,3,4,5 ) AS foo ), table_bloat AS ( SELECT schemaname, tablename, cc.relpages, bs, CEIL((cc.reltuples*((datahdr+ma- (CASE WHEN datahdr%ma=0 THEN ma ELSE datahdr%ma END))+nullhdr2+4))/(bs-20::float)) AS otta FROM bloat_info JOIN pg_class cc ON cc.relname = bloat_info.tablename JOIN pg_namespace nn ON cc.relnamespace = nn.oid AND nn.nspname = bloat_info.schemaname AND nn.nspname <> 'information_schema' ), index_bloat AS ( SELECT schemaname, tablename, bs, COALESCE(c2.relname,'?') AS iname, COALESCE(c2.reltuples,0) AS ituples, COALESCE(c2.relpages,0) AS ipages, COALESCE(CEIL((c2.reltuples*(datahdr-12))/(bs-20::float)),0) AS iotta -- very rough approximation, assumes all cols FROM bloat_info JOIN pg_class cc ON cc.relname = bloat_info.tablename JOIN pg_namespace nn ON cc.relnamespace = nn.oid AND nn.nspname = bloat_info.schemaname AND nn.nspname <> 'information_schema' JOIN pg_index i ON indrelid = cc.oid JOIN pg_class c2 ON c2.oid = i.indexrelid ) SELECT type, schemaname, object_name, bloat, pg_size_pretty(raw_waste) as waste FROM (SELECT 'table' as type, schemaname, tablename as object_name, ROUND(CASE WHEN otta=0 THEN 0.0 ELSE table_bloat.relpages/otta::numeric END,1) AS bloat, CASE WHEN relpages < otta THEN '0' ELSE (bs*(table_bloat.relpages-otta)::bigint)::bigint END AS raw_waste FROM table_bloat UNION SELECT 'index' as type, schemaname, tablename || '::' || iname as object_name, ROUND(CASE WHEN iotta=0 OR ipages=0 THEN 0.0 ELSE ipages/iotta::numeric END,1) AS bloat, CASE WHEN ipages < iotta THEN '0' ELSE (bs*(ipages-iotta))::bigint END AS raw_waste FROM index_bloat) bloat_summary ORDER BY raw_waste DESC, bloat DESC Bloat

Slide 68

Slide 68 text

Bloat type | schemaname | object_name | bloat | waste -------+------------+---------------------------------------------------+-------+------------ index | public | pg_depend::pg_depend_reference_index | 2.1 | 336 kB index | public | pg_depend::pg_depend_depender_index | 1.9 | 272 kB table | public | users | 1.3 | 256 kB …

Slide 69

Slide 69 text

Recap

Slide 70

Slide 70 text

Small • Leverage datatypes, it’s not a dumb hash • Test your backups • Take time to master your tools

Slide 71

Slide 71 text

Medium • Tune Postgres well • Watch cache hit ratio • Be generous with indexes

Slide 72

Slide 72 text

Large • Move away from pg_dump • Setup connection pooling • If you need to shard, invest in the right way or use a tool to help

Slide 73

Slide 73 text

Thanks! Questions?