Postgres – A Data Platform

Postgres: A Data Platform craigkerstiens

Who am I Product @ Heroku craigkerstiens.com postgresweekly.com postgresguide.com craigkerstiens

Postgres – TLDR; Datatypes Conditional Indexes Transactional DDL Foreign Data
Wrappers Concurrent Index Creation Extensions Common Table Expressions Fast Column Addition Listen/Notify Table Inheritance Per Transaction sync replication Window functions NoSQL inside SQL Momentum craigkerstiens

Rough outline Context Day to day Datatypes Performance Expanding postgres

1989 • Gas was $1.09 • Movie ticket $2.75 •
Microsoft releases Windows 1.0 • Postgres • Project called Postgres is released • Roots in Ingres (post-ingres) Context | Day to day | Datatypes | Performance | Expanding Postgres

1996 It might help to explain that the pronunciation is
"post-gres" or! "post-gres-cue-ell", not "post-gray-something".! ! I heard people making this same mistake in presentations at this! past weekend's Postgres Anniversary Conference :-( Arguably,! the 1996 decision to call it PostgreSQL instead of reverting to! plain Postgres was the single worst mistake this project ever made.! It seems far too late to change now, though.! ! ! ! ! regards, tom lane! Context | Day to day | Datatypes | Performance | Expanding Postgres

MVCC Immutable structure ! Reads don’t block writes, each transaction
sees what the state was when it started ! Context | Day to day | Datatypes | Performance | Expanding Postgres

Under the covers A giant append only log ! Context
| Day to day | Datatypes | Performance | Expanding Postgres Writing data adds to the log Updating data adds to the log Deleting data adds to the log

A tangent Context | Day to day | Datatypes |
Performance | Expanding Postgres

What is a relational DB • Stores the data and
how its related • Data is in a flat two dimensional space • Has relationships between the data ! • Really though it’s math Context | Day to day | Datatypes | Performance | Expanding Postgres

What is SQL • SQL means SQL to access it
• SQL also implies relational • A car has an owner • An owner has an address Context | Day to day | Datatypes | Performance | Expanding Postgres

NoSQL is overloaded It can relate to CAP but doesn’t
necessarily • Consistency • Availability • Partition Tolerance ! • It can relate to SQL or user experience Context | Day to day | Datatypes | Performance | Expanding Postgres

Databases broken down • Relational databases • Key-Value stores •
Document databases • Text search solutions • Distributed data stores • Time series Context | Day to day | Datatypes | Performance | Expanding Postgres

Day to Day

Postgres bag of tricks • My editor of choice •
Writing better SQL • Misc tips/tricks Context | Day to day | Datatypes | Performance | Expanding Postgres

PSQL \e - opens your default $EDITOR \d - describe
something \dt - list all tables \x auto - pretty results \x help - help .psqlrc - really geek out (pro tip - name your queries) Context | Day to day | Datatypes | Performance | Expanding Postgres

Formatting SQL SELECT foo FROM bar a, baz b WHERE
a.this = b.that AND condition2 = false GROUP BY 1 ORDER BY 1 ASC; Context | Day to day | Datatypes | Performance | Expanding Postgres

More readable SQL

Common Table Expressions

An example ! --- Calculates the projects per each user
tasks_per_project_per_user AS ( SELECT user_id, project_id, count(*) as task_count FROM tasks GROUP BY user_id, project_id ), ! --- Gets user ids that have over 50% of tasks assigned overloaded_users AS ( SELECT tasks_per_project_per_user.user_id, ! FROM tasks_per_project_per_user, total_tasks_per_project WHERE tasks_per_project_per_user.task_count > (total_tasks_per_project / 2) ) ! SELECT email, task_list, title FROM users_tasks, overloaded_users WHERE users_tasks.user_id = overloaded_users.user_id --- Initial query to grab project title and tasks per user WITH users_tasks AS ( SELECT users.id as user_id, users.email, array_agg(tasks.name) as task_list, projects.title FROM users, tasks, project WHERE users.id = tasks.user_id projects.title = tasks.project_id GROUP BY users.email, projects.title ), ! --- Calculates the total tasks per each project total_tasks_per_project AS ( SELECT project_id, count(*) as task_count FROM tasks GROUP BY project_id ), Context | Day to day | Datatypes | Performance | Expanding Postgres

An example --- Initial query to grab project title and
tasks per user WITH users_tasks AS ( SELECT users.id as user_id, users.email, array_agg(tasks.name) as task_list, projects.title FROM users, tasks, project WHERE users.id = tasks.user_id projects.title = tasks.project_id GROUP BY users.email, projects.title ), Context | Day to day | Datatypes | Performance | Expanding Postgres

An example --- Calculates the total tasks per each project
total_tasks_per_project AS ( SELECT project_id, count(*) as task_count FROM tasks GROUP BY project_id ), Context | Day to day | Datatypes | Performance | Expanding Postgres

An example --- Calculates the projects per each user tasks_per_project_per_user
AS ( SELECT user_id, project_id, count(*) as task_count FROM tasks GROUP BY user_id, project_id ), Context | Day to day | Datatypes | Performance | Expanding Postgres

An example --- Gets user ids that have over 50%
of tasks assigned overloaded_users AS ( SELECT tasks_per_project_per_user.user_id, ! FROM tasks_per_project_per_user, total_tasks_per_project WHERE tasks_per_project_per_user.task_count > (total_tasks_per_project / 2) ) Context | Day to day | Datatypes | Performance | Expanding Postgres

An example SELECT email, task_list, title FROM users_tasks, overloaded_users WHERE
users_tasks.user_id = overloaded_users.user_id Context | Day to day | Datatypes | Performance | Expanding Postgres

Window Functions

An Example SELECT last_name, salary, department, rank() OVER (PARTITION BY
department ORDER BY salary DESC) FROM employees; ! ! ! last_name salary department rank Jones 45000 Accounting 1 Williams 37000 Accounting 2 Smith 55000 Sales 1 Adams 50000 Sales 2 Johnson 40000 Marketing 1

Extensions

Extensions hstore citext db_link uuid_ossp cube PostGIS intarray pg_crypto isn
table_func unaccent dict_int dict_xsyn earth_distance pg_stat_tuple ltree pg_row_locks trigram

Datatypes

The list integer float serial money character bytea timestamp timestamp
with tz date time interval Context | Day to day | Datatypes | Performance | Expanding Postgres boolean enums point line box path polygon circle inet cidr bit tsvector tsquery UUID XML JSON array rangetypes

Money • Don’t use it • Created when the world
was flat • Money only knows a single currency Context | Day to day | Datatypes | Performance | Expanding Postgres

Serial Context | Day to day | Datatypes | Performance
| Expanding Postgres

Serial • Serial has a limit • UUID scales •
UUID doesn’t leak your primary keys • uuid_ossp extension UUID Context | Day to day | Datatypes | Performance | Expanding Postgres

Shapes • Basic ones for a shortcut • earth_distance !
• PostGIS for robustness Context | Day to day | Datatypes | Performance | Expanding Postgres

XML Just kidding

JSON vs. hStore • NoSQL often means schema-less ! •
Adding columns is annoying • Just add data Context | Day to day | Datatypes | Performance | Expanding Postgres

hStore • Key value store directly in Postgres ! •
Can filter where a key does/doesn’t exist • Can filter for specific values of keys Context | Day to day | Datatypes | Performance | Expanding Postgres

JSON Postgres 9.2 was the JSON release Marketing Postgres 9.3
started to become usable operators, basic indexing, etc. Postgres 9.4 Context | Day to day | Datatypes | Performance | Expanding Postgres

JSON Context | Day to day | Datatypes | Performance

JSONB Context | Day to day | Datatypes | Performance

Others • Timestamps with timezone • Intervals (now() - ‘1
hour’::interval) • Array • Range types Context | Day to day | Datatypes | Performance | Expanding Postgres

Performance

Mo cache, Less problems

Cache SELECT 'index hit rate' as name, (sum(idx_blks_hit) - sum(idx_blks_read))
/ sum(idx_blks_hit + idx_blks_read) as ratio FROM pg_statio_user_indexes union all SELECT 'cache hit rate' as name, case sum(idx_blks_hit) when 0 then 'NaN'::numeric else to_char((sum(idx_blks_hit) - sum(idx_blks_read)) / sum(idx_blks_hit + idx_blks_read), '99.99')::numeric end as ratio FROM pg_statio_user_indexes) Context | Day to day | Datatypes | Performance | Expanding Postgres

Index Hit Rate SELECT relname, 100 * idx_scan / (seq_scan
+ idx_scan), n_live_tup FROM pg_stat_user_tables ORDER BY n_live_tup DESC; Context | Day to day | Datatypes | Performance | Expanding Postgres

Rule of thumb Cache > 99% Index hit rate >
95% where rows > 10000 Context | Day to day | Datatypes | Performance | Expanding Postgres

Speciﬁc Query Performance Context | Day to day | Datatypes
| Performance | Expanding Postgres

An example SELECT last_name FROM employees WHERE salary >= 50000;

Context | Day to day | Datatypes | Performance |
Expanding Postgres # EXPLAIN SELECT last_name FROM employees WHERE salary >= 50000; ! QUERY PLAN -------------------------------------------------- Seq Scan on employees (cost=0.00..35811.00 rows=1 width=6) Filter: (salary >= 50000) (3 rows) Query plans

Query plans # EXPLAIN SELECT last_name FROM employees WHERE salary
>= 50000; ! QUERY PLAN -------------------------------------------------- Seq Scan on employees (cost=0.00..35811.00 rows=1 width=6) Filter: (salary >= 50000) (3 rows) Context | Day to day | Datatypes | Performance | Expanding Postgres

Query plans # EXPLAIN ANALYZE SELECT last_name FROM employees WHERE
salary >= 50000; QUERY PLAN -------------------------------------------------- Seq Scan on employees (cost=0.00..35811.00 rows=1 width=6) (actual time=2.401..295.247 rows=1428 loops=1) Filter: (salary >= 50000) Total runtime: 295.379 (3 rows) ! Filter: (salary >= 50000) (3 rows) Context | Day to day | Datatypes | Performance | Expanding Postgres

Indexes fix it # CREATE INDEX idx_emps ON employees (salary);

EXPLAIN ANALYZE SELECT last_name FROM employees WHERE salary >= 50000;
QUERY PLAN -------------------------------------------------- Index Scan using idx_emps on employees (cost=0.00..8.49 rows=1 width=6) (actual time = 0.047..1.603 rows=1428 loops=1) Index Cond: (salary >= 50000) Total runtime: 1.771 ms (3 rows) Context | Day to day | Datatypes | Performance | Expanding Postgres Indexes fix it

pg_stat_statments $ select * from pg_stat_statements where query ~ 'from
users where email'; ! ! userid │ 16384 dbid │ 16388 query │ select * from users where email = ?; calls │ 2 total_time │ 0.000268 rows │ 2 shared_blks_hit │ 16 shared_blks_read │ 0 shared_blks_dirtied │ 0 shared_blks_written │ 0 local_blks_hit │ 0 local_blks_read │ 0 local_blks_dirtied │ 0 local_blks_written │ 0 ... Context | Day to day | Datatypes | Performance | Expanding Postgres

pg_stat_statements SELECT (total_time / 1000 / 60) as total, (total_time/calls)
as avg, query FROM pg_stat_statements ORDER BY 1 DESC LIMIT 100; Context | Day to day | Datatypes | Performance | Expanding Postgres

Indexes B-Tree Generalized Inverted Index (GIN) Generalized Search Tree (GIST)
K Nearest Neighbors (KNN) Space Partitioned GIST (SP-GIST) Context | Day to day | Datatypes | Performance | Expanding Postgres

Indexes Which do I use? Context | Day to day
| Datatypes | Performance | Expanding Postgres craigkerstiens

BTree If you don’t know better, you’re using this. This
is usually okay. Context | Day to day | Datatypes | Performance | Expanding Postgres

Generalized Inverted Index (GIN) Multiple values in a single column.
! Rule of thumb is Array/hStore Context | Day to day | Datatypes | Performance | Expanding Postgres

Generalized Search Tree (GiST) Values that span across a boundary
! Rule of thumb is full text and shapes Context | Day to day | Datatypes | Performance | Expanding Postgres

Indexes B-Tree Generalized Inverted Index (GIN) Generalized Search Tree (GIST)
K Nearest Neighbors (KNN) Space Partitioned GIST (SP-GIST) VODKA (coming soon) Context | Day to day | Datatypes | Performance | Expanding Postgres

There’s more Context | Day to day | Datatypes |
Performance | Expanding Postgres

Conditional & Functional Conditional ! CREATE INDEX … WHERE deleted_at
is null ! Functional ! CREATE INDEX … LOWER(firstname) Context | Day to day | Datatypes | Performance | Expanding Postgres

Backups Logical ! Portable Has load on DB ! <
50 GB Context | Day to day | Datatypes | Performance | Expanding Postgres Phsyical ! More setup Limited load on DB ! > 50 GB

Replication Slony Londiste Bucardo PgPool WAL-e Barman Context | Day
to day | Datatypes | Performance | Expanding Postgres

Expanding Postgres

Databases broken down • Relational databases • Key-Value stores •
Document databases • Text search solutions • Distributed data stores • Time series Context | Day to day | Datatypes | Performance | Expanding Postgres

What is Postgres Context | Day to day | Datatypes
| Performance | Expanding Postgres

Foreign Data Wrappers • Query from Postgres 1 to Postgres
2 • Query other things from Postgres • Redis • Mongo • LDAP • Twitter? Context | Day to day | Datatypes | Performance | Expanding Postgres

PL’s Context | Day to day | Datatypes | Performance

PLV8 Context | Day to day | Datatypes | Performance

Postgres – TLDR; Datatypes Conditional Indexes Transactional DDL Foreign Data
Wrappers Concurrent Index Creation Extensions Common Table Expressions Fast Column Addition Listen/Notify Table Inheritance Per Transaction sync replication Window functions NoSQL inside SQL Momentum craigkerstiens

Fin. craigkerstiens

Postgres – A Data Platform

Postgres – A Data Platform

More Decks by Craig Kerstiens

Featured

Transcript