Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DB for your next project? Why, Postgres, of course

DB for your next project? Why, Postgres, of course

This talk highlights nice sides of Postgres which IMO make it an excellent choice for your next project - aside from money. I'm talking about NoSQL support, datatypes, logging, error handling, procedures in languages devs know as well as about Postgres internals: it's query path, architecture and processes.

Tomasz Borek

October 11, 2016
Tweet

More Decks by Tomasz Borek

Other Decks in Technology

Transcript

  1. What will I tell you? • Colourful history of PostgreSQL

    – So, DB wars • Chosen features • Architecture and internals • Query path and optimization (no hinting) • Multithreading (very briefly, too little time)
  2. My Faves • Error reporting / logging • PL/xSQL –

    feel free to use Perl, Python, Ruby, Java, LISP... • XML and JSON handling • Foreign Data Wrappers (FDW) • Windowing functions • Common table expressions (CTE) and recursive queries • Power of Indexes
  3. Logging, ‘gotchas’ • Default is to stderr only • Set

    on CLI or in config, not through sets • Where is it? • How to log queries… or turning log_collector on
  4. Where is it? • Default – data/pg_log • Launchers can

    set it (Mac Homebrew/plist) • Version and config dependent
  5. Logging, turn it on • Default is to stderr only

    • In PG: logging_collector = on log_filename = strftime-patterned filename [log_destination = [stderr|syslog|csvlog] ] log_statement = [none|ddl|mod|all] / / all log_min_error_statement = ERROR log_line_prefix = '%t %c %u ' # time sessionid user
  6. PL/pgSQL • Stored procedure dilemma – Where to keep your

    logic? – How your logic is NOT in your SCM
  7. PL/pgSQL • Stored procedure dilemma – Where to keep your

    logic? – How your logic is NOT in your SCM • Over dozen of options: – Perl, Python, Ruby, – pgSQL, Java, – TCL, LISP…
  8. PL/pgSQL • Stored procedure dilemma – Where to keep your

    logic? – How your logic is NOT in your SCM • Over dozen of options: – Perl, Python, Ruby, – pgSQL, Java, – TCL, LISP… • DevOps, SysAdmins, DBAs… ETLs etc.
  9. PL/pgSQL • Stored procedure dilemma – Where to keep your

    logic? – How your logic is NOT in your SCM • Over dozen of options: – Perl, Python, Ruby, – pgSQL, Java, – TCL, LISP… • DevOps, SysAdmins, DBAs… ETLs etc.
  10. Perl function example CREATE FUNCTION perl_max (integer, integer) RETURNS integer

    AS $$ my ($x, $y) = @_; if (not defined $x) { return undef if not defined $y; return $y; } return $x if not defined $y; return $x if $x > $y; return $y; $$ LANGUAGE plperl;
  11. XML or JSON support • Parsing and retrieving XML (functions)

    • Valid JSON checks (type) • Careful with encoding! – PG allows only one server encoding per database – Specify it to UTF-8 or weep • Document database instead of OO or rel – JSON, JSONB, HSTORE – noSQL fun welcome!
  12. HSTORE? CREATE TABLE example ( id serial PRIMARY KEY, data

    hstore); INSERT INTO example (data) VALUES ('name => "John Smith", age => 28, gender => "M"'), ('name => "Jane Smith", age => 24');
  13. HSTORE? CREATE TABLE example ( id serial PRIMARY KEY, data

    hstore); INSERT INTO example (data) VALUES ('name => "John Smith", age => 28, gender => "M"'), ('name => "Jane Smith", age => 24'); SELECT id, data->'name' FROM example; SELECT id, data->'age' FROM example WHERE data->'age' >= '25';
  14. XML functions example XMLROOT ( XMLELEMENT ( NAME gazonk, XMLATTRIBUTES

    ( ’val’ AS name, 1 + 1 AS num ), XMLELEMENT ( NAME qux, ’foo’ ) ), VERSION ’1.0’, STANDALONE YES ) <?xml version=’1.0’ standalone=’yes’ ?> <gazonk name=’val’ num=’2’> <qux>foo</qux> </gazonk> xml '<foo>bar</foo>' '<foo>bar</foo>'::xml
  15. Foreign Data Wrappers (FDW) • Stop ETL, start FDW •

    Read AND write • FS, Mongo, Hadoop, Redis… • You can write your own!
  16. Windowing functions • Replacement for procedures (somewhat) • In a

    nutshell: – Take row, – find related rows, – compute things over related rows, – return result along with the row • Ranking, averaging, growth per time... http://www.craigkerstiens.com/2014/02/26/Tracking-MoM-growth-in-SQL/ https://www.postgresql.org/docs/9.1/static/tutorial-window.html
  17. Index power • Geo and spherical indexes • Partial indexes

    (email like @company.com) • Function indexes • JSON(B) has index support • You may create your own index
  18. Check out processes • pgrep -l postgres • htop >

    filter: postgres • Whatever you like / use usually • Careful with kill -9 on connections – kill -15 better
  19. Parser • Syntax checks, like FRIM is not a keyword

    – SELECT * FRIM myTable; • Catalog lookup – MyTable may not exist • In the end query tree is built – Query tokenization: SELECT (keyword) employeeName (field id) count (function call)...
  20. Planner • Where Planner Tree is built • Where best

    execution is decided upon – Seq or index scan? Index or bitmap index? – Which join order? – Which join strategy (nested, hashed, merge)? – Inner or outer? – Aggregation: plain, hashed, sorted… • Heuristic, if finding all plans too costly
  21. Example to explain EXPLAIN EXPLAIN SELECT * FROM tenk1; QUERY

    PLAN ------------------------------------------------------------ Seq Scan on tenk1 (cost=0.00..458.00 rows=10000 width=244)
  22. Explaining EXPLAIN - what EXPLAIN SELECT * FROM tenk1; QUERY

    PLAN ------------------------------------------------------------ Seq Scan on tenk1 (cost=0.00..458.00 rows=10000 width=244) • Startup cost – time before output phase begins • Total cost – in page fetches, may change, assumed to run node to completion • Rows – estimated number to scan (but LIMIT etc.) • Estimated average width of output from that node (in bytes)
  23. Explaining EXPLAIN - how EXPLAIN SELECT * FROM tenk1; QUERY

    PLAN ------------------------------------------------------------ Seq Scan on tenk1 (cost=0.00..458.00 rows=10000 width=244) SELECT relpages, reltuples FROM pg_class WHERE relname = 'tenk1'; / /358|10k • No WHERE, no index • Cost = disk pages read * seq page cost + rows scanned * cpu tuple cost • 358 * 1.0 + 10000 * 0.01 = 458 // default values
  24. Analyzing EXPLAIN ANALYZE EXPLAIN ANALYZE SELECT * FROM tenk1 t1,

    tenk2 t2 WHERE t1.unique1 < 10 AND t1.unique2 = t2.unique2; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=4.65..118.62 rows=10 width=488) (actual time=0.128..0.377 rows=10 loops=1) -> Bitmap Heap Scan on tenk1 t1 (cost=4.36..39.47 rows=10 width=244) (actual time=0.057..0.121 rows=10 loops=1) Recheck Cond: (unique1 < 10) -> Bitmap Index Scan on tenk1_unique1 (cost=0.00..4.36 rows=10 width=0) (actual time=0.024..0.024 rows=10 loops=1) Index Cond: (unique1 < 10) -> Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.29..7.91 rows=1 width=244) (actual time=0.021..0.022 rows=1 loops=10) Index Cond: (unique2 = t1.unique2) Planning time: 0.181 ms Execution time: 0.501 ms • Actually runs the query • More info: actual times, rows removed by filter, sort method used, disk/memory used...
  25. Analyzing EXPLAIN ANALYZE EXPLAIN ANALYZE SELECT * FROM tenk1 t1,

    tenk2 t2 WHERE t1.unique1 < 10 AND t1.unique2 = t2.unique2; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=4.65..118.62 rows=10 width=488) (actual time=0.128..0.377 rows=10 loops=1) -> Bitmap Heap Scan on tenk1 t1 (cost=4.36..39.47 rows=10 width=244) (actual time=0.057..0.121 rows=10 loops=1) Recheck Cond: (unique1 < 10) -> Bitmap Index Scan on tenk1_unique1 (cost=0.00..4.36 rows=10 width=0) (actual time=0.024..0.024 rows=10 loops=1) Index Cond: (unique1 < 10) -> Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.29..7.91 rows=1 width=244) (actual time=0.021..0.022 rows=1 loops=10) Index Cond: (unique2 = t1.unique2) Planning time: 0.181 ms Execution time: 0.501 ms
  26. Analyzing EXPLAIN ANALYZE EXPLAIN ANALYZE SELECT * FROM tenk1 t1,

    tenk2 t2 WHERE t1.unique1 < 10 AND t1.unique2 = t2.unique2; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=4.65..118.62 rows=10 width=488) (actual time=0.128..0.377 rows=10 loops=1) -> Bitmap Heap Scan on tenk1 t1 (cost=4.36..39.47 rows=10 width=244) (actual time=0.057..0.121 rows=10 loops=1) Recheck Cond: (unique1 < 10) -> Bitmap Index Scan on tenk1_unique1 (cost=0.00..4.36 rows=10 width=0) (actual time=0.024..0.024 rows=10 loops=1) Index Cond: (unique1 < 10) -> Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.29..7.91 rows=1 width=244) (actual time=0.021..0.022 rows=1 loops=10) Index Cond: (unique2 = t1.unique2) Planning time: 0.181 ms Execution time: 0.501 ms
  27. Analyzing EXPLAIN ANALYZE EXPLAIN ANALYZE SELECT * FROM tenk1 t1,

    tenk2 t2 WHERE t1.unique1 < 10 AND t1.unique2 = t2.unique2; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=4.65..118.62 rows=10 width=488) (actual time=0.128..0.377 rows=10 loops=1) -> Bitmap Heap Scan on tenk1 t1 (cost=4.36..39.47 rows=10 width=244) (actual time=0.057..0.121 rows=10 loops=1) Recheck Cond: (unique1 < 10) -> Bitmap Index Scan on tenk1_unique1 (cost=0.00..4.36 rows=10 width=0) (actual time=0.024..0.024 rows=10 loops=1) Index Cond: (unique1 < 10) -> Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.29..7.91 rows=1 width=244) (actual time=0.021..0.022 rows=1 loops=10) Index Cond: (unique2 = t1.unique2) Planning time: 0.181 ms Execution time: 0.501 ms
  28. Battle-tested • Could mature since 1987 • Comes in many

    flavours (forks) • Largest cluster – PBs in Yahoo • Skype, NASA, Instagram • Stable: – Many years on one version – Good version support – Every year something new – Follows ANSI SQL standards https://www.postgresql.org/about/users/
  29. Great features • Java, Perl, Python for stored procedures •

    Add CTEs and FDWs => great ETL or µservice • Handles XMLs and JSONs • Error reporting / logging • MVCC built-in • Windowing functions • ...
  30. Solid internals • Well-thought out processes • Built-in security (dozen

    of solutions) • WAL, stats collector, vacuum • Good rule engine and clear query optimization – No hinting will bother some people • Plethora of data types
  31. Disadvantages • More like Python then Perl/PHP • Some learning

    curve • Some say: – replication(‘s performance) • I can’t think of more, doesn’t mean none are present :-)