Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PostgreSQL Online Troubleshooting

PostgreSQL Online Troubleshooting

Slide from my and Ilya Kosmodemyansky talk at RITfest 2015 Moscow, Russia

Alexey Lesovsky

May 29, 2015
Tweet

More Decks by Alexey Lesovsky

Other Decks in Education

Transcript

  1. case 1: Bad release. Overview. - Symptoms: - significant load

    increase, - slowing operations. - Often unpredictable: - we don't know where the problem occurs. - Emergency: - problem must be found and resolved ASAP.
  2. case 1: Bad release. Troubleshoot. - Outside the database -

    top, sysstat, etc... - nagios/zabbix/... - Inside the database - pgbadger/loganalyze/etc... - pg_stat_statements
  3. case 1: Bad release. Outside the database. - top: -

    cpu usage, load average, swapping, iowait. - sysstat: - disk utilization (iostat), - resource consumption (sar).
  4. case 1: Bad release. Outside the database. - Okmeter: -

    online monitoring service, - rich feature/plugin set, - postgresql good support.
  5. case 1: Bad release. Inside the database. - Log analyze

    (pgBadger) - huge logs - read log before report creating - a lot of time need tobuild report - pg_stat_statements (contrib) - small storage footprint, - quick and flexible reports.
  6. case 1: Bad release. Inside the database. - query_stat_total.sql -

    https://github.com/PostgreSQL-Consulting/pg- utils
  7. case 1: Bad release. Query #1. SELECT p.id, p.rating FROM

    posts p LEFT JOIN complaints com ON (com.post_id = p.id AND com.user_id= ?) WHERE p.is_deleted IS FALSE AND com.is_hide IS NOT TRUE AND p.type_id != ? ORDER BY p.rating DESC LIMIT ?;
  8. case 1: Bad release. Query #1. JOIN -> (NOT) EXISTS

    SELECT p.id, p.rating FROM posts p WHERE p.is_deleted IS FALSE AND p.type_id != ? AND NOT EXISTS (SELECT 1 FROM complaints com WHERE com.post_id = p.id AND user_id = ? AND is_hide = true) ORDER BY p.rating DESC LIMIT ?;
  9. case 1: Bad release. Query #2. SELECT * FROM tags

    WHERE (tags.title ilike ?) Trigram Index. CREATE INDEX tags_title_trigram_key on tags using gin(title gin_trgm_ops);
  10. case 1: Bad release. Query #3. SELECT post.* FROM post

    JOIN domain ON post.domain_id = domain.id LEFT OUTER JOIN domain_acl ON domain_acl.domain_id = domain.id AND domain_acl.user_id = ? WHERE post.deleted = ? AND post.domain_id IN (?, ?, ?, ?, ?, ?, ?, ?) AND ((domain.flags & ?) = ? OR (domain_acl.acl & ?) = ?) AND post.id NOT IN (?, ?) ORDER BY post.last_activity DESC LIMIT ? OFFSET ?
  11. case 1: Bad release. Query #3. Index Only Scan SELECT

    * FROM post WHERE id IN (SELECT post.id FROM post JOIN domain ON post.domain_id = domain.id LEFT OUTER JOIN domain_acl ON domain_acl.domain_id = domain.id AND domain_acl.user_id = ? WHERE post.deleted = ? AND post.domain_id IN (?, ?, ?, ?, ?, ?, ?, ?) AND ((domain.flags & ?) = ? OR (domain_acl.acl & ?) = ?) AND post.id NOT IN (?, ?) ORDER BY post.last_activity DESC LIMIT ? OFFSET ?) order by post.last_activity DESC CREATE INDEX post_domain_id_last_activity_id_deleted_partial ON post USING btree (domain_id, last_activity, id, deleted) where deleted = 0;
  12. case 1: Bad release. Query #4. SELECT * FROM "group"

    WHERE ("group".group_vislvl_content >= ?) AND (group_main_domain_id IS NULL OR group_main_domain_id IN(?,?)) AND ("group".obj_pics_count + "group".group_persons_count + "group".group_blog_posts_count + "group".group_wiki_count >= ?) AND "group".group_is_demo = ? AND "group".obj_status_did = ? ORDER BY "group".group_persons_count desc, "group".obj_created asc LIMIT ?;
  13. case 1: Bad release. Query #4. Partial Index CREATE INDEX

    group_special2_key ON "group" USING btree (group_persons_count DESC, obj_created) WHERE ("group".obj_pics_count + "group".group_persons_count + "group".group_blog_posts_count + "group".group_wiki_count >= 1);
  14. case 2: More app servers... We need more... - project

    grow - load increasing - add more app servers - more apps -> more db connections
  15. case 2: More app servers... We need more... - too

    much db connections are bad - high resource contention - os overhead (memory, locks, forks)
  16. case 2: More app servers... We need more... - pgbouncer

    - lightweight connection pooler - stable, simple, fast (libevent) - use pgbouncer between apps and database
  17. case 2: More app servers... We need more... - simple

    test: without pgbouncer - pgbench -C -c 32 -T 300 -U postgres shopdb transaction type: TPC-B (sort of), scaling factor: 128, query mode: simple, number of clients: 32, number of threads: 1, duration: 300 s number of transactions actually processed: 253628 latency average: 37.851 ms tps = 845.403711 (including connections establishing) tps = 15320.442789 (excluding connections establishing)
  18. case 2: More app servers... We need more... - simple

    test: with pgbouncer - pgbench -C -c 32 -T 300 -U postgres shopdb transaction type: TPC-B (sort of), scaling factor: 128, query mode: simple, number of clients: 32, number of threads: 1, duration: 300 s number of transactions actually processed: 2689931 latency average: 3.569 ms tps = 8966.389025 (including connections establishing) tps = 19225.431659 (excluding connections establishing)
  19. case 2: More app servers... We need more... - total:

    300 seconds with 32 clients on 8-core server - latency: 37.8ms vs. 3.5ms - total transactions: 253628 vs. 2689931 - tps: 15320 vs. 19225