Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥

How We Made PostgreSQL Fitter, Happier, More Pr...

How We Made PostgreSQL Fitter, Happier, More Productive

This talk will walk through a series of improvements made to our PostgreSQL 10 databases to improve performance, increase operational efficiency, and tune parameters. Some of the techniques were to add monitoring, remove and replace indexes, remove bloat, and split tables into a new database without application downtime.

Andrew Atkinson

November 24, 2021
Tweet

Resources

PGConf NYC 2021 - How we made PG Fitter, Happier, More Productive by Andrew Atkinson

https://www.youtube.com/watch?v=ijYha2bBink&list=PL9-zCXZQFyvqhkrefUXAfC4FYntQX9SC9&index=9

This talk will walk through a series of improvements made to our PostgreSQL 10 databases to improve performance, increase operational efficiency, and tune parameters. Some of the techniques were to add monitoring, remove and replace indexes, remove bloat, and split tables into a new database without application downtime.

More Decks by Andrew Atkinson

Other Decks in Technology

Transcript

  1. How We Made PG Fitter, Happier, More Productive Andrew Atkinson

    bit.ly/pg-fitter- happier PostgreSQL Fitter, Happier 2021 Slide 1 of 72
  2. Fountain Fountain is a High Volume Hiring Platform NOTE This

    talk is not about Fountain Hiring DBAs, Engineers right now! http://get.fountain.com/careers PostgreSQL Fitter, Happier 2021 Slide 4 of 72
  3. Context and Inspiration The platform experienced a 10x usage increase

    during the Covid-19 pandemic We had no dedicated DBA There was an opportunity to rapidly learn to better utilize PG PostgreSQL Fitter, Happier 2021 Slide 5 of 72
  4. Case Study: Scaling and Optimizing PostgreSQL Share our challenges, findings

    and some successful results Connect our challenges with common scaling and optimization issues Backend powered by Rails and PostgreSQL 10 on AWS RDS Single writer primary with async replication PostgreSQL Fitter, Happier 2021 Slide 6 of 72
  5. Database Connections Problems As web app and DB activity load

    increased, performance degraded Web app servers and background process workers couldn't get connections ...Is the server running on host example and accepting TCP/IP connections?... Could not add more servers to handle load due to lack of database connections PostgreSQL Fitter, Happier 2021 Slide 9 of 72
  6. Connections Learnings Connections are costly! PostgreSQL forks a new process

    for each connection Connections are a limited resource! Connections may be active , idle , idle in transaction . Too many idle connections can be a problem Connections in transaction but idle that are old may be a problem PostgreSQL Fitter, Happier 2021 Slide 10 of 72
  7. Connection Poolers pgbouncer Can be monitored by pgdash Pgpool-II RDS

    Proxy PostgreSQL Fitter, Happier 2021 Slide 12 of 72
  8. Connection Pooler Benefits Connection pooling enables many more client transactions

    with a fixed number of server connections Safely add more app servers and background workers that require connections Improve performance More efficiently utilize connections with transactions Eliminate overhead of connection establishment PostgreSQL Fitter, Happier 2021 Slide 14 of 72
  9. Added RDS Proxy RDS Proxy meant our small team didn't

    have to manage the infrastructure We no longer ran out of connections or saw degraded performance RDS Proxy was a big win for system stability! PostgreSQL Fitter, Happier 2021 Slide 15 of 72
  10. Excessive Disk Space Observations Tables and indexes had a very

    high bloat percentage No table or index bloat reductions had been performed Non-optimized indexes were taking up more space than necessary PostgreSQL Fitter, Happier 2021 Slide 17 of 72
  11. Space Reduction for Indexes Bloated indexes can be rebuilt without

    bloat Indexes can be reduced in size by limiting the fields and rows Unnecessary indexes can be removed entirely PostgreSQL Fitter, Happier 2021 Slide 19 of 72
  12. Non-Optimized Indexes We had over 100 unused indexes among 75

    tables Some bloated indexes were over 100 GB in size Estimated bloat percentage >= 80% PostgreSQL Fitter, Happier 2021 Slide 21 of 72
  13. Rebuild Indexes Used pg_repack with PG 10 to rebuild indexes

    one by one On PG >= 12 REINDEX CONCURRENTLY is available Rebuilt over 30 indexes reclaiming 230 GB of space PostgreSQL Fitter, Happier 2021 Slide 22 of 72
  14. Removed Unused Indexes Find indexes with zero scans Confirm no

    usages across the app, cron jobs, background jobs etc. In Rails create migration to drop the index ( DROP INDEX index_name ) Removed over 100 indexes and 300 GB of space! PostgreSQL Fitter, Happier 2021 Slide 24 of 72
  15. Relocated Database Tables We had 6 tables of metrics style

    data This data was typically not presented in the app High writes, insert only Around 700GB of data PostgreSQL Fitter, Happier 2021 Slide 25 of 72
  16. No Downtime Relocation Process 1. Change queries that use fields

    from tables being moved Typically involved splitting into more queries 2. Create new replica 3. Test and release queries against replica 4. Send writes to old and new. Writes will fail on read until promoted 5. Promote replica 6. Stop writing to old location and remove redundant obsolete tables. PostgreSQL Fitter, Happier 2021 Slide 26 of 72
  17. Space Reduction Technique Size Initial DB Size 2.5 TB Relocated

    tables 700 GB Removed unused indexes 300 GB Removed index bloat 200 GB New DB Size 1.2 TB PostgreSQL Fitter, Happier 2021 Slide 27 of 72
  18. Performance Challenges 10x increase in RPM Autovacuum not running at

    all Autovacuum running for a very long time Slow queries Non-optimized indexes PostgreSQL Fitter, Happier 2021 Slide 29 of 72
  19. Scaled Up Database Server (Vertically) Scaled Up Database Server 96

    vCPUs 768 GB memory Provisioned IOPS PostgreSQL Fitter, Happier 2021 Slide 30 of 72
  20. Why Tune Autovacuum? Remove dead rows for future inserts and

    updates quicker Reduce excessive disk space growth by avoiding inefficient disk use Ensure data statistics are refreshed as quickly as possible Reduce autovacuum run times PostgreSQL Fitter, Happier 2021 Slide 31 of 72
  21. Autovacuum Parameters Run vacuum in proportion to UPDATE and DELETE

    workload Param (PG 10) Original Tuned autovacuum_vacuum_scale_factor 20% 1% (tuned for big tables) autovacuum_vacuum_cost_limit 200 2000 autovacuum_vacuum_cost_delay 20ms 2ms maintenance_work_mem Sized by RDS autovacuum_max_workers PostgreSQL Fitter, Happier 2021 Slide 33 of 72
  22. Query Performance Query load can be categorized as write or

    read Slow queries contribute to system load Super slow reads on replica can affect primary Bloat can affect query performance PostgreSQL Fitter, Happier 2021 Slide 34 of 72
  23. Relocated Read Queries to Replica High proportion of reads (SELECT)

    to writes (INSERT, UPDATE, DELETE) Relocated queries to replica to reduce load on primary Additional replicas can be added to distribute read load PostgreSQL Fitter, Happier 2021 Slide 36 of 72
  24. Query Relocation Process Identify slow queries in APM or PgHero

    Determine if they are read only Wrap query in a code block that connects to the replica Fall back to the primary if not found Fall back to the primary if replication lag is too high PostgreSQL Fitter, Happier 2021 Slide 37 of 72
  25. Bloat and Query Performance Very high bloat can contribute to

    poor or inaccurate query plans Bloat causes unnecessary scans through pages Bloat can slow down index only scans PostgreSQL Fitter, Happier 2021 Slide 38 of 72
  26. Fast Database Migrations Use strong_migrations and code review to detect

    slow migrations Avoid common problems with large tables, like adding a column with a default PostgreSQL Fitter, Happier 2021 Slide 39 of 72
  27. Write Rate Performance and Indexes Indexes add some maintenance overhead

    for writes Remove unused, duplicate or invalid indexes Use PgHero or this SQL query to find unused indexes Eliminate redundant indexes e.g. querying a when indexes exist for a & b and a H.O.T. updates require non-indexed columns PostgreSQL Fitter, Happier 2021 Slide 41 of 72
  28. Query Statistics Enabled pg_stat_statements to collect query statistics Viewed total_time

    and calls statistics with PgHero Slow queries > 20ms flagged https://wiki.postgresql.org/wiki/Monitoring PostgreSQL Fitter, Happier 2021 Slide 42 of 72
  29. Query Optimization Process Overview Use EXPLAIN to look at query

    execution plan estimates Use EXPLAIN ANALYZE and look at where significant time is spent Look for techniques like adding an index to improve query performance Focus on highest impact queries Test with similar hardware and with similar data (very important) PostgreSQL Fitter, Happier 2021 Slide 43 of 72
  30. -- -- Query: Drivers sorted by average rating and trip

    count -- rideshare_development> explain SELECT CONCAT(d.first_name, ' ', d.last_name) AS driver_name, AVG(t.rating) AS avg_rating, COUNT(t.rating) AS trip_count FROM trips t JOIN users d ON t.driver_id = d.id GROUP BY t.driver_id, d.first_name, d.last_name ORDER BY COUNT(t.rating) DESC; PostgreSQL Fitter, Happier 2021 Slide 45 of 72
  31. +---------------------------------------------------------------------------------+ | QUERY PLAN | |---------------------------------------------------------------------------------| | Sort (cost=231.61..236.11 rows=1800

    width=90) | | Sort Key: (count(t.rating)) DESC | | -> HashAggregate (cost=107.28..134.28 rows=1800 width=90) | | Group Key: t.driver_id, d.first_name, d.last_name | | -> Hash Join (cost=3.35..69.78 rows=3000 width=22) | | Hash Cond: (t.driver_id = d.id) | | -> Seq Scan on trips t (cost=0.00..58.00 rows=3000 width=8) | | -> Hash (cost=2.60..2.60 rows=60 width=22) | | -> Seq Scan on users d (cost=0.00..2.60 rows=60 width=22) | +---------------------------------------------------------------------------------+ EXPLAIN Time: 0.027s PostgreSQL Fitter, Happier 2021 Slide 46 of 72
  32. Query Optimization Tips Look for Index Only Scan opportunities Query

    fewer fields (avoid select * ) Eliminate Heap Fetch by limiting fields Consider a Covering index by using INCLUDE (b) to include b payload column (11+) Avoid ORDER BY and DISTINCT if possible, or take advantage of sorted property of indexes PostgreSQL Fitter, Happier 2021 Slide 47 of 72
  33. Indexes Not Used A sequential scan may be more efficient

    than an index scan Query execution planner will look at cardinality (unique values) and estimate row counts and selectivity In a test database, sequential scans can be disabled (NOT FOR PROD) to confirm indexes are used rideshare_development> SET enable_seqscan = OFF; SET PostgreSQL Fitter, Happier 2021 Slide 48 of 72
  34. +---------------------------------------------------------------------------------------------------------------+ | QUERY PLAN | |---------------------------------------------------------------------------------------------------------------| | Sort (cost=300.56..305.06 rows=1800

    width=90) | | Sort Key: (count(t.rating)) DESC | | -> HashAggregate (cost=176.23..203.23 rows=1800 width=90) | | Group Key: t.driver_id, d.first_name, d.last_name | | -> Hash Join (cost=15.07..138.73 rows=3000 width=22) | | Hash Cond: (t.driver_id = d.id) | | -> Index Scan using index_trips_on_driver_id on trips t (cost=0.28..115.51 rows=3000 width=8) | | -> Hash (cost=14.04..14.04 rows=60 width=22) | | -> Index Scan using users_pkey on users d (cost=0.14..14.04 rows=60 width=22) | +---------------------------------------------------------------------------------------------------------------+ EXPLAIN Time: 0.022s PostgreSQL Fitter, Happier 2021 Slide 49 of 72
  35. Partial Indexes Partial indexes avoid indexing common values Write rate

    benefit as new rows likely will not match predicate PostgreSQL Fitter, Happier 2021 Slide 50 of 72
  36. Partial Index Example 1 of 2 Query trips with no

    rating 3000 trips, 750 without a rating -> Bitmap Index Scan on index_trips_on_rating (cost=0.00..13.90 rows=750 width=0) Index Cond: (rating IS NULL) PostgreSQL Fitter, Happier 2021 Slide 51 of 72
  37. Partial Index Example 2 of 2 create index index_trips_on_rating_partial ON

    trips (rating) where rating IS NULL; Bitmap Index Scan on index_trips_on_rating_partial (cost=0.00..11.90 rows=750 width=0) Index Cond: (rating IS NULL) Index size Partial index size 40 kB 16 kB 60% decrease PostgreSQL Fitter, Happier 2021 Slide 52 of 72
  38. Partial Indexes Results Created replacement partial indexes for several existing

    indexes Two examples of biggest size reductions were 82% and 97% smaller On large tables this is significant PostgreSQL Fitter, Happier 2021 Slide 53 of 72
  39. Reduced Columns in Multi-Column Indexes Remove columns that are not

    needed Replace 3 columns with 2, or 2 with 1 PostgreSQL can combine multiple single column indexes! PostgreSQL Fitter, Happier 2021 Slide 54 of 72
  40. Slow Queries Results Fixed more than 10 slow queries identified

    by PgHero PostgreSQL Fitter, Happier 2021 Slide 55 of 72
  41. Performance Analysis Tools https://wiki.postgresql.org/wiki/Performance_Analysis_Tools Percona pg_stat_monitor - PostgreSQL Statistics Collector

    (extension for 11, 12, 13) pganalyze Index Advisor - Index recommendations! PostgreSQL Fitter, Happier 2021 Slide 58 of 72
  42. Query Errors Queries sometimes never finish Checking out a connection

    from the connection pool may never finish Acquiring a lock may never finish A transaction may never commit or rollback PostgreSQL Fitter, Happier 2021 Slide 60 of 72
  43. Statement Timeout Sets a maximum time a query can run

    before being canceled Parameter Example value statement_timeout 4-5 seconds PostgreSQL Fitter, Happier 2021 Slide 62 of 72
  44. Checkout Timeout Sets a maximum time to wait for a

    connection to be available Parameter Example value checkout_timeout 4-5 seconds PostgreSQL Fitter, Happier 2021 Slide 63 of 72
  45. Lock Timeout Sets a maximum time to wait to acquire

    a lock SHOW lock_timeout SET LOCAL lock_timeout = '5s' PostgreSQL Fitter, Happier 2021 Slide 64 of 72
  46. Idle Transactions Active Record manages its own idle transactions in

    connection pool Sets a maximum time a connection can be idle before being disconnected Idle transactions can prevent vacuum from running Parameter Default idle_timeout 300s (Rails default) idle_in_transaction_session_timeout 24 hours? PostgreSQL Fitter, Happier 2021 Slide 65 of 72
  47. Replication Errors Replication lag that gets too high can stop

    replication entirely Rebuilding indexes is IO intensive which will cause a replication lag spike Replication can conflict with vacuum on primary PostgreSQL Fitter, Happier 2021 Slide 66 of 72
  48. Special Recognition! Past team for opportunities to learn Special thanks

    to David Practice runs: Jamie, current team, Lukas, Andrew, and Craig PostgreSQL community: Documentation, IRC, Slack PostgreSQL Fitter, Happier 2021 Slide 68 of 72
  49. Connections and Space Summary Added a connection pooler to increase

    performance and enable additional scale Re-created 30 indexes to eliminate bloat Removed more than 100 unused indexes Distributed write load by relocating metrics tables to their own database Tuned Autovacuum to run more frequently and for longer PostgreSQL Fitter, Happier 2021 Slide 69 of 72
  50. Performance and Errors Summary Added a dashboard to show slow

    queries and fixed the top 10 Created awareness of index maintenance cost especially for high write tables Moved more than 10 queries to a read replica Added timeouts to increase resiliency from errors related to queries, database connections, and locks PostgreSQL Fitter, Happier 2021 Slide 70 of 72