Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Postgres 10, Performance and You

Postgres 10, Performance and You

Postgres 10 is out this year, with a whole host of features you won't want to miss. Join the Heroku data team as we take a deep dive into parallel queries, native json indexes, and other performance packed features in PostgreSQL. Whether you've got no idea what Postgres version you're using or you had a bowl of devops for dinner, you won't want to miss this talk.

Gabe Enslein

April 18, 2018
Tweet

More Decks by Gabe Enslein

Other Decks in Programming

Transcript

  1. PostgreSQL 10,
    Performance and You
    Gabriel Enslein

    View Slide

  2. Who here uses Postgres?

    View Slide

  3. Why PostgreSQL is great
    Native JSON support
    Native Hash support
    Strict Data Integrity commitment and enforcement
    Continued excellence in subqueries
    Numerous extensions for data flexibility
    Standards built-in natively

    View Slide

  4. How PostgreSQL 10 got even better
    ● Native Table partitioning
    ● Hash Index resiliency
    ● Native Full-Text Search for JSON and JSONB
    ● Parallel Query capabilities for Indexes, Outer Joins
    ● Query Analyzer accuracy with parallel changes

    View Slide

  5. Terms/Phrases in the talk you’ll hear
    ● Bloat
    ● Vacuum
    ● Sequential scan
    ● Replication
    ● Pruning
    ● Statistics
    If you want to learn more, see me after

    View Slide

  6. Who is this Gabriel character?
    Joined Heroku Data late 2016
    Sinatra and RoR API Services
    Focused on Postgres High Availability data growth
    PostgreSQL, MySQL, SQL Server, Cassandra,
    Couchbase…(please no more)

    View Slide

  7. Self-Help/Life Coaching is my passion on the side

    View Slide

  8. Super Secret Project: Social interest platform
    Each person as a Customer:
    ● Customers will have profiles
    ● Each customer can posts from areas of interest
    ○ Favorite Quotes
    ○ TV Shows
    ○ Movies
    ○ Foods
    ○ Etc.
    ● Imagine millions of users and hundreds of thousands of posts per day

    View Slide

  9. Great idea to spread uplifting thoughts on the Internet!
    ActiveRecord.migration do
    change do
    create_table :quotes do
    primary_key :id
    integer :character_id
    text :quote
    timestamps
    end
    end
    end
    Sharing Inspirational Quotes

    View Slide

  10. Each row size: 0.5kb
    Our DB has 16 GB RAM
    Most quotes, never updated after first week posted
    We usually only want quotes posted or updated in the last week
    Uh-oh, we have 750 Million quotes and counting

    View Slide

  11. Quotes table size 375 GB
    Way more than memory we have even for 1/20 of the table
    Here is what Postgres will do:
    ● It pulls each record from disk, and it attempts to put it into memory.
    ● Postgres does this because memory is fast, and if you needed it once, you might
    need it again soon.
    Uh-oh, we have 750 Million quotes and counting

    View Slide

  12. ● What happens if you’re putting things into memory that you don’t use?
    ○ Postgres will use temp disk space to manage large queries
    ● It caches each whole record, then creates pointers to disk for data overflow
    ● Postgres can stop accepting connections if we run too many of these queries at
    the same time
    But the dataset is too big for memory
    what happens now?

    View Slide

  13. Who likes fast queries?
    Who likes to keep the database up?
    Who likes having database connections?
    But the dataset is too big for memory
    what happens now?

    View Slide

  14. Great idea to spread uplifting thoughts on the Internet!
    ActiveRecord.migration do
    change do
    create_table :quotes do
    primary_key :id
    integer :character_id
    text :quote
    timestamps
    end
    execute "ALTER TABLE quotes PARTITION BY RANGE (updated_at)"
    end
    end
    Sharing Inspirational Quotes: Native Partitioning!

    View Slide

  15. Native Table Partitioning
    Entirely avoid the VACUUM overhead caused by a bulk DELETE.
    When most heavily accessed rows of a given table are in a single partition or a small
    number of partitions, faster lookups
    Seldom-used data can be migrated to cheaper and slower storage media.
    Why use native partitioning?

    View Slide

  16. Native Table Partitioning
    Query performance improved dramatically
    Performance improved by sequential scan of a partition instead of index and
    random access reads
    Bulk loads and deletes can be accomplished by adding or removing partitions.
    Native partitioning is integrated into Postgres internal operations now
    Pg_partman is still supported, but it’s not built into 10’s new optimizations
    Why use native partitioning?

    View Slide

  17. Partitioning isn’t natively supported in ORMs
    Some gems have existed to attempt to integrate partitioning into
    ActiveRecord
    Tables can be created from subsections of the original Partitioning (month,
    year, quote text, etc.)
    Subpartitions can be made on the Tables from the original partition as well
    Native Table Partitioning
    Noteworthy items

    View Slide

  18. Market Pivot: Home Delivery

    View Slide

  19. Customer shipping addresses
    ActiveRecord.migration do
    change do
    create_table :addresses do
    primary_key :id
    customer_id :integer
    text :street_address
    text :city
    text :state
    text :zip_code
    text :country_code
    end
    end
    end
    Competing with delivery businesses
    We will compete with
    home delivery companies
    Direct to Customers from the
    platform

    View Slide

  20. Customer shipping addresses
    But there are problems!
    ● Our delivery drivers have a hard time finding addresses
    ● Addresses aren’t uniform across North America
    ● Customer Satisfaction is very important, special instructions should
    be considered
    Competing with delivery businesses

    View Slide

  21. Customer shipping addresses
    How can we store all of this seemingly unstructured data?
    Introducing Hstore
    ● HStore is not new
    ● it’s a key/value store in Postgres
    Let’s implement the feature in HStore to track extra delivery info
    Competing with delivery businesses

    View Slide

  22. Customer shipping addresses
    ActiveRecord.migration do
    change do
    create_table :addresses do
    primary_key :id
    customer_id :integer
    text :street_address
    text :city
    text :state
    text :zip_code
    text :country_code
    hstore :usage # Usage hash example: { primary: true/false, type: 'home/business/aparment' }
    end
    end
    end
    Competing with delivery businesses

    View Slide

  23. Customer shipping addresses
    Wait, why is the address lookup so slow?
    Addresses table has 250 million records
    ● Customers have multiple addresses
    ● People in apartments move frequently but don’t remove old ones
    Our drivers don’t want to wait on slow updates out on the road
    Luckily we can add an index, and really improve lookups!
    Competing with delivery businesses

    View Slide

  24. Customer shipping addresses
    ActiveRecord.migration do
    change do
    create_table :addresses do
    primary_key :id
    customer_id :integer
    text :street_address
    text :city
    text :state
    text :zip_code
    text :country_code
    hstore :usage # Usage hash example: { primary: true/false, type: 'home/business/aparment' }
    end
    execute "CREATE INDEX addresses_usage ON addresses USING HASH(usage)"
    end
    end
    Competing with delivery businesses

    View Slide

  25. But wait...
    Hash Index resiliency Postgres 9.x
    ERROR: could not read block 32570 of relation base/16390/2663:
    read only 0 of 8192 bytes
    Customer shipping addresses
    It broke though, what happened?

    View Slide

  26. Hash Index resiliency Postgres 9.x

    View Slide

  27. Hash Index resiliency Postgres 9.x
    Why the resiliency issues with hash indexes?
    ● Not treated as first class citizens
    ● Not replicated
    ● Hard to maintain
    ● Misleading failure modes (XX001, XX002)
    ● Performance was generally poor under heavy load

    View Slide

  28. Hash Index resiliency Postgres 9.x
    If customers add/update addresses frequently, the hash index becomes stale
    Replicas/Followers won’t have the index at all
    Index has to be rebuilt if database crashes (OS goes down, power outage,
    etc.)
    Customer shipping addresses

    View Slide

  29. Hash Index resiliency Postgres 9.x
    Recreating large hash indexes can be time consuming
    Careful not to rely on hash indexes, severely outdated indexes can imitate
    corruption (XX001, XX002)
    What’s the solution?!
    Customer shipping addresses

    View Slide

  30. Now first-class citizens
    Crash safe
    Replicated
    Reduced locking during bucket splits
    Faster lookups
    More even index growth
    Single-page pruning
    Hash Index resiliency Postgres 10
    Crash Safe, Replicable Hash Indexes

    View Slide

  31. Customer shipping addresses
    ActiveRecord.migration do
    change do
    create_table :addresses do
    primary_key :id
    customer_id :integer
    text :street_address
    text :city
    text :state
    text :zip_code
    text :country_code
    hstore :usage # Unstructured Data, data differing between locales (US vs. EU vs. CA, etc.)
    end
    execute "CREATE INDEX addresses_usage ON addresses USING HASH(usage)"
    end
    end
    Hash Index resiliency Postgres 10

    View Slide

  32. Hash Index resiliency Postgres 10
    Customer shipping addresses
    Usage index will persist after a crash or failure
    Faster lookups on the hstore index for the values
    Reduced locking to alleviate concurrency
    - multiple edits to a single address in quick succession more
    likely to succeed
    Replicas/Followers will have the indexes

    View Slide

  33. New Feature: Curated
    Media Content

    View Slide

  34. Curated Content: Characters of Interest
    Interest Outlet: Our favorite Characters
    Hundreds of thousands of TV Shows and Movies
    Each show or move can have many characters (some have hundreds)
    Varying amounts of information may be acquired over time about each
    character or show

    View Slide

  35. Curated Content: Characters of Interest
    Specific information for each character of interest:
    ● Genres
    ● Catchphrases
    ● Famous accessories (items, weapons, clothes)
    Need unstructured way of keeping that data colocated
    Interest Outlet: Our favorite Characters

    View Slide

  36. Storing JSON in Postgres
    Curated Content: Characters of Interest
    ActiveRecord.migration do
    change do
    create_table :characters do
    primary_key :id
    integer :media_id
    text :name
    jsonb :metadata
    end
    end
    end

    View Slide

  37. Curated Content: Characters of Interest
    Storing JSON in Postgres
    Millions of characters, JSON blobs are much bigger than single line quotes
    Dynamic unstructured nature means dynamic sizes of each record
    How can we keep unstructured data without having to parse it all?

    View Slide

  38. Curated Content: Characters of Interest
    Storing JSON in Postgres
    Multiple Tables?
    Many Relationships?
    Partitions?
    Is there something better?

    View Slide

  39. Full text JSON Searching Postgres 10
    ● Store whole request bodies for inspection
    ● Native indexing in Postgres
    ● Allows for more NoSQL style operations
    ○ dynamic schema flexibility
    ● Development agility in initial stages
    Desire for searching JSON

    View Slide

  40. Full text JSON Searching Postgres 10
    Curated Content: Characters of Interest
    ActiveRecord.migration do
    change do
    create_table :characters do
    primary_key :id
    integer :media_id
    text :name
    jsonb :metadata
    end
    execute <<- SQL
    CREATE INDEX characters_metadata ON characters USING gin ((
    to_tsvector('english',metadata) )
    SQL
    end
    end
    Most
    importantly:
    Unstructured
    Data flexibility!

    View Slide

  41. Benefits of JSON Full Text Searching
    All data stays colocated
    Indexes can help filter data based on existing or missing JSON keys
    More perspectives of character behavior from the existing data
    Genres, strengths, weaknesses, skills (Hammers anyone?)
    Full text JSON Searching Postgres 10
    Curated Content: Characters of Interest

    View Slide

  42. Revisiting Quotes: Another feature request
    People want to be re-inspired, “On this day…”
    Replay a stream of older Quote posts
    Stream should include all Posters I follow or like, not just my posts
    Even getting last month’s posts seem slow though, they’re so many
    I want Throwback Quotes

    View Slide

  43. Postgres 10 introduces Parallel Query support for more query types
    Bitmap Heap Scan
    Gather Merge
    Merge Joins
    Revisiting Quotes: Another feature request
    I want Throwback Quotes

    View Slide

  44. Postgres 9.6 Parallel Query
    Why sequential scan?
    - Main table of a query has to be scanned
    - Scan has to be able to be broken into parts for the workers
    - Postgres 9.6 only supported this directly on tables
    - This created sequential scans, which can be very slow
    Parallel Sequential Scans

    View Slide

  45. Postgres 10 Parallel Query optimization
    Improvements made
    - Now process scan the index of main table
    - Builds the list of Heap pages to check in memory
    - Scan can be broken into parallel to check through the heap pages
    - Loads less data, requires less caching
    Parallel Bitmap Heap Scans

    View Slide

  46. Postgres 9.6 Parallel Query
    What is the Gather mechanism?
    - Postgres 9.6 introduced it as a way to collect results from workers
    - Results were collected in arbitrary orderings from all workers
    Gather: Worker result collection

    View Slide

  47. Postgres 9.6 Parallel Query
    What is the Gather mechanism?
    - If workers are producing sorted output, this adds overhead
    - Sorting parallel chunks, merge randomly, then sort again, seems good
    - What about really large result sets? (Remember we have lots of quotes)
    Gather: Worker result collection

    View Slide

  48. Postgres 9.6 Parallel Query
    Gather: Worker result collection
    What are my primes?

    View Slide

  49. Postgres 9.6 Parallel Query
    Gather: Worker result collection
    What are my primes?
    - 17 is the first node visited
    - Order based on visit point and
    total items (17,19, 11, 2, 23, 31)
    - Another pass is required to sort
    after gathering matching results
    - Workers returned as fast as
    possible, no location context

    View Slide

  50. Postgres 10 Parallel Query
    What is the Gather Merge?
    - Speed up the parallel worker results gathering
    - Sorts are taken into account during parallelization
    - Removes much overhead doing the sort after the fact
    - Especially beneficial result set is large
    Gather Merge: Optimized Worker result collection

    View Slide

  51. Postgres 10 Parallel Query
    Gather Merge: Optimized Worker result collection
    Revisit: What are my primes?

    View Slide

  52. Postgres 10 Parallel Query
    Gather Merge: Optimized Worker result collection
    - 17 is the first node visited
    - Direction, location considered
    when workers collect results
    - B-tree construction will naturally
    consider leaf vs. root node
    locations (insertion balancing)
    Revisit: What are my primes?

    View Slide

  53. Postgres 10 Parallel Query
    Gather Merge: Optimized Worker result collection
    - Tree collapsing happens in
    parallel now
    - Order will be more intuitive and
    require less resorting
    Revisit: What are my primes?

    View Slide

  54. Postgres 9.6 Parallel Query
    What is Quotes JOIN Customers actually doing?
    - Postgres 9.6 only allowed parallelization on nested loops or hash joins
    - Other parallel portions cannot perform the merge joins further up
    - Indexes, ordering all adding overhead, not speeding up requests
    Merge Join: All followed Posters’ Quotes

    View Slide

  55. Postgres 10 Parallel Query
    What’s the difference in the Parallel Merge Join?
    - The parallel merge joins can now be done at all parallel portions of work
    - Outer sides of JOINs are parallelized more effectively
    - Combined with Parallel Index scans, higher efficiency
    - Natural ordering by default from the algorithm without specifying ORDER
    Parallel Merge Join: Optimized ordering

    View Slide

  56. Postgres 10 Query Analyzer
    Engineering Support: Filing and tracking issues
    Postgres 9.6 Query Analyzer
    vs.
    Issues table -> 1.8M records, 495 MB

    View Slide

  57. Postgres 9.6 Query Analyzer accuracy
    EXPLAIN ANALYZE
    SELECT "issues".*
    FROM "issues"
    WHERE "issues"."repo_id" = 2151
    AND "issues"."state" = 'open'
    ORDER BY created_at DESC LIMIT 20 OFFSET 0;
    QUERY PLAN
    -------------------------------------------------------------------------------------------------------------
    Limit (cost=4267.01..4267.02 rows=20 width=1233) (actual time=935.987..935.992 rows=20 loops=1)
    -> Sort (cost=4267.01..4268.21 rows=2384 width=1233) (actual time=935.985..935.988 rows=20 loops=1)
    Sort Key: created_at DESC
    Sort Method: top-N heapsort Memory: 31kB
    -> Index Scan using index_issues_on_repo_id_and_id on issues (cost=0.08..4254.33 rows=2384
    width=1233) (actual time=2.877..934.092 rows=2159 loops=1)
    Index Cond: (repo_id = 2151)
    Planning time: 7.061 ms
    Execution time: 936.042 ms

    View Slide

  58. Postgres 10 Query Analyzer accuracy
    EXPLAIN ANALYZE
    SELECT "issues".*
    FROM "issues"
    WHERE "issues"."repo_id" = 2151
    AND "issues"."state" = 'open'
    ORDER BY created_at DESC LIMIT 20 OFFSET 0;
    QUERY PLAN
    -------------------------------------------------------------------------------------------------------------
    Limit (cost=3985.46..3985.47 rows=20 width=1232) (actual time=6.768..6.771 rows=20 loops=1)
    -> Sort (cost=3985.46..3986.58 rows=2222 width=1232) (actual time=6.767..6.768 rows=20 loops=1)
    Sort Key: created_at DESC
    Sort Method: top-N heapsort Memory: 39kB
    -> Index Scan using index_issues_on_repo_id_and_id on issues (cost=0.08..3973.64 rows=2222
    width=1232) (actual time=0.062..5.952 rows=1980 loops=1)
    Index Cond: (repo_id = 2151)
    Planning time: 0.745 ms
    Execution time: 6.815 ms

    View Slide

  59. Planning time: 7.061 ms
    Execution time: 936.042 ms
    Vs
    Planning time: 0.745 ms
    Execution time: 6.815 ms

    View Slide

  60. QUERY ANALYTICS:
    ANOTHER TALK

    View Slide

  61. Gotchas with PostgreSQL 10
    Rails 4 doesn’t support PostgreSQL 10, Only up to PostgreSQL 9.6
    Partitioning is not native to any Ruby ORM
    Postgres-specific index types (JSON, Hash) are not native to any Ruby ORM
    Analytics for queries and indexes have to be done in raw SQL

    View Slide

  62. Heroku Postgres improvements released
    Added support for numerous native extensions
    Managed Postgres Credentials
    Released hardware new plans with dedicated hardware
    Improved Postgres CLI analytics from improved Postgres query analytics
    Heroku Postgres Connection Pooling

    View Slide

  63. Heroku Postgres extensions enabled
    Made a list, vetted it twice!
    Top 5 Favorites added
    lo
    sslinfo
    tcn
    timetravel
    address_standardizer
    For the complete list, checkout the Heroku Postgres Extensions article

    View Slide

  64. Postgres permissions difficult to reason about
    Creates more simplified view of Postgres Roles
    Still allows for customizable permissions on Roles and Tables
    Heroku Postgres Managed Credentials
    What are managed credentials?

    View Slide

  65. Heroku Postgres Managed Credentials
    Footguns considered, guard rails added
    ● Special characters (No one likes Bobby Tables)
    ● Name lengths
    ● Best practices published
    What are managed credentials?

    View Slide

  66. Heroku Postgres Managed Credentials
    Separation of concerns
    Principle of least privilege
    Better control of external access:
    ● E.g. Don’t want third-party analytics reading PII data
    Why use managed credentials?
    For more info, check out the Heroku Postgres Credentials Devcenter Article

    View Slide

  67. Heroku PGX Plans Released
    Benefits to using PGX
    Optimized for heavier I/O work
    Better database and disk caching
    More stable performance
    Increased accuracy performing query analytics
    Enhanced CPU capability for optimizing Parallel Queries

    View Slide

  68. Heroku PGX Plans Released
    Dedicated Provisioned Disk I/O
    Doubled CPU Cores at tier-4 and up
    RAM sizes doubled at tier-4 and up
    Increased disk sizes at every tier
    Prices stayed the same or lowered!
    Big Takeaways
    Read more about the
    Heroku Postgres PGX
    plans on our blog post

    View Slide

  69. Heroku Postgres CLI analytics
    Database tuning and monitoring still vital
    - https://devcenter.heroku.com/articles/heroku-postgres-database-tuning
    Extended CLI analytics improved from Postgres 10 internals changes
    - https://github.com/heroku/heroku-pg-extras
    heroku pg:diagnose #and more!

    View Slide

  70. Heroku Connection Pooling: PgBouncer Beta
    ● Manages connections for your Postgres Database
    ○ Reduces need to open new connections
    ● Lives on the Database Server
    ● Uses Transactional connection Pooling
    What’s PgBouncer?

    View Slide

  71. Heroku Connection Pooling: PgBouncer Beta
    ● Better Visibility on connection statistics
    ● Exposes guard rails to protect your database
    ● Excellent for high-throughput asynchronous workloads
    Why use Connection pooling?

    View Slide

  72. Heroku Connection Pooling: PgBouncer Beta
    On Server Connection pooling is only available on Postgres 9.6, 10
    - Read more about it: Postgres connection pooling
    - Still in public beta, more features to come!
    - Default connections direct to DBs still available after enabling
    Still offering Client-side buildpack
    - Client-side PgBouncer article

    View Slide

  73. Questions welcome!
    Come by the Booth!

    View Slide