Postgres 10, Performance and You

PostgreSQL 10, Performance and You Gabriel Enslein

Who here uses Postgres?

Why PostgreSQL is great Native JSON support Native Hash support
Strict Data Integrity commitment and enforcement Continued excellence in subqueries Numerous extensions for data flexibility Standards built-in natively

How PostgreSQL 10 got even better • Native Table partitioning
• Hash Index resiliency • Native Full-Text Search for JSON and JSONB • Parallel Query capabilities for Indexes, Outer Joins • Query Analyzer accuracy with parallel changes

Terms/Phrases in the talk you’ll hear • Bloat • Vacuum
• Sequential scan • Replication • Pruning • Statistics If you want to learn more, see me after

Who is this Gabriel character? Joined Heroku Data late 2016
Sinatra and RoR API Services Focused on Postgres High Availability data growth PostgreSQL, MySQL, SQL Server, Cassandra, Couchbase…(please no more)

Self-Help/Life Coaching is my passion on the side

Super Secret Project: Social interest platform Each person as a
Customer: • Customers will have profiles • Each customer can posts from areas of interest ◦ Favorite Quotes ◦ TV Shows ◦ Movies ◦ Foods ◦ Etc. • Imagine millions of users and hundreds of thousands of posts per day

Great idea to spread uplifting thoughts on the Internet! ActiveRecord.migration
do change do create_table :quotes do primary_key :id integer :character_id text :quote timestamps end end end Sharing Inspirational Quotes

Each row size: 0.5kb Our DB has 16 GB RAM
Most quotes, never updated after first week posted We usually only want quotes posted or updated in the last week Uh-oh, we have 750 Million quotes and counting

Quotes table size 375 GB Way more than memory we
have even for 1/20 of the table Here is what Postgres will do: • It pulls each record from disk, and it attempts to put it into memory. • Postgres does this because memory is fast, and if you needed it once, you might need it again soon. Uh-oh, we have 750 Million quotes and counting

• What happens if you’re putting things into memory that
you don’t use? ◦ Postgres will use temp disk space to manage large queries • It caches each whole record, then creates pointers to disk for data overflow • Postgres can stop accepting connections if we run too many of these queries at the same time But the dataset is too big for memory what happens now?

Who likes fast queries? Who likes to keep the database
up? Who likes having database connections? But the dataset is too big for memory what happens now?

Great idea to spread uplifting thoughts on the Internet! ActiveRecord.migration
do change do create_table :quotes do primary_key :id integer :character_id text :quote timestamps end execute "ALTER TABLE quotes PARTITION BY RANGE (updated_at)" end end Sharing Inspirational Quotes: Native Partitioning!

Native Table Partitioning Entirely avoid the VACUUM overhead caused by
a bulk DELETE. When most heavily accessed rows of a given table are in a single partition or a small number of partitions, faster lookups Seldom-used data can be migrated to cheaper and slower storage media. Why use native partitioning?

Native Table Partitioning Query performance improved dramatically Performance improved by
sequential scan of a partition instead of index and random access reads Bulk loads and deletes can be accomplished by adding or removing partitions. Native partitioning is integrated into Postgres internal operations now Pg_partman is still supported, but it’s not built into 10’s new optimizations Why use native partitioning?

Partitioning isn’t natively supported in ORMs Some gems have existed
to attempt to integrate partitioning into ActiveRecord Tables can be created from subsections of the original Partitioning (month, year, quote text, etc.) Subpartitions can be made on the Tables from the original partition as well Native Table Partitioning Noteworthy items

Market Pivot: Home Delivery

Customer shipping addresses ActiveRecord.migration do change do create_table :addresses do
primary_key :id customer_id :integer text :street_address text :city text :state text :zip_code text :country_code end end end Competing with delivery businesses We will compete with home delivery companies Direct to Customers from the platform

Customer shipping addresses But there are problems! • Our delivery
drivers have a hard time finding addresses • Addresses aren’t uniform across North America • Customer Satisfaction is very important, special instructions should be considered Competing with delivery businesses

Customer shipping addresses How can we store all of this
seemingly unstructured data? Introducing Hstore • HStore is not new • it’s a key/value store in Postgres Let’s implement the feature in HStore to track extra delivery info Competing with delivery businesses

primary_key :id customer_id :integer text :street_address text :city text :state text :zip_code text :country_code hstore :usage # Usage hash example: { primary: true/false, type: 'home/business/aparment' } end end end Competing with delivery businesses

Customer shipping addresses Wait, why is the address lookup so
slow? Addresses table has 250 million records • Customers have multiple addresses • People in apartments move frequently but don’t remove old ones Our drivers don’t want to wait on slow updates out on the road Luckily we can add an index, and really improve lookups! Competing with delivery businesses

primary_key :id customer_id :integer text :street_address text :city text :state text :zip_code text :country_code hstore :usage # Usage hash example: { primary: true/false, type: 'home/business/aparment' } end execute "CREATE INDEX addresses_usage ON addresses USING HASH(usage)" end end Competing with delivery businesses

But wait... Hash Index resiliency Postgres 9.x ERROR: could not
read block 32570 of relation base/16390/2663: read only 0 of 8192 bytes Customer shipping addresses It broke though, what happened?

Hash Index resiliency Postgres 9.x

Hash Index resiliency Postgres 9.x Why the resiliency issues with
hash indexes? • Not treated as first class citizens • Not replicated • Hard to maintain • Misleading failure modes (XX001, XX002) • Performance was generally poor under heavy load

Hash Index resiliency Postgres 9.x If customers add/update addresses frequently,
the hash index becomes stale Replicas/Followers won’t have the index at all Index has to be rebuilt if database crashes (OS goes down, power outage, etc.) Customer shipping addresses

Hash Index resiliency Postgres 9.x Recreating large hash indexes can
be time consuming Careful not to rely on hash indexes, severely outdated indexes can imitate corruption (XX001, XX002) What’s the solution?! Customer shipping addresses

Now first-class citizens Crash safe Replicated Reduced locking during bucket
splits Faster lookups More even index growth Single-page pruning Hash Index resiliency Postgres 10 Crash Safe, Replicable Hash Indexes

primary_key :id customer_id :integer text :street_address text :city text :state text :zip_code text :country_code hstore :usage # Unstructured Data, data differing between locales (US vs. EU vs. CA, etc.) end execute "CREATE INDEX addresses_usage ON addresses USING HASH(usage)" end end Hash Index resiliency Postgres 10

Hash Index resiliency Postgres 10 Customer shipping addresses Usage index
will persist after a crash or failure Faster lookups on the hstore index for the values Reduced locking to alleviate concurrency - multiple edits to a single address in quick succession more likely to succeed Replicas/Followers will have the indexes

New Feature: Curated Media Content

Curated Content: Characters of Interest Interest Outlet: Our favorite Characters
Hundreds of thousands of TV Shows and Movies Each show or move can have many characters (some have hundreds) Varying amounts of information may be acquired over time about each character or show

Curated Content: Characters of Interest Specific information for each character
of interest: • Genres • Catchphrases • Famous accessories (items, weapons, clothes) Need unstructured way of keeping that data colocated Interest Outlet: Our favorite Characters

Storing JSON in Postgres Curated Content: Characters of Interest ActiveRecord.migration
do change do create_table :characters do primary_key :id integer :media_id text :name jsonb :metadata end end end

Curated Content: Characters of Interest Storing JSON in Postgres Millions
of characters, JSON blobs are much bigger than single line quotes Dynamic unstructured nature means dynamic sizes of each record How can we keep unstructured data without having to parse it all?

Curated Content: Characters of Interest Storing JSON in Postgres Multiple
Tables? Many Relationships? Partitions? Is there something better?

Full text JSON Searching Postgres 10 • Store whole request
bodies for inspection • Native indexing in Postgres • Allows for more NoSQL style operations ◦ dynamic schema flexibility • Development agility in initial stages Desire for searching JSON

Full text JSON Searching Postgres 10 Curated Content: Characters of
Interest ActiveRecord.migration do change do create_table :characters do primary_key :id integer :media_id text :name jsonb :metadata end execute <<- SQL CREATE INDEX characters_metadata ON characters USING gin (( to_tsvector('english',metadata) ) SQL end end Most importantly: Unstructured Data flexibility!

Benefits of JSON Full Text Searching All data stays colocated
Indexes can help filter data based on existing or missing JSON keys More perspectives of character behavior from the existing data Genres, strengths, weaknesses, skills (Hammers anyone?) Full text JSON Searching Postgres 10 Curated Content: Characters of Interest

Revisiting Quotes: Another feature request People want to be re-inspired,
“On this day…” Replay a stream of older Quote posts Stream should include all Posters I follow or like, not just my posts Even getting last month’s posts seem slow though, they’re so many I want Throwback Quotes

Postgres 10 introduces Parallel Query support for more query types
Bitmap Heap Scan Gather Merge Merge Joins Revisiting Quotes: Another feature request I want Throwback Quotes

Postgres 9.6 Parallel Query Why sequential scan? - Main table
of a query has to be scanned - Scan has to be able to be broken into parts for the workers - Postgres 9.6 only supported this directly on tables - This created sequential scans, which can be very slow Parallel Sequential Scans

Postgres 10 Parallel Query optimization Improvements made - Now process
scan the index of main table - Builds the list of Heap pages to check in memory - Scan can be broken into parallel to check through the heap pages - Loads less data, requires less caching Parallel Bitmap Heap Scans

Postgres 9.6 Parallel Query What is the Gather mechanism? -
Postgres 9.6 introduced it as a way to collect results from workers - Results were collected in arbitrary orderings from all workers Gather: Worker result collection

Postgres 9.6 Parallel Query What is the Gather mechanism? -
If workers are producing sorted output, this adds overhead - Sorting parallel chunks, merge randomly, then sort again, seems good - What about really large result sets? (Remember we have lots of quotes) Gather: Worker result collection

Postgres 9.6 Parallel Query Gather: Worker result collection What are
my primes?

Postgres 9.6 Parallel Query Gather: Worker result collection What are
my primes? - 17 is the first node visited - Order based on visit point and total items (17,19, 11, 2, 23, 31) - Another pass is required to sort after gathering matching results - Workers returned as fast as possible, no location context

Postgres 10 Parallel Query What is the Gather Merge? -
Speed up the parallel worker results gathering - Sorts are taken into account during parallelization - Removes much overhead doing the sort after the fact - Especially beneficial result set is large Gather Merge: Optimized Worker result collection

Postgres 10 Parallel Query Gather Merge: Optimized Worker result collection
Revisit: What are my primes?

- 17 is the first node visited - Direction, location considered when workers collect results - B-tree construction will naturally consider leaf vs. root node locations (insertion balancing) Revisit: What are my primes?

- Tree collapsing happens in parallel now - Order will be more intuitive and require less resorting Revisit: What are my primes?

Postgres 9.6 Parallel Query What is Quotes JOIN Customers actually
doing? - Postgres 9.6 only allowed parallelization on nested loops or hash joins - Other parallel portions cannot perform the merge joins further up - Indexes, ordering all adding overhead, not speeding up requests Merge Join: All followed Posters’ Quotes

Postgres 10 Parallel Query What’s the difference in the Parallel
Merge Join? - The parallel merge joins can now be done at all parallel portions of work - Outer sides of JOINs are parallelized more effectively - Combined with Parallel Index scans, higher efficiency - Natural ordering by default from the algorithm without specifying ORDER Parallel Merge Join: Optimized ordering

Postgres 10 Query Analyzer Engineering Support: Filing and tracking issues
Postgres 9.6 Query Analyzer vs. Issues table -> 1.8M records, 495 MB

Postgres 9.6 Query Analyzer accuracy EXPLAIN ANALYZE SELECT "issues".* FROM
"issues" WHERE "issues"."repo_id" = 2151 AND "issues"."state" = 'open' ORDER BY created_at DESC LIMIT 20 OFFSET 0; QUERY PLAN ------------------------------------------------------------------------------------------------------------- Limit (cost=4267.01..4267.02 rows=20 width=1233) (actual time=935.987..935.992 rows=20 loops=1) -> Sort (cost=4267.01..4268.21 rows=2384 width=1233) (actual time=935.985..935.988 rows=20 loops=1) Sort Key: created_at DESC Sort Method: top-N heapsort Memory: 31kB -> Index Scan using index_issues_on_repo_id_and_id on issues (cost=0.08..4254.33 rows=2384 width=1233) (actual time=2.877..934.092 rows=2159 loops=1) Index Cond: (repo_id = 2151) Planning time: 7.061 ms Execution time: 936.042 ms

Postgres 10 Query Analyzer accuracy EXPLAIN ANALYZE SELECT "issues".* FROM
"issues" WHERE "issues"."repo_id" = 2151 AND "issues"."state" = 'open' ORDER BY created_at DESC LIMIT 20 OFFSET 0; QUERY PLAN ------------------------------------------------------------------------------------------------------------- Limit (cost=3985.46..3985.47 rows=20 width=1232) (actual time=6.768..6.771 rows=20 loops=1) -> Sort (cost=3985.46..3986.58 rows=2222 width=1232) (actual time=6.767..6.768 rows=20 loops=1) Sort Key: created_at DESC Sort Method: top-N heapsort Memory: 39kB -> Index Scan using index_issues_on_repo_id_and_id on issues (cost=0.08..3973.64 rows=2222 width=1232) (actual time=0.062..5.952 rows=1980 loops=1) Index Cond: (repo_id = 2151) Planning time: 0.745 ms Execution time: 6.815 ms

Planning time: 7.061 ms Execution time: 936.042 ms Vs Planning
time: 0.745 ms Execution time: 6.815 ms

QUERY ANALYTICS: ANOTHER TALK

Gotchas with PostgreSQL 10 Rails 4 doesn’t support PostgreSQL 10,
Only up to PostgreSQL 9.6 Partitioning is not native to any Ruby ORM Postgres-specific index types (JSON, Hash) are not native to any Ruby ORM Analytics for queries and indexes have to be done in raw SQL

Heroku Postgres improvements released Added support for numerous native extensions
Managed Postgres Credentials Released hardware new plans with dedicated hardware Improved Postgres CLI analytics from improved Postgres query analytics Heroku Postgres Connection Pooling

Heroku Postgres extensions enabled Made a list, vetted it twice!
Top 5 Favorites added lo sslinfo tcn timetravel address_standardizer For the complete list, checkout the Heroku Postgres Extensions article

Postgres permissions difficult to reason about Creates more simplified view
of Postgres Roles Still allows for customizable permissions on Roles and Tables Heroku Postgres Managed Credentials What are managed credentials?

Heroku Postgres Managed Credentials Footguns considered, guard rails added •
Special characters (No one likes Bobby Tables) • Name lengths • Best practices published What are managed credentials?

Heroku Postgres Managed Credentials Separation of concerns Principle of least
privilege Better control of external access: • E.g. Don’t want third-party analytics reading PII data Why use managed credentials? For more info, check out the Heroku Postgres Credentials Devcenter Article

Heroku PGX Plans Released Benefits to using PGX Optimized for
heavier I/O work Better database and disk caching More stable performance Increased accuracy performing query analytics Enhanced CPU capability for optimizing Parallel Queries

Heroku PGX Plans Released Dedicated Provisioned Disk I/O Doubled CPU
Cores at tier-4 and up RAM sizes doubled at tier-4 and up Increased disk sizes at every tier Prices stayed the same or lowered! Big Takeaways Read more about the Heroku Postgres PGX plans on our blog post

Heroku Postgres CLI analytics Database tuning and monitoring still vital
- https://devcenter.heroku.com/articles/heroku-postgres-database-tuning Extended CLI analytics improved from Postgres 10 internals changes - https://github.com/heroku/heroku-pg-extras heroku pg:diagnose #and more!

Heroku Connection Pooling: PgBouncer Beta • Manages connections for your
Postgres Database ◦ Reduces need to open new connections • Lives on the Database Server • Uses Transactional connection Pooling What’s PgBouncer?

Heroku Connection Pooling: PgBouncer Beta • Better Visibility on connection
statistics • Exposes guard rails to protect your database • Excellent for high-throughput asynchronous workloads Why use Connection pooling?

Heroku Connection Pooling: PgBouncer Beta On Server Connection pooling is
only available on Postgres 9.6, 10 - Read more about it: Postgres connection pooling - Still in public beta, more features to come! - Default connections direct to DBs still available after enabling Still offering Client-side buildpack - Client-side PgBouncer article

Questions welcome! Come by the Booth!

Postgres 10, Performance and You

Postgres 10, Performance and You

More Decks by Gabe Enslein

Other Decks in Programming

Featured

Transcript