Postgres 10, Performance and You

Slide 1

Slide 1 text

PostgreSQL 10, Performance and You Gabriel Enslein

Slide 2

Slide 2 text

Who here uses Postgres?

Slide 3

Slide 3 text

Why PostgreSQL is great Native JSON support Native Hash support Strict Data Integrity commitment and enforcement Continued excellence in subqueries Numerous extensions for data flexibility Standards built-in natively

Slide 4

Slide 4 text

How PostgreSQL 10 got even better ● Native Table partitioning ● Hash Index resiliency ● Native Full-Text Search for JSON and JSONB ● Parallel Query capabilities for Indexes, Outer Joins ● Query Analyzer accuracy with parallel changes

Slide 5

Slide 5 text

Terms/Phrases in the talk you’ll hear ● Bloat ● Vacuum ● Sequential scan ● Replication ● Pruning ● Statistics If you want to learn more, see me after

Slide 6

Slide 6 text

Who is this Gabriel character? Joined Heroku Data late 2016 Sinatra and RoR API Services Focused on Postgres High Availability data growth PostgreSQL, MySQL, SQL Server, Cassandra, Couchbase…(please no more)

Slide 7

Slide 7 text

Self-Help/Life Coaching is my passion on the side

Slide 8

Slide 8 text

Super Secret Project: Social interest platform Each person as a Customer: ● Customers will have profiles ● Each customer can posts from areas of interest ○ Favorite Quotes ○ TV Shows ○ Movies ○ Foods ○ Etc. ● Imagine millions of users and hundreds of thousands of posts per day

Slide 9

Slide 9 text

Great idea to spread uplifting thoughts on the Internet! ActiveRecord.migration do change do create_table :quotes do primary_key :id integer :character_id text :quote timestamps end end end Sharing Inspirational Quotes

Slide 10

Slide 10 text

Each row size: 0.5kb Our DB has 16 GB RAM Most quotes, never updated after first week posted We usually only want quotes posted or updated in the last week Uh-oh, we have 750 Million quotes and counting

Slide 11

Slide 11 text

Quotes table size 375 GB Way more than memory we have even for 1/20 of the table Here is what Postgres will do: ● It pulls each record from disk, and it attempts to put it into memory. ● Postgres does this because memory is fast, and if you needed it once, you might need it again soon. Uh-oh, we have 750 Million quotes and counting

Slide 12

Slide 12 text

● What happens if you’re putting things into memory that you don’t use? ○ Postgres will use temp disk space to manage large queries ● It caches each whole record, then creates pointers to disk for data overflow ● Postgres can stop accepting connections if we run too many of these queries at the same time But the dataset is too big for memory what happens now?

Slide 13

Slide 13 text

Who likes fast queries? Who likes to keep the database up? Who likes having database connections? But the dataset is too big for memory what happens now?

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Native Table Partitioning Entirely avoid the VACUUM overhead caused by a bulk DELETE. When most heavily accessed rows of a given table are in a single partition or a small number of partitions, faster lookups Seldom-used data can be migrated to cheaper and slower storage media. Why use native partitioning?

Slide 16

Slide 16 text

Native Table Partitioning Query performance improved dramatically Performance improved by sequential scan of a partition instead of index and random access reads Bulk loads and deletes can be accomplished by adding or removing partitions. Native partitioning is integrated into Postgres internal operations now Pg_partman is still supported, but it’s not built into 10’s new optimizations Why use native partitioning?

Slide 17

Slide 17 text

Partitioning isn’t natively supported in ORMs Some gems have existed to attempt to integrate partitioning into ActiveRecord Tables can be created from subsections of the original Partitioning (month, year, quote text, etc.) Subpartitions can be made on the Tables from the original partition as well Native Table Partitioning Noteworthy items

Slide 18

Slide 18 text

Market Pivot: Home Delivery

Slide 19

Slide 19 text

Customer shipping addresses ActiveRecord.migration do change do create_table :addresses do primary_key :id customer_id :integer text :street_address text :city text :state text :zip_code text :country_code end end end Competing with delivery businesses We will compete with home delivery companies Direct to Customers from the platform

Slide 20

Slide 20 text

Customer shipping addresses But there are problems! ● Our delivery drivers have a hard time finding addresses ● Addresses aren’t uniform across North America ● Customer Satisfaction is very important, special instructions should be considered Competing with delivery businesses

Slide 21

Slide 21 text

Customer shipping addresses How can we store all of this seemingly unstructured data? Introducing Hstore ● HStore is not new ● it’s a key/value store in Postgres Let’s implement the feature in HStore to track extra delivery info Competing with delivery businesses

Slide 22

Slide 22 text

Customer shipping addresses ActiveRecord.migration do change do create_table :addresses do primary_key :id customer_id :integer text :street_address text :city text :state text :zip_code text :country_code hstore :usage # Usage hash example: { primary: true/false, type: 'home/business/aparment' } end end end Competing with delivery businesses

Slide 23

Slide 23 text

Customer shipping addresses Wait, why is the address lookup so slow? Addresses table has 250 million records ● Customers have multiple addresses ● People in apartments move frequently but don’t remove old ones Our drivers don’t want to wait on slow updates out on the road Luckily we can add an index, and really improve lookups! Competing with delivery businesses

Slide 24

Slide 24 text

Customer shipping addresses ActiveRecord.migration do change do create_table :addresses do primary_key :id customer_id :integer text :street_address text :city text :state text :zip_code text :country_code hstore :usage # Usage hash example: { primary: true/false, type: 'home/business/aparment' } end execute "CREATE INDEX addresses_usage ON addresses USING HASH(usage)" end end Competing with delivery businesses

Slide 25

Slide 25 text

But wait... Hash Index resiliency Postgres 9.x ERROR: could not read block 32570 of relation base/16390/2663: read only 0 of 8192 bytes Customer shipping addresses It broke though, what happened?

Slide 26

Slide 26 text

Hash Index resiliency Postgres 9.x

Slide 27

Slide 27 text

Hash Index resiliency Postgres 9.x Why the resiliency issues with hash indexes? ● Not treated as first class citizens ● Not replicated ● Hard to maintain ● Misleading failure modes (XX001, XX002) ● Performance was generally poor under heavy load

Slide 28

Slide 28 text

Hash Index resiliency Postgres 9.x If customers add/update addresses frequently, the hash index becomes stale Replicas/Followers won’t have the index at all Index has to be rebuilt if database crashes (OS goes down, power outage, etc.) Customer shipping addresses

Slide 29

Slide 29 text

Hash Index resiliency Postgres 9.x Recreating large hash indexes can be time consuming Careful not to rely on hash indexes, severely outdated indexes can imitate corruption (XX001, XX002) What’s the solution?! Customer shipping addresses

Slide 30

Slide 30 text

Now first-class citizens Crash safe Replicated Reduced locking during bucket splits Faster lookups More even index growth Single-page pruning Hash Index resiliency Postgres 10 Crash Safe, Replicable Hash Indexes

Slide 31

Slide 31 text

Customer shipping addresses ActiveRecord.migration do change do create_table :addresses do primary_key :id customer_id :integer text :street_address text :city text :state text :zip_code text :country_code hstore :usage # Unstructured Data, data differing between locales (US vs. EU vs. CA, etc.) end execute "CREATE INDEX addresses_usage ON addresses USING HASH(usage)" end end Hash Index resiliency Postgres 10

Slide 32

Slide 32 text

Hash Index resiliency Postgres 10 Customer shipping addresses Usage index will persist after a crash or failure Faster lookups on the hstore index for the values Reduced locking to alleviate concurrency - multiple edits to a single address in quick succession more likely to succeed Replicas/Followers will have the indexes

Slide 33

Slide 33 text

New Feature: Curated Media Content

Slide 34

Slide 34 text

Curated Content: Characters of Interest Interest Outlet: Our favorite Characters Hundreds of thousands of TV Shows and Movies Each show or move can have many characters (some have hundreds) Varying amounts of information may be acquired over time about each character or show

Slide 35

Slide 35 text

Curated Content: Characters of Interest Specific information for each character of interest: ● Genres ● Catchphrases ● Famous accessories (items, weapons, clothes) Need unstructured way of keeping that data colocated Interest Outlet: Our favorite Characters

Slide 36

Slide 36 text

Storing JSON in Postgres Curated Content: Characters of Interest ActiveRecord.migration do change do create_table :characters do primary_key :id integer :media_id text :name jsonb :metadata end end end

Slide 37

Slide 37 text

Curated Content: Characters of Interest Storing JSON in Postgres Millions of characters, JSON blobs are much bigger than single line quotes Dynamic unstructured nature means dynamic sizes of each record How can we keep unstructured data without having to parse it all?

Slide 38

Slide 38 text

Curated Content: Characters of Interest Storing JSON in Postgres Multiple Tables? Many Relationships? Partitions? Is there something better?

Slide 39

Slide 39 text

Full text JSON Searching Postgres 10 ● Store whole request bodies for inspection ● Native indexing in Postgres ● Allows for more NoSQL style operations ○ dynamic schema flexibility ● Development agility in initial stages Desire for searching JSON

Slide 40

Slide 40 text

Full text JSON Searching Postgres 10 Curated Content: Characters of Interest ActiveRecord.migration do change do create_table :characters do primary_key :id integer :media_id text :name jsonb :metadata end execute <<- SQL CREATE INDEX characters_metadata ON characters USING gin (( to_tsvector('english',metadata) ) SQL end end Most importantly: Unstructured Data flexibility!

Slide 41

Slide 41 text

Benefits of JSON Full Text Searching All data stays colocated Indexes can help filter data based on existing or missing JSON keys More perspectives of character behavior from the existing data Genres, strengths, weaknesses, skills (Hammers anyone?) Full text JSON Searching Postgres 10 Curated Content: Characters of Interest

Slide 42

Slide 42 text

Revisiting Quotes: Another feature request People want to be re-inspired, “On this day…” Replay a stream of older Quote posts Stream should include all Posters I follow or like, not just my posts Even getting last month’s posts seem slow though, they’re so many I want Throwback Quotes

Slide 43

Slide 43 text

Postgres 10 introduces Parallel Query support for more query types Bitmap Heap Scan Gather Merge Merge Joins Revisiting Quotes: Another feature request I want Throwback Quotes

Slide 44

Slide 44 text

Postgres 9.6 Parallel Query Why sequential scan? - Main table of a query has to be scanned - Scan has to be able to be broken into parts for the workers - Postgres 9.6 only supported this directly on tables - This created sequential scans, which can be very slow Parallel Sequential Scans

Slide 45

Slide 45 text

Postgres 10 Parallel Query optimization Improvements made - Now process scan the index of main table - Builds the list of Heap pages to check in memory - Scan can be broken into parallel to check through the heap pages - Loads less data, requires less caching Parallel Bitmap Heap Scans

Slide 46

Slide 46 text

Postgres 9.6 Parallel Query What is the Gather mechanism? - Postgres 9.6 introduced it as a way to collect results from workers - Results were collected in arbitrary orderings from all workers Gather: Worker result collection

Slide 47

Slide 47 text

Postgres 9.6 Parallel Query What is the Gather mechanism? - If workers are producing sorted output, this adds overhead - Sorting parallel chunks, merge randomly, then sort again, seems good - What about really large result sets? (Remember we have lots of quotes) Gather: Worker result collection

Slide 48

Slide 48 text

Postgres 9.6 Parallel Query Gather: Worker result collection What are my primes?

Slide 49

Slide 49 text

Postgres 9.6 Parallel Query Gather: Worker result collection What are my primes? - 17 is the first node visited - Order based on visit point and total items (17,19, 11, 2, 23, 31) - Another pass is required to sort after gathering matching results - Workers returned as fast as possible, no location context

Slide 50

Slide 50 text

Postgres 10 Parallel Query What is the Gather Merge? - Speed up the parallel worker results gathering - Sorts are taken into account during parallelization - Removes much overhead doing the sort after the fact - Especially beneficial result set is large Gather Merge: Optimized Worker result collection

Slide 51

Slide 51 text

Postgres 10 Parallel Query Gather Merge: Optimized Worker result collection Revisit: What are my primes?

Slide 52

Slide 52 text

Postgres 10 Parallel Query Gather Merge: Optimized Worker result collection - 17 is the first node visited - Direction, location considered when workers collect results - B-tree construction will naturally consider leaf vs. root node locations (insertion balancing) Revisit: What are my primes?

Slide 53

Slide 53 text

Postgres 10 Parallel Query Gather Merge: Optimized Worker result collection - Tree collapsing happens in parallel now - Order will be more intuitive and require less resorting Revisit: What are my primes?

Slide 54

Slide 54 text

Postgres 9.6 Parallel Query What is Quotes JOIN Customers actually doing? - Postgres 9.6 only allowed parallelization on nested loops or hash joins - Other parallel portions cannot perform the merge joins further up - Indexes, ordering all adding overhead, not speeding up requests Merge Join: All followed Posters’ Quotes

Slide 55

Slide 55 text

Postgres 10 Parallel Query What’s the difference in the Parallel Merge Join? - The parallel merge joins can now be done at all parallel portions of work - Outer sides of JOINs are parallelized more effectively - Combined with Parallel Index scans, higher efficiency - Natural ordering by default from the algorithm without specifying ORDER Parallel Merge Join: Optimized ordering

Slide 56

Slide 56 text

Postgres 10 Query Analyzer Engineering Support: Filing and tracking issues Postgres 9.6 Query Analyzer vs. Issues table -> 1.8M records, 495 MB

Slide 57

Slide 57 text

Postgres 9.6 Query Analyzer accuracy EXPLAIN ANALYZE SELECT "issues".* FROM "issues" WHERE "issues"."repo_id" = 2151 AND "issues"."state" = 'open' ORDER BY created_at DESC LIMIT 20 OFFSET 0; QUERY PLAN ------------------------------------------------------------------------------------------------------------- Limit (cost=4267.01..4267.02 rows=20 width=1233) (actual time=935.987..935.992 rows=20 loops=1) -> Sort (cost=4267.01..4268.21 rows=2384 width=1233) (actual time=935.985..935.988 rows=20 loops=1) Sort Key: created_at DESC Sort Method: top-N heapsort Memory: 31kB -> Index Scan using index_issues_on_repo_id_and_id on issues (cost=0.08..4254.33 rows=2384 width=1233) (actual time=2.877..934.092 rows=2159 loops=1) Index Cond: (repo_id = 2151) Planning time: 7.061 ms Execution time: 936.042 ms

Slide 58

Slide 58 text

Postgres 10 Query Analyzer accuracy EXPLAIN ANALYZE SELECT "issues".* FROM "issues" WHERE "issues"."repo_id" = 2151 AND "issues"."state" = 'open' ORDER BY created_at DESC LIMIT 20 OFFSET 0; QUERY PLAN ------------------------------------------------------------------------------------------------------------- Limit (cost=3985.46..3985.47 rows=20 width=1232) (actual time=6.768..6.771 rows=20 loops=1) -> Sort (cost=3985.46..3986.58 rows=2222 width=1232) (actual time=6.767..6.768 rows=20 loops=1) Sort Key: created_at DESC Sort Method: top-N heapsort Memory: 39kB -> Index Scan using index_issues_on_repo_id_and_id on issues (cost=0.08..3973.64 rows=2222 width=1232) (actual time=0.062..5.952 rows=1980 loops=1) Index Cond: (repo_id = 2151) Planning time: 0.745 ms Execution time: 6.815 ms

Slide 59

Slide 59 text

Planning time: 7.061 ms Execution time: 936.042 ms Vs Planning time: 0.745 ms Execution time: 6.815 ms

Slide 60

Slide 60 text

QUERY ANALYTICS: ANOTHER TALK

Slide 61

Slide 61 text

Gotchas with PostgreSQL 10 Rails 4 doesn’t support PostgreSQL 10, Only up to PostgreSQL 9.6 Partitioning is not native to any Ruby ORM Postgres-specific index types (JSON, Hash) are not native to any Ruby ORM Analytics for queries and indexes have to be done in raw SQL

Slide 62

Slide 62 text

Heroku Postgres improvements released Added support for numerous native extensions Managed Postgres Credentials Released hardware new plans with dedicated hardware Improved Postgres CLI analytics from improved Postgres query analytics Heroku Postgres Connection Pooling

Slide 63

Slide 63 text

Heroku Postgres extensions enabled Made a list, vetted it twice! Top 5 Favorites added lo sslinfo tcn timetravel address_standardizer For the complete list, checkout the Heroku Postgres Extensions article

Slide 64

Slide 64 text

Postgres permissions difficult to reason about Creates more simplified view of Postgres Roles Still allows for customizable permissions on Roles and Tables Heroku Postgres Managed Credentials What are managed credentials?

Slide 65

Slide 65 text

Heroku Postgres Managed Credentials Footguns considered, guard rails added ● Special characters (No one likes Bobby Tables) ● Name lengths ● Best practices published What are managed credentials?

Slide 66

Slide 66 text

Heroku Postgres Managed Credentials Separation of concerns Principle of least privilege Better control of external access: ● E.g. Don’t want third-party analytics reading PII data Why use managed credentials? For more info, check out the Heroku Postgres Credentials Devcenter Article

Slide 67

Slide 67 text

Heroku PGX Plans Released Benefits to using PGX Optimized for heavier I/O work Better database and disk caching More stable performance Increased accuracy performing query analytics Enhanced CPU capability for optimizing Parallel Queries

Slide 68

Slide 68 text

Heroku PGX Plans Released Dedicated Provisioned Disk I/O Doubled CPU Cores at tier-4 and up RAM sizes doubled at tier-4 and up Increased disk sizes at every tier Prices stayed the same or lowered! Big Takeaways Read more about the Heroku Postgres PGX plans on our blog post

Slide 69

Slide 69 text

Heroku Postgres CLI analytics Database tuning and monitoring still vital - https://devcenter.heroku.com/articles/heroku-postgres-database-tuning Extended CLI analytics improved from Postgres 10 internals changes - https://github.com/heroku/heroku-pg-extras heroku pg:diagnose #and more!

Slide 70

Slide 70 text

Heroku Connection Pooling: PgBouncer Beta ● Manages connections for your Postgres Database ○ Reduces need to open new connections ● Lives on the Database Server ● Uses Transactional connection Pooling What’s PgBouncer?

Slide 71

Slide 71 text

Heroku Connection Pooling: PgBouncer Beta ● Better Visibility on connection statistics ● Exposes guard rails to protect your database ● Excellent for high-throughput asynchronous workloads Why use Connection pooling?

Slide 72

Slide 72 text

Heroku Connection Pooling: PgBouncer Beta On Server Connection pooling is only available on Postgres 9.6, 10 - Read more about it: Postgres connection pooling - Still in public beta, more features to come! - Default connections direct to DBs still available after enabling Still offering Client-side buildpack - Client-side PgBouncer article

Slide 73

Slide 73 text

Questions welcome! Come by the Booth!