GB • Cache • Caches query results, query plan, and relations • Cache hit should be over 99% • pg:cache-hit table hit rate | 0.93700270695348766263 NOT IDEAL index hit rate | 0.76517846526624000529 AWFUL
and index) • Unused indexes • Bloat factor (wasted space from deletion/update/failed transactions) - should be < 10 • Fix and Challenge • Reduce unnecessary storage • Increase cache • Cache storage is expensive • Decide cache size is an art, must be done on production system
pg:table-size: failures table | 5832 MB • pg:index-size: failures_pkey | 89 MB • Has cache-hit improved? If not, increase storage • Upgrade heroku db to the next tier (standard2) • Overall index hit rate: 0.9865709375927209 • Overall table hit rate: 0.99908830338836100280
Postgres query optimizer decides the most optimized way, based on cache, how to scan. • It’s not always good to perform index scan • Index scan requires more IO: reading index, reading table • Seq scan is good when result set is large (10% of table)
balanced structure, good for most of cases (such as such as equality and ranges) and all data types • Partial • add_index :statuses, :liked, name: "index_statuses_on_disliked", where: "(liked IS FALSE)” • Faster since less storage is limited by where clause • Commonly applied to boolean fields in where clause • Expression • Useful for queries that match some function of data such as case-insensitive match of login email • create index users_lower_email on users(lower(email)) • Multi-column • Combining multiple columns (foo, bar), used for where foo = ‘fooz’ and bar = ‘baz'
panel • Slowest query for facebook-envoy-production qry: SELECT COUNT(*) FROM "follows" WHERE "follows"."account_id" = ? AND "follows"."active" = ? exec_time: 467:27:38.383773 prop_exec_time: 76.3% ncalls: 5,469,425 sync_io_time: 55:44:28.64694 ! sync_io_time should be 0, meaning no wait.
Building and rebuilding index has a time cost (sync_io_time) and space cost • If the query is performed < 2ms, it’s reasonable and we can stop optimizing it. • N + 1 queries - reducing throughput
right index type • Reducing filtering • Seq v.s. index scan • N + 1 queries • Always use ‘explain analyze’ on production data locally • Sanity-check: observe improvements with newrelic and librato
120 • Each connection costs 7MB memory (part of our allocation) • pg:info shows current open connections • linksmith fail to close connections caused linktivator db connection going up to 495. • Temporary fix is to restart linksmith. • Permanent fix is to rescue and close db connections. • More Permanent fix is to shutdown linksmith in favor of bit.ly
• Always choose pt-backups with 1-month data retention • Think about add index when writing joins, use explain analyze compare the results before/after against production data locally • Turn up mq workers when db optimization is completed, monitor db load making sure it’s <=1 • Turn up scheduler frequency (without getting into rate limit) for more responsive audience engagement data (can we achieve 4hr delay instead of 24hr)