Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Making Data Dance

Making Data Dance

Rails and the ActiveRecord gem are really handy tools to get work done, and do it quickly. They aren't a silver bullet, though. Sometimes the database is the best place to run a complicated query. Especially when you have big transactional or time-series data. Databases are also incredibly powerful, and can do remarkable things at scale. Why not take advantage of that power?

This talk discusses some of the things that you can do with PostgreSQL that will both blow your mind and also help you transform data into knowledge.

3a1f18e5445bd821f290ed68c1f2d925?s=128

Barrett Clark

April 23, 2015
Tweet

Transcript

  1. Making Data Dance @barrettclark

  2. Harness Heretical Antipatterns For Fun And Profit

  3. My Resume

  4. My Resume } DATA + +

  5. ActiveRecord

  6. Raw SQL • Complicated joins • Deeper database functionality

  7. But database portability?!

  8. https://flic.kr/p/eg4hih

  9. It just doesn’t work like that. Database Portability Pipe Dream

  10. You chose the ___ database for a reason Use it.

  11. Caveats • 500ms • Database connection pool

  12. Let’s get our hands dirty

  13. Window Functions

  14. –PostgreSQL 9.4.0 Documentation A window function performs a calculation across

    a set of table rows that are somehow related to the current row.
  15. –Barrett Clark Fold data from other rows into this row.

  16. Window Function Greatest Hits •lead() •lag() •first_value() •last_value() •row_number()

  17. id fruit 1 apple 2 apple 3 apple 4 pear

    5 apple 6 pear 7 pear 8 pear 9 pear 10 banana
  18. SELECT *, LEAD(id) OVER(), LAG(id) OVER() FROM fruits; id |

    fruit | lead | lag ----+--------+------+----- 1 | apple | 2 | 2 | apple | 3 | 1 3 | apple | 4 | 2 4 | pear | 5 | 3 5 | apple | 6 | 4 6 | pear | 7 | 5 7 | pear | 8 | 6 8 | pear | 9 | 7 9 | pear | 10 | 8 10 | banana | | 9 (10 rows) http://sqlfiddle.com/#!15/230c3b/2
  19. –PostgreSQL 9.4.0 Documentation A window function performs a calculation across

    a set of table rows that are somehow related to the current row.
  20. https://flic.kr/p/7KGqEY

  21. id | fruit | first_value ----+--------+------------- 1 | apple |

    1 2 | apple | 1 3 | apple | 1 5 | apple | 1 10 | banana | 10 4 | pear | 4 6 | pear | 4 7 | pear | 4 8 | pear | 4 9 | pear | 4 (10 rows) SELECT id, fruit, FIRST_VALUE(id) OVER(PARTITION BY fruit ORDER BY id) FROM fruits; http://sqlfiddle.com/#!15/230c3b
  22. SELECT *, FIRST_VALUE(id) OVER(PARTITION BY fruit ORDER BY id), ROW_NUMBER()

    OVER(PARTITION BY fruit ORDER BY id) FROM fruits; id | fruit | first_value | row_number ----+--------+-------------+------------ 1 | apple | 1 | 1 2 | apple | 1 | 2 3 | apple | 1 | 3 5 | apple | 1 | 4 10 | banana | 10 | 1 4 | pear | 4 | 1 6 | pear | 4 | 2 7 | pear | 4 | 3 8 | pear | 4 | 4 9 | pear | 4 | 5 (10 rows) http://sqlfiddle.com/#!15/230c3b
  23. Practical Example Chat App

  24. id | message | room_id | user_id ----+-----------------------------+---------+--------- 1 |

    Leaving for the airport now | 1 | 1337 2 | Flight just landed! | 1 | 1337 3 | waiting on an uber | 2 | 1337 4 | omw y'all! | 1 | 1337 9 | hello, world | 1 | 27 (5 rows) SELECT id, message, room_id, user_id FROM messages ORDER BY id;
  25. SELECT id, message, room_id, user_id, LEAD(user_id) OVER(PARTITION BY room_id ORDER

    BY room_id, id) AS next_user_id FROM messages ORDER BY 3, 1; Look ahead Defines the groups of rows to evaluate against the current row
  26. id | message | room_id | user_id | next_user_id ----+-----------------------------+---------+---------+--------------

    1 | Leaving for the airport now | 1 | 1337 | 1337 2 | Flight just landed! | 1 | 1337 | 1337 4 | omw y'all! | 1 | 1337 | 27 9 | hello, world | 1 | 27 | 3 | waiting on an uber | 2 | 1337 | (5 rows) SELECT id, message, room_id, user_id, LEAD(user_id) OVER(PARTITION BY room_id ORDER BY room_id, id) AS next_user_id FROM messages ORDER BY 3, 1;
  27. 3 1 2 4 5 ProTip™ SELECT id, message, room_id,

    user_id, LEAD(user_id) OVER(PARTITION BY room_id ORDER BY room_id, id) AS next_user_id FROM messages ORDER BY 3, 1;
  28. class Message < ActiveRecord::Base def self.window_example sql = <<-SQL SELECT

    id, message, room_id, user_id, LEAD(user_id) OVER(PARTITION BY room_id ORDER BY id) AS next_user_id FROM messages ORDER BY 3, 1; SQL connection.execute(sql) end end Returns array of Message objects with additional calculated fields
  29. Too anti pattern for you?

  30. module PGConnection def conn config = YAML.load_file(File.open('config/database.yml'))['development'] @conn ||= PG.connect(

    :dbname => config['database'], :user => config['username'], :password => config['password'], :host => config['host'] || 'localhost' ) end end class Antipattern extend PGConnection def self.window_example sql = <<-SQL SELECT id, message, room_id, user_id, LEAD(user_id) OVER(PARTITION BY room_id ORDER BY id) AS next_user_id FROM messages ORDER BY 3, 1; SQL conn.exec(sql).values end end Returns array of arrays of values
  31. Models do not have to be backed by ActiveRecord. Or

    even persisted in the database.
  32. Questions We Can Now Answer • When did something change?

    • When did someone leave a place? • How long did each thing last? • How long did people stay in which places? • Enter / Exit events
  33. Subqueries

  34. Subquery • Subquery to filter and group • Subquery for

    a field value • Subquery in a join
  35. Base Query

  36. “For a given phone, pull in the next minor field”

    SELECT id, phone, major, minor, reading_timestamp, LEAD(minor) OVER ( PARTITION BY phone ORDER BY phone, reading_timestamp ) AS next_minor FROM beacon_readings ORDER BY 2, 5;
  37. When did something change?

  38. SELECT major, minor, next_minor, COUNT(*) FROM ( SELECT id, phone,

    major, minor, reading_timestamp, LEAD(minor) OVER ( PARTITION BY phone ORDER BY phone, reading_timestamp) AS next_minor FROM beacon_readings ORDER BY 2, 5 ) AS beacon_readings_lead WHERE major = 1 AND minor != next_minor GROUP BY major, minor, next_minor ORDER BY major, minor, next_minor;
  39. Yo dawg, I heard you like queries

  40. SELECT major, minor, next_minor, COUNT(*) FROM ( SELECT id, phone,

    major, minor, reading_timestamp, LEAD(minor) OVER ( PARTITION BY phone ORDER BY phone, reading_timestamp) AS next_minor FROM beacon_readings ORDER BY 2, 5 ) AS beacon_readings_lead WHERE major = 1 AND minor != next_minor GROUP BY major, minor, next_minor ORDER BY major, minor, next_minor;
  41. major | minor | next_minor | count -------+-------+------------+------- 1 |

    0 | 1 | 18 1 | 0 | 2 | 3 1 | 0 | 3 | 23 1 | 0 | 4 | 7 1 | 1 | 0 | 19 1 | 1 | 2 | 59 1 | 1 | 3 | 158 1 | 1 | 4 | 44 1 | 2 | 0 | 5 1 | 2 | 1 | 59 1 | 2 | 3 | 85 1 | 2 | 4 | 40 1 | 3 | 0 | 21 1 | 3 | 1 | 154 1 | 3 | 2 | 85 1 | 3 | 4 | 52 1 | 4 | 0 | 6 1 | 4 | 1 | 49 1 | 4 | 2 | 42 1 | 4 | 3 | 45 (20 rows)
  42. None
  43. Refactor

  44. https://flic.kr/p/9tT9Yh

  45. CTE Common Table Expression

  46. SELECT major, minor, next_minor, COUNT(*) FROM ( SELECT id, phone,

    major, minor, reading_timestamp, LEAD(minor) OVER ( PARTITION BY phone ORDER BY phone, reading_timestamp) AS next_minor FROM beacon_readings ORDER BY 2, 5 ) AS beacon_readings_lead WHERE major = 1 AND minor != next_minor GROUP BY major, minor, next_minor ORDER BY major, minor, next_minor;
  47. WITH beacon_readings_lead_lag AS ( SELECT id, phone, major, minor, reading_timestamp,

    LEAD(minor) OVER ( PARTITION BY phone ORDER BY phone, reading_timestamp ) AS next_minor FROM beacon_readings ORDER BY 2, 5 ) SELECT major, minor, next_minor, COUNT(*) FROM beacon_readings_lead_lag WHERE major = 1 AND minor != next_minor GROUP BY major, minor, next_minor ORDER BY major, minor, next_minor;
  48. WITH interval_query AS ( SELECT (ts ||' hour')::INTERVAL AS hour_interval

    FROM generate_series(0,23) AS ts ), time_series AS ( SELECT DATE_TRUNC('hour', NOW()) + INTERVAL '60 min' * ROUND(DATE_PART('minute', NOW()) / 60.0) - interval_query.hour_interval AS start_time FROM interval_query ), time_intervals AS ( SELECT start_time, start_time + '1 hour'::INTERVAL AS end_time FROM time_series ORDER BY start_time ), reading_counts AS ( SELECT f.start_time, f.end_time, br.minor, COUNT(DISTINCT br.phone) readings FROM beacon_readings br RIGHT JOIN time_intervals f ON br.reading_timestamp >= f.start_time AND br.reading_timestamp < f.end_time AND br.major = 1 GROUP BY f.start_time, f.end_time, br.minor ORDER BY f.start_time, br.minor ) SELECT * FROM reading_counts;
  49. None
  50. Repetition Repetition RepetitionRepetition Repetition Repetition Repetition Repetition Repetition Repetition Repetition

    Repetition Repetition Repetition Repetition Repetition Repetition Repetition Repetition Repetition Repetition Repetition Repetition Repetition Repetition Repetition Repetition Repetition Repetition Repetition
  51. View Stored Query

  52. CREATE OR REPLACE VIEW beacon_readings_lead_view AS SELECT id, phone, major,

    minor, reading_timestamp, LEAD(minor) OVER ( PARTITION BY phone ORDER BY phone, reading_timestamp ) AS next_minor FROM beacon_readings ORDER BY 2, 5 ;
  53. class CreateLeadView < ActiveRecord::Migration def up execute <<-SQL.strip_heredoc CREATE OR

    REPLACE VIEW beacon_readings_lead_view AS SELECT ID, phone, major, minor, reading_timestamp, LEAD(minor) OVER ( PARTITION BY phone ORDER BY phone, reading_timestamp ) AS next_minor FROM beacon_readings ORDER BY phone, reading_timestamp; SQL end def down sql = "DROP VIEW IF EXISTS beacon_readings_lead_view;" execute(sql) end end
  54. SELECT major, minor, next_minor, COUNT(*) FROM beacon_readings_lead_view WHERE major =

    1 AND minor != next_minor GROUP BY major, minor, next_minor ORDER BY major, minor, next_minor;
  55. Business Logic In The Database?!?

  56. YOU Get to decide where your code runs

  57. https://flic.kr/p/2MW7SY

  58. My query is too slow

  59. Do the heavy lifting in a separate process

  60. Materialized View Stored Query And Data

  61. CREATE MATERIALIZED VIEW beacon_readings_lead_lag_mv AS SELECT * FROM beacon_readings_lead_lag_view WHERE

    reading_timestamp >= NOW() - '1 day'::INTERVAL; CREATE INDEX index_beacon_readings_mv_on_major ON beacon_readings_lead_lag_mv USING btree(major ASC NULLS LAST); CREATE INDEX index_beacon_readings_mv_on_phone ON beacon_readings_lead_lag_mv USING btree(phone COLLATE "default" ASC NULLS LAST); CREATE INDEX index_beacon_readings_mv_on_reading_timestamp ON beacon_readings_lead_lag_mv USING btree(reading_timestamp ASC NULLS LAST);
  62. DATA https://flic.kr/p/khS6jn

  63. REFRESH MATERIALIZED VIEW CONCURRENTLY beacon_readings_lead_lag_mv; New in 9.4!

  64. namespace :db do namespace :heroku do # https://devcenter.heroku.com/articles/scheduler desc 'Update

    the materialized view(s)' task :update_materialized_view => :environment do sql = 'REFRESH MATERIALIZED VIEW beacon_readings_lead_lag_mv;' ActiveRecord::Base.connection.execute(sql) end end end
  65. namespace :db do namespace :heroku do desc "Load the PROD

    database from Heroku to the local dev database" task :load => :environment do if Rails.env == 'development' Bundler.with_clean_env do config = Rails.configuration.database_configuration[Rails.env] system <<-CMD heroku pgbackups:capture --expire curl -o latest.dump `heroku pgbackups:url` pg_restore --verbose --clean --no-acl --no-owner -h localhost \ -U #{config["username"]} -d #{config["database"]} latest.dump rm -rf latest.dump CMD end end end end end https://gist.github.com/barrettclark/c94467e3872d16b3f8b0
  66. https://flic.kr/p/bVUL7

  67. Subquery For A Field Value

  68. SELECT id, phone, ( SELECT MIN(id) FROM beacon_readings_lead_lag_mv WHERE major

    = br1.major AND minor = 0 AND phone = br1.phone AND id >= br1.id ) AS session_close_id, minor, next_beacon_minor, reading_timestamp FROM beacon_readings_lead_lag_mv br1 WHERE major = 2 AND minor != next_beacon_minor ORDER BY phone, reading_timestamp;
  69. SELECT id, phone, ( SELECT MIN(id) FROM beacon_readings_lead_lag_mv WHERE major

    = br1.major AND minor = 0 AND phone = br1.phone AND id >= br1.id ) AS session_close_id, minor, next_beacon_minor, reading_timestamp FROM beacon_readings_lead_lag_mv br1 WHERE major = 2 AND minor != next_beacon_minor ORDER BY phone, reading_timestamp;
  70. None
  71. Other Useful Things

  72. Datatypes • Array • DateRange and TSRange • JSON •

    JSONB (9.4) • UUID http://edgeguides.rubyonrails.org/active_record_postgresql.html
  73. Future You Will Thank You

  74. Postgres • Heroku’s Postgres.app (ships with PostGIS baked in) •

    Current Version = 9.4 • Database Tools • pgAdmin3 (free) • Navicat (several offerings, free trials) • Visual Queryplan: http://explain.depesz.com/
  75. None
  76. Recap

  77. It’s OK to write SQL

  78. Refactor • Common Table Expression (CTE) • View • Materialized

    View
  79. Not saying do everything in the database

  80. PostgreSQL is awesome

  81. Thank You.

  82. Barrett Clark @barrettclark