Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data @ Kickstarter: A Guided Tour

Data @ Kickstarter: A Guided Tour

A high level presentation about how we're working with data at Kickstarter. Includes details on our events stack, some ETL, Redshift queries, and our process of handling and processing data requests internally.

Fred Benenson

June 06, 2014
Tweet

More Decks by Fred Benenson

Other Decks in Technology

Transcript

  1. Some Kickstarter Stats • $1.15+ billion pledged • 63,000+ Successful

    Projects • 6.35 million backers • 1.9 million repeat backers • 15.7 million pledges • $1-2 million dollars a day pledged • 43-44% of projects reach their goal • 65 $1+ million projects
  2. Some Kickstarter Data Stats • Rails App hosted entirely on

    Amazon Web Services • 18 app servers • db.m2.4xlarge RDS master databases • 4-5 Read Replicas • Hundreds of million of rows in MySQL Databases • Generating 7-10 million page-view level events per-day • Billions of rows of analytics data • Running alongside a combination elasticsearch, Redis, and Resque servers • Graphite & Statsd for operations stats 3
  3. Interesting things we’re doing that we’re not going to talk

    about tonight • Machine Learning • The Art of Instrumentation • DevOps Data (STATSD, etc.)
  4. The Data Sources Kickstarter MySQL RDS #1 Kickstarter MySQL RDS

    #2 Mixpanel Mixpanel Download to EC2 DynamoDB
  5. The Pipeline Kickstarter MySQL RDS #1 Kickstarter MySQL RDS #2

    S3 Daily Buckets sqoop on EMR Mixpanel Download to EC2 EMR Streaming to clean CSVs EMR to transform JSON to CSV EMR Streaming to clean CSVs sqoop on EMR AWS Data Pipeline DynamoDB EMR to transform DynamoDB to CSV
  6. … into Redshift Kickstarter MySQL RDS #1 Kickstarter MySQL RDS

    #2 S3 Daily Buckets sqoop on EMR Mixpanel Download to EC2 EMR Streaming to clean CSVs psql client AWS Redshift EMR to transform JSON to CSV EMR Streaming to clean CSVs sqoop on EMR AWS Data Pipeline DynamoDB EMR to transform DynamoDB to CSV
  7. … what else is in Redshift? • 2010 US Census

    Data snapshots across all our databases • Denormalized table views • An integers table • Exploded MaxMind GeoIP table of IP ↔️ Locations (billions of rows)
  8. Trello Lists • Welcome / Read This First / Prompts

    / Examples • Doing • Shipped per Week (e.g. “Shipped May 12 - May 16”) • Per-department Lists (e.g. Community, Marketing, etc.) • Backlog • A/B Tests (Running, TODO, etc.)
  9. Help for Data Requests Prompts • What’s your question? •

    Why is this important? • When do you need this done?
  10. Help for Data Requests Prompts • What’s your question? •

    Why is this important? • When do you need this done? Examples • Least amount of time/ resources • Moderate amount of time/ resources • Most amount of time/ resources
  11. An Example Card • Assign members to the card •

    Always expect a deadline • Labels are ¯\_(π)_/¯ useful • Encourage replying to the prompts to help colleagues think about the request • URGENT cards should include URGENT in their title, highly discouraged
  12. • Data goes to shared Google Docs spreadsheet • SQL

    / Queries go to GitHub (either repo, or sometimes just a gist), with permalinked SHA1 URL • R goes to GitHub • Final graphs are attached and kept updated • In-depth reports are produced with knitr Reproducibility is Next to Godliness
  13. What are the most common 
 aspect ratios for project

    videos? • Aspect ratio = width / height • 311 distinct aspect ratios • Most common: • 1.77777 (16:9) • 1.33333 (4:3) • 1.78343
  14. Internal Newsletter: METADATA • One Liner stats • Bigger Questions

    • External Data Readings (from Twitter, Hacker News, etc.) • Consistent source of inspiration for the us and the rest of the team
  15. Internal Documentation • A useful place to store snippets of

    commonly used code (e.g. timezone conversion in SQL) • Required reading for onboarding new data hires, helpful for engineers too • Kickstarter SQL Style Guide
  16. Internal Tools • Invest the time in parsing out what

    the common questions are • … and build the tools to help your team answer them themselves • Include disclaimers and ask people to gut-check the results
  17. Mixpanel • Pricey, but product managers love it • Helpful

    for looking at real-time product interaction • Nightly export of raw JSON is excellent, was a deciding factor when choosing to go with Mixpanel • current_user and current_project global objects & their wrappers are great
  18. Query Study: Live Projects • Nightly denormalization checking how many

    projects are live:
 
 SELECT COUNT(*) FROM projects WHERE state = ‘live’ • Simple but not trivial to backfill old dates • Historical, so no possible to split out by category, project size, etc.
  19. Desert time is for thinking about queries • How could

    we arbitrarily re-generate the # of live projects per-day in Redshift? • What if each project had a vote for the day that it was live? • Something like:
 
 date, project_id
 2009-09-19, 1151
 2009-09-20, 1151
 2009-09-21, 1151
 ...
 2009-10-19, 1151
  20. GROUP BY to the rescue SELECT 
 date,
 COUNT(DISTINCT project_id)

    as live_project_count
 FROM 
 (
 2009-09-19, 1151
 2009-09-20, 1151
 2009-09-21, 1151
 ...
 2009-10-19, 1151
 2009-10-20, 2351
 ...
 2014-05–26, 12381281
 2014-05–27, 12381281
 ) 
 GROUP BY date

  21. The Trick • “Explode” the # of days a project

    is live and tag it with those dates • generate_sequence() in Postgres • ??? in Redshift • Enter: the integers table
  22. The Actual Query SELECT id, state, CONVERT_TIMEZONE('UTC', 'America/New_York', launched_at) AS

    launched_at, CONVERT_TIMEZONE('UTC', 'America/New_York', state_changed_at) AS state_changed_at, CONVERT_TIMEZONE('UTC', 'America/New_York', deadline) AS deadline FROM ksr_projects)
  23. The Actual Query SELECT id, state, CONVERT_TIMEZONE('UTC', 'America/New_York', launched_at) AS

    launched_at, CONVERT_TIMEZONE('UTC', 'America/New_York', state_changed_at) AS state_changed_at, CONVERT_TIMEZONE('UTC', 'America/New_York', deadline) AS deadline FROM ksr_projects) SELECT DATE(launched_at) + n AS date_live, 
 id AS project_id FROM ( ! WHERE n <= ( DATE( CASE WHEN state_changed_at < deadline AND state != 'live' THEN state_changed_at ELSE deadline END ) - DATE(launched_at) ) AND launched_at IS NOT NULL GROUP BY id, date_live, launched_at, deadline ORDER BY id, date_live) , integers
  24. The Actual Query SELECT id, state, CONVERT_TIMEZONE('UTC', 'America/New_York', launched_at) AS

    launched_at, CONVERT_TIMEZONE('UTC', 'America/New_York', state_changed_at) AS state_changed_at, CONVERT_TIMEZONE('UTC', 'America/New_York', deadline) AS deadline FROM ksr_projects) SELECT DATE(launched_at) + n AS date_live, 
 id AS project_id FROM ( ! WHERE n <= ( DATE( CASE WHEN state_changed_at < deadline AND state != 'live' THEN state_changed_at ELSE deadline END ) - DATE(launched_at) ) AND launched_at IS NOT NULL GROUP BY id, date_live, launched_at, deadline ORDER BY id, date_live) , integers 2009-09-19, 1151
 2009-09-20, 1151
 2009-09-21, 1151
 ...
 2009-10-19, 1151
 2009-10-20, 2351
 ...
 2014-05–26, 12381281
 2014-05–27, 12381281 ⤵️
  25. The Actual Query SELECT id, state, CONVERT_TIMEZONE('UTC', 'America/New_York', launched_at) AS

    launched_at, CONVERT_TIMEZONE('UTC', 'America/New_York', state_changed_at) AS state_changed_at, CONVERT_TIMEZONE('UTC', 'America/New_York', deadline) AS deadline FROM ksr_projects) SELECT DATE(launched_at) + n AS date_live, 
 id AS project_id FROM ( ! WHERE n <= ( DATE( CASE WHEN state_changed_at < deadline AND state != 'live' THEN state_changed_at ELSE deadline END ) - DATE(launched_at) ) AND launched_at IS NOT NULL GROUP BY id, date_live, launched_at, deadline ORDER BY id, date_live) , integers SELECT date_live, COUNT(DISTINCT project_id) AS live_projects FROM ( GROUP BY date_live; 2009-09-19, 1151
 2009-09-20, 1151
 2009-09-21, 1151
 ...
 2009-10-19, 1151
 2009-10-20, 2351
 ...
 2014-05–26, 12381281
 2014-05–27, 12381281 ⤵️
  26. How does the query compare to our historical record? Pretty

    close! Except there’s a gap. ! Bonus points if you know why.
  27. Bonus Query: Window Functions SELECT event1, AVG(time2 - time1) AS

    average_time1_time2_difference, event2, AVG(time3 - time2) AS average_time2_time3_difference, event3, COUNT(*) AS count FROM ( SELECT event, "time", LEAD(event, 1) OVER (PARTITION BY distinct_id ORDER BY "time") AS event1, LEAD("time", 1) OVER (PARTITION BY distinct_id ORDER BY "time") AS time1, LEAD(event, 2) OVER (PARTITION BY distinct_id ORDER BY "time") AS event2, LEAD("time", 2) OVER (PARTITION BY distinct_id ORDER BY "time") AS time2, LEAD(event, 3) OVER (PARTITION BY distinct_id ORDER BY "time") AS event3, LEAD("time", 3) OVER (PARTITION BY distinct_id ORDER BY "time") AS time3 FROM mixpanel_events ) WHERE event1 = 'New user' GROUP BY event1, event2, event3 ORDER BY count DESC LIMIT 1000; WARNING: 12 hour query over billions of rows What are the two most common events that are triggered after New Users signup?
  28. • Extending our Logistic Regression framework • Realtime / Streaming

    events data into Redshift • Data driven Trust & Safety risk analysis • A/B Testing • Automated reporting with knitr, R, Redshift • More long-term research projects! Stuff We’re Working on Now