Data @ Kickstarter: A Guided Tour

Data @ Kickstarter: A Guided Tour [email protected] / [email protected]  @fredbenenson
/ @jsalfen

Some Kickstarter Stats • $1.15+ billion pledged • 63,000+ Successful
Projects • 6.35 million backers • 1.9 million repeat backers • 15.7 million pledges • $1-2 million dollars a day pledged • 43-44% of projects reach their goal • 65 $1+ million projects

Some Kickstarter Data Stats • Rails App hosted entirely on
Amazon Web Services • 18 app servers • db.m2.4xlarge RDS master databases • 4-5 Read Replicas • Hundreds of million of rows in MySQL Databases • Generating 7-10 million page-view level events per-day • Billions of rows of analytics data • Running alongside a combination elasticsearch, Redis, and Resque servers • Graphite & Statsd for operations stats 3

Interesting things we’re doing that we’re not going to talk
about tonight • Machine Learning • The Art of Instrumentation • DevOps Data (STATSD, etc.)

The Stack

The Data Sources Kickstarter MySQL RDS #1 Kickstarter MySQL RDS
#2 Mixpanel Mixpanel Download to EC2 DynamoDB

The Pipeline Kickstarter MySQL RDS #1 Kickstarter MySQL RDS #2
S3 Daily Buckets sqoop on EMR Mixpanel Download to EC2 EMR Streaming to clean CSVs EMR to transform JSON to CSV EMR Streaming to clean CSVs sqoop on EMR AWS Data Pipeline DynamoDB EMR to transform DynamoDB to CSV

… into Redshift Kickstarter MySQL RDS #1 Kickstarter MySQL RDS
#2 S3 Daily Buckets sqoop on EMR Mixpanel Download to EC2 EMR Streaming to clean CSVs psql client AWS Redshift EMR to transform JSON to CSV EMR Streaming to clean CSVs sqoop on EMR AWS Data Pipeline DynamoDB EMR to transform DynamoDB to CSV

tl;dr: Get data into Redshift

… what else is in Redshift? • 2010 US Census
Data snapshots across all our databases • Denormalized table views • An integers table • Exploded MaxMind GeoIP table of IP ↔️ Locations (billions of rows)

The Craft

The Data Trello

Trello Lists • Welcome / Read This First / Prompts
/ Examples • Doing • Shipped per Week (e.g. “Shipped May 12 - May 16”) • Per-department Lists (e.g. Community, Marketing, etc.) • Backlog • A/B Tests (Running, TODO, etc.)

More on our blog…

Help for Data Requests Prompts • What’s your question? •
Why is this important? • When do you need this done?

Help for Data Requests Prompts • What’s your question? •
Why is this important? • When do you need this done? Examples • Least amount of time/ resources • Moderate amount of time/ resources • Most amount of time/ resources

An Example Card • Assign members to the card •
Always expect a deadline • Labels are ¯\_(π)_/¯ useful • Encourage replying to the prompts to help colleagues think about the request • URGENT cards should include URGENT in their title, highly discouraged

• Data goes to shared Google Docs spreadsheet • SQL
/ Queries go to GitHub (either repo, or sometimes just a gist), with permalinked SHA1 URL • R goes to GitHub • Final graphs are attached and kept updated • In-depth reports are produced with knitr Reproducibility is Next to Godliness

GitHub / datastarter

One query per ﬁle, proper formatting

Flexible and Transparent • Ad hoc requests • Engineering work
• In-depth research

What are the most common   aspect ratios for project
videos? • Aspect ratio = width / height • 311 distinct aspect ratios • Most common: • 1.77777 (16:9) • 1.33333 (4:3) • 1.78343

Project video aspect ratio per month

Internal Newsletter: METADATA • One Liner stats • Bigger Questions
• External Data Readings (from Twitter, Hacker News, etc.) • Consistent source of inspiration for the us and the rest of the team

Internal Documentation • A useful place to store snippets of
commonly used code (e.g. timezone conversion in SQL) • Required reading for onboarding new data hires, helpful for engineers too • Kickstarter SQL Style Guide

Internal Tools • Invest the time in parsing out what
the common questions are • … and build the tools to help your team answer them themselves • Include disclaimers and ask people to gut-check the results

Mixpanel • Pricey, but product managers love it • Helpful
for looking at real-time product interaction • Nightly export of raw JSON is excellent, was a deciding factor when choosing to go with Mixpanel • current_user and current_project global objects & their wrappers are great

Query Study: Live Projects • Nightly denormalization checking how many
projects are live:    SELECT COUNT(*) FROM projects WHERE state = ‘live’ • Simple but not trivial to backﬁll old dates • Historical, so no possible to split out by category, project size, etc.

Desert time is for thinking about queries • How could
we arbitrarily re-generate the # of live projects per-day in Redshift? • What if each project had a vote for the day that it was live? • Something like:    date, project_id  2009-09-19, 1151  2009-09-20, 1151  2009-09-21, 1151  ...  2009-10-19, 1151

GROUP BY to the rescue SELECT   date,  COUNT(DISTINCT project_id)
as live_project_count  FROM   (  2009-09-19, 1151  2009-09-20, 1151  2009-09-21, 1151  ...  2009-10-19, 1151  2009-10-20, 2351  ...  2014-05–26, 12381281  2014-05–27, 12381281  )   GROUP BY date 

The Trick • “Explode” the # of days a project
is live and tag it with those dates • generate_sequence() in Postgres • ??? in Redshift • Enter: the integers table

The Actual Query

The Actual Query SELECT id, state, CONVERT_TIMEZONE('UTC', 'America/New_York', launched_at) AS
launched_at, CONVERT_TIMEZONE('UTC', 'America/New_York', state_changed_at) AS state_changed_at, CONVERT_TIMEZONE('UTC', 'America/New_York', deadline) AS deadline FROM ksr_projects)

launched_at, CONVERT_TIMEZONE('UTC', 'America/New_York', state_changed_at) AS state_changed_at, CONVERT_TIMEZONE('UTC', 'America/New_York', deadline) AS deadline FROM ksr_projects) SELECT DATE(launched_at) + n AS date_live,   id AS project_id FROM ( ! WHERE n <= ( DATE( CASE WHEN state_changed_at < deadline AND state != 'live' THEN state_changed_at ELSE deadline END ) - DATE(launched_at) ) AND launched_at IS NOT NULL GROUP BY id, date_live, launched_at, deadline ORDER BY id, date_live) , integers

launched_at, CONVERT_TIMEZONE('UTC', 'America/New_York', state_changed_at) AS state_changed_at, CONVERT_TIMEZONE('UTC', 'America/New_York', deadline) AS deadline FROM ksr_projects) SELECT DATE(launched_at) + n AS date_live,   id AS project_id FROM ( ! WHERE n <= ( DATE( CASE WHEN state_changed_at < deadline AND state != 'live' THEN state_changed_at ELSE deadline END ) - DATE(launched_at) ) AND launched_at IS NOT NULL GROUP BY id, date_live, launched_at, deadline ORDER BY id, date_live) , integers 2009-09-19, 1151  2009-09-20, 1151  2009-09-21, 1151  ...  2009-10-19, 1151  2009-10-20, 2351  ...  2014-05–26, 12381281  2014-05–27, 12381281 ⤵️

launched_at, CONVERT_TIMEZONE('UTC', 'America/New_York', state_changed_at) AS state_changed_at, CONVERT_TIMEZONE('UTC', 'America/New_York', deadline) AS deadline FROM ksr_projects) SELECT DATE(launched_at) + n AS date_live,   id AS project_id FROM ( ! WHERE n <= ( DATE( CASE WHEN state_changed_at < deadline AND state != 'live' THEN state_changed_at ELSE deadline END ) - DATE(launched_at) ) AND launched_at IS NOT NULL GROUP BY id, date_live, launched_at, deadline ORDER BY id, date_live) , integers SELECT date_live, COUNT(DISTINCT project_id) AS live_projects FROM ( GROUP BY date_live; 2009-09-19, 1151  2009-09-20, 1151  2009-09-21, 1151  ...  2009-10-19, 1151  2009-10-20, 2351  ...  2014-05–26, 12381281  2014-05–27, 12381281 ⤵️

How does the query compare to our historical record? Pretty
close! Except there’s a gap. ! Bonus points if you know why.

Bonus Query: Window Functions SELECT event1, AVG(time2 - time1) AS
average_time1_time2_difference, event2, AVG(time3 - time2) AS average_time2_time3_difference, event3, COUNT(*) AS count FROM ( SELECT event, "time", LEAD(event, 1) OVER (PARTITION BY distinct_id ORDER BY "time") AS event1, LEAD("time", 1) OVER (PARTITION BY distinct_id ORDER BY "time") AS time1, LEAD(event, 2) OVER (PARTITION BY distinct_id ORDER BY "time") AS event2, LEAD("time", 2) OVER (PARTITION BY distinct_id ORDER BY "time") AS time2, LEAD(event, 3) OVER (PARTITION BY distinct_id ORDER BY "time") AS event3, LEAD("time", 3) OVER (PARTITION BY distinct_id ORDER BY "time") AS time3 FROM mixpanel_events ) WHERE event1 = 'New user' GROUP BY event1, event2, event3 ORDER BY count DESC LIMIT 1000; WARNING: 12 hour query over billions of rows What are the two most common events that are triggered after New Users signup?

• Extending our Logistic Regression framework • Realtime / Streaming
events data into Redshift • Data driven Trust & Safety risk analysis • A/B Testing • Automated reporting with knitr, R, Redshift • More long-term research projects! Stuff We’re Working on Now

We’re hiring!

Data @ Kickstarter: A Guided Tour

Data @ Kickstarter: A Guided Tour

More Decks by Fred Benenson

Other Decks in Technology

Featured

Transcript