Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data @ Kickstarter: A Guided Tour

Data @ Kickstarter: A Guided Tour

A high level presentation about how we're working with data at Kickstarter. Includes details on our events stack, some ETL, Redshift queries, and our process of handling and processing data requests internally.

Fred Benenson

June 06, 2014
Tweet

More Decks by Fred Benenson

Other Decks in Technology

Transcript

  1. Data @ Kickstarter: A Guided Tour
    [email protected] / [email protected]
    @fredbenenson / @jsalfen

    View Slide

  2. Some Kickstarter Stats

    $1.15+ billion pledged

    63,000+ Successful Projects

    6.35 million backers

    1.9 million repeat backers

    15.7 million pledges

    $1-2 million dollars a day pledged

    43-44% of projects reach their goal

    65 $1+ million projects

    View Slide

  3. Some Kickstarter Data Stats

    Rails App hosted entirely on Amazon Web Services

    18 app servers

    db.m2.4xlarge RDS master databases

    4-5 Read Replicas

    Hundreds of million of rows in MySQL Databases

    Generating 7-10 million page-view level events per-day

    Billions of rows of analytics data

    Running alongside a combination elasticsearch, Redis, and Resque servers

    Graphite & Statsd for operations stats
    3

    View Slide

  4. Interesting things we’re doing
    that we’re not going to talk about tonight

    Machine Learning

    The Art of Instrumentation

    DevOps Data (STATSD, etc.)

    View Slide

  5. The Stack

    View Slide

  6. The Data Sources
    Kickstarter
    MySQL RDS #1
    Kickstarter
    MySQL RDS #2
    Mixpanel
    Mixpanel Download
    to EC2
    DynamoDB

    View Slide

  7. The Pipeline
    Kickstarter
    MySQL RDS #1
    Kickstarter
    MySQL RDS #2
    S3 Daily Buckets
    sqoop on EMR
    Mixpanel Download
    to EC2
    EMR Streaming
    to clean CSVs
    EMR to transform
    JSON to CSV
    EMR Streaming
    to clean CSVs
    sqoop on EMR
    AWS Data Pipeline
    DynamoDB
    EMR to transform
    DynamoDB to CSV

    View Slide

  8. … into Redshift
    Kickstarter
    MySQL RDS #1
    Kickstarter
    MySQL RDS #2
    S3 Daily Buckets
    sqoop on EMR
    Mixpanel Download
    to EC2
    EMR Streaming
    to clean CSVs
    psql
    client
    AWS Redshift
    EMR to transform
    JSON to CSV
    EMR Streaming
    to clean CSVs
    sqoop on EMR
    AWS Data Pipeline
    DynamoDB
    EMR to transform
    DynamoDB to CSV

    View Slide

  9. tl;dr: Get data into Redshift

    View Slide

  10. … what else is in Redshift?

    2010 US Census Data snapshots across all our databases

    Denormalized table views

    An integers table

    Exploded MaxMind GeoIP table of IP ↔️ Locations (billions of rows)

    View Slide

  11. The Craft

    View Slide

  12. The Data Trello

    View Slide

  13. Trello Lists

    Welcome / Read This First / Prompts / Examples

    Doing

    Shipped per Week (e.g. “Shipped May 12 - May 16”)

    Per-department Lists (e.g. Community, Marketing, etc.)

    Backlog

    A/B Tests (Running, TODO, etc.)

    View Slide

  14. More on our blog…

    View Slide

  15. Help for Data Requests
    Prompts

    What’s your question?

    Why is this important?

    When do you need this
    done?

    View Slide

  16. Help for Data Requests
    Prompts

    What’s your question?

    Why is this important?

    When do you need this
    done?
    Examples

    Least amount of time/
    resources

    Moderate amount of time/
    resources

    Most amount of time/
    resources

    View Slide

  17. An Example Card

    Assign members to the card

    Always expect a deadline

    Labels are ¯\_(π)_/¯ useful

    Encourage replying to the prompts to
    help colleagues think about the
    request

    URGENT cards should include
    URGENT in their title, highly
    discouraged

    View Slide


  18. Data goes to shared Google Docs
    spreadsheet

    SQL / Queries go to GitHub (either
    repo, or sometimes just a gist), with
    permalinked SHA1 URL

    R goes to GitHub

    Final graphs are attached and kept
    updated

    In-depth reports are produced with
    knitr
    Reproducibility is Next to Godliness

    View Slide

  19. GitHub / datastarter

    View Slide

  20. One query per file, proper formatting

    View Slide

  21. Flexible and Transparent

    Ad hoc requests

    Engineering work

    In-depth research

    View Slide

  22. What are the most common 

    aspect ratios for project videos?

    Aspect ratio = width / height

    311 distinct aspect ratios

    Most common:

    1.77777 (16:9)

    1.33333 (4:3)

    1.78343

    View Slide

  23. Project video aspect ratio per month

    View Slide

  24. Internal Newsletter: METADATA

    One Liner stats

    Bigger Questions

    External Data Readings (from Twitter,
    Hacker News, etc.)

    Consistent source of inspiration for the
    us and the rest of the team

    View Slide

  25. Internal Documentation

    A useful place to store snippets of
    commonly used code (e.g. timezone
    conversion in SQL)

    Required reading for onboarding new
    data hires, helpful for engineers too

    Kickstarter SQL Style Guide

    View Slide

  26. Internal Tools

    Invest the time in parsing out what the
    common questions are

    … and build the tools to help your
    team answer them themselves

    Include disclaimers and ask people to
    gut-check the results

    View Slide

  27. Mixpanel

    Pricey, but product managers love it

    Helpful for looking at real-time product
    interaction

    Nightly export of raw JSON is
    excellent, was a deciding factor when
    choosing to go with Mixpanel

    current_user and
    current_project global objects &
    their wrappers are great

    View Slide

  28. Query Study: Live Projects

    Nightly denormalization checking how many projects are live:


    SELECT COUNT(*) FROM projects WHERE state = ‘live’

    Simple but not trivial to backfill old dates

    Historical, so no possible to split out by category, project size, etc.

    View Slide

  29. Desert time is for thinking about queries

    How could we arbitrarily re-generate the
    # of live projects per-day in Redshift?

    What if each project had a vote for the
    day that it was live?

    Something like:


    date, project_id

    2009-09-19, 1151

    2009-09-20, 1151

    2009-09-21, 1151

    ...

    2009-10-19, 1151

    View Slide

  30. GROUP BY to the rescue
    SELECT 

    date,

    COUNT(DISTINCT project_id) as live_project_count

    FROM 

    (

    2009-09-19, 1151

    2009-09-20, 1151

    2009-09-21, 1151

    ...

    2009-10-19, 1151

    2009-10-20, 2351

    ...

    2014-05–26, 12381281

    2014-05–27, 12381281

    ) 

    GROUP BY date


    View Slide

  31. The Trick

    “Explode” the # of days a project is live and tag it with
    those dates

    generate_sequence() in Postgres

    ??? in Redshift

    Enter: the integers table

    View Slide

  32. The Actual Query

    View Slide

  33. The Actual Query
    SELECT
    id,
    state,
    CONVERT_TIMEZONE('UTC', 'America/New_York', launched_at) AS launched_at,
    CONVERT_TIMEZONE('UTC', 'America/New_York', state_changed_at) AS state_changed_at,
    CONVERT_TIMEZONE('UTC', 'America/New_York', deadline) AS deadline
    FROM
    ksr_projects)

    View Slide

  34. The Actual Query
    SELECT
    id,
    state,
    CONVERT_TIMEZONE('UTC', 'America/New_York', launched_at) AS launched_at,
    CONVERT_TIMEZONE('UTC', 'America/New_York', state_changed_at) AS state_changed_at,
    CONVERT_TIMEZONE('UTC', 'America/New_York', deadline) AS deadline
    FROM
    ksr_projects)
    SELECT
    DATE(launched_at) + n AS date_live, 

    id AS project_id
    FROM (
    !
    WHERE
    n <= (
    DATE(
    CASE
    WHEN state_changed_at < deadline AND state != 'live'
    THEN state_changed_at
    ELSE deadline END
    ) - DATE(launched_at)
    )
    AND launched_at IS NOT NULL
    GROUP BY id, date_live, launched_at, deadline
    ORDER BY id, date_live)
    , integers

    View Slide

  35. The Actual Query
    SELECT
    id,
    state,
    CONVERT_TIMEZONE('UTC', 'America/New_York', launched_at) AS launched_at,
    CONVERT_TIMEZONE('UTC', 'America/New_York', state_changed_at) AS state_changed_at,
    CONVERT_TIMEZONE('UTC', 'America/New_York', deadline) AS deadline
    FROM
    ksr_projects)
    SELECT
    DATE(launched_at) + n AS date_live, 

    id AS project_id
    FROM (
    !
    WHERE
    n <= (
    DATE(
    CASE
    WHEN state_changed_at < deadline AND state != 'live'
    THEN state_changed_at
    ELSE deadline END
    ) - DATE(launched_at)
    )
    AND launched_at IS NOT NULL
    GROUP BY id, date_live, launched_at, deadline
    ORDER BY id, date_live)
    , integers
    2009-09-19, 1151

    2009-09-20, 1151

    2009-09-21, 1151

    ...

    2009-10-19, 1151

    2009-10-20, 2351

    ...

    2014-05–26, 12381281

    2014-05–27, 12381281
    ⤵️

    View Slide

  36. The Actual Query
    SELECT
    id,
    state,
    CONVERT_TIMEZONE('UTC', 'America/New_York', launched_at) AS launched_at,
    CONVERT_TIMEZONE('UTC', 'America/New_York', state_changed_at) AS state_changed_at,
    CONVERT_TIMEZONE('UTC', 'America/New_York', deadline) AS deadline
    FROM
    ksr_projects)
    SELECT
    DATE(launched_at) + n AS date_live, 

    id AS project_id
    FROM (
    !
    WHERE
    n <= (
    DATE(
    CASE
    WHEN state_changed_at < deadline AND state != 'live'
    THEN state_changed_at
    ELSE deadline END
    ) - DATE(launched_at)
    )
    AND launched_at IS NOT NULL
    GROUP BY id, date_live, launched_at, deadline
    ORDER BY id, date_live)
    , integers
    SELECT
    date_live,
    COUNT(DISTINCT project_id) AS live_projects
    FROM (
    GROUP BY date_live;
    2009-09-19, 1151

    2009-09-20, 1151

    2009-09-21, 1151

    ...

    2009-10-19, 1151

    2009-10-20, 2351

    ...

    2014-05–26, 12381281

    2014-05–27, 12381281
    ⤵️

    View Slide

  37. How does the query compare to
    our historical record?
    Pretty close! Except there’s a gap.
    !
    Bonus points if you know why.

    View Slide

  38. Bonus Query: Window Functions
    SELECT
    event1,
    AVG(time2 - time1) AS average_time1_time2_difference,
    event2,
    AVG(time3 - time2) AS average_time2_time3_difference,
    event3,
    COUNT(*) AS count
    FROM (
    SELECT
    event,
    "time",
    LEAD(event, 1) OVER (PARTITION BY distinct_id ORDER BY "time") AS event1,
    LEAD("time", 1) OVER (PARTITION BY distinct_id ORDER BY "time") AS time1,
    LEAD(event, 2) OVER (PARTITION BY distinct_id ORDER BY "time") AS event2,
    LEAD("time", 2) OVER (PARTITION BY distinct_id ORDER BY "time") AS time2,
    LEAD(event, 3) OVER (PARTITION BY distinct_id ORDER BY "time") AS event3,
    LEAD("time", 3) OVER (PARTITION BY distinct_id ORDER BY "time") AS time3
    FROM mixpanel_events
    )
    WHERE
    event1 = 'New user'
    GROUP BY event1, event2, event3
    ORDER BY count DESC
    LIMIT 1000;
    WARNING: 12 hour query over billions of rows
    What are the two most
    common events that are
    triggered after New Users
    signup?

    View Slide


  39. Extending our Logistic Regression framework

    Realtime / Streaming events data into Redshift

    Data driven Trust & Safety risk analysis

    A/B Testing

    Automated reporting with knitr, R, Redshift

    More long-term research projects!
    Stuff We’re Working on Now

    View Slide

  40. We’re hiring!

    View Slide