Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Data Science and Analysis

Thomson Nguyen
December 17, 2012

Introduction to Data Science and Analysis

This was a talk I gave at the General Assembly in SF.

Thomson Nguyen

December 17, 2012
Tweet

More Decks by Thomson Nguyen

Other Decks in Research

Transcript

  1. Thomson Nguyen
    Data Scientist, Causes
    INTRO TO DATA SCIENCE
    AND ANALYSIS

    View Slide

  2. WHAT THIS TALK ISN’T 2
    ‣ A technical introduction to R or its methods
    ‣ Machine learning recipes that data scientists use
    ‣ A surefire method to land a job as a data scientist

    View Slide

  3. WHAT THIS TALK IS 3
    ‣ An overview of what a data-driven company is
    ‣ A look into a workflow of how a data scientist works
    ‣ An example case study of how data science adds value
    ‣ Resources on where to go after this talk

    View Slide

  4. AGENDA
    ‣ The Data Pyramid
    ‣ A Data Science Workflow
    ‣ Case Study: Causes and petition signing
    ‣ Pair programming on a Data Science problem
    ‣ Further resources
    4

    View Slide

  5. INTRO TO DATA SCIENCE AND ANALYTICS
    THE DATA PYRAMID 5
    OR, THE ROAD TO BEING A DATA-DRIVEN COMPANY

    View Slide

  6. THE DATA PYRAMID 6

    View Slide

  7. THE DATA PYRAMID 7
    Defining Key Performance Metrics
    • Find out what makes a
    company successful
    • Align entire company goals
    and roadmap around these
    KPIs
    • Examples: Daily revenue,
    DAU/MAUs, Average
    Revenue Per User, Time
    spent on site, Site Uptime,
    etc.

    View Slide

  8. THE DATA PYRAMID 8
    Data Warehousing
    • Taking production data
    from disparate sources and
    replicating it into a central
    database
    • Schema is planned and
    arranged according to KPIs
    defined earlier
    • Single source of truth for
    all analysis and queries (i.e.,
    no one is using production)
    Defining Key Performance Metrics

    View Slide

  9. THE DATA PYRAMID 9
    Data Warehousing
    • From a data warehouse,
    we can create KPI reports for
    nontechnical stakeholders
    (marketing, C-levels, board
    members, etc.)
    • We can also analyze certain
    metrics and tables and make
    hypotheses about specific
    trends in our data.
    Defining Key Performance Metrics
    Analytics and BI

    View Slide

  10. THE DATA PYRAMID 10
    Data Warehousing
    • We can create features and
    variables from our
    hypotheses and using
    machine learning, we can
    write algorithms and models
    to:
    • Explain the past, or
    • Predict the future.
    • These models can be user-
    facing (product features), or
    internal.
    Defining Key Performance Metrics
    Analytics and BI
    Data
    Science

    View Slide

  11. THE DATA PYRAMID - USER-FACING EXAMPLE 11
    Source: amazon.com

    View Slide

  12. THE DATA PYRAMID - USER-FACING EXAMPLE 12
    Source: amazon.com

    View Slide

  13. THE DATA PYRAMID - USER-FACING EXAMPLE 13
    Source: linkedin.com

    View Slide

  14. THE DATA PYRAMID - USER-FACING EXAMPLE 14
    Source: linkedin.com

    View Slide

  15. INTRO TO DATA SCIENCE AND ANALYTICS
    THE DATA SCIENCE WORKFLOW
    15
    OR, HOW TO APPROACH PROBLEMS IN DATA SCIENCE

    View Slide

  16. INTRO TO DATA SCIENCE AND ANALYTICS
    THE DATA SCIENCE WORKFLOW
    16
    A
    OR, HOW TO APPROACH PROBLEMS IN DATA SCIENCE

    View Slide

  17. A DATA SCIENCE WORKFLOW 17
    BEN FRY, FATHOM INFORMATION DESIGN
    • Acquire - Where does the data come from?
    • Parse - How do we preprocess the data?
    • Filter - How do we transform the data?
    • Mine - What machine learning algorithms can we use?
    • Represent - How can we visualize our results?
    • Refine - How can we take feedback to make it more accurate?
    • Interact - How will this make the end-user’s life better?
    Source: CMU

    View Slide

  18. A DATA SCIENCE WORKFLOW 18
    BEN FRY, FATHOM INFORMATION DESIGN
    Source: CMU

    View Slide

  19. A DATA SCIENCE WORKFLOW 19
    BEN FRY, FATHOM INFORMATION DESIGN
    Source: CMU

    View Slide

  20. A DATA SCIENCE WORKFLOW 20
    BEN FRY, FATHOM INFORMATION DESIGN
    Source: CMU

    View Slide

  21. INTRO TO DATA SCIENCE AND ANALYTICS
    CASE STUDY
    21
    PREDICTING THE PROBABILITY OF SIGNING A PETITION

    View Slide

  22. SUMMARY KEY CHALLENGE / QUESTION
    CASE STUDY
    Can we create a model that
    predicts the probability that a
    user on our website will sign a
    petition? Can we use this model
    to predict the probability for
    future actions (petitions and
    pledges?)
    We want to use our massive amounts of data to create a probability
    that someone will sign a petition or take a pledge on causes.com.
    We’ll be looking at a specific petition on Causes as our target
    petition. Our dataset will consist of users who have either signed
    the petition or have not signed the petition, along with their
    features (age, gender, number of actions taken, etc.)
    CAUSES
    22

    View Slide

  23. CASE STUDY - CAUSES 23

    View Slide

  24. CASE STUDY - ACQUIRE 24
    Signatures Users Locations
    FB Likes
    SQL databases
    Data Warehousing
    Defining Key Performance Metrics
    Analytics and BI
    Data
    Science
    ‣ Our user data and FB likes are gathered
    from FBConnect, which is required for all
    Causes users
    ‣ Location data is gathered from a user’s IP
    address while they are on our site.
    ‣ When a user signs the petition, we record it
    in a database consisting of action_credits.
    60.34.92.5

    View Slide

  25. CASE STUDY - PARSE 25
    Signatures Users Locations
    FB Likes
    SQL databases
    ETL scripts
    Analytics DB
    (Data Warehouse)
    Data Warehousing
    Defining Key Performance Metrics
    Analytics and BI
    Data
    Science
    ‣ Our production data is stored in
    horizontally sharded databases in MySQL
    ‣ We extract the relevant data from these
    individual production databases,
    ‣ then transform the data to fit a certain
    schema,
    ‣ and finally we load the data into a central
    analytics DB.

    View Slide

  26. CASE STUDY - FILTER 26
    ‣ A probability distribution that plots all
    of its values in a symmetrical fashion
    ‣ Most of the results are situated
    around the probability's mean (the
    center)
    ‣ Values are equally likely to plot either
    above or below the mean.
    ‣ Grouping takes place at values that
    are close to the mean and then tails
    off symmetrically away from the
    mean.
    PRIMER ON NORMAL DISTRIBUTION

    View Slide

  27. CASE STUDY - FILTER 27
    Data Warehousing
    Defining Key Performance Metrics
    Analytics and BI
    Data
    Science
    ‣ We’ll want to normalize certain
    features (that is, transform variables
    to represent a normal distribution)
    ‣ The top two plots represent Causes
    users and a QQ-distribution
    ‣ We took the log(age) and got closer
    to a normal distribution (bottom two
    plots).

    View Slide

  28. CASE STUDY - MINE 28
    Data Warehousing
    Defining Key Performance Metrics
    Analytics and BI
    Data
    Science
    id user_id target_user_id action_id age is_female is_vet is_south created_at signed_petition
    1 13215827 NULL 5390823 19 0 1 1 28-nov-2012 1
    2 1614524 120583933 5390823 25 1 1 0 28-nov-2012 1
    3 11058392 NULL 6729301 29 1 0 0 28-nov-2012 0
    4 9529371 168920339 8950132 28 0 0 1 28-nov-2012 0
    5 10385293 NULL 5390823 24 1 1 1 28-nov-2012 1
    ‣ This is a subset of the total number of columns we have in our dataset.
    ‣ The actual dataset has approximately 110 columns, representing either a
    demographic feature (age, gender, location), or a psychographic/behavioral feature
    (is a veteran, number of petitions signed on Causes, number of related FB likes)

    View Slide

  29. CASE STUDY - MINE 29
    Data Warehousing
    Defining Key Performance Metrics
    Analytics and BI
    Data
    Science
    id user_id target_user_id action_id age is_female is_vet is_south created_at signed_petition
    1 13215827 NULL 5390823 19 0 1 1 28-nov-2012 1
    2 1614524 120583933 5390823 25 1 1 0 28-nov-2012 1
    3 11058392 NULL 6729301 29 1 0 0 28-nov-2012 0
    4 9529371 168920339 8950132 28 0 0 1 28-nov-2012 0
    5 10385293 NULL 5390823 24 1 1 1 28-nov-2012 1
    ‣ This is a subset of the total number of columns we have in our dataset.
    ‣ The actual dataset has approximately 110 columns, representing either a
    demographic feature (age, gender, location), or a psychographic/behavioral feature
    (is a veteran, number of petitions signed on Causes, number of related FB likes).
    For this example, we’ll only use the above features.
    ‣ Our response variable (the feature we’re predicting) is whether or not they signed
    the actual petition.

    View Slide

  30. CASE STUDY - MINE 30
    Data Warehousing
    Defining Key Performance Metrics
    Analytics and BI
    Data
    Science
    We’ll be using a logistic regression to calculate this probability:
    action_data = read.csv(“actions.csv”)
    logit_model = glm(signed_petition ~
    age + is_female
    is_vet + is_south,
    data = action_data, family = "binomial")
    In R:

    View Slide

  31. CASE STUDY - REPRESENT 31
    Data Warehousing
    Defining Key Performance Metrics
    Analytics and BI
    Data
    Science
    The condensed output in R looks like this:
    ## Call:
    ## glm(signed_petition ~ age + is_female + is_vet + is_south,
    ## data = action_data, family = "binomial")
    ## Coefficients:
    ## Estimate Std. Error z value Pr(>|z|)
    ## (Intercept) 0.01998 1.13995 -3.50 0.00047 ***
    ## age 0.00226 0.00109 2.07 0.03847 *
    ## is_female -0.10404 0.33182 2.42 0.01539 **
    ## is_vet 0.67544 0.31649 -2.13 0.03283 *
    ## is_south 0.34020 0.34531 -3.88 0.00010 ***
    ## ---
    ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ##
    ## (Dispersion parameter for binomial family taken to be 1)
    ##

    View Slide

  32. CASE STUDY - REPRESENT 32
    Data Warehousing
    Defining Key Performance Metrics
    Analytics and BI
    Data
    Science
    Here’s a distribution of the regression, drilled down by age:
    20 30 40 50 60 70 80

    View Slide

  33. CASE STUDY - INTERACT 33
    Data Warehousing
    Defining Key Performance Metrics
    Analytics and BI
    Data
    Science
    id user_id
    signed_
    petition
    predicted_
    probability
    1 13215827 1 0.852
    2 1614524 1 0.931
    3 11058392 0 0.394
    4 9529371 0 0.485
    5 10385293 1 0.729
    ‣ Our result is an associated probability
    with each user_id in our dataset
    ‣ We can use these probabilities to try
    and create a user facing product that
    will result in more signatures, and
    more site activity.

    View Slide

  34. CASE STUDY - INTERACT 34
    Data Warehousing
    Defining Key Performance Metrics
    Analytics and BI
    Data
    Science
    People with high probabilities will received military-themed
    actions in their inbox:

    View Slide

  35. CASE STUDY - REFINE 35
    Data Warehousing
    Defining Key Performance Metrics
    Analytics and BI
    Data
    Science
    Based on the results of these e-mail blasts, we can refine and
    improve our logistic regression as necessary:
    And the cycle continues until we’ve achieved some set benchmark for success.

    View Slide

  36. CASE STUDY - REPRESENT 36
    Data Warehousing
    Defining Key Performance Metrics
    Analytics and BI
    Data
    Science

    View Slide

  37. CASE STUDY - REPRESENT 36
    Data Warehousing
    Defining Key Performance Metrics
    Analytics and BI
    Data
    Science

    View Slide

  38. CASE STUDY - REPRESENT 36
    Data Warehousing
    Defining Key Performance Metrics
    Analytics and BI
    Data
    Science

    View Slide

  39. DEMO 37
    • Hospitalization in the United States:
    •~71 million people were admitted into the
    hospital last year
    • Of these 71 million people, 11 million of
    them were classified as “unnecessary”,
    resulting in $30bn in unnecessary
    expenditure.
    •The majority (83%) of these admissions
    were made by GPs and Managed Care
    Organizations.

    View Slide

  40. DEMO 38
    • Hospitalization in the United States:
    •~71 million people were admitted into the
    hospital last year
    • Of these 71 million people, 11 million of
    them were classified as “unnecessary”,
    resulting in $30bn in unnecessary
    expenditure.
    •The majority (83%) of these admissions
    were made by GPs and Managed Care
    Organizations.
    Is there a data-driven approach that GPs
    can use to assist their diagnoses and
    decrease false positives?

    View Slide

  41. INTRO TO DATA SCIENCE AND ANALYTICS
    FURTHER RESOURCES
    39
    WHERE TO GO FROM HERE

    View Slide

  42. FURTHER RESOURCES - BOOKS 40
    Data Warehousing
    Defining Key Performance Metrics
    Analytics and BI
    Data
    Science

    View Slide

  43. FURTHER RESOURCES - BOOKS 41
    Data Warehousing
    Defining Key Performance Metrics
    Analytics and BI
    Data
    Science

    View Slide

  44. FURTHER RESOURCES - BOOKS 42
    Data Warehousing
    Defining Key Performance Metrics
    Analytics and BI
    Data
    Science

    View Slide

  45. FURTHER RESOURCES - KAGGLE 43
    Data Warehousing
    Defining Key Performance Metrics
    Analytics and BI
    Data
    Science

    View Slide

  46. Q&A
    DATA SCIENCE AND ANALYTICS 44

    View Slide

  47. INTRO TO DATA SCIENCE AND ANALYTICS
    THANKS!
    45
    E-MAIL ME QUESTIONS: [email protected]
    OR SEND ME A TWEET: @ITSTHOMSON

    View Slide