Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Data Science and Analysis

Thomson Nguyen
December 17, 2012

Introduction to Data Science and Analysis

This was a talk I gave at the General Assembly in SF.

Thomson Nguyen

December 17, 2012
Tweet

More Decks by Thomson Nguyen

Other Decks in Research

Transcript

  1. WHAT THIS TALK ISN’T 2 ‣ A technical introduction to

    R or its methods ‣ Machine learning recipes that data scientists use ‣ A surefire method to land a job as a data scientist
  2. WHAT THIS TALK IS 3 ‣ An overview of what

    a data-driven company is ‣ A look into a workflow of how a data scientist works ‣ An example case study of how data science adds value ‣ Resources on where to go after this talk
  3. AGENDA ‣ The Data Pyramid ‣ A Data Science Workflow

    ‣ Case Study: Causes and petition signing ‣ Pair programming on a Data Science problem ‣ Further resources 4
  4. INTRO TO DATA SCIENCE AND ANALYTICS THE DATA PYRAMID 5

    OR, THE ROAD TO BEING A DATA-DRIVEN COMPANY
  5. THE DATA PYRAMID 7 Defining Key Performance Metrics • Find

    out what makes a company successful • Align entire company goals and roadmap around these KPIs • Examples: Daily revenue, DAU/MAUs, Average Revenue Per User, Time spent on site, Site Uptime, etc.
  6. THE DATA PYRAMID 8 Data Warehousing • Taking production data

    from disparate sources and replicating it into a central database • Schema is planned and arranged according to KPIs defined earlier • Single source of truth for all analysis and queries (i.e., no one is using production) Defining Key Performance Metrics
  7. THE DATA PYRAMID 9 Data Warehousing • From a data

    warehouse, we can create KPI reports for nontechnical stakeholders (marketing, C-levels, board members, etc.) • We can also analyze certain metrics and tables and make hypotheses about specific trends in our data. Defining Key Performance Metrics Analytics and BI
  8. THE DATA PYRAMID 10 Data Warehousing • We can create

    features and variables from our hypotheses and using machine learning, we can write algorithms and models to: • Explain the past, or • Predict the future. • These models can be user- facing (product features), or internal. Defining Key Performance Metrics Analytics and BI Data Science
  9. INTRO TO DATA SCIENCE AND ANALYTICS THE DATA SCIENCE WORKFLOW

    15 OR, HOW TO APPROACH PROBLEMS IN DATA SCIENCE
  10. INTRO TO DATA SCIENCE AND ANALYTICS THE DATA SCIENCE WORKFLOW

    16 A OR, HOW TO APPROACH PROBLEMS IN DATA SCIENCE
  11. A DATA SCIENCE WORKFLOW 17 BEN FRY, FATHOM INFORMATION DESIGN

    • Acquire - Where does the data come from? • Parse - How do we preprocess the data? • Filter - How do we transform the data? • Mine - What machine learning algorithms can we use? • Represent - How can we visualize our results? • Refine - How can we take feedback to make it more accurate? • Interact - How will this make the end-user’s life better? Source: CMU
  12. INTRO TO DATA SCIENCE AND ANALYTICS CASE STUDY 21 PREDICTING

    THE PROBABILITY OF SIGNING A PETITION
  13. SUMMARY KEY CHALLENGE / QUESTION CASE STUDY Can we create

    a model that predicts the probability that a user on our website will sign a petition? Can we use this model to predict the probability for future actions (petitions and pledges?) We want to use our massive amounts of data to create a probability that someone will sign a petition or take a pledge on causes.com. We’ll be looking at a specific petition on Causes as our target petition. Our dataset will consist of users who have either signed the petition or have not signed the petition, along with their features (age, gender, number of actions taken, etc.) CAUSES 22
  14. CASE STUDY - ACQUIRE 24 Signatures Users Locations FB Likes

    SQL databases Data Warehousing Defining Key Performance Metrics Analytics and BI Data Science ‣ Our user data and FB likes are gathered from FBConnect, which is required for all Causes users ‣ Location data is gathered from a user’s IP address while they are on our site. ‣ When a user signs the petition, we record it in a database consisting of action_credits. 60.34.92.5
  15. CASE STUDY - PARSE 25 Signatures Users Locations FB Likes

    SQL databases ETL scripts Analytics DB (Data Warehouse) Data Warehousing Defining Key Performance Metrics Analytics and BI Data Science ‣ Our production data is stored in horizontally sharded databases in MySQL ‣ We extract the relevant data from these individual production databases, ‣ then transform the data to fit a certain schema, ‣ and finally we load the data into a central analytics DB.
  16. CASE STUDY - FILTER 26 ‣ A probability distribution that

    plots all of its values in a symmetrical fashion ‣ Most of the results are situated around the probability's mean (the center) ‣ Values are equally likely to plot either above or below the mean. ‣ Grouping takes place at values that are close to the mean and then tails off symmetrically away from the mean. PRIMER ON NORMAL DISTRIBUTION
  17. CASE STUDY - FILTER 27 Data Warehousing Defining Key Performance

    Metrics Analytics and BI Data Science ‣ We’ll want to normalize certain features (that is, transform variables to represent a normal distribution) ‣ The top two plots represent Causes users and a QQ-distribution ‣ We took the log(age) and got closer to a normal distribution (bottom two plots).
  18. CASE STUDY - MINE 28 Data Warehousing Defining Key Performance

    Metrics Analytics and BI Data Science id user_id target_user_id action_id age is_female is_vet is_south created_at signed_petition 1 13215827 NULL 5390823 19 0 1 1 28-nov-2012 1 2 1614524 120583933 5390823 25 1 1 0 28-nov-2012 1 3 11058392 NULL 6729301 29 1 0 0 28-nov-2012 0 4 9529371 168920339 8950132 28 0 0 1 28-nov-2012 0 5 10385293 NULL 5390823 24 1 1 1 28-nov-2012 1 ‣ This is a subset of the total number of columns we have in our dataset. ‣ The actual dataset has approximately 110 columns, representing either a demographic feature (age, gender, location), or a psychographic/behavioral feature (is a veteran, number of petitions signed on Causes, number of related FB likes)
  19. CASE STUDY - MINE 29 Data Warehousing Defining Key Performance

    Metrics Analytics and BI Data Science id user_id target_user_id action_id age is_female is_vet is_south created_at signed_petition 1 13215827 NULL 5390823 19 0 1 1 28-nov-2012 1 2 1614524 120583933 5390823 25 1 1 0 28-nov-2012 1 3 11058392 NULL 6729301 29 1 0 0 28-nov-2012 0 4 9529371 168920339 8950132 28 0 0 1 28-nov-2012 0 5 10385293 NULL 5390823 24 1 1 1 28-nov-2012 1 ‣ This is a subset of the total number of columns we have in our dataset. ‣ The actual dataset has approximately 110 columns, representing either a demographic feature (age, gender, location), or a psychographic/behavioral feature (is a veteran, number of petitions signed on Causes, number of related FB likes). For this example, we’ll only use the above features. ‣ Our response variable (the feature we’re predicting) is whether or not they signed the actual petition.
  20. CASE STUDY - MINE 30 Data Warehousing Defining Key Performance

    Metrics Analytics and BI Data Science We’ll be using a logistic regression to calculate this probability: action_data = read.csv(“actions.csv”) logit_model = glm(signed_petition ~ age + is_female is_vet + is_south, data = action_data, family = "binomial") In R:
  21. CASE STUDY - REPRESENT 31 Data Warehousing Defining Key Performance

    Metrics Analytics and BI Data Science The condensed output in R looks like this: ## Call: ## glm(signed_petition ~ age + is_female + is_vet + is_south, ## data = action_data, family = "binomial") ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.01998 1.13995 -3.50 0.00047 *** ## age 0.00226 0.00109 2.07 0.03847 * ## is_female -0.10404 0.33182 2.42 0.01539 ** ## is_vet 0.67544 0.31649 -2.13 0.03283 * ## is_south 0.34020 0.34531 -3.88 0.00010 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ##
  22. CASE STUDY - REPRESENT 32 Data Warehousing Defining Key Performance

    Metrics Analytics and BI Data Science Here’s a distribution of the regression, drilled down by age: 20 30 40 50 60 70 80
  23. CASE STUDY - INTERACT 33 Data Warehousing Defining Key Performance

    Metrics Analytics and BI Data Science id user_id signed_ petition predicted_ probability 1 13215827 1 0.852 2 1614524 1 0.931 3 11058392 0 0.394 4 9529371 0 0.485 5 10385293 1 0.729 ‣ Our result is an associated probability with each user_id in our dataset ‣ We can use these probabilities to try and create a user facing product that will result in more signatures, and more site activity.
  24. CASE STUDY - INTERACT 34 Data Warehousing Defining Key Performance

    Metrics Analytics and BI Data Science People with high probabilities will received military-themed actions in their inbox:
  25. CASE STUDY - REFINE 35 Data Warehousing Defining Key Performance

    Metrics Analytics and BI Data Science Based on the results of these e-mail blasts, we can refine and improve our logistic regression as necessary: And the cycle continues until we’ve achieved some set benchmark for success.
  26. DEMO 37 • Hospitalization in the United States: •~71 million

    people were admitted into the hospital last year • Of these 71 million people, 11 million of them were classified as “unnecessary”, resulting in $30bn in unnecessary expenditure. •The majority (83%) of these admissions were made by GPs and Managed Care Organizations.
  27. DEMO 38 • Hospitalization in the United States: •~71 million

    people were admitted into the hospital last year • Of these 71 million people, 11 million of them were classified as “unnecessary”, resulting in $30bn in unnecessary expenditure. •The majority (83%) of these admissions were made by GPs and Managed Care Organizations. Is there a data-driven approach that GPs can use to assist their diagnoses and decrease false positives?