Slide 1

Slide 1 text

Thomson Nguyen Data Scientist, Causes INTRO TO DATA SCIENCE AND ANALYSIS

Slide 2

Slide 2 text

WHAT THIS TALK ISN’T 2 ‣ A technical introduction to R or its methods ‣ Machine learning recipes that data scientists use ‣ A surefire method to land a job as a data scientist

Slide 3

Slide 3 text

WHAT THIS TALK IS 3 ‣ An overview of what a data-driven company is ‣ A look into a workflow of how a data scientist works ‣ An example case study of how data science adds value ‣ Resources on where to go after this talk

Slide 4

Slide 4 text

AGENDA ‣ The Data Pyramid ‣ A Data Science Workflow ‣ Case Study: Causes and petition signing ‣ Pair programming on a Data Science problem ‣ Further resources 4

Slide 5

Slide 5 text

INTRO TO DATA SCIENCE AND ANALYTICS THE DATA PYRAMID 5 OR, THE ROAD TO BEING A DATA-DRIVEN COMPANY

Slide 6

Slide 6 text

THE DATA PYRAMID 6

Slide 7

Slide 7 text

THE DATA PYRAMID 7 Defining Key Performance Metrics • Find out what makes a company successful • Align entire company goals and roadmap around these KPIs • Examples: Daily revenue, DAU/MAUs, Average Revenue Per User, Time spent on site, Site Uptime, etc.

Slide 8

Slide 8 text

THE DATA PYRAMID 8 Data Warehousing • Taking production data from disparate sources and replicating it into a central database • Schema is planned and arranged according to KPIs defined earlier • Single source of truth for all analysis and queries (i.e., no one is using production) Defining Key Performance Metrics

Slide 9

Slide 9 text

THE DATA PYRAMID 9 Data Warehousing • From a data warehouse, we can create KPI reports for nontechnical stakeholders (marketing, C-levels, board members, etc.) • We can also analyze certain metrics and tables and make hypotheses about specific trends in our data. Defining Key Performance Metrics Analytics and BI

Slide 10

Slide 10 text

THE DATA PYRAMID 10 Data Warehousing • We can create features and variables from our hypotheses and using machine learning, we can write algorithms and models to: • Explain the past, or • Predict the future. • These models can be user- facing (product features), or internal. Defining Key Performance Metrics Analytics and BI Data Science

Slide 11

Slide 11 text

THE DATA PYRAMID - USER-FACING EXAMPLE 11 Source: amazon.com

Slide 12

Slide 12 text

THE DATA PYRAMID - USER-FACING EXAMPLE 12 Source: amazon.com

Slide 13

Slide 13 text

THE DATA PYRAMID - USER-FACING EXAMPLE 13 Source: linkedin.com

Slide 14

Slide 14 text

THE DATA PYRAMID - USER-FACING EXAMPLE 14 Source: linkedin.com

Slide 15

Slide 15 text

INTRO TO DATA SCIENCE AND ANALYTICS THE DATA SCIENCE WORKFLOW 15 OR, HOW TO APPROACH PROBLEMS IN DATA SCIENCE

Slide 16

Slide 16 text

INTRO TO DATA SCIENCE AND ANALYTICS THE DATA SCIENCE WORKFLOW 16 A OR, HOW TO APPROACH PROBLEMS IN DATA SCIENCE

Slide 17

Slide 17 text

A DATA SCIENCE WORKFLOW 17 BEN FRY, FATHOM INFORMATION DESIGN • Acquire - Where does the data come from? • Parse - How do we preprocess the data? • Filter - How do we transform the data? • Mine - What machine learning algorithms can we use? • Represent - How can we visualize our results? • Refine - How can we take feedback to make it more accurate? • Interact - How will this make the end-user’s life better? Source: CMU

Slide 18

Slide 18 text

A DATA SCIENCE WORKFLOW 18 BEN FRY, FATHOM INFORMATION DESIGN Source: CMU

Slide 19

Slide 19 text

A DATA SCIENCE WORKFLOW 19 BEN FRY, FATHOM INFORMATION DESIGN Source: CMU

Slide 20

Slide 20 text

A DATA SCIENCE WORKFLOW 20 BEN FRY, FATHOM INFORMATION DESIGN Source: CMU

Slide 21

Slide 21 text

INTRO TO DATA SCIENCE AND ANALYTICS CASE STUDY 21 PREDICTING THE PROBABILITY OF SIGNING A PETITION

Slide 22

Slide 22 text

SUMMARY KEY CHALLENGE / QUESTION CASE STUDY Can we create a model that predicts the probability that a user on our website will sign a petition? Can we use this model to predict the probability for future actions (petitions and pledges?) We want to use our massive amounts of data to create a probability that someone will sign a petition or take a pledge on causes.com. We’ll be looking at a specific petition on Causes as our target petition. Our dataset will consist of users who have either signed the petition or have not signed the petition, along with their features (age, gender, number of actions taken, etc.) CAUSES 22

Slide 23

Slide 23 text

CASE STUDY - CAUSES 23

Slide 24

Slide 24 text

CASE STUDY - ACQUIRE 24 Signatures Users Locations FB Likes SQL databases Data Warehousing Defining Key Performance Metrics Analytics and BI Data Science ‣ Our user data and FB likes are gathered from FBConnect, which is required for all Causes users ‣ Location data is gathered from a user’s IP address while they are on our site. ‣ When a user signs the petition, we record it in a database consisting of action_credits. 60.34.92.5

Slide 25

Slide 25 text

CASE STUDY - PARSE 25 Signatures Users Locations FB Likes SQL databases ETL scripts Analytics DB (Data Warehouse) Data Warehousing Defining Key Performance Metrics Analytics and BI Data Science ‣ Our production data is stored in horizontally sharded databases in MySQL ‣ We extract the relevant data from these individual production databases, ‣ then transform the data to fit a certain schema, ‣ and finally we load the data into a central analytics DB.

Slide 26

Slide 26 text

CASE STUDY - FILTER 26 ‣ A probability distribution that plots all of its values in a symmetrical fashion ‣ Most of the results are situated around the probability's mean (the center) ‣ Values are equally likely to plot either above or below the mean. ‣ Grouping takes place at values that are close to the mean and then tails off symmetrically away from the mean. PRIMER ON NORMAL DISTRIBUTION

Slide 27

Slide 27 text

CASE STUDY - FILTER 27 Data Warehousing Defining Key Performance Metrics Analytics and BI Data Science ‣ We’ll want to normalize certain features (that is, transform variables to represent a normal distribution) ‣ The top two plots represent Causes users and a QQ-distribution ‣ We took the log(age) and got closer to a normal distribution (bottom two plots).

Slide 28

Slide 28 text

CASE STUDY - MINE 28 Data Warehousing Defining Key Performance Metrics Analytics and BI Data Science id user_id target_user_id action_id age is_female is_vet is_south created_at signed_petition 1 13215827 NULL 5390823 19 0 1 1 28-nov-2012 1 2 1614524 120583933 5390823 25 1 1 0 28-nov-2012 1 3 11058392 NULL 6729301 29 1 0 0 28-nov-2012 0 4 9529371 168920339 8950132 28 0 0 1 28-nov-2012 0 5 10385293 NULL 5390823 24 1 1 1 28-nov-2012 1 ‣ This is a subset of the total number of columns we have in our dataset. ‣ The actual dataset has approximately 110 columns, representing either a demographic feature (age, gender, location), or a psychographic/behavioral feature (is a veteran, number of petitions signed on Causes, number of related FB likes)

Slide 29

Slide 29 text

CASE STUDY - MINE 29 Data Warehousing Defining Key Performance Metrics Analytics and BI Data Science id user_id target_user_id action_id age is_female is_vet is_south created_at signed_petition 1 13215827 NULL 5390823 19 0 1 1 28-nov-2012 1 2 1614524 120583933 5390823 25 1 1 0 28-nov-2012 1 3 11058392 NULL 6729301 29 1 0 0 28-nov-2012 0 4 9529371 168920339 8950132 28 0 0 1 28-nov-2012 0 5 10385293 NULL 5390823 24 1 1 1 28-nov-2012 1 ‣ This is a subset of the total number of columns we have in our dataset. ‣ The actual dataset has approximately 110 columns, representing either a demographic feature (age, gender, location), or a psychographic/behavioral feature (is a veteran, number of petitions signed on Causes, number of related FB likes). For this example, we’ll only use the above features. ‣ Our response variable (the feature we’re predicting) is whether or not they signed the actual petition.

Slide 30

Slide 30 text

CASE STUDY - MINE 30 Data Warehousing Defining Key Performance Metrics Analytics and BI Data Science We’ll be using a logistic regression to calculate this probability: action_data = read.csv(“actions.csv”) logit_model = glm(signed_petition ~ age + is_female is_vet + is_south, data = action_data, family = "binomial") In R:

Slide 31

Slide 31 text

CASE STUDY - REPRESENT 31 Data Warehousing Defining Key Performance Metrics Analytics and BI Data Science The condensed output in R looks like this: ## Call: ## glm(signed_petition ~ age + is_female + is_vet + is_south, ## data = action_data, family = "binomial") ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.01998 1.13995 -3.50 0.00047 *** ## age 0.00226 0.00109 2.07 0.03847 * ## is_female -0.10404 0.33182 2.42 0.01539 ** ## is_vet 0.67544 0.31649 -2.13 0.03283 * ## is_south 0.34020 0.34531 -3.88 0.00010 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ##

Slide 32

Slide 32 text

CASE STUDY - REPRESENT 32 Data Warehousing Defining Key Performance Metrics Analytics and BI Data Science Here’s a distribution of the regression, drilled down by age: 20 30 40 50 60 70 80

Slide 33

Slide 33 text

CASE STUDY - INTERACT 33 Data Warehousing Defining Key Performance Metrics Analytics and BI Data Science id user_id signed_ petition predicted_ probability 1 13215827 1 0.852 2 1614524 1 0.931 3 11058392 0 0.394 4 9529371 0 0.485 5 10385293 1 0.729 ‣ Our result is an associated probability with each user_id in our dataset ‣ We can use these probabilities to try and create a user facing product that will result in more signatures, and more site activity.

Slide 34

Slide 34 text

CASE STUDY - INTERACT 34 Data Warehousing Defining Key Performance Metrics Analytics and BI Data Science People with high probabilities will received military-themed actions in their inbox:

Slide 35

Slide 35 text

CASE STUDY - REFINE 35 Data Warehousing Defining Key Performance Metrics Analytics and BI Data Science Based on the results of these e-mail blasts, we can refine and improve our logistic regression as necessary: And the cycle continues until we’ve achieved some set benchmark for success.

Slide 36

Slide 36 text

CASE STUDY - REPRESENT 36 Data Warehousing Defining Key Performance Metrics Analytics and BI Data Science

Slide 37

Slide 37 text

CASE STUDY - REPRESENT 36 Data Warehousing Defining Key Performance Metrics Analytics and BI Data Science

Slide 38

Slide 38 text

CASE STUDY - REPRESENT 36 Data Warehousing Defining Key Performance Metrics Analytics and BI Data Science

Slide 39

Slide 39 text

DEMO 37 • Hospitalization in the United States: •~71 million people were admitted into the hospital last year • Of these 71 million people, 11 million of them were classified as “unnecessary”, resulting in $30bn in unnecessary expenditure. •The majority (83%) of these admissions were made by GPs and Managed Care Organizations.

Slide 40

Slide 40 text

DEMO 38 • Hospitalization in the United States: •~71 million people were admitted into the hospital last year • Of these 71 million people, 11 million of them were classified as “unnecessary”, resulting in $30bn in unnecessary expenditure. •The majority (83%) of these admissions were made by GPs and Managed Care Organizations. Is there a data-driven approach that GPs can use to assist their diagnoses and decrease false positives?

Slide 41

Slide 41 text

INTRO TO DATA SCIENCE AND ANALYTICS FURTHER RESOURCES 39 WHERE TO GO FROM HERE

Slide 42

Slide 42 text

FURTHER RESOURCES - BOOKS 40 Data Warehousing Defining Key Performance Metrics Analytics and BI Data Science

Slide 43

Slide 43 text

FURTHER RESOURCES - BOOKS 41 Data Warehousing Defining Key Performance Metrics Analytics and BI Data Science

Slide 44

Slide 44 text

FURTHER RESOURCES - BOOKS 42 Data Warehousing Defining Key Performance Metrics Analytics and BI Data Science

Slide 45

Slide 45 text

FURTHER RESOURCES - KAGGLE 43 Data Warehousing Defining Key Performance Metrics Analytics and BI Data Science

Slide 46

Slide 46 text

Q&A DATA SCIENCE AND ANALYTICS 44

Slide 47

Slide 47 text

INTRO TO DATA SCIENCE AND ANALYTICS THANKS! 45 E-MAIL ME QUESTIONS: THOMSON@CANTAB.NET OR SEND ME A TWEET: @ITSTHOMSON