Introduction to Data Science and Analysis

Thomson Nguyen Data Scientist, Causes INTRO TO DATA SCIENCE AND
ANALYSIS

WHAT THIS TALK ISN’T 2 ‣ A technical introduction to
R or its methods ‣ Machine learning recipes that data scientists use ‣ A surefire method to land a job as a data scientist

WHAT THIS TALK IS 3 ‣ An overview of what
a data-driven company is ‣ A look into a workflow of how a data scientist works ‣ An example case study of how data science adds value ‣ Resources on where to go after this talk

AGENDA ‣ The Data Pyramid ‣ A Data Science Workflow
‣ Case Study: Causes and petition signing ‣ Pair programming on a Data Science problem ‣ Further resources 4

INTRO TO DATA SCIENCE AND ANALYTICS THE DATA PYRAMID 5
OR, THE ROAD TO BEING A DATA-DRIVEN COMPANY

THE DATA PYRAMID 6

THE DATA PYRAMID 7 Defining Key Performance Metrics • Find
out what makes a company successful • Align entire company goals and roadmap around these KPIs • Examples: Daily revenue, DAU/MAUs, Average Revenue Per User, Time spent on site, Site Uptime, etc.

THE DATA PYRAMID 8 Data Warehousing • Taking production data
from disparate sources and replicating it into a central database • Schema is planned and arranged according to KPIs defined earlier • Single source of truth for all analysis and queries (i.e., no one is using production) Defining Key Performance Metrics

THE DATA PYRAMID 9 Data Warehousing • From a data
warehouse, we can create KPI reports for nontechnical stakeholders (marketing, C-levels, board members, etc.) • We can also analyze certain metrics and tables and make hypotheses about specific trends in our data. Defining Key Performance Metrics Analytics and BI

THE DATA PYRAMID 10 Data Warehousing • We can create
features and variables from our hypotheses and using machine learning, we can write algorithms and models to: • Explain the past, or • Predict the future. • These models can be user- facing (product features), or internal. Defining Key Performance Metrics Analytics and BI Data Science

THE DATA PYRAMID - USER-FACING EXAMPLE 11 Source: amazon.com

THE DATA PYRAMID - USER-FACING EXAMPLE 12 Source: amazon.com

THE DATA PYRAMID - USER-FACING EXAMPLE 13 Source: linkedin.com

THE DATA PYRAMID - USER-FACING EXAMPLE 14 Source: linkedin.com

INTRO TO DATA SCIENCE AND ANALYTICS THE DATA SCIENCE WORKFLOW
15 OR, HOW TO APPROACH PROBLEMS IN DATA SCIENCE

INTRO TO DATA SCIENCE AND ANALYTICS THE DATA SCIENCE WORKFLOW
16 A OR, HOW TO APPROACH PROBLEMS IN DATA SCIENCE

A DATA SCIENCE WORKFLOW 17 BEN FRY, FATHOM INFORMATION DESIGN
• Acquire - Where does the data come from? • Parse - How do we preprocess the data? • Filter - How do we transform the data? • Mine - What machine learning algorithms can we use? • Represent - How can we visualize our results? • Refine - How can we take feedback to make it more accurate? • Interact - How will this make the end-user’s life better? Source: CMU

Source: CMU

INTRO TO DATA SCIENCE AND ANALYTICS CASE STUDY 21 PREDICTING
THE PROBABILITY OF SIGNING A PETITION

SUMMARY KEY CHALLENGE / QUESTION CASE STUDY Can we create
a model that predicts the probability that a user on our website will sign a petition? Can we use this model to predict the probability for future actions (petitions and pledges?) We want to use our massive amounts of data to create a probability that someone will sign a petition or take a pledge on causes.com. We’ll be looking at a specific petition on Causes as our target petition. Our dataset will consist of users who have either signed the petition or have not signed the petition, along with their features (age, gender, number of actions taken, etc.) CAUSES 22

CASE STUDY - CAUSES 23

CASE STUDY - ACQUIRE 24 Signatures Users Locations FB Likes
SQL databases Data Warehousing Defining Key Performance Metrics Analytics and BI Data Science ‣ Our user data and FB likes are gathered from FBConnect, which is required for all Causes users ‣ Location data is gathered from a user’s IP address while they are on our site. ‣ When a user signs the petition, we record it in a database consisting of action_credits. 60.34.92.5

CASE STUDY - PARSE 25 Signatures Users Locations FB Likes
SQL databases ETL scripts Analytics DB (Data Warehouse) Data Warehousing Defining Key Performance Metrics Analytics and BI Data Science ‣ Our production data is stored in horizontally sharded databases in MySQL ‣ We extract the relevant data from these individual production databases, ‣ then transform the data to fit a certain schema, ‣ and finally we load the data into a central analytics DB.

CASE STUDY - FILTER 26 ‣ A probability distribution that
plots all of its values in a symmetrical fashion ‣ Most of the results are situated around the probability's mean (the center) ‣ Values are equally likely to plot either above or below the mean. ‣ Grouping takes place at values that are close to the mean and then tails off symmetrically away from the mean. PRIMER ON NORMAL DISTRIBUTION

CASE STUDY - FILTER 27 Data Warehousing Defining Key Performance
Metrics Analytics and BI Data Science ‣ We’ll want to normalize certain features (that is, transform variables to represent a normal distribution) ‣ The top two plots represent Causes users and a QQ-distribution ‣ We took the log(age) and got closer to a normal distribution (bottom two plots).

CASE STUDY - MINE 28 Data Warehousing Defining Key Performance
Metrics Analytics and BI Data Science id user_id target_user_id action_id age is_female is_vet is_south created_at signed_petition 1 13215827 NULL 5390823 19 0 1 1 28-nov-2012 1 2 1614524 120583933 5390823 25 1 1 0 28-nov-2012 1 3 11058392 NULL 6729301 29 1 0 0 28-nov-2012 0 4 9529371 168920339 8950132 28 0 0 1 28-nov-2012 0 5 10385293 NULL 5390823 24 1 1 1 28-nov-2012 1 ‣ This is a subset of the total number of columns we have in our dataset. ‣ The actual dataset has approximately 110 columns, representing either a demographic feature (age, gender, location), or a psychographic/behavioral feature (is a veteran, number of petitions signed on Causes, number of related FB likes)

Metrics Analytics and BI Data Science id user_id target_user_id action_id age is_female is_vet is_south created_at signed_petition 1 13215827 NULL 5390823 19 0 1 1 28-nov-2012 1 2 1614524 120583933 5390823 25 1 1 0 28-nov-2012 1 3 11058392 NULL 6729301 29 1 0 0 28-nov-2012 0 4 9529371 168920339 8950132 28 0 0 1 28-nov-2012 0 5 10385293 NULL 5390823 24 1 1 1 28-nov-2012 1 ‣ This is a subset of the total number of columns we have in our dataset. ‣ The actual dataset has approximately 110 columns, representing either a demographic feature (age, gender, location), or a psychographic/behavioral feature (is a veteran, number of petitions signed on Causes, number of related FB likes). For this example, we’ll only use the above features. ‣ Our response variable (the feature we’re predicting) is whether or not they signed the actual petition.

Metrics Analytics and BI Data Science We’ll be using a logistic regression to calculate this probability: action_data = read.csv(“actions.csv”) logit_model = glm(signed_petition ~ age + is_female is_vet + is_south, data = action_data, family = "binomial") In R:

CASE STUDY - REPRESENT 31 Data Warehousing Defining Key Performance
Metrics Analytics and BI Data Science The condensed output in R looks like this: ## Call: ## glm(signed_petition ~ age + is_female + is_vet + is_south, ## data = action_data, family = "binomial") ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.01998 1.13995 -3.50 0.00047 *** ## age 0.00226 0.00109 2.07 0.03847 * ## is_female -0.10404 0.33182 2.42 0.01539 ** ## is_vet 0.67544 0.31649 -2.13 0.03283 * ## is_south 0.34020 0.34531 -3.88 0.00010 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ##

Metrics Analytics and BI Data Science Here’s a distribution of the regression, drilled down by age: 20 30 40 50 60 70 80

CASE STUDY - INTERACT 33 Data Warehousing Defining Key Performance
Metrics Analytics and BI Data Science id user_id signed_ petition predicted_ probability 1 13215827 1 0.852 2 1614524 1 0.931 3 11058392 0 0.394 4 9529371 0 0.485 5 10385293 1 0.729 ‣ Our result is an associated probability with each user_id in our dataset ‣ We can use these probabilities to try and create a user facing product that will result in more signatures, and more site activity.

CASE STUDY - INTERACT 34 Data Warehousing Defining Key Performance
Metrics Analytics and BI Data Science People with high probabilities will received military-themed actions in their inbox:

CASE STUDY - REFINE 35 Data Warehousing Defining Key Performance
Metrics Analytics and BI Data Science Based on the results of these e-mail blasts, we can refine and improve our logistic regression as necessary: And the cycle continues until we’ve achieved some set benchmark for success.

Metrics Analytics and BI Data Science

DEMO 37 • Hospitalization in the United States: •~71 million
people were admitted into the hospital last year • Of these 71 million people, 11 million of them were classified as “unnecessary”, resulting in $30bn in unnecessary expenditure. •The majority (83%) of these admissions were made by GPs and Managed Care Organizations.

DEMO 38 • Hospitalization in the United States: •~71 million
people were admitted into the hospital last year • Of these 71 million people, 11 million of them were classified as “unnecessary”, resulting in $30bn in unnecessary expenditure. •The majority (83%) of these admissions were made by GPs and Managed Care Organizations. Is there a data-driven approach that GPs can use to assist their diagnoses and decrease false positives?

INTRO TO DATA SCIENCE AND ANALYTICS FURTHER RESOURCES 39 WHERE
TO GO FROM HERE

FURTHER RESOURCES - BOOKS 40 Data Warehousing Defining Key Performance

FURTHER RESOURCES - KAGGLE 43 Data Warehousing Defining Key Performance

Q&A DATA SCIENCE AND ANALYTICS 44

INTRO TO DATA SCIENCE AND ANALYTICS THANKS! 45 E-MAIL ME
QUESTIONS: [email protected] OR SEND ME A TWEET: @ITSTHOMSON

Introduction to Data Science and Analysis

Introduction to Data Science and Analysis

More Decks by Thomson Nguyen

Other Decks in Research

Featured

Transcript