Slide 1

Slide 1 text

1 Exploratory Seminar Multiple Linear Regression

Slide 2

Slide 2 text

EXPLORATORY

Slide 3

Slide 3 text

Kan Nishida co-founder/CEO Exploratory Summary Beginning of 2016, launched Exploratory, Inc. to democratize Data Science. Prior to Exploratory, Kan was a director of product development at Oracle leading teams for building various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Speaker

Slide 4

Slide 4 text

Mission Make Data Science Available for Everyone

Slide 5

Slide 5 text

Data Science is not just for Engineers and Statisticians. Exploratory makes it possible for Everyone to do Data Science. The Third Wave

Slide 6

Slide 6 text

First Wave Second Wave Third Wave Proprietary Open Source UI & Programming Programming 2016 2000 1976 Monetization Commoditization Democratization Statisticians Data Scientists Democratization of Data Science Algorithms Experience Tools Open Source UI & Automation Business Users Theme Users

Slide 7

Slide 7 text

Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling Visualization Analytics (Statistics / Machine Learning) Exploratory Data Analysis

Slide 8

Slide 8 text

Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling Visualization Analytics (Statistics / Machine Learning)

Slide 9

Slide 9 text

9 Exploratory Seminar Multiple Linear Regression

Slide 10

Slide 10 text

An Old and Basic regression algorithm, but due to its Simplicity it is still one of the most commonly used Statistical (or Machine) Learning algorithm. Linear Regression

Slide 11

Slide 11 text

Data 11

Slide 12

Slide 12 text

Employee Data

Slide 13

Slide 13 text

Monthly Income

Slide 14

Slide 14 text

Linear Regression Basics 14

Slide 15

Slide 15 text

15 Monthly Income 5000 10000 15000 25000 20000 Working Years 40 20 10 0 30

Slide 16

Slide 16 text

Want to find a simple pattern that can explain both the given data and the data we don’t have at hands. 16

Slide 17

Slide 17 text

17 500ສ Working Years Salary 1000ສ 1500ສ 2000ສ 40 20 10 0 30

Slide 18

Slide 18 text

18 Draw a line to make the distance between the actual values and the line to be minimal. 40 20 10 0 30 5000 10000 15000 25000 20000

Slide 19

Slide 19 text

19 40 20 10 0 30 5000 10000 15000 25000 20000 Monthly Income = 500 * Working Years + 5000

Slide 20

Slide 20 text

20 5000 Slopeɿ500 40 20 10 0 30 Monthly Income = 500 * Working Years + 5000

Slide 21

Slide 21 text

21 5000 40 20 10 0 30 Y Intercept Monthly Income = 500 * Working Years + 5000

Slide 22

Slide 22 text

Linear Regression algorithm finds these parameters based on a given data and build a model. Model Monthly Income = 500 * Working Years + 5000

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

Wait…

Slide 26

Slide 26 text

• Other variables are correlated. (e.g. Age vs. Working Years) • If one variable changes another variable would also change at the same time. • How can we know an independent effect that is coming from only Working Years?

Slide 27

Slide 27 text

27 Monthly Income Monthly Income = 500 * Working Years + 5000 Working Years

Slide 28

Slide 28 text

Are there any variables that are correlated to Working Years?

Slide 29

Slide 29 text

Working Years and Job Level are correlated.

Slide 30

Slide 30 text

30 Working Years Monthly Income Job Level Correlated

Slide 31

Slide 31 text

31 If Working Years increases Job Level increases, too. Working Years Monthly Income Job Level

Slide 32

Slide 32 text

32 Maybe, Job Level is the one having an effect on Monthly Income? Working Years Monthly Income Job Level

Slide 33

Slide 33 text

33 Or, Working Years is the one having an effect on Monthly Income? Working Years Monthly Income Job Level

Slide 34

Slide 34 text

34 Or, both Job Level and Working Years are having an effect on Monthly Income? Working Years Monthly Income Job Level

Slide 35

Slide 35 text

Let’s investigate! How 1 additional year of Working Years has an effect on Monthly Income?

Slide 36

Slide 36 text

10 Years 11 Years Compare people with 10 years and people with 11 years

Slide 37

Slide 37 text

10 Years 11 Years Compare the averages of two groups Avg: 8,000 Avg: 10,000

Slide 38

Slide 38 text

Here’s a problem…

Slide 39

Slide 39 text

Job Level: 1 Job Level: 2 Job Level: 3 But, people in two groups have various Job Levels.

Slide 40

Slide 40 text

Job Level: 1, 2, 3 Job Level: 1, 2, 3 10 Years 11 Years

Slide 41

Slide 41 text

Avg: 8,000 Avg: 10,000 Is the difference really coming from Working Years? 10 Years 11 Years

Slide 42

Slide 42 text

Avg: 8,000 Avg: 10,000 10 Years 11 Years Or, maybe it’s because of the difference in Job Level?

Slide 43

Slide 43 text

How can we see the effect of Working Years alone?

Slide 44

Slide 44 text

10 Years 11 Years Job Level: 1 Job Level: 1 Compare people with 10 years and people with 11 years, but with the same Job Level.

Slide 45

Slide 45 text

10 Years 11 Years Avg: 8,000 Avg: 8,500 Compare the average Monthly Incomes of two groups

Slide 46

Slide 46 text

This difference should be coming from the difference in Working Years, NOT from Job Level. 10 Years 11 Years Avg: 8,000 Avg: 8,500

Slide 47

Slide 47 text

47 In order to see an one variable’s independent effect on Monthly Income… Working Years Monthly Income Job Level

Slide 48

Slide 48 text

48 Working Years Monthly Income Job Level 1 -> 2 10 -> 10 Constant Change only one variable, but hold the other variables constant. Effect?

Slide 49

Slide 49 text

49 Working Years Monthly Income Job Level 10 -> 11 1 -> 1 Constant Change only one variable, but hold the other variables constant. Effect?

Slide 50

Slide 50 text

Here comes the Multiple Linear Regression!

Slide 51

Slide 51 text

• Interpretation of Multiple Linear Regression • Variable Importance • Variable Selection - R-Squared, Adjusted R-Squared • Building Multiple Linear Regression Models

Slide 52

Slide 52 text

Revisit: Interpretation of Coefficient

Slide 53

Slide 53 text

One point increase in x would expect a change of a in y. Simple Linear Regression y = a * x + b

Slide 54

Slide 54 text

One year increase in Working Years would expect $500 increase in Monthly Income. Simple Linear Regression Monthly Income = 500 * Working Years + 5000

Slide 55

Slide 55 text

One point increase in x would expect a change of a in y, when all other variables stay the same. Multiple Linear Regression y = a1 * x1 + a2 * x2 + b

Slide 56

Slide 56 text

One year increase of Working Years would expect $500 increase in Monthly Income, Job Level stays the same. Multiple Linear Regression Monthly Income = 500 * Working Years + 600 * Job Level + 5000

Slide 57

Slide 57 text

Let’s unpack it…

Slide 58

Slide 58 text

Monthly Income = 500 * Working Years + 600 * Job Level + 5000 6100 = 500 * 1 + 600 * 1 + 5000 Working Years: 1 Job Level: 1 If you work just for 1 year…

Slide 59

Slide 59 text

Monthly Income = 500 * Working Years + 600 * Job Level + 5000 6600 = 500 * 2 + 600 * 1 + 5000 If you work just for 2 years but stay at the same job level… Working Years: 2 Job Level: 1

Slide 60

Slide 60 text

6600 = 500 * 2 + 600 * 1 + 5000 6100 = 500 * 1 + 600 * 1 + 5000 Working Years: 1 Working Years: 2 1 year 2 Years 6600 6100 $500 increase!

Slide 61

Slide 61 text

6600 = 500 * 2 + 600 * 1 + 5000 6100 = 500 * 1 + 600 * 1 + 5000 Working Years: 1 Working Years: 2 1 year 2 Years 6600 6100 $500 increase! This difference is coming from here!

Slide 62

Slide 62 text

1 Years 2 Years 6,100 6,600

Slide 63

Slide 63 text

1 Years 2 Years 6,100 6,600 Monthly Income = 500 * Working Years + 600 * Job Level + 5000

Slide 64

Slide 64 text

One point increase in x would expect a change of a in y, when all other variables stay the same. Multiple Linear Regression y = a1 * x1 + a2 * x2 + b

Slide 65

Slide 65 text

Working Years & Job Levels

Slide 66

Slide 66 text

Job Level

Slide 67

Slide 67 text

Working Years

Slide 68

Slide 68 text

Working Years + Job Level

Slide 69

Slide 69 text

Monthly Income = 46 * Working Years + 3788 * Job Level + 5000

Slide 70

Slide 70 text

One year increase of Working Years would expect $46 increase in Monthly Income, if Job Level stays the same. Monthly Income = 46 * Working Years + 3788 * Job Level + 5000

Slide 71

Slide 71 text

One level increase of Job Level would expect $3788 increase in Monthly Income, if Working Years is the same. Monthly Income = 46 * Working Years + 3788 * Job Level + 5000

Slide 72

Slide 72 text

72 Both Working Years and Job Level have effects on Monthly Income. Working Years Monthly Income Job Level

Slide 73

Slide 73 text

• Interpretation of Multiple Linear Regression • Variable Importance • Variable Selection - R-Squared, Adjusted R-Squared • Building Multiple Linear Regression Models

Slide 74

Slide 74 text

How about including all the variables?

Slide 75

Slide 75 text

No content

Slide 76

Slide 76 text

Which variables are more important than the others?

Slide 77

Slide 77 text

No content

Slide 78

Slide 78 text

The higher coefficient doesn’t mean more important.

Slide 79

Slide 79 text

Because, the units are different.

Slide 80

Slide 80 text

One unit in Year One unit in Job Level One unit in Job Role 1 Year 1 Level Sales Executive -> Sales Rep

Slide 81

Slide 81 text

Still, want to compare which variables have more effects!

Slide 82

Slide 82 text

Which variables are more important? • Standardize the variables • Relative Importance with R Squared

Slide 83

Slide 83 text

Which variables are more important? • Standardize the variables • Relative Importance with R Squared

Slide 84

Slide 84 text

No content

Slide 85

Slide 85 text

No content

Slide 86

Slide 86 text

No content

Slide 87

Slide 87 text

• The variance might vary among the variables. • Underlying distribution vary among the variables. • Harder to interpret when Categorical variables are in the mix. But, it might not be appropriate…

Slide 88

Slide 88 text

Job Role (Categorical)

Slide 89

Slide 89 text

Which variables are more important? • Standardize the variables • Relative Importance with R Squared

Slide 90

Slide 90 text

Shows which variable are more important based on their contribution to R Squared.

Slide 91

Slide 91 text

R Squared? Let’s revisit!

Slide 92

Slide 92 text

Mean The part between the prediction and the dot is not explained by the model. The part between the prediction and the mean is explained by the model. Model Actual

Slide 93

Slide 93 text

93 Working Years 40 20 10 0 30 Monthly Income 5000 10000 15000 25000 20000

Slide 94

Slide 94 text

94 Mean (Average) 100% 60% 5000 10000 15000 25000 20000 0% Working Years 40 20 10 0 30 Monthly Income

Slide 95

Slide 95 text

95 Various Methods to Calculate Importance • First Variable • Last Variable • Lindeman, Merenda, and Gold

Slide 96

Slide 96 text

96 First Variable Method How much is R Squared for each variable? 0.8 0.2 0.1 R Squared Model A B C

Slide 97

Slide 97 text

97 Last Variable Method How much does a variable contribute? A + B + C B + C - 0.9 - 0.1 = 0.8 A + B + C A + C - A + B + C A + B - 0.9 - 0.7 = 0.2 0.9 - 0.8 = 0.1 Contribution Baseline Model Without

Slide 98

Slide 98 text

98 Lindeman Merenda Gold Method A B + A 0.8 B + C + A 0.7 0.75 0.75 0.75 Average B How much does a variable increase R Squared? C + A C B + C Without A With A R Squared Importance for A - - -

Slide 99

Slide 99 text

• Interpretation of Multiple Linear Regression • Variable Importance • Variable Selection - R-Squared, Adjusted R-Squared • Building Multiple Linear Regression Models

Slide 100

Slide 100 text

All variables

Slide 101

Slide 101 text

No content

Slide 102

Slide 102 text

Build a model only with Job Level, Job Role, Working Years, & Age.

Slide 103

Slide 103 text

All variables R Squared decreased, but this is expected. Only with 4 variables

Slide 104

Slide 104 text

All variables Adjusted R Squared increased! Only with 4 variables

Slide 105

Slide 105 text

R-Squared vs. Adjusted R-Squared

Slide 106

Slide 106 text

R Squared • The value of R Squared increases as more predictors are added, regardless of whether the added predictor is helping to improve model’s predicting power. • Tend to give wrong impression that the model is getting better since the value always increases when a new predictor is added.

Slide 107

Slide 107 text

Adjusted R Squared • Adjusted R Squared increases only when an added predictor actually helps improving model’s quality in explainability or prediction. • It stays same, or even decreases, when variables that are not helpful are added as predictors.

Slide 108

Slide 108 text

• Interpretation of Multiple Linear Regression • Variable Importance • Variable Selection - R-Squared, Adjusted R-Squared • Building Multiple Linear Regression Models

Slide 109

Slide 109 text

Only Job Level, Job Role, Total Working Years, Age

Slide 110

Slide 110 text

• Do the variables have similar effects (coefficients) on Monthly Income for all the Job Roles? • Are those effects all significant for all the Job Roles? • Which Job Roles can the model explain the variance of Monthly Income better for? 110

Slide 111

Slide 111 text

Create Multiple Models!

Slide 112

Slide 112 text

with Repeat By!

Slide 113

Slide 113 text

Repeat By

Slide 114

Slide 114 text

114 Build a Model Data Model

Slide 115

Slide 115 text

115 Build Multiple Models with Repeat By Data Model Data Data Data Model Model Repeat By

Slide 116

Slide 116 text

116 Repeat by Job Roles HR Research Director Sales Rep Repeat By Data Data Data Data Model Model Model

Slide 117

Slide 117 text

• Do the variables have similar effects (coefficients) on Monthly Income for all the Job Roles? • Are those effects all significant for all the Job Roles? • Which Job Roles can the model explain the variance of Monthly Income better for? 117

Slide 118

Slide 118 text

118

Slide 119

Slide 119 text

119 One Job level increase increases about $3000 for some job roles (e.g. Healthcare Rep, HR, Mfg. Director, etc.)

Slide 120

Slide 120 text

120 One Job level increases about less than $2000 for other job roles (e.g. Lab Technician, Sales Rep)

Slide 121

Slide 121 text

Build models with the standardized variables.

Slide 122

Slide 122 text

• Do the variables have similar effects (coefficients) on Monthly Income for all the Job Roles? • Are those effects all significant for all the Job Roles? • Which Job Roles can the model explain the variance of Monthly Income better for? 122

Slide 123

Slide 123 text

123 There is not enough evidence that Working Years would increase Monthly Income for some job roles like HR, Lab Technician, Research Director.

Slide 124

Slide 124 text

• Do the variables have similar effects (coefficients) on Monthly Income for all the Job Roles? • Are those effects all significant for all the Job Roles? • Which Job Roles can the model explain the variance of Monthly Income better for? 124

Slide 125

Slide 125 text

Monthly Salary for the job roles like Research Director, HR, Manager can be explained by this model very well.

Slide 126

Slide 126 text

But, for other job roles like Sales Rep, Lab. Technician cannot be explained by this model very well.

Slide 127

Slide 127 text

No content

Slide 128

Slide 128 text

Q & A

Slide 129

Slide 129 text

Contact Email [email protected] Home Page https://exploratory.io Twitter @KanAugust Online Seminar https://exploratory.io/online-seminar