Kan Nishida
PRO
August 21, 2019
880

# Exploratory: Linear Regression Part 2 - Multiple Regression & Variable Importance

This is a follow up session from the previous session “Introduction to Linear Regression Part 1 - Basic”.

In this session, Kan will introduce more advanced topics such as Multiple Regression, Co-Linearity, and Variable Importance. Also, he will demonstrate how you can build multiple Linear Regression models for multiple groups and how you can use this technique to make your analysis one step deeper.

August 21, 2019

## Transcript

3. ### Kan Nishida co-founder/CEO Exploratory Summary Beginning of 2016, launched Exploratory,

Inc. to democratize Data Science. Prior to Exploratory, Kan was a director of product development at Oracle leading teams for building various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Speaker

5. ### Data Science is not just for Engineers and Statisticians. Exploratory

makes it possible for Everyone to do Data Science. The Third Wave
6. ### First Wave Second Wave Third Wave Proprietary Open Source UI

& Programming Programming 2016 2000 1976 Monetization Commoditization Democratization Statisticians Data Scientists Democratization of Data Science Algorithms Experience Tools Open Source UI & Automation Business Users Theme Users
7. ### Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling Visualization

Analytics (Statistics / Machine Learning) Exploratory Data Analysis
8. ### Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling Visualization

Analytics (Statistics / Machine Learning)

10. ### An Old and Basic regression algorithm, but due to its

Simplicity it is still one of the most commonly used Statistical (or Machine) Learning algorithm. Linear Regression

15. ### 15 Monthly Income 5000 10000 15000 25000 20000 Working Years

40 20 10 0 30
16. ### Want to ﬁnd a simple pattern that can explain both

the given data and the data we don’t have at hands. 16

10 0 30
18. ### 18 Draw a line to make the distance between the

actual values and the line to be minimal. 40 20 10 0 30 5000 10000 15000 25000 20000
19. ### 19 40 20 10 0 30 5000 10000 15000 25000

20000 Monthly Income = 500 * Working Years + 5000
20. ### 20 5000 Slopeɿ500 40 20 10 0 30 Monthly Income

= 500 * Working Years + 5000
21. ### 21 5000 40 20 10 0 30 Y Intercept Monthly

Income = 500 * Working Years + 5000
22. ### Linear Regression algorithm ﬁnds these parameters based on a given

data and build a model. Model Monthly Income = 500 * Working Years + 5000
23. None
24. None

26. ### • Other variables are correlated. (e.g. Age vs. Working Years)

• If one variable changes another variable would also change at the same time. • How can we know an independent effect that is coming from only Working Years?
27. ### 27 Monthly Income Monthly Income = 500 * Working Years

+ 5000 Working Years

31. ### 31 If Working Years increases Job Level increases, too. Working

Years Monthly Income Job Level
32. ### 32 Maybe, Job Level is the one having an effect

on Monthly Income? Working Years Monthly Income Job Level
33. ### 33 Or, Working Years is the one having an effect

on Monthly Income? Working Years Monthly Income Job Level
34. ### 34 Or, both Job Level and Working Years are having

an effect on Monthly Income? Working Years Monthly Income Job Level
35. ### Let’s investigate! How 1 additional year of Working Years has

an effect on Monthly Income?
36. ### 10 Years 11 Years Compare people with 10 years and

people with 11 years
37. ### 10 Years 11 Years Compare the averages of two groups

Avg: 8,000 Avg: 10,000

39. ### Job Level: 1 Job Level: 2 Job Level: 3 But,

people in two groups have various Job Levels.
40. ### Job Level: 1, 2, 3 Job Level: 1, 2, 3

10 Years 11 Years
41. ### Avg: 8,000 Avg: 10,000 Is the difference really coming from

Working Years? 10 Years 11 Years
42. ### Avg: 8,000 Avg: 10,000 10 Years 11 Years Or, maybe

it’s because of the difference in Job Level?

44. ### 10 Years 11 Years Job Level: 1 Job Level: 1

Compare people with 10 years and people with 11 years, but with the same Job Level.
45. ### 10 Years 11 Years Avg: 8,000 Avg: 8,500 Compare the

average Monthly Incomes of two groups
46. ### This difference should be coming from the difference in Working

Years, NOT from Job Level. 10 Years 11 Years Avg: 8,000 Avg: 8,500
47. ### 47 In order to see an one variable’s independent effect

on Monthly Income… Working Years Monthly Income Job Level
48. ### 48 Working Years Monthly Income Job Level 1 -> 2

10 -> 10 Constant Change only one variable, but hold the other variables constant. Eﬀect?
49. ### 49 Working Years Monthly Income Job Level 10 -> 11

1 -> 1 Constant Change only one variable, but hold the other variables constant. Eﬀect?

51. ### • Interpretation of Multiple Linear Regression • Variable Importance •

Variable Selection - R-Squared, Adjusted R-Squared • Building Multiple Linear Regression Models

53. ### One point increase in x would expect a change of

a in y. Simple Linear Regression y = a * x + b
54. ### One year increase in Working Years would expect \$500 increase

in Monthly Income. Simple Linear Regression Monthly Income = 500 * Working Years + 5000
55. ### One point increase in x would expect a change of

a in y, when all other variables stay the same. Multiple Linear Regression y = a1 * x1 + a2 * x2 + b
56. ### One year increase of Working Years would expect \$500 increase

in Monthly Income, Job Level stays the same. Multiple Linear Regression Monthly Income = 500 * Working Years + 600 * Job Level + 5000

58. ### Monthly Income = 500 * Working Years + 600 *

Job Level + 5000 6100 = 500 * 1 + 600 * 1 + 5000 Working Years: 1 Job Level: 1 If you work just for 1 year…
59. ### Monthly Income = 500 * Working Years + 600 *

Job Level + 5000 6600 = 500 * 2 + 600 * 1 + 5000 If you work just for 2 years but stay at the same job level… Working Years: 2 Job Level: 1
60. ### 6600 = 500 * 2 + 600 * 1 +

5000 6100 = 500 * 1 + 600 * 1 + 5000 Working Years: 1 Working Years: 2 1 year 2 Years 6600 6100 \$500 increase!
61. ### 6600 = 500 * 2 + 600 * 1 +

5000 6100 = 500 * 1 + 600 * 1 + 5000 Working Years: 1 Working Years: 2 1 year 2 Years 6600 6100 \$500 increase! This diﬀerence is coming from here!

63. ### 1 Years 2 Years 6,100 6,600 Monthly Income = 500

* Working Years + 600 * Job Level + 5000
64. ### One point increase in x would expect a change of

a in y, when all other variables stay the same. Multiple Linear Regression y = a1 * x1 + a2 * x2 + b

69. ### Monthly Income = 46 * Working Years + 3788 *

Job Level + 5000
70. ### One year increase of Working Years would expect \$46 increase

in Monthly Income, if Job Level stays the same. Monthly Income = 46 * Working Years + 3788 * Job Level + 5000
71. ### One level increase of Job Level would expect \$3788 increase

in Monthly Income, if Working Years is the same. Monthly Income = 46 * Working Years + 3788 * Job Level + 5000
72. ### 72 Both Working Years and Job Level have effects on

Monthly Income. Working Years Monthly Income Job Level
73. ### • Interpretation of Multiple Linear Regression • Variable Importance •

Variable Selection - R-Squared, Adjusted R-Squared • Building Multiple Linear Regression Models

75. None

77. None

80. ### One unit in Year One unit in Job Level One

unit in Job Role 1 Year 1 Level Sales Executive -> Sales Rep

82. ### Which variables are more important? • Standardize the variables •

Relative Importance with R Squared
83. ### Which variables are more important? • Standardize the variables •

Relative Importance with R Squared
84. None
85. None
86. None
87. ### • The variance might vary among the variables. • Underlying

distribution vary among the variables. • Harder to interpret when Categorical variables are in the mix. But, it might not be appropriate…

89. ### Which variables are more important? • Standardize the variables •

Relative Importance with R Squared
90. ### Shows which variable are more important based on their contribution

to R Squared.

92. ### Mean The part between the prediction and the dot is

not explained by the model. The part between the prediction and the mean is explained by the model. Model Actual
93. ### 93 Working Years 40 20 10 0 30 Monthly Income

5000 10000 15000 25000 20000
94. ### 94 Mean (Average) 100% 60% 5000 10000 15000 25000 20000

0% Working Years 40 20 10 0 30 Monthly Income
95. ### 95 Various Methods to Calculate Importance • First Variable •

Last Variable • Lindeman, Merenda, and Gold
96. ### 96 First Variable Method How much is R Squared for

each variable? 0.8 0.2 0.1 R Squared Model A B C
97. ### 97 Last Variable Method How much does a variable contribute?

A + B + C B + C - 0.9 - 0.1 = 0.8 A + B + C A + C - A + B + C A + B - 0.9 - 0.7 = 0.2 0.9 - 0.8 = 0.1 Contribution Baseline Model Without
98. ### 98 Lindeman Merenda Gold Method A B + A 0.8

B + C + A 0.7 0.75 0.75 0.75 Average B How much does a variable increase R Squared? C + A C B + C Without A With A R Squared Importance for A - - -
99. ### • Interpretation of Multiple Linear Regression • Variable Importance •

Variable Selection - R-Squared, Adjusted R-Squared • Building Multiple Linear Regression Models

101. None
102. ### Build a model only with Job Level, Job Role, Working

Years, & Age.
103. ### All variables R Squared decreased, but this is expected. Only

with 4 variables

106. ### R Squared • The value of R Squared increases as

more predictors are added, regardless of whether the added predictor is helping to improve model’s predicting power. • Tend to give wrong impression that the model is getting better since the value always increases when a new predictor is added.
107. ### Adjusted R Squared • Adjusted R Squared increases only when

an added predictor actually helps improving model’s quality in explainability or prediction. • It stays same, or even decreases, when variables that are not helpful are added as predictors.
108. ### • Interpretation of Multiple Linear Regression • Variable Importance •

Variable Selection - R-Squared, Adjusted R-Squared • Building Multiple Linear Regression Models

110. ### • Do the variables have similar eﬀects (coeﬃcients) on Monthly

Income for all the Job Roles? • Are those effects all signiﬁcant for all the Job Roles? • Which Job Roles can the model explain the variance of Monthly Income better for? 110

115. ### 115 Build Multiple Models with Repeat By Data Model Data

Data Data Model Model Repeat By
116. ### 116 Repeat by Job Roles HR Research Director Sales Rep

Repeat By Data Data Data Data Model Model Model
117. ### • Do the variables have similar eﬀects (coeﬃcients) on Monthly

Income for all the Job Roles? • Are those effects all signiﬁcant for all the Job Roles? • Which Job Roles can the model explain the variance of Monthly Income better for? 117

119. ### 119 One Job level increase increases about \$3000 for some

job roles (e.g. Healthcare Rep, HR, Mfg. Director, etc.)
120. ### 120 One Job level increases about less than \$2000 for

other job roles (e.g. Lab Technician, Sales Rep)

122. ### • Do the variables have similar effects (coefﬁcients) on Monthly

Income for all the Job Roles? • Are those eﬀects all signiﬁcant for all the Job Roles? • Which Job Roles can the model explain the variance of Monthly Income better for? 122
123. ### 123 There is not enough evidence that Working Years would

increase Monthly Income for some job roles like HR, Lab Technician, Research Director.
124. ### • Do the variables have similar effects (coefﬁcients) on Monthly

Income for all the Job Roles? • Are those effects all signiﬁcant for all the Job Roles? • Which Job Roles can the model explain the variance of Monthly Income better for? 124
125. ### Monthly Salary for the job roles like Research Director, HR,

Manager can be explained by this model very well.
126. ### But, for other job roles like Sales Rep, Lab. Technician

cannot be explained by this model very well.
127. None