Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Exploratory: Linear Regression Part 2 - Multiple Regression & Variable Importance

Exploratory: Linear Regression Part 2 - Multiple Regression & Variable Importance

This is a follow up session from the previous session “Introduction to Linear Regression Part 1 - Basic”.

In this session, Kan will introduce more advanced topics such as Multiple Regression, Co-Linearity, and Variable Importance. Also, he will demonstrate how you can build multiple Linear Regression models for multiple groups and how you can use this technique to make your analysis one step deeper.

19fc8f6113c5c3d86e6176362ff29479?s=128

Kan Nishida
PRO

August 21, 2019
Tweet

Transcript

  1. 1 Exploratory Seminar Multiple Linear Regression

  2. EXPLORATORY

  3. Kan Nishida co-founder/CEO Exploratory Summary Beginning of 2016, launched Exploratory,

    Inc. to democratize Data Science. Prior to Exploratory, Kan was a director of product development at Oracle leading teams for building various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Speaker
  4. Mission Make Data Science Available for Everyone

  5. Data Science is not just for Engineers and Statisticians. Exploratory

    makes it possible for Everyone to do Data Science. The Third Wave
  6. First Wave Second Wave Third Wave Proprietary Open Source UI

    & Programming Programming 2016 2000 1976 Monetization Commoditization Democratization Statisticians Data Scientists Democratization of Data Science Algorithms Experience Tools Open Source UI & Automation Business Users Theme Users
  7. Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling Visualization

    Analytics (Statistics / Machine Learning) Exploratory Data Analysis
  8. Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling Visualization

    Analytics (Statistics / Machine Learning)
  9. 9 Exploratory Seminar Multiple Linear Regression

  10. An Old and Basic regression algorithm, but due to its

    Simplicity it is still one of the most commonly used Statistical (or Machine) Learning algorithm. Linear Regression
  11. Data 11

  12. Employee Data

  13. Monthly Income

  14. Linear Regression Basics 14

  15. 15 Monthly Income 5000 10000 15000 25000 20000 Working Years

    40 20 10 0 30
  16. Want to find a simple pattern that can explain both

    the given data and the data we don’t have at hands. 16
  17. 17 500ສ Working Years Salary 1000ສ 1500ສ 2000ສ 40 20

    10 0 30
  18. 18 Draw a line to make the distance between the

    actual values and the line to be minimal. 40 20 10 0 30 5000 10000 15000 25000 20000
  19. 19 40 20 10 0 30 5000 10000 15000 25000

    20000 Monthly Income = 500 * Working Years + 5000
  20. 20 5000 Slopeɿ500 40 20 10 0 30 Monthly Income

    = 500 * Working Years + 5000
  21. 21 5000 40 20 10 0 30 Y Intercept Monthly

    Income = 500 * Working Years + 5000
  22. Linear Regression algorithm finds these parameters based on a given

    data and build a model. Model Monthly Income = 500 * Working Years + 5000
  23. None
  24. None
  25. Wait…

  26. • Other variables are correlated. (e.g. Age vs. Working Years)

    • If one variable changes another variable would also change at the same time. • How can we know an independent effect that is coming from only Working Years?
  27. 27 Monthly Income Monthly Income = 500 * Working Years

    + 5000 Working Years
  28. Are there any variables that are correlated to Working Years?

  29. Working Years and Job Level are correlated.

  30. 30 Working Years Monthly Income Job Level Correlated

  31. 31 If Working Years increases Job Level increases, too. Working

    Years Monthly Income Job Level
  32. 32 Maybe, Job Level is the one having an effect

    on Monthly Income? Working Years Monthly Income Job Level
  33. 33 Or, Working Years is the one having an effect

    on Monthly Income? Working Years Monthly Income Job Level
  34. 34 Or, both Job Level and Working Years are having

    an effect on Monthly Income? Working Years Monthly Income Job Level
  35. Let’s investigate! How 1 additional year of Working Years has

    an effect on Monthly Income?
  36. 10 Years 11 Years Compare people with 10 years and

    people with 11 years
  37. 10 Years 11 Years Compare the averages of two groups

    Avg: 8,000 Avg: 10,000
  38. Here’s a problem…

  39. Job Level: 1 Job Level: 2 Job Level: 3 But,

    people in two groups have various Job Levels.
  40. Job Level: 1, 2, 3 Job Level: 1, 2, 3

    10 Years 11 Years
  41. Avg: 8,000 Avg: 10,000 Is the difference really coming from

    Working Years? 10 Years 11 Years
  42. Avg: 8,000 Avg: 10,000 10 Years 11 Years Or, maybe

    it’s because of the difference in Job Level?
  43. How can we see the effect of Working Years alone?

  44. 10 Years 11 Years Job Level: 1 Job Level: 1

    Compare people with 10 years and people with 11 years, but with the same Job Level.
  45. 10 Years 11 Years Avg: 8,000 Avg: 8,500 Compare the

    average Monthly Incomes of two groups
  46. This difference should be coming from the difference in Working

    Years, NOT from Job Level. 10 Years 11 Years Avg: 8,000 Avg: 8,500
  47. 47 In order to see an one variable’s independent effect

    on Monthly Income… Working Years Monthly Income Job Level
  48. 48 Working Years Monthly Income Job Level 1 -> 2

    10 -> 10 Constant Change only one variable, but hold the other variables constant. Effect?
  49. 49 Working Years Monthly Income Job Level 10 -> 11

    1 -> 1 Constant Change only one variable, but hold the other variables constant. Effect?
  50. Here comes the Multiple Linear Regression!

  51. • Interpretation of Multiple Linear Regression • Variable Importance •

    Variable Selection - R-Squared, Adjusted R-Squared • Building Multiple Linear Regression Models
  52. Revisit: Interpretation of Coefficient

  53. One point increase in x would expect a change of

    a in y. Simple Linear Regression y = a * x + b
  54. One year increase in Working Years would expect $500 increase

    in Monthly Income. Simple Linear Regression Monthly Income = 500 * Working Years + 5000
  55. One point increase in x would expect a change of

    a in y, when all other variables stay the same. Multiple Linear Regression y = a1 * x1 + a2 * x2 + b
  56. One year increase of Working Years would expect $500 increase

    in Monthly Income, Job Level stays the same. Multiple Linear Regression Monthly Income = 500 * Working Years + 600 * Job Level + 5000
  57. Let’s unpack it…

  58. Monthly Income = 500 * Working Years + 600 *

    Job Level + 5000 6100 = 500 * 1 + 600 * 1 + 5000 Working Years: 1 Job Level: 1 If you work just for 1 year…
  59. Monthly Income = 500 * Working Years + 600 *

    Job Level + 5000 6600 = 500 * 2 + 600 * 1 + 5000 If you work just for 2 years but stay at the same job level… Working Years: 2 Job Level: 1
  60. 6600 = 500 * 2 + 600 * 1 +

    5000 6100 = 500 * 1 + 600 * 1 + 5000 Working Years: 1 Working Years: 2 1 year 2 Years 6600 6100 $500 increase!
  61. 6600 = 500 * 2 + 600 * 1 +

    5000 6100 = 500 * 1 + 600 * 1 + 5000 Working Years: 1 Working Years: 2 1 year 2 Years 6600 6100 $500 increase! This difference is coming from here!
  62. 1 Years 2 Years 6,100 6,600

  63. 1 Years 2 Years 6,100 6,600 Monthly Income = 500

    * Working Years + 600 * Job Level + 5000
  64. One point increase in x would expect a change of

    a in y, when all other variables stay the same. Multiple Linear Regression y = a1 * x1 + a2 * x2 + b
  65. Working Years & Job Levels

  66. Job Level

  67. Working Years

  68. Working Years + Job Level

  69. Monthly Income = 46 * Working Years + 3788 *

    Job Level + 5000
  70. One year increase of Working Years would expect $46 increase

    in Monthly Income, if Job Level stays the same. Monthly Income = 46 * Working Years + 3788 * Job Level + 5000
  71. One level increase of Job Level would expect $3788 increase

    in Monthly Income, if Working Years is the same. Monthly Income = 46 * Working Years + 3788 * Job Level + 5000
  72. 72 Both Working Years and Job Level have effects on

    Monthly Income. Working Years Monthly Income Job Level
  73. • Interpretation of Multiple Linear Regression • Variable Importance •

    Variable Selection - R-Squared, Adjusted R-Squared • Building Multiple Linear Regression Models
  74. How about including all the variables?

  75. None
  76. Which variables are more important than the others?

  77. None
  78. The higher coefficient doesn’t mean more important.

  79. Because, the units are different.

  80. One unit in Year One unit in Job Level One

    unit in Job Role 1 Year 1 Level Sales Executive -> Sales Rep
  81. Still, want to compare which variables have more effects!

  82. Which variables are more important? • Standardize the variables •

    Relative Importance with R Squared
  83. Which variables are more important? • Standardize the variables •

    Relative Importance with R Squared
  84. None
  85. None
  86. None
  87. • The variance might vary among the variables. • Underlying

    distribution vary among the variables. • Harder to interpret when Categorical variables are in the mix. But, it might not be appropriate…
  88. Job Role (Categorical)

  89. Which variables are more important? • Standardize the variables •

    Relative Importance with R Squared
  90. Shows which variable are more important based on their contribution

    to R Squared.
  91. R Squared? Let’s revisit!

  92. Mean The part between the prediction and the dot is

    not explained by the model. The part between the prediction and the mean is explained by the model. Model Actual
  93. 93 Working Years 40 20 10 0 30 Monthly Income

    5000 10000 15000 25000 20000
  94. 94 Mean (Average) 100% 60% 5000 10000 15000 25000 20000

    0% Working Years 40 20 10 0 30 Monthly Income
  95. 95 Various Methods to Calculate Importance • First Variable •

    Last Variable • Lindeman, Merenda, and Gold
  96. 96 First Variable Method How much is R Squared for

    each variable? 0.8 0.2 0.1 R Squared Model A B C
  97. 97 Last Variable Method How much does a variable contribute?

    A + B + C B + C - 0.9 - 0.1 = 0.8 A + B + C A + C - A + B + C A + B - 0.9 - 0.7 = 0.2 0.9 - 0.8 = 0.1 Contribution Baseline Model Without
  98. 98 Lindeman Merenda Gold Method A B + A 0.8

    B + C + A 0.7 0.75 0.75 0.75 Average B How much does a variable increase R Squared? C + A C B + C Without A With A R Squared Importance for A - - -
  99. • Interpretation of Multiple Linear Regression • Variable Importance •

    Variable Selection - R-Squared, Adjusted R-Squared • Building Multiple Linear Regression Models
  100. All variables

  101. None
  102. Build a model only with Job Level, Job Role, Working

    Years, & Age.
  103. All variables R Squared decreased, but this is expected. Only

    with 4 variables
  104. All variables Adjusted R Squared increased! Only with 4 variables

  105. R-Squared vs. Adjusted R-Squared

  106. R Squared • The value of R Squared increases as

    more predictors are added, regardless of whether the added predictor is helping to improve model’s predicting power. • Tend to give wrong impression that the model is getting better since the value always increases when a new predictor is added.
  107. Adjusted R Squared • Adjusted R Squared increases only when

    an added predictor actually helps improving model’s quality in explainability or prediction. • It stays same, or even decreases, when variables that are not helpful are added as predictors.
  108. • Interpretation of Multiple Linear Regression • Variable Importance •

    Variable Selection - R-Squared, Adjusted R-Squared • Building Multiple Linear Regression Models
  109. Only Job Level, Job Role, Total Working Years, Age

  110. • Do the variables have similar effects (coefficients) on Monthly

    Income for all the Job Roles? • Are those effects all significant for all the Job Roles? • Which Job Roles can the model explain the variance of Monthly Income better for? 110
  111. Create Multiple Models!

  112. with Repeat By!

  113. Repeat By

  114. 114 Build a Model Data Model

  115. 115 Build Multiple Models with Repeat By Data Model Data

    Data Data Model Model Repeat By
  116. 116 Repeat by Job Roles HR Research Director Sales Rep

    Repeat By Data Data Data Data Model Model Model
  117. • Do the variables have similar effects (coefficients) on Monthly

    Income for all the Job Roles? • Are those effects all significant for all the Job Roles? • Which Job Roles can the model explain the variance of Monthly Income better for? 117
  118. 118

  119. 119 One Job level increase increases about $3000 for some

    job roles (e.g. Healthcare Rep, HR, Mfg. Director, etc.)
  120. 120 One Job level increases about less than $2000 for

    other job roles (e.g. Lab Technician, Sales Rep)
  121. Build models with the standardized variables.

  122. • Do the variables have similar effects (coefficients) on Monthly

    Income for all the Job Roles? • Are those effects all significant for all the Job Roles? • Which Job Roles can the model explain the variance of Monthly Income better for? 122
  123. 123 There is not enough evidence that Working Years would

    increase Monthly Income for some job roles like HR, Lab Technician, Research Director.
  124. • Do the variables have similar effects (coefficients) on Monthly

    Income for all the Job Roles? • Are those effects all significant for all the Job Roles? • Which Job Roles can the model explain the variance of Monthly Income better for? 124
  125. Monthly Salary for the job roles like Research Director, HR,

    Manager can be explained by this model very well.
  126. But, for other job roles like Sales Rep, Lab. Technician

    cannot be explained by this model very well.
  127. None
  128. Q & A

  129. Contact Email kan@exploratory.io Home Page https://exploratory.io Twitter @KanAugust Online Seminar

    https://exploratory.io/online-seminar