Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Exploratory: Linear Regression Part 2 - Multiple Regression & Variable Importance

Kan Nishida
August 21, 2019

Exploratory: Linear Regression Part 2 - Multiple Regression & Variable Importance

This is a follow up session from the previous session “Introduction to Linear Regression Part 1 - Basic”.

In this session, Kan will introduce more advanced topics such as Multiple Regression, Co-Linearity, and Variable Importance. Also, he will demonstrate how you can build multiple Linear Regression models for multiple groups and how you can use this technique to make your analysis one step deeper.

Kan Nishida

August 21, 2019
Tweet

More Decks by Kan Nishida

Other Decks in Science

Transcript

  1. Kan Nishida co-founder/CEO Exploratory Summary Beginning of 2016, launched Exploratory,

    Inc. to democratize Data Science. Prior to Exploratory, Kan was a director of product development at Oracle leading teams for building various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Speaker
  2. Data Science is not just for Engineers and Statisticians. Exploratory

    makes it possible for Everyone to do Data Science. The Third Wave
  3. First Wave Second Wave Third Wave Proprietary Open Source UI

    & Programming Programming 2016 2000 1976 Monetization Commoditization Democratization Statisticians Data Scientists Democratization of Data Science Algorithms Experience Tools Open Source UI & Automation Business Users Theme Users
  4. Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling Visualization

    Analytics (Statistics / Machine Learning) Exploratory Data Analysis
  5. An Old and Basic regression algorithm, but due to its

    Simplicity it is still one of the most commonly used Statistical (or Machine) Learning algorithm. Linear Regression
  6. Want to find a simple pattern that can explain both

    the given data and the data we don’t have at hands. 16
  7. 18 Draw a line to make the distance between the

    actual values and the line to be minimal. 40 20 10 0 30 5000 10000 15000 25000 20000
  8. 19 40 20 10 0 30 5000 10000 15000 25000

    20000 Monthly Income = 500 * Working Years + 5000
  9. 21 5000 40 20 10 0 30 Y Intercept Monthly

    Income = 500 * Working Years + 5000
  10. Linear Regression algorithm finds these parameters based on a given

    data and build a model. Model Monthly Income = 500 * Working Years + 5000
  11. • Other variables are correlated. (e.g. Age vs. Working Years)

    • If one variable changes another variable would also change at the same time. • How can we know an independent effect that is coming from only Working Years?
  12. 32 Maybe, Job Level is the one having an effect

    on Monthly Income? Working Years Monthly Income Job Level
  13. 33 Or, Working Years is the one having an effect

    on Monthly Income? Working Years Monthly Income Job Level
  14. 34 Or, both Job Level and Working Years are having

    an effect on Monthly Income? Working Years Monthly Income Job Level
  15. Job Level: 1 Job Level: 2 Job Level: 3 But,

    people in two groups have various Job Levels.
  16. Avg: 8,000 Avg: 10,000 10 Years 11 Years Or, maybe

    it’s because of the difference in Job Level?
  17. 10 Years 11 Years Job Level: 1 Job Level: 1

    Compare people with 10 years and people with 11 years, but with the same Job Level.
  18. 10 Years 11 Years Avg: 8,000 Avg: 8,500 Compare the

    average Monthly Incomes of two groups
  19. This difference should be coming from the difference in Working

    Years, NOT from Job Level. 10 Years 11 Years Avg: 8,000 Avg: 8,500
  20. 47 In order to see an one variable’s independent effect

    on Monthly Income… Working Years Monthly Income Job Level
  21. 48 Working Years Monthly Income Job Level 1 -> 2

    10 -> 10 Constant Change only one variable, but hold the other variables constant. Effect?
  22. 49 Working Years Monthly Income Job Level 10 -> 11

    1 -> 1 Constant Change only one variable, but hold the other variables constant. Effect?
  23. • Interpretation of Multiple Linear Regression • Variable Importance •

    Variable Selection - R-Squared, Adjusted R-Squared • Building Multiple Linear Regression Models
  24. One point increase in x would expect a change of

    a in y. Simple Linear Regression y = a * x + b
  25. One year increase in Working Years would expect $500 increase

    in Monthly Income. Simple Linear Regression Monthly Income = 500 * Working Years + 5000
  26. One point increase in x would expect a change of

    a in y, when all other variables stay the same. Multiple Linear Regression y = a1 * x1 + a2 * x2 + b
  27. One year increase of Working Years would expect $500 increase

    in Monthly Income, Job Level stays the same. Multiple Linear Regression Monthly Income = 500 * Working Years + 600 * Job Level + 5000
  28. Monthly Income = 500 * Working Years + 600 *

    Job Level + 5000 6100 = 500 * 1 + 600 * 1 + 5000 Working Years: 1 Job Level: 1 If you work just for 1 year…
  29. Monthly Income = 500 * Working Years + 600 *

    Job Level + 5000 6600 = 500 * 2 + 600 * 1 + 5000 If you work just for 2 years but stay at the same job level… Working Years: 2 Job Level: 1
  30. 6600 = 500 * 2 + 600 * 1 +

    5000 6100 = 500 * 1 + 600 * 1 + 5000 Working Years: 1 Working Years: 2 1 year 2 Years 6600 6100 $500 increase!
  31. 6600 = 500 * 2 + 600 * 1 +

    5000 6100 = 500 * 1 + 600 * 1 + 5000 Working Years: 1 Working Years: 2 1 year 2 Years 6600 6100 $500 increase! This difference is coming from here!
  32. 1 Years 2 Years 6,100 6,600 Monthly Income = 500

    * Working Years + 600 * Job Level + 5000
  33. One point increase in x would expect a change of

    a in y, when all other variables stay the same. Multiple Linear Regression y = a1 * x1 + a2 * x2 + b
  34. One year increase of Working Years would expect $46 increase

    in Monthly Income, if Job Level stays the same. Monthly Income = 46 * Working Years + 3788 * Job Level + 5000
  35. One level increase of Job Level would expect $3788 increase

    in Monthly Income, if Working Years is the same. Monthly Income = 46 * Working Years + 3788 * Job Level + 5000
  36. 72 Both Working Years and Job Level have effects on

    Monthly Income. Working Years Monthly Income Job Level
  37. • Interpretation of Multiple Linear Regression • Variable Importance •

    Variable Selection - R-Squared, Adjusted R-Squared • Building Multiple Linear Regression Models
  38. One unit in Year One unit in Job Level One

    unit in Job Role 1 Year 1 Level Sales Executive -> Sales Rep
  39. • The variance might vary among the variables. • Underlying

    distribution vary among the variables. • Harder to interpret when Categorical variables are in the mix. But, it might not be appropriate…
  40. Mean The part between the prediction and the dot is

    not explained by the model. The part between the prediction and the mean is explained by the model. Model Actual
  41. 93 Working Years 40 20 10 0 30 Monthly Income

    5000 10000 15000 25000 20000
  42. 94 Mean (Average) 100% 60% 5000 10000 15000 25000 20000

    0% Working Years 40 20 10 0 30 Monthly Income
  43. 95 Various Methods to Calculate Importance • First Variable •

    Last Variable • Lindeman, Merenda, and Gold
  44. 96 First Variable Method How much is R Squared for

    each variable? 0.8 0.2 0.1 R Squared Model A B C
  45. 97 Last Variable Method How much does a variable contribute?

    A + B + C B + C - 0.9 - 0.1 = 0.8 A + B + C A + C - A + B + C A + B - 0.9 - 0.7 = 0.2 0.9 - 0.8 = 0.1 Contribution Baseline Model Without
  46. 98 Lindeman Merenda Gold Method A B + A 0.8

    B + C + A 0.7 0.75 0.75 0.75 Average B How much does a variable increase R Squared? C + A C B + C Without A With A R Squared Importance for A - - -
  47. • Interpretation of Multiple Linear Regression • Variable Importance •

    Variable Selection - R-Squared, Adjusted R-Squared • Building Multiple Linear Regression Models
  48. R Squared • The value of R Squared increases as

    more predictors are added, regardless of whether the added predictor is helping to improve model’s predicting power. • Tend to give wrong impression that the model is getting better since the value always increases when a new predictor is added.
  49. Adjusted R Squared • Adjusted R Squared increases only when

    an added predictor actually helps improving model’s quality in explainability or prediction. • It stays same, or even decreases, when variables that are not helpful are added as predictors.
  50. • Interpretation of Multiple Linear Regression • Variable Importance •

    Variable Selection - R-Squared, Adjusted R-Squared • Building Multiple Linear Regression Models
  51. • Do the variables have similar effects (coefficients) on Monthly

    Income for all the Job Roles? • Are those effects all significant for all the Job Roles? • Which Job Roles can the model explain the variance of Monthly Income better for? 110
  52. 116 Repeat by Job Roles HR Research Director Sales Rep

    Repeat By Data Data Data Data Model Model Model
  53. • Do the variables have similar effects (coefficients) on Monthly

    Income for all the Job Roles? • Are those effects all significant for all the Job Roles? • Which Job Roles can the model explain the variance of Monthly Income better for? 117
  54. 118

  55. 119 One Job level increase increases about $3000 for some

    job roles (e.g. Healthcare Rep, HR, Mfg. Director, etc.)
  56. 120 One Job level increases about less than $2000 for

    other job roles (e.g. Lab Technician, Sales Rep)
  57. • Do the variables have similar effects (coefficients) on Monthly

    Income for all the Job Roles? • Are those effects all significant for all the Job Roles? • Which Job Roles can the model explain the variance of Monthly Income better for? 122
  58. 123 There is not enough evidence that Working Years would

    increase Monthly Income for some job roles like HR, Lab Technician, Research Director.
  59. • Do the variables have similar effects (coefficients) on Monthly

    Income for all the Job Roles? • Are those effects all significant for all the Job Roles? • Which Job Roles can the model explain the variance of Monthly Income better for? 124
  60. Monthly Salary for the job roles like Research Director, HR,

    Manager can be explained by this model very well.
  61. But, for other job roles like Sales Rep, Lab. Technician

    cannot be explained by this model very well.