Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Exploratory: An Introduction to Linear Regression

Kan Nishida
August 14, 2019

Exploratory: An Introduction to Linear Regression

Linear Regression algorithm is considered as a basic algorithms, yet it is still one of the most popular algorithms in the world of data science because of its simplicity and applicability to many use cases.

Kan will be introducing the basic of Linear Regression algorithm and how to gain useful insights from the prediction model built by the algorithm in order.

Kan Nishida

August 14, 2019
Tweet

More Decks by Kan Nishida

Other Decks in Science

Transcript

  1. Kan Nishida co-founder/CEO Exploratory Summary Beginning of 2016, launched Exploratory,

    Inc. to democratize Data Science. Prior to Exploratory, Kan was a director of product development at Oracle leading teams for building various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Speaker
  2. Data Science is not just for Engineers and Statisticians. Exploratory

    makes it possible for Everyone to do Data Science. The Third Wave
  3. First Wave Second Wave Third Wave Proprietary Open Source UI

    & Programming Programming 2016 2000 1976 Monetization Commoditization Democratization Statisticians Data Scientists Democratization of Data Science Algorithms Experience Tools Open Source UI & Automation Business Users Theme Users
  4. Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling Visualization

    Analytics (Statistics / Machine Learning) Exploratory Data Analysis
  5. An Old and Basic regression algorithm, but due to its

    Simplicity it is still one of the most commonly used Statistical (or Machine) Learning algorithm. Linear Regression
  6. 24 Correlation A relationship where changes in one variable happen

    together with changes in another variable with a certain rule.
  7. 26 Age Monthly Income The bigger the Age is, the

    bigger the Monthly Income is. Correlation
  8. 30 How much the income would be in this company?

    $20,000 $1,000 Monthly Income Variance
  9. 32 0 30 20 If we can find a correlation

    between Monthly Income and Working Years… 10 $20,000 $1,000 Working Years Monthly Income
  10. 33 0 30 20 10 $20,000 $1,000 Working Years If

    Working Years is 20 years, Monthly Income would be around $15,000. $15,000 Monthly Income
  11. 34 5000 0 30 20 Working Years Correlation Variance 100

    $20,000 $1,000 $15,000 Correlation reduces Uncertainty caused by Variance. Monthly Income
  12. If we can find strong correlations, it makes it easier

    to explain how Monthly Income changes and to predict what Monthly Income will be.
  13. Correlation is not equal to Causation. Causation is a special

    type of Correlation. If we can confirm a given Correlation is Causation, then we can control the outcome.
  14. • How much of the change in Monthly Income we

    can expect by a change in Working Years? • If there is, how big is it? Is it Strong enough that we should pay attention to it? • How much of the variance can it explain? Given that we find a correlation between two variables…
  15. Want to find a simple pattern that can explain both

    the given data and the data we don’t have at hands. 44
  16. 46 Draw a line to make the distance between the

    actual values and the line to be minimal. 40 20 10 0 30 5000 10000 15000 25000 20000
  17. 47 40 20 10 0 30 5000 10000 15000 25000

    20000 Monthly Income = 500 * Working Years + 5000
  18. 49 5000 40 20 10 0 30 Y Intercept Monthly

    Income = 500 * Working Years + 5000
  19. Linear Regression algorithm finds these parameters based on a given

    data and build a model. Model Monthly Income = 500 * Working Years + 5000
  20. Target value (y) can be estimated from a single variable

    (x) y = a * x + b Simple Linear Regression Monthly Income = 500 * Working Years + 5000
  21. Predictor variables can be multiple. y = a1 * x1

    + a2 * x2 + b Multiple Linear Regression Monthly Income = 500 * Working Years + 600 + Job Level + 5000
  22. 53 40 20 10 0 30 5000 10000 15000 25000

    20000 With real world data, it never predicts perfectly. Residuals
  23. 55

  24. 57

  25. 62 Working Years 4 2 1 0 3 Monthly Income

    = 500 * Working Years + 5000 5000 5500 6000 6500 Slopeɿ500 One year increase in Working Years will increase Monthly Income for $500.
  26. 63 5000 Slopeɿ1000 5500 6000 6500 7000 Working Years 4

    2 1 0 3 Monthly Income = 1000 * Working Years + 5000 One year increase in Working Years will increase Monthly Income for $1000.
  27. 64 Working Years 4 2 1 0 3 Monthly Income

    = -500 * Working Years + 6500 5000 5500 6000 6500 Slopeɿ-500 One year increase in Working Years will decrease Monthly Income for $500.
  28. 65 Working Years 4 2 1 0 3 Monthly Income

    = 0 * Working Years + 5500 5000 5500 6000 6500 Slopeɿ0 Regardless of the values in Working Years, Monthly Income is always $5,500.
  29. 66 Working Years 4 2 1 0 3 Monthly Income

    = 0 * Working Years + 5500 5000 5500 6000 6500 Slopeɿ0 Working Years and Monthly Income are independent.
  30. 69 1228 1500 2000 2500 Slopeɿ468 MonthlyIncome = 468 *

    TotalWorkingYears + 1228 1000 Working Years 4 2 1 0 3
  31. 70 MonthlyIncome = 468 * TotalWorkingYears + 1228 1696 1500

    2000 2500 1000 Working Years 4 2 1 0 3
  32. 75

  33. 78 Name Sales HR Peter 1 0 Maria 0 1

    Jane 1 0 Kan 0 0 Name Department Peter Sales Maria HR Jane Sales Kan R&D As part of the model building, categorical variables are expanded to multiple columns so that each category has its own column with values being either 0 or 1.
  34. 79 If Department is SalesɺSales is 1ɺHR is 0. Name

    Sales HR Peter 1 0 Maria 0 1 Jane 1 0 Kan 0 0
  35. 80 If Department is R&D (Base Level), both Sales &

    HR are 0. Name Sales HR Peter 1 0 Maria 0 1 Jane 1 0 Kan 0 0
  36. 81

  37. • Null Hypothesis : A given variable has nothing to

    do with the changes in the target variable. • We can use P Value as a guide to decide if we can reject Null Hypothesis. Statistical Test on Coefficient
  38. 93 P Value is 1ˋ (0.01). Very rare to observe

    this effect by random chance, we can accept that Monthly Income and Working Years are correlated.
  39. 102 P Value is 51ˋ (0.51). It’s often to observe

    this effect by random chance, we can’t conclude that Monthly Income and Working Years are correlated.
  40. Coefficient is 467, but Confidence Interval is between 448 and

    487. The true Coefficient should be included in this range at 95% probability. The coefficient is most likely not 0.
  41. 108 Working Years 4 2 1 0 3 Monthly Income

    = 0 * Working Years + 5500 5000 5500 6000 6500 Slopeɿ0 Regardless of the values in Working Years, Monthly Income is always $5,500. If the coefficient was 0 …
  42. 0 Coefficient is -9, but Confidence Interval is between -39

    and 19. The true Coefficient should be included in this range at 95% probability. The coefficient could be 0.
  43. 113 Distance 4 2 1 0 3 Monthly Income =

    0 * Distance + 6593 5000 5500 6000 6500 Slopeɿ0 Regardless of the values in Distance, Monthly Income is always $6,593.
  44. R Squared • How good does the model perform compared

    to a null model? • It can be between 0 and 1, 1 being the highest.
  45. When R Squared is high… When R Squared is low…

    Let's take a look at High and Low scenarios.
  46. Mean The part between the prediction and the dot is

    not explained by the model. The part between the prediction and the mean is explained by the model. x y Model
  47. Mean Only small part of the variability of the data

    is explained by the model. x y Model
  48. The part between the prediction and the dot is not

    explained by the model. The part between the prediction and the mean is explained by the model. Mean x y Model
  49. R Squared is the ratio of variability of the target

    variable values that is explained by the model. When R Squared is high… When R Squared is low…
  50. 127 40 20 10 0 30 5000 10000 15000 25000

    20000 Working Years Monthly Income
  51. 128 Average 100% 60% 40 20 10 0 30 Working

    Years 5000 10000 15000 25000 20000 Monthly Income
  52. 60% of the Monthly Income’s variance from the mean can

    be explained by Working Years. In order to explain the remaining 40% we need to go find other variables.
  53. • Other variables are correlated, too. One variable changes while

    another variable changes at the same timing (Age vs. Working Years) • Want to know the independent effect of one variable alone. • How the variable effect on another variable might be different in different group of data. Want to compare the effects among the groups.