Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Exploratory: Analytics : Introduction to Principal Component Analysis (PCA)

19fc8f6113c5c3d86e6176362ff29479?s=47 Kan Nishida
PRO
September 11, 2019

Exploratory: Analytics : Introduction to Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is an unsupervised machine learning algorithm and is known as one of the most popular dimensionality reduction technique. It can be also often used to visualize the relationships between the variables or even between the subjects of your interest such as customers, products, countries, etc.

Kan will introduce the basic concept of PCA and demonstrate how to use it to discover the patterns in data and understand the relationships better with many examples.

19fc8f6113c5c3d86e6176362ff29479?s=128

Kan Nishida
PRO

September 11, 2019
Tweet

Transcript

  1. Introduction to PCA Principal Component Analysis Exploratory Seminar #18

  2. EXPLORATORY

  3. Kan Nishida co-founder/CEO Exploratory Summary Beginning of 2016, launched Exploratory,

    Inc. to democratize Data Science. Prior to Exploratory, Kan was a director of product development at Oracle leading teams for building various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Speaker
  4. Mission Make Data Science Available for Everyone

  5. Data Science is not just for Engineers and Statisticians. Exploratory

    makes it possible for Everyone to do Data Science. The Third Wave
  6. First Wave Second Wave Third Wave Proprietary Open Source UI

    & Programming Programming 2016 2000 1976 Monetization Commoditization Democratization Statisticians Data Scientists Democratization of Data Science Algorithms Experience Tools Open Source UI & Automation Business Users Theme Users
  7. Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling Visualization

    Analytics (Statistics / Machine Learning) Exploratory Data Analysis
  8. Introduction to PCA Principal Component Analysis Exploratory Seminar #18

  9. Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling Visualization

    Analytics (Statistics / Machine Learning)
  10. PCA • Unsupervised Statistical Learning (Machine Learning) algorithm. • Often

    used as Dimensionality Reduction technique, which is to represent the original information with fewer dimensions while minimizing the loss of information. • Also, it’s useful to visualize the relationships between the variables and characterize the subjects of your interest such as customers, products, countries, etc.
  11. $6,503 Variation Average $15,000 $1,000

  12. $6,503 Average

  13. 13 Age Monthly Income The bigger the Age is, the

    bigger the Monthly Income is. Correlation
  14. Strong Negative Correlation No Correlation Strong Positive Correlation 0 1

    -1 -0.5 0.5 Correlation
  15. Correlation

  16. Job Level vs. Monthly Income

  17. Job Level vs. Monthly Income

  18. When we have a set of variables that are highly

    correlated, do we need to keep all of them?
  19. Do we need all of them to explain the characteristics

    of the subjects of our interest?
  20. California Election 2016 - Ballot Measures

  21. None
  22. None
  23. Cigarette Tax vs. Firearms Ammunition

  24. Cigarette Tax vs. Firearms Ammunition

  25. None
  26. Do we need all of the measures?

  27. Maybe, we have only 3 type of measures.

  28. The measures Democratic countries are overwhelmingly supporting.

  29. The measures Democratic countries are overwhelmingly NOT supporting.

  30. The measures Democratic countries don’t give a !!

  31. The fewer questions, the higher chance you’ll get answers.

  32. But also…

  33. If we can represent the data with fewer variables…

  34. If we can represent the data with fewer variables… it’s

    easier to visualize the relationship in the data.
  35. If we can represent the data with fewer variables… it’s

    easier to visualize the relationship in the data. This makes easier to discover and understand the relationship in the data
  36. Remember, we are comfortable with visualizing data and understanding it

    with 2 dimensions. (maybe 3, though not me )
  37. PCA (Principal Component Analysis)

  38. Generates a new set of artificial dimensions (components) that are

    created in a way that they are not correlated to one another and that can carry as much information of the original data as possible with fewer dimensions. PCA
  39. How PCA finds the new dimensions? 1. Finds a center

    point of the whole data presented in the multi-dimensional space. 2. Finds the direction that has the highest variance. (The 1st Component) 3. Finds the direction that is orthogonal to the 1st component and has the highest variance. (The 2nd Component) 4. Finds the direction that is orthogonal to the 1st and the 2nd components and has the highest variance. (The 3rd component) 5. Repeat till the last Nth component. 1 2 3 4
  40. PCA • Find the directions (Components) in data that has

    high variance. • Find a few components with high variance that can explain the most variance of data. (Principal Components)
  41. Examples: • US Baby Data • California Election Data •

    Employee Data
  42. US Baby Data

  43. None
  44. None
  45. None
  46. None
  47. Rotate

  48. PCA

  49. None
  50. None
  51. High Father Age Low

  52. High Mother Age Low

  53. High PC1 Low

  54. High PC2 Low

  55. PC2 PC1 100%

  56. Both Father Age and Mother Age are High.

  57. Both Father Age and Mother Age are Low.

  58. Both Father Age and Mother Age are about Average.

  59. Father Age is High and Mother Age is Low.

  60. Father Age is Low and Mother Age is High.

  61. Father Age is Super High and Mother Age is Middle.

  62. Father Age is Middle and Mother Age is Super High.

  63. D.C. Wyoming

  64. D.C. (Blue) vs. Wyoming (Orange)

  65. Mother Age and Father Age in Wyoming are Low.

  66. Mother Age and Father Age in D.C. are High.

  67. None
  68. Why do we need to create new 2 dimensions to

    try to express 2 original dimensions?
  69. umm… You don’t need to!

  70. But, when you start adding more dimensions you will start

    appreciate…
  71. None
  72. Father Age and Mother Age are pointing to the same

    direction with the same length.
  73. Weight Pounds and Mother Age/Father Age are orthogonal.

  74. None
  75. PC1 is expressing Mother Age and Father Age information well.

    High PC1 Low
  76. PC2 is expressing Weight Pounds information well. High PC2 Low

  77. A combination of PC 1 and PC2 can express 92%

    of the original information.
  78. Father Age and Mother Age are about Average, but Weight

    is Super High.
  79. Father Age and Mother Age are about Average, but Weight

    is Super Low.
  80. Father Age and Mother Age are Super High, but Weight

    is about Average.
  81. Father Age and Mother Age are Super Low, but Weight

    is about Average.
  82. California Election 2016

  83. California Election 2016 - Ballot Measures

  84. None
  85. The 1st component can represent 73% of variance in data.

  86. The 1st and 2nd components together can represent 88% of

    variance in data.
  87. The 1st and 2nd components together can represent 88% of

    variance in data.
  88. The 1st component is explaining the difference between Democratic and

    Republican countries.
  89. The 2nd component is explaining the difference between the counties

    that care about Adult Film and the counties that don’t care.
  90. Employee Data

  91. Employee Data

  92. Employee Data

  93. Performance/Percent Salary Hike are highly correlated. Monthly Income/TotalWorking Years/Job Level/Age

    are highly correlated
  94. These 2 dimensions can represent 31% of variance in data.

  95. Manager and Research Director are at the higher side of

    the spectrum of Monthly Income, Working Years, etc.
  96. Lab Technician, Sales Rep, and Research Scientist are at the

    lower side of the spectrum of Monthly Income, Working Years, etc.
  97. Without Performance Rate and Percent Salary Hike variables

  98. These 2 dimensions can still represent 31% of variance in

    data.
  99. Why PCA? • Dimensionality Reduction. • Make it easier to

    visualize high dimensional data. • Understand the patterns and characteristics inside the Data better.
  100. Q & A

  101. EXPLORATORY