Exploratory: Analytics : Introduction to Principal Component Analysis (PCA)

19fc8f6113c5c3d86e6176362ff29479?s=47 Kan Nishida
September 11, 2019

Exploratory: Analytics : Introduction to Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is an unsupervised machine learning algorithm and is known as one of the most popular dimensionality reduction technique. It can be also often used to visualize the relationships between the variables or even between the subjects of your interest such as customers, products, countries, etc.

Kan will introduce the basic concept of PCA and demonstrate how to use it to discover the patterns in data and understand the relationships better with many examples.

19fc8f6113c5c3d86e6176362ff29479?s=128

Kan Nishida

September 11, 2019
Tweet

Transcript

  1. 3.

    Kan Nishida co-founder/CEO Exploratory Summary Beginning of 2016, launched Exploratory,

    Inc. to democratize Data Science. Prior to Exploratory, Kan was a director of product development at Oracle leading teams for building various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Speaker
  2. 5.

    Data Science is not just for Engineers and Statisticians. Exploratory

    makes it possible for Everyone to do Data Science. The Third Wave
  3. 6.

    First Wave Second Wave Third Wave Proprietary Open Source UI

    & Programming Programming 2016 2000 1976 Monetization Commoditization Democratization Statisticians Data Scientists Democratization of Data Science Algorithms Experience Tools Open Source UI & Automation Business Users Theme Users
  4. 7.

    Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling Visualization

    Analytics (Statistics / Machine Learning) Exploratory Data Analysis
  5. 10.

    PCA • Unsupervised Statistical Learning (Machine Learning) algorithm. • Often

    used as Dimensionality Reduction technique, which is to represent the original information with fewer dimensions while minimizing the loss of information. • Also, it’s useful to visualize the relationships between the variables and characterize the subjects of your interest such as customers, products, countries, etc.
  6. 13.

    13 Age Monthly Income The bigger the Age is, the

    bigger the Monthly Income is. Correlation
  7. 18.

    When we have a set of variables that are highly

    correlated, do we need to keep all of them?
  8. 19.
  9. 21.
  10. 22.
  11. 25.
  12. 34.

    If we can represent the data with fewer variables… it’s

    easier to visualize the relationship in the data.
  13. 35.

    If we can represent the data with fewer variables… it’s

    easier to visualize the relationship in the data. This makes easier to discover and understand the relationship in the data
  14. 36.
  15. 38.

    Generates a new set of artificial dimensions (components) that are

    created in a way that they are not correlated to one another and that can carry as much information of the original data as possible with fewer dimensions. PCA
  16. 39.

    How PCA finds the new dimensions? 1. Finds a center

    point of the whole data presented in the multi-dimensional space. 2. Finds the direction that has the highest variance. (The 1st Component) 3. Finds the direction that is orthogonal to the 1st component and has the highest variance. (The 2nd Component) 4. Finds the direction that is orthogonal to the 1st and the 2nd components and has the highest variance. (The 3rd component) 5. Repeat till the last Nth component. 1 2 3 4
  17. 40.

    PCA • Find the directions (Components) in data that has

    high variance. • Find a few components with high variance that can explain the most variance of data. (Principal Components)
  18. 43.
  19. 44.
  20. 45.
  21. 46.
  22. 47.
  23. 48.

    PCA

  24. 49.
  25. 50.
  26. 67.
  27. 68.

    Why do we need to create new 2 dimensions to

    try to express 2 original dimensions?
  28. 71.
  29. 72.

    Father Age and Mother Age are pointing to the same

    direction with the same length.
  30. 74.
  31. 77.

    A combination of PC 1 and PC2 can express 92%

    of the original information.
  32. 84.
  33. 89.

    The 2nd component is explaining the difference between the counties

    that care about Adult Film and the counties that don’t care.
  34. 95.

    Manager and Research Director are at the higher side of

    the spectrum of Monthly Income, Working Years, etc.
  35. 96.

    Lab Technician, Sales Rep, and Research Scientist are at the

    lower side of the spectrum of Monthly Income, Working Years, etc.
  36. 99.

    Why PCA? • Dimensionality Reduction. • Make it easier to

    visualize high dimensional data. • Understand the patterns and characteristics inside the Data better.
  37. 100.