Exploratory: Analytics : Introduction to Principal Component Analysis (PCA)

Introduction to PCA Principal Component Analysis Exploratory Seminar #18

EXPLORATORY

Kan Nishida co-founder/CEO Exploratory Summary Beginning of 2016, launched Exploratory,
Inc. to democratize Data Science. Prior to Exploratory, Kan was a director of product development at Oracle leading teams for building various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Speaker

Mission Make Data Science Available for Everyone

Data Science is not just for Engineers and Statisticians. Exploratory
makes it possible for Everyone to do Data Science. The Third Wave

First Wave Second Wave Third Wave Proprietary Open Source UI
& Programming Programming 2016 2000 1976 Monetization Commoditization Democratization Statisticians Data Scientists Democratization of Data Science Algorithms Experience Tools Open Source UI & Automation Business Users Theme Users

Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling Visualization
Analytics (Statistics / Machine Learning) Exploratory Data Analysis

Introduction to PCA Principal Component Analysis Exploratory Seminar #18

Questions Communication (Dashboard, Note, Slides) Data Access Data Wrangling Visualization
Analytics (Statistics / Machine Learning)

PCA • Unsupervised Statistical Learning (Machine Learning) algorithm. • Often
used as Dimensionality Reduction technique, which is to represent the original information with fewer dimensions while minimizing the loss of information. • Also, it’s useful to visualize the relationships between the variables and characterize the subjects of your interest such as customers, products, countries, etc.

$6,503 Variation Average $15,000 $1,000

$6,503 Average

13 Age Monthly Income The bigger the Age is, the
bigger the Monthly Income is. Correlation

Strong Negative Correlation No Correlation Strong Positive Correlation 0 1
-1 -0.5 0.5 Correlation

Correlation

Job Level vs. Monthly Income

When we have a set of variables that are highly
correlated, do we need to keep all of them?

Do we need all of them to explain the characteristics
of the subjects of our interest?

California Election 2016 - Ballot Measures

Cigarette Tax vs. Firearms Ammunition

Do we need all of the measures?

Maybe, we have only 3 type of measures.

The measures Democratic countries are overwhelmingly supporting.

The measures Democratic countries are overwhelmingly NOT supporting.

The measures Democratic countries don’t give a !!

The fewer questions, the higher chance you’ll get answers.

But also…

If we can represent the data with fewer variables…

If we can represent the data with fewer variables… it’s
easier to visualize the relationship in the data.

If we can represent the data with fewer variables… it’s
easier to visualize the relationship in the data. This makes easier to discover and understand the relationship in the data

Remember, we are comfortable with visualizing data and understanding it
with 2 dimensions. (maybe 3, though not me )

PCA (Principal Component Analysis)

Generates a new set of artiﬁcial dimensions (components) that are
created in a way that they are not correlated to one another and that can carry as much information of the original data as possible with fewer dimensions. PCA

How PCA ﬁnds the new dimensions? 1. Finds a center
point of the whole data presented in the multi-dimensional space. 2. Finds the direction that has the highest variance. (The 1st Component) 3. Finds the direction that is orthogonal to the 1st component and has the highest variance. (The 2nd Component) 4. Finds the direction that is orthogonal to the 1st and the 2nd components and has the highest variance. (The 3rd component) 5. Repeat till the last Nth component. 1 2 3 4

PCA • Find the directions (Components) in data that has
high variance. • Find a few components with high variance that can explain the most variance of data. (Principal Components)

Examples: • US Baby Data • California Election Data •
Employee Data

US Baby Data

Rotate

High Father Age Low

High Mother Age Low

High PC1 Low

High PC2 Low

PC2 PC1 100%

Both Father Age and Mother Age are High.

Both Father Age and Mother Age are Low.

Both Father Age and Mother Age are about Average.

Father Age is High and Mother Age is Low.

Father Age is Low and Mother Age is High.

Father Age is Super High and Mother Age is Middle.

Father Age is Middle and Mother Age is Super High.

D.C. Wyoming

D.C. (Blue) vs. Wyoming (Orange)

Mother Age and Father Age in Wyoming are Low.

Mother Age and Father Age in D.C. are High.

Why do we need to create new 2 dimensions to
try to express 2 original dimensions?

umm… You don’t need to!

But, when you start adding more dimensions you will start
appreciate…

Father Age and Mother Age are pointing to the same
direction with the same length.

Weight Pounds and Mother Age/Father Age are orthogonal.

PC1 is expressing Mother Age and Father Age information well.
High PC1 Low

PC2 is expressing Weight Pounds information well. High PC2 Low

A combination of PC 1 and PC2 can express 92%
of the original information.

Father Age and Mother Age are about Average, but Weight
is Super High.

Father Age and Mother Age are about Average, but Weight
is Super Low.

Father Age and Mother Age are Super High, but Weight
is about Average.

Father Age and Mother Age are Super Low, but Weight
is about Average.

California Election 2016

California Election 2016 - Ballot Measures

The 1st component can represent 73% of variance in data.

The 1st and 2nd components together can represent 88% of
variance in data.

The 1st component is explaining the difference between Democratic and
Republican countries.

The 2nd component is explaining the difference between the counties
that care about Adult Film and the counties that don’t care.

Employee Data

Performance/Percent Salary Hike are highly correlated. Monthly Income/TotalWorking Years/Job Level/Age
are highly correlated

These 2 dimensions can represent 31% of variance in data.

Manager and Research Director are at the higher side of
the spectrum of Monthly Income, Working Years, etc.

Lab Technician, Sales Rep, and Research Scientist are at the
lower side of the spectrum of Monthly Income, Working Years, etc.

Without Performance Rate and Percent Salary Hike variables

These 2 dimensions can still represent 31% of variance in
data.

Why PCA? • Dimensionality Reduction. • Make it easier to
visualize high dimensional data. • Understand the patterns and characteristics inside the Data better.

EXPLORATORY

Exploratory: Analytics : Introduction to Princi...

Exploratory: Analytics : Introduction to Principal Component Analysis (PCA)

More Decks by Kan Nishida

Other Decks in Science

Featured

Transcript