Exploratory Seminar: An Introduction to K-Means Clustering

EXPLORATORY

Kan Nishida co-founder/CEO Exploratory Summary Beginning of 2016, launched Exploratory,
Inc. to make Data Science available for everyone. Prior to Exploratory, Kan was a director of development at Oracle leading development teams for building various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Instructor

Mission Make Data Science available for everyone

Data Science is not just for Engineers and Statisticians. Exploratory
makes it possible for Everyone to do Data Science. The Third Wave

First Wave Second Wave Third Wave Proprietary Open Source UI
& Programming Programming 2016 2000 1976 Monetization Commoditization Democratization Statisticians Data Scientists Smart Waves - Machine Learning / AI Algorithms Experience Tools Open Source UI & Automation Business Users Theme Users Exploratory

Questions Data Science Workﬂow Communication Data Access Data Wrangling Data
Visualization Machine Learning / Statistics Exploration

Questions What you can do with Exploratory Communication Data Access
Data Wrangling Visualization Machine Learning / Statistics Exploratory Data Analysis

Questions Communication Data Access Data Wrangling Visualization Machine Learning /
Statistics Exploratory Data Analysis

K-Means Clustering

• Supervised Learning • Unsupervised Learning Machine Learning

• Detect patterns / trends in Data and produce insights.
• You don’t have answers in your Data. Unsupervised Learning

Unsupervised Learning • Clustering • Anomaly Detection • MDS (Multi-Dimensional
Scaling) • PCA / SVD (Dimensionality Reduction)

Clustering

Attributes Woman Generation X Actor High Income American Southern Long
Hair

Also known as Variables, Features, Predictors, etc. Woman Generation X
Actor High Income American Southern Long Hair

Politicians Actors

Not yet… Getting accused of…

• K-Means • Hierarchical Clustering • GMM (Gaussian mixture model)
• LDA Algorithms

K-Means • Specify the number of clusters • Only based
on numeric variables • Can be affected by noise and outliers

Example

US Baby Data

Cluster US States into a few groups based on their
similarity using Father Age and Mother Age variables.

Data Preparation Each subjects (US State) needs to be presented
as its own unique row. One row is one state.

1. Group Data by US State 2. Summarize - Mean
of Mother Age and Mean of Father Age Data Preparation

Group the data by US State. From the column menu
at the “state” column (either on Table view or Summary view), select “Group By” 1. Group By

Select “Summarize (Aggregate)”, then select “mean (Average)” from the column
header menu for Mother Age. 2.1. Summarize for Mean of Mother Age

2.2. Summarize for Mean of Father Age

Summarized data at US States level.

Run K-Means Clustering!

How K-Means Algorithm Works?

1. Set the center of clusters randomly. 2. Each dot
should belong to the nearest center of the cluster. 3. Move each center of clusters to the average of its members. 4. Repeat the step2 and step3 until all centers don’t move anymore.

46 1. Set the center of clusters randomly. 2. Each
dot should belong to the nearest center of the cluster. 3. Move each center of clusters to the average of its members. 4. Repeat the step2 and step3 until all centers don’t move anymore.

• Command Step • Analytics View K-Means Clustering in Exploratory

• Select mother_age_mean, father_age_mean with Control (Command) + Click. •
Click the column menu, select “Analytics”, select“K-Means” and select “Selected Columns”. Run K-Means Clustering

Now the US States are clustered (grouped) based on the
mother and father ages. Each US State now has the cluster ID.

Visualize it!

Scatter

It doesn’t have to be just two variables. You can
use many variables.

Let’s add another variable “Cigarette Use Ratio” as one of
the clustering variables. Add Cigarette Use Ratio

1. Create Cigarette Use Ratio column 2. Add Cigarette Use
Ratio column to K- Means and re-run. Create Cigarette Use Ratio Column

Summarize with ‘Ratio of TRUE’ 1. Create Cigarette Use Ratio
Column

Summarize with ‘Ratio of TRUE’

Select ‘K-Means’ step and Click the token to add cigarette_use_ratio
column. Add cigarette_use_ratio column and re-run K-Means

Add ‘cigarette_use_ratio’ column. Add cigarette_use_ratio column and re-run K-Means

Clusters are recalculated. Add cigarette_use_ratio column and re-run K-Means

Opening the chart again will automatically refresh and use the
new clusters. But, it looks the same as before… why?

The clusters are separated by mother_age_mean variable, but not by
cigarette_use_ratio variable.

Compared to mother_age and father_age, cigarette_use_ratio values are very small.

K-Means Algorithm • It calculates the distances between the centers
of the clusters and the members. So the variables with larger values tend to have bigger inﬂuence on how the data get clustered. • Mother Age and Father Age values are in a similar range while cigarette_use_ratio values are much smaller than the ages.

Normalize!

Select all three columns, and select ’normalize’ function from the
column header menu.

Values are normalized in each variable.

We can compare the variables with a same scale.

Cluster 2 and 3 can not be separated only by
Mother and Father ages. This means Cigarette Use Ratio makes the difference.

Cluster 2 and 3 are differentiated by Cigarette Use Ratio
values.

Using Analytics View

K-Means with Analytics View • It builds a K-Means Clustering
Model and generate a set of pre- deﬁned charts to help you understand the characters of the clusters. • It normalizes the data before building the model.

Select ‘K-Means Clustering’

Select all 3 numerical columns.

• Each dot represents each US State. • X and
Y Axis are artiﬁcially created dimensions to express as much of the variance of the original measures (Mother Age, Father Age, Cigarette Use Ratio)

• The original measures (Mother Age, Father Age, Cigarette Use
Ratio) are shown as gray lines. • The dots closer to the gray lines have higher values on those lines. • Example: Blue cluster tend to have large numbers of Father Age and Mother Age.

• Cluster 1: Father Age and Mother Age are higher.
Cigarette Use Ratio is lower. • Cluster 2: Father Age, Mother Age, and Cigarette Use Ratio are lower. • Cluster 3: Father Age and Mother Age are lower. Cigarette Use Ratio is higher.

Visualizing K-Means Clustering Results to Understand the Characteristics of Clusters
Better Take a look at the following blog post for more details. (https://exploratory.io/note/kanaugust/Introduction-to-K-Means-Clustering-under-Analytics-View-bjW2EZc3Ge)

How many clusters should we build?

Elbow Curve Method

Select TRUE for Elbow Method option.

The distances between the center and the members are shown
at Y-Axis. The point where the value drops signiﬁcantly means that increasing the number of the clusters is contributing to capture the differences among the data in a meaningful way.

Future Seminars

January 8th (Tuesday), 2019 • Data Wrangling: Working with Date
/ Time Data and Visualize It Planned • Analytics 101 - When to use which algorithms? • Data Wrangling: Introduction to Regular Expression https://exploratory.io/online-seminar

Contact Email [email protected] Data Science Training https://exploratory.io/training Twitter @KanAugust Online
Seminar https://exploratory.io/online-seminar

Exploratory Seminar: An Introduction to K-Means...

Exploratory Seminar: An Introduction to K-Means Clustering

More Decks by Kan Nishida

Other Decks in Technology

Featured

Transcript