Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Exploratory Seminar: An Introduction to K-Means Clustering

Kan Nishida
December 18, 2018

Exploratory Seminar: An Introduction to K-Means Clustering

This is to introduce K-Means Clustering algorithm, which segments the data based on a given set of variables, by demonstrating it with Exploratory’s Analytics view.

Kan Nishida

December 18, 2018
Tweet

More Decks by Kan Nishida

Other Decks in Technology

Transcript

  1. Kan Nishida co-founder/CEO Exploratory Summary Beginning of 2016, launched Exploratory,

    Inc. to make Data Science available for everyone. Prior to Exploratory, Kan was a director of development at Oracle leading development teams for building various Data Science products in areas including Machine Learning, BI, Data Visualization, Mobile Analytics, Big Data, etc. While at Oracle, Kan also provided training and consulting services to help organizations transform with data. @KanAugust Instructor
  2. Data Science is not just for Engineers and Statisticians. Exploratory

    makes it possible for Everyone to do Data Science. The Third Wave
  3. First Wave Second Wave Third Wave Proprietary Open Source UI

    & Programming Programming 2016 2000 1976 Monetization Commoditization Democratization Statisticians Data Scientists Smart Waves - Machine Learning / AI Algorithms Experience Tools Open Source UI & Automation Business Users Theme Users Exploratory
  4. Questions Data Science Workflow Communication Data Access Data Wrangling Data

    Visualization Machine Learning / Statistics Exploration
  5. Questions What you can do with Exploratory Communication Data Access

    Data Wrangling Visualization Machine Learning / Statistics Exploratory Data Analysis
  6. • Detect patterns / trends in Data and produce insights.

    • You don’t have answers in your Data. Unsupervised Learning
  7. Also known as Variables, Features, Predictors, etc. Woman Generation X

    Actor High Income American Southern Long Hair
  8. K-Means • Specify the number of clusters • Only based

    on numeric variables • Can be affected by noise and outliers
  9. Cluster US States into a few groups based on their

    similarity using Father Age and Mother Age variables.
  10. Data Preparation Each subjects (US State) needs to be presented

    as its own unique row. One row is one state.
  11. 1. Group Data by US State 2. Summarize - Mean

    of Mother Age and Mean of Father Age Data Preparation
  12. Group the data by US State. From the column menu

    at the “state” column (either on Table view or Summary view), select “Group By” 1. Group By
  13. Select “Summarize (Aggregate)”, then select “mean (Average)” from the column

    header menu for Mother Age. 2.1. Summarize for Mean of Mother Age
  14. 43

  15. 1. Set the center of clusters randomly. 2. Each dot

    should belong to the nearest center of the cluster. 3. Move each center of clusters to the average of its members. 4. Repeat the step2 and step3 until all centers don’t move anymore.
  16. 1. Set the center of clusters randomly. 2. Each dot

    should belong to the nearest center of the cluster. 3. Move each center of clusters to the average of its members. 4. Repeat the step2 and step3 until all centers don’t move anymore.
  17. 46 1. Set the center of clusters randomly. 2. Each

    dot should belong to the nearest center of the cluster. 3. Move each center of clusters to the average of its members. 4. Repeat the step2 and step3 until all centers don’t move anymore.
  18. 47 1. Set the center of clusters randomly. 2. Each

    dot should belong to the nearest center of the cluster. 3. Move each center of clusters to the average of its members. 4. Repeat the step2 and step3 until all centers don’t move anymore.
  19. 48 1. Set the center of clusters randomly. 2. Each

    dot should belong to the nearest center of the cluster. 3. Move each center of clusters to the average of its members. 4. Repeat the step2 and step3 until all centers don’t move anymore.
  20. 49 1. Set the center of clusters randomly. 2. Each

    dot should belong to the nearest center of the cluster. 3. Move each center of clusters to the average of its members. 4. Repeat the step2 and step3 until all centers don’t move anymore.
  21. 50 1. Set the center of clusters randomly. 2. Each

    dot should belong to the nearest center of the cluster. 3. Move each center of clusters to the average of its members. 4. Repeat the step2 and step3 until all centers don’t move anymore.
  22. 51 1. Set the center of clusters randomly. 2. Each

    dot should belong to the nearest center of the cluster. 3. Move each center of clusters to the average of its members. 4. Repeat the step2 and step3 until all centers don’t move anymore.
  23. 52 1. Set the center of clusters randomly. 2. Each

    dot should belong to the nearest center of the cluster. 3. Move each center of clusters to the average of its members. 4. Repeat the step2 and step3 until all centers don’t move anymore.
  24. 53 1. Set the center of clusters randomly. 2. Each

    dot should belong to the nearest center of the cluster. 3. Move each center of clusters to the average of its members. 4. Repeat the step2 and step3 until all centers don’t move anymore.
  25. 54 1. Set the center of clusters randomly. 2. Each

    dot should belong to the nearest center of the cluster. 3. Move each center of clusters to the average of its members. 4. Repeat the step2 and step3 until all centers don’t move anymore.
  26. • Select mother_age_mean, father_age_mean with Control (Command) + Click. •

    Click the column menu, select “Analytics”, select“K-Means” and select “Selected Columns”. Run K-Means Clustering
  27. Now the US States are clustered (grouped) based on the

    mother and father ages. Each US State now has the cluster ID.
  28. Map

  29. Let’s add another variable “Cigarette Use Ratio” as one of

    the clustering variables. Add Cigarette Use Ratio
  30. 1. Create Cigarette Use Ratio column 2. Add Cigarette Use

    Ratio column to K- Means and re-run. Create Cigarette Use Ratio Column
  31. Select ‘K-Means’ step and Click the token to add cigarette_use_ratio

    column. Add cigarette_use_ratio column and re-run K-Means
  32. Opening the chart again will automatically refresh and use the

    new clusters. But, it looks the same as before… why?
  33. K-Means Algorithm • It calculates the distances between the centers

    of the clusters and the members. So the variables with larger values tend to have bigger influence on how the data get clustered. • Mother Age and Father Age values are in a similar range while cigarette_use_ratio values are much smaller than the ages.
  34. Cluster 2 and 3 can not be separated only by

    Mother and Father ages. This means Cigarette Use Ratio makes the difference.
  35. K-Means with Analytics View • It builds a K-Means Clustering

    Model and generate a set of pre- defined charts to help you understand the characters of the clusters. • It normalizes the data before building the model.
  36. • Each dot represents each US State. • X and

    Y Axis are artificially created dimensions to express as much of the variance of the original measures (Mother Age, Father Age, Cigarette Use Ratio)
  37. • The original measures (Mother Age, Father Age, Cigarette Use

    Ratio) are shown as gray lines. • The dots closer to the gray lines have higher values on those lines. • Example: Blue cluster tend to have large numbers of Father Age and Mother Age.
  38. • Cluster 1: Father Age and Mother Age are higher.

    Cigarette Use Ratio is lower. • Cluster 2: Father Age, Mother Age, and Cigarette Use Ratio are lower. • Cluster 3: Father Age and Mother Age are lower. Cigarette Use Ratio is higher.
  39. Visualizing K-Means Clustering Results to Understand the Characteristics of Clusters

    Better Take a look at the following blog post for more details. (https://exploratory.io/note/kanaugust/Introduction-to-K-Means-Clustering-under-Analytics-View-bjW2EZc3Ge)
  40. The distances between the center and the members are shown

    at Y-Axis. The point where the value drops significantly means that increasing the number of the clusters is contributing to capture the differences among the data in a meaningful way.
  41. January 8th (Tuesday), 2019 • Data Wrangling: Working with Date

    / Time Data and Visualize It Planned • Analytics 101 - When to use which algorithms? • Data Wrangling: Introduction to Regular Expression https://exploratory.io/online-seminar