Machine Learning 101: Clustering

With Marcos Arancibia, Product Manager, Data Science and Big Data
@MarcosArancibia Mark Hornick, Senior Director, Product Management, Data Science and Machine Learning @MarkHornick oracle.com/machine-learning Oracle Machine Learning Office Hours Machine Learning 101 - Clustering Copyright © 2020, Oracle and/or its affiliates. All rights reserved

Today’s Agenda Upcoming session Speaker Marcos Arancibia – Machine Learning
101: Clustering Q&A Copyright © 2020 Oracle and/or its affiliates.

Next Session November 5, 2020: Oracle Machine Office Hours, 8AM
US Pacific Machine Learning 102 – Clustering Have you always been curious about what machine learning can do for your business problem, but could never find the time to learn the practical necessary skills? Do you wish to learn what Classification, Regression, Clustering and Feature Extraction techniques do, and how to apply them using the Oracle Machine Learning family of products? Join us for this special series “Oracle Machine Learning Office Hours – Machine Learning 101”, where we will go through the main steps of solving a Business Problem from beginning to end, using the different components available in Oracle Machine Learning: programming languages and interfaces, including Notebooks with SQL, UI, and languages like R and Python. This sixth session in the series will cover Clustering 102, and we will learn more about the methods on multiple dimensions, how to compare Cluster techniques, and explore Dimensionality Reduction and how to extract only the most meaningful attributes from datasets with lots of attributes (or derived attributes). Copyright © 2020, Oracle and/or its affiliates. All rights reserved

For product info… https://www.oracle.com/machine-learning Copyright © 2020 Oracle and/or its
affiliates.

Today’s Session: Machine Learning 101 - Clustering In this "ML
Clustering 101" we will learn the terminology around Clustering or Segmentation, how to get the data prepared for clustering, understand Clustering techniques, how to measure cluster separation, identify potential pitfalls and see use cases. We will also learn about methods for determining the ideal number of clusters. Copyright © 2020, Oracle and/or its affiliates. All rights reserved

• What is machine learning? (Watch the Session from May
19, 2020) • What is Clustering? • Clustering algorithms and methods • Types of data needed for Clustering and terminology • Workflow and data preparation • Data for Clustering used in the example • Clustering Model Intuition for k-Means • Model evaluation • Q&A Agenda Copyright © 2020 Oracle and/or its affiliates 7

Clustering is a subcategory of Unsupervised Learning (Machine Learning with
an unknown past outcome). The notion of a "cluster" cannot be precisely defined (and there are many types of Clustering), but the commonality between them all is the goal of identifying a grouping of of the data points. Examples of Clustering: • Cluster analysis is widely used in market research from surveys and test panels. Researchers use cluster analysis to partition the general population of consumers into segments and to better understand the customer behavior and differences between groups of consumers/prospects • Anomaly detection and outliers are typically defined with respect to certain clusters on the data that either belong to a suspicious cluster, or that are far from the centroids of their respect clusters. • In crime analysis, clustering can help identify areas of the city where there are greater incidences of particular types of crime (hot spot zones), and help prioritize law enforcement resources more effectively. • In Social Network Analysis, clustering may be used to recognize communities within large groups of people. Some of these models can also potentially assist in fraud models (identifying ring leaders and connections), and churn models (identifying potential offers that can make a customer to stay), usually in combination with Classification models. Machine learning methods that can be applied to a wide range of business problems What is Clustering? Copyright © 2020 Oracle and/or its affiliates.

Copyright © 2020 Oracle and/or its affiliates. There are several
methods for Clustering that depends on different models, and the ones we are going to talk about in this talk are the following: Distance-based, or Centroid models • Probably the most well-known algorithm of this family is the k-Means algorithm. This type of clustering uses a distance metric to determine similarity between data objects. The distance metric measures the distance between actual cases in the cluster and the prototypical case for the cluster. The prototypical case is known as the centroid. Grid-based, or Connectivity models • These types of models work by creating a tree that connects all data points at the top, and are known as Hierarchical Cluster models, basically based on distance connectivity. They divide the input space into hyper-rectangular cells and identify adjacent high-density cells to form clusters. Density-based, or Distribution models • For these models, clusters are modeled using statistical distributions. One of the most common use multivariate normal (Gaussian) distributions to identify segments (known as Gaussian Mixture Models), and the algorithm is known as Expectation-Maximization. Algorithms and Methods What is Clustering?

10 • The term "k-means" was first used by James
MacQueen in 1967, though the idea goes back to Hugo Steinhaus in 1956. • The standard algorithm was first proposed by Stuart Lloyd of Bell Labs in 1957 as a technique for pulse- code modulation, though it wasn't published as a journal article until 1982. In 1965, Edward W. Forgy published essentially the same method, which is why it is sometimes referred to as Lloyd-Forgy. • The objective of the algorithm is to partition the data into k clusters in which each record belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster. • The result is a partitioning of the data space into Voronoi cells, and the objective is to minimize within-cluster variances (squared Euclidean distances). A Centroid-based, Distance algorithm K-Means clustering https://commons.wikimedia.org/w/index.php?curid=59409335 By Chire - Own work, CC BY-SA 4.0

11 • Patented by Oracle in 2002, the O-Cluster algorithm
combines an active sampling technique with an axis-parallel partitioning strategy to identify continuous areas of high density in the input space. • The method operates on a limited memory buffer and requires at most a single scan through the data, being very efficient and lightweight when compared to other traditional partitioning methods. Campos, M.M., Milenova, B.L., "O-Cluster: Scalable Clustering of Large High Dimensional Data Sets", Oracle Data Mining Technologies, 10 Van De Graaff Drive, Burlington, MA 01803 A proprietary Grid-based partitioning algorithm O-Cluster – Orthogonal Partitioning https://www.oracle.com/technetwork/testcontent/o-cluster-algorithm-1-133481.pdf O-Cluster partitions on the DS3 data set. The grid depicts the splitting planes found by O-Cluster. Squares ( ) represent the original cluster centroids, stars (*) represent the centroids of the points belonging to an O-Cluster partition; recall = 71%, precision = 97%.

12 • EM is an iterative method that starts with
an initial parameter guess, which are used to compute the likelihood of the current model in (Expectation step), and the parameter values are then recomputed to maximize the likelihood (Maximization step), iterating the process until model convergence. • Due to its probabilistic nature, density-based clustering can compute reliable probabilities in cluster assignment. It can also handle missing values automatically. • Oracle Machine Learning implementation includes significant enhancements, such as scalable processing of large volumes of data, automatic parameter initialization and determining the number of EM components automatically. A density-based clustering algorithm Expectation-Maximization The tendency of k-means to produce equal-sized clusters leads to bad results on the example on the top right, while EM (to the right) benefits from the Gaussian distributions with different radius present in the data set. https://commons.wikimedia.org/w/index.php?curid=11765684 By Chire - Own work, Public Domain,

To identify how similar the flights are, based on all
information about the flight: “Garbage in, garbage out” is especially true for ML, but also having the right data What type of data is needed for a Clustering (Unsupervised) problem? Copyright © 2020 Oracle and/or its affiliates. Historical data with known outcomes ID CARR DEPDELAY ORIGIN DEST … DISTANCE ARRDELAY 633 AA -1 mins SJU DFW … 2,195 0 mins 1184 UA 18.5 mins JAX IAD … 631 20 mins 86 NW 8 mins HNL SEA … 2,677 15 mins … … … …. … … … …

To identify how similar the flights are, based on all
information about the flight: “Garbage in, garbage out” is especially true for ML, but also having the right data What type of data is needed for a Clustering (Unsupervised) problem? Copyright © 2020 Oracle and/or its affiliates. New data unknown outcomes – predict arrival delay ID CARR DEPDELAY ORIGIN DEST … DISTANCE ARRDELAY PREDICTED CLUSTER 345 AA -1 mins SJU DFW … 2,195 0 mins 1 1235 UA 18.5 mins JAX IAD … 631 20 mins 3 342 NW 8 mins HNL SEA … 2,677 15 mins 2 … … … …. … … … … … Historical data with known outcomes ID CARR DEPDELAY ORIGIN DEST … DISTANCE ARRDELAY 633 AA -1 mins SJU DFW … 2,195 0 mins 1184 UA 18.5 mins JAX IAD … 631 20 mins 86 NW 8 mins HNL SEA … 2,677 15 mins … … … …. … … … …

Several names are used for the same components, depending on
the field of study Machine Learning terminology for Clustering Copyright © 2020 Oracle and/or its affiliates. Historical data with known outcomes Table Row • Record • Case • Instance • Example Table Columns • Variable • Attribute • Field • Predictor Table Column • Case ID • Unique ID ID CARR DEPDELAY ORIGIN DEST … DISTANCE ARRDELAY 345 AA -1 mins SJU DFW … 2,195 0 mins 1235 UA 18.5 mins JAX IAD … 631 20 mins 342 NW 8 mins HNL SEA … 2,677 15 mins … … … …. … … … … Data • Database Table or View • Data set (or dataset) • Training data – to build a model • Test data – to evaluate a model

Several names are used for the same components, depending on
the field of study Machine Learning terminology for Clustering Copyright © 2020 Oracle and/or its affiliates. Table Column • Predicted Cluster/Segment New data unknown outcomes – predict if customer will buy product Data • Database Table or View • Scoring data – for predictions ID CARR DEPDELAY ORIGIN DEST … DISTANCE ARRDELAY PREDICTED CLUSTER 345 AA -1 mins SJU DFW … 2,195 0 mins 1 1235 UA 18.5 mins JAX IAD … 631 20 mins 3 342 NW 8 mins HNL SEA … 2,677 15 mins 2 … … … …. … … … … …

Copyright © 2020, Oracle and/or its affiliates | Confidential: Internal
17 Data is usually processed as one set (Train) • Because Clustering (when Unsupervised) assumes that you need to be able to build (train) the model the entire set of data, the process uses the whole dataset for Training and finding the Cluster Centroids, the proper Grids or the Probability Distributions that make up the Cluster definitions • OML E.M. is a bit different: it uses a held-aside sample for searching the ideal number of Clusters automatically. Clustering workflow Build Model Score the Test Data Builds the Structure on how to Segment the data Compute Cluster Statistics Predicted Cluster # Models ideally check for records with high similarity within a cluster, and low similarity between different clusters Train Data Assigned Cluster # Basic Statistics: • Within* Sum of Squares • Between* Sum of Squares • Total Sum of Squares * we will see definitions later Assigns the most likely Cluster

18 Most clustering algorithms will require • Data Transformation • Standardization/Normalization of values • Missing value Imputation For example, what can be derived from a single date? Data preparation 05/19/2020 Basic Information • 138 days since 1st Jan 2020 • Tuesday • Third day of the week • Second day of the workweek • Sunrise was at 6:32PM in Miami • Sun will set at 8:02PM in Miami • It's an overcast day in Miami • There were Flood Warnings in Miami Domain Knowledge • Has been a customer for 3.5 years • Machine has been operating for 564 days • Customer increased spending in the last 3 months • Revenue last month declined vs. Avg previous 3 months • Customer has declined usage 30% since last offer • 6 months since last Contact

19 OML Includes an Automatic Data Preparation (ADP) K-Means • ADP performs normalization for k-Means. If you do not use ADP, you must normalize numeric attributes before creating or applying the model. k-Means interprets missing values as missing at random. The algorithm replaces missing categorical values with the mode and missing numerical values with the mean O-Cluster • ADP bins numerical attributes for O-Cluster. It uses a specialized form of equi-width binning that computes the number of bins per attribute automatically. Numerical columns with all nulls or a single value are removed. O- Cluster handles missing values naturally as missing at random. Expectation-Maximization (EM) • ADP normalizes numerical attributes (in non-nested columns) when they are modeled with Gaussian distributions. ADP applies a topN binning transformation to categorical attributes. • Missing values are handled automatically. The EM algorithm replaces missing values with the mean in single- column numerical attributes that are modeled with Gaussian distributions. In other single-column attributes (categoricals and numericals modeled with Bernoulli distributions), NULLs are not replaced; they are treated as a distinct value with its own frequency count. In nested columns, missing values are treated as zeros. Data preparation

Copyright © 2020, Oracle and/or its affiliates 20 Arrival and
Departure Delays of Domestic Flights • The ONTIME dataset contains scheduled and actual departure and arrival times, reason of delay and other measurements reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. • The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS). • A version of this dataset that became famous consists of flight arrival and departure details from October 1987 to April 2008 (123mi records) presented by the American Statistical Association Data Expo in 2009, and the data is still hosted at http://stat-computing.org/dataexpo/2009/ • Newer data can be found on the BTS site at https://www.bts.gov/browse-statistical-products-and- data/bts-publications/airline-service-quality-performance-234-time Data that will be used for the current Session Data for Clustering used in the example

Copyright © 2020, Oracle and/or its affiliates 21 The basic
Objective of clustering models is to try to find natural groupings of data points in space. As a basic example, assume we are looking at a single Airport (Chicago), and we want to look at the Departure Delay of flights related to the Distance to the destination of those flights. For example: Clustering Model Intuition for k-Means 10 0 20 Departure Delay of Flight -10 100 500 Flight Distance 1000

Objective of clustering models is to try to find natural groupings of data points in space. As a basic example, assume we are looking at a single Airport (Chicago), and we want to look at the Departure Delay of flights related to the Distance to the destination of those flights. For example: Clustering Model Intuition for k-Means 10 0 20 100 500 Departure Delay of Flight Flight Distance 1000 -10 Step 1 – Select k points in the space defined by the input attributes, at random. Let’s assume k=2 * *

Objective of clustering models is to try to find natural groupings of data points in space. As a basic example, assume we are looking at a single Airport (Chicago), and we want to look at the Departure Delay of flights related to the Distance to the destination of those flights. For example: Clustering Model Intuition for k-Means Departure Delay of Flight Flight Distance Step 1 – Select k points in the space defined by the input attributes, at random. Let’s assume k=2 Step 2 – Assign each record to group with the closest point * *

Objective of clustering models is to try to find natural groupings of data points in space. As a basic example, assume we are looking at a single Airport (Chicago), and we want to look at the Departure Delay of flights related to the Distance to the destination of those flights. For example: Clustering Model Intuition for k-Means Flight Distance Step 1 – Select k points in the space defined by the input attributes, at random. Let’s assume k=2 Step 2 – Assign each record to group with the closest point Step 3 – Move the point to the center of the “Cloud” of points, known as the Centroid of the Cluster * * * * Departure Delay of Flight

Objective of clustering models is to try to find natural groupings of data points in space. As a basic example, assume we are looking at a single Airport (Chicago), and we want to look at the Departure Delay of flights related to the Distance to the destination of those flights. For example: Clustering Model Intuition for k-Means Flight Distance Step 1 – Select k points in the space defined by the input attributes, at random. Let’s assume k=2 Step 2 – Assign each record to group with the closest point Step 3 – Move the point to the center of the “Cloud” of points, known as the Centroid of the Cluster Step 4 – repeat Steps 2 and 3 until the Centroids “stop moving” (or reach a threshold) * * Departure Delay of Flight

Objective of clustering models is to try to find natural groupings of data points in space. As a basic example, assume we are looking at a single Airport (Chicago), and we want to look at the Departure Delay of flights related to the Distance to the destination of those flights. For example: Clustering Model Intuition for k-Means Flight Distance Step 1 – Select k points in the space defined by the input attributes, at random. Let’s assume k=2 Step 2 – Assign each record to group with the closest point Step 3 – Move the point to the center of the “Cloud” of points, known as the Centroid of the Cluster Step 4 – repeat Steps 2 and 3 until the Centroids “stop moving” (or reach a threshold) * * * * Departure Delay of Flight

Objective of clustering models is to try to find natural groupings of data points in space. As a basic example, assume we are looking at a single Airport (Chicago), and we want to look at the Departure Delay of flights related to the Distance to the destination of those flights. For example: Clustering Model Intuition for k-Means Flight Distance Step 1 – Select k points in the space defined by the input attributes, at random. Let’s assume k=2 Step 2 – Assign each record to group with the closest point Step 3 – Move the point to the center of the “Cloud” of points, known as the Centroid of the Cluster Step 4 – repeat Steps 2 and 3 until the Centroids “stop moving” (or reach a threshold) * * Departure Delay of Flight

Objective of clustering models is to try to find natural groupings of data points in space. As a basic example, assume we are looking at a single Airport (Chicago), and we want to look at the Departure Delay of flights related to the Distance to the destination of those flights. For example: Clustering Model Intuition for k-Means Flight Distance • From this point on, we can see that the movement almost stops, making the position of the 2 Centroids the solution to this k-Means Clustering * * Departure Delay of Flight

Copyright © 2020, Oracle and/or its affiliates 29 Any new
data points would be assigned to a Cluster by measuring its distance to the different Centroids, and assigned the same Cluster of the closest one. Clustering Model Intuition for k-Means Flight Distance • Because the new data point is closest to the Orange cluster, it is assigned as such * * Departure Delay of Flight

* Copyright © 2020, Oracle and/or its affiliates 30 The
iterations of a k-Means process is trying to minimize the distance between each Centroid and the points assigned to it, by using the Sum-of-Squares distances. The Within-Cluster Sum of Squares is what measures the distance in each Cluster and their Centroid Objective of a k-Means cluster Flight Distance • Within-Cluster Sum of Squares for each Cluster is trying to measure the Cluster “compactness” quality of the Clusters, so the lower the better • The Total Within-Cluster Sum of Squares is the grand sum for all Clusters * Departure Delay of Flight Measure

Copyright © 2020, Oracle and/or its affiliates 31 The Between-Cluster
Sum of Squares is what measures the distance between all Centroids. To compute this statistic one calculates the squared Euclidean distance from each cluster Centroid to all other cluster centroids. The sum of all the values is the Between-Cluster Sum of Squares. Objective of a k-Means cluster Flight Distance • Between-Cluster Sum of Squares for all Centroids is trying to measure “how far” each Cluster is from each other (the variation between all clusters), so the larger the better, and a larger value can indicate clusters that are spread apart * * Measure Departure Delay of Flight

Copyright © 2020, Oracle and/or its affiliates 32 Identifying the
ideal number of Clusters for k-Means • k-Means requires that the end user provides the number of Clusters k, so there are several methods available to try to give a (sometimes) visual way to estimate k. • Traditional techniques include the Elbow-Method from the 1950’s, the Silhouette Method from the 1980’s and the Gap Statistic from the 2,000’s. Initialization of the Random seeds for the Centroids • Centroid initialization can be crucial for a good distribution of clusters for k-Means, so many techniques have been created over the years to improve the initial selection in order to avoid over or under fitting. • Very frequently an initialization known as k-Means ++ is used to improve the quality of the seeds for the Centroids. • Because k-Means ++ requires k passes through the data, it does not scale well to larger datasets, and a method known as k-MeansII (or Scalable k-Means++) was created in 2001 to handle that. • OML Enhanced k-Means uses a scalable parallel model build based on k-MeansII k-Means properties

Copyright © 2020, Oracle and/or its affiliates 33 The main
characteristics of the three algorithms are compared in the following table. Features of the Oracle Machine Learning clustering algorithms Feature k-Means O-Cluster Expectation Maximization Clustering methodolgy Distance-based Grid-based Distribution-based Number of cases Handles data sets of any size More appropriate for data sets that have more than 500 cases. Handles large tables through active sampling Handles data sets of any size Number of attributes More appropriate for data sets with a low number of attributes More appropriate for data sets with a high number of attributes Appropriate for data sets with many or few attributes Number of clusters User-specified Automatically determined Automatically determined Hierarchical clustering Yes Yes Yes Probabilistic cluster assignment Yes Yes Yes

Thank You Marcos Arancibia | [email protected] Mark Hornick | [email protected]
Oracle Machine Learning Product Management

Machine Learning 101: Clustering

Machine Learning 101: Clustering

More Decks by Marcos Arancibia

Other Decks in Technology

Featured

Transcript