Slide 1

Slide 1 text

jgs SER 594 Software Engineering for Machine Learning Lecture 15: Clustering II Dr. Javier Gonzalez-Sanchez [email protected] javiergs.engineering.asu.edu | javiergs.com PERALTA 230U Office Hours: By appointment

Slide 2

Slide 2 text

jgs Previously … Unsupervised Learning

Slide 3

Slide 3 text

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 3 jgs Machine Learning

Slide 4

Slide 4 text

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 4 jgs Algorithm: K-Means

Slide 5

Slide 5 text

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 5 jgs Code: K-means https://github.com/javiergs/Medium/tree/main/Clustering

Slide 6

Slide 6 text

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 6 jgs Weka GUI

Slide 7

Slide 7 text

jgs Weka API Code

Slide 8

Slide 8 text

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 8 jgs Weka API weka.jar

Slide 9

Slide 9 text

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 9 jgs Weka: SimpleKMeans https://github.com/javiergs/Medium/blob/main/Clustering/Weka/WekaKMeans.java

Slide 10

Slide 10 text

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 10 jgs PS.

Slide 11

Slide 11 text

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 11 jgs ARFF § An ARFF (Attribute-Relation File Format) file is a text file that describes a list of instances sharing a set of attributes. § ARFF files were developed by the Machine Learning Project at the Department of Computer Science of The University of Waikato for use with the Weka machine learning software

Slide 12

Slide 12 text

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 12 jgs ARFF Examples @RELATION iris @ATTRIBUTE sepallength NUMERIC @ATTRIBUTE sepalwidth NUMERIC @ATTRIBUTE petallength NUMERIC @ATTRIBUTE petalwidth NUMERIC @ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica} @DATA 5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa 4.6,3.1,1.5,0.2,Iris-setosa 5.0,3.6,1.4,0.2,Iris-setosa 5.4,3.9,1.7,0.4,Iris-setosa 4.6,3.4,1.4,0.3,Iris-setosa 5.0,3.4,1.5,0.2,Iris-setosa 4.4,2.9,1.4,0.2,Iris-setosa 4.9,3.1,1.5,0.1,Iris-setosa ...

Slide 13

Slide 13 text

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 13 jgs ARFF Examples https://storm.cis.fordham.edu/~gweiss/data-mining/datasets.html

Slide 14

Slide 14 text

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 14 jgs Results

Slide 15

Slide 15 text

jgs Algorithms

Slide 16

Slide 16 text

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 16 jgs K-means Problems § K-Means clustering may cluster loosely related observations together. Every observation becomes a part of some cluster eventually, even if the observations are scattered far away in the vector space § Clusters depend on the mean value of cluster elements; each data point plays a role in forming the clusters. A slight change in data points might affect the clustering outcome. § Another challenge with k-means is that you need to specify the number of clusters (“k”) in order to use it. Much of the time, we won’t know what a reasonable k value is a priori.

Slide 17

Slide 17 text

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 17 jgs Algorithms § K-Means - distance between points. Minimize square-error criterion. § DBSCAN (Density-Based Spatial Clustering of Applications with Noise) - distance between nearest points. § Simple EM (Expectation Maximization) is finding likelihood of an observation belonging to a cluster(probability). Maximize log- likelihood criterion

Slide 18

Slide 18 text

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 18 jgs DBSCAN § The algorithm proceeds by arbitrarily picking up a point in the dataset § If there are at least N points within a radius of E to the point, then we consider all these points to be part of the same cluster. § Repeat until all points have been visited

Slide 19

Slide 19 text

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 19 jgs Code Apache Commons DBSCANClusterer.java http://home.apache.org/~luc/commons-math-3.6-RC2-site/ jacoco/org.apache.commons.math3.stat.clustering/ DBSCANClusterer.java.html

Slide 20

Slide 20 text

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 20 jgs Weka Code

Slide 21

Slide 21 text

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 21 jgs Simple EM (Expectation Maximization) EM can decide how many clusters to create by cross validation, or you may specify apriori how many clusters to generate. 1. The number of clusters is set to 1. 2. EM assigns a probability distribution to each instance which indicates the probability of it belonging to each of the clusters. 3. The training set is split randomly into 10 folds. 4. EM is performed 10 times using the 10 folds. 5. The loglikelihood is averaged over all 10 results. 6. If loglikelihood has increased the number of clusters is increased by 1 and the program continues at step 2. The number of folds is fixed to 10, if the number of instances in the training set is not smaller 10. If this is the case the number of folds is set equal to the number of instances.

Slide 22

Slide 22 text

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 22 jgs Simple EM (Expectation Maximization) EM Demystified: An Expectation-Maximization Tutorial https://vannevar.ece.uw.edu/techsite/papers/documents/UWEETR-2010-0002.pdf

Slide 23

Slide 23 text

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 23 jgs Weka: EM

Slide 24

Slide 24 text

jgs Evaluation Clustering

Slide 25

Slide 25 text

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 25 jgs Evaluation § A clustering algorithm's quality can be estimated by using the logLikelihood measure, which measures how consistent the identified clusters are. § The dataset is split into multiple folds, and clustering is run with each fold. The motivation is that, if the clustering algorithm assigns a high probability to similar data that wasn't used to fit parameters, then it has probably done a good job of capturing the data structure. § Weka offers the CluterEvaluation class to estimate it

Slide 26

Slide 26 text

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 26 jgs With a Dataset ClusterEvaluation eval = new ClusterEvaluation(); DataSource src = new DataSource(”file.arff"); Instances dt = src.getDataSet(); eval.setClusterer(model); eval.evaluateClusterer(dt); System.out.println(eval.clusterResultsToString()); System.out.println(eval.getLogLikelihood());

Slide 27

Slide 27 text

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 27 jgs One More Thing

Slide 28

Slide 28 text

Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 28 jgs Questions

Slide 29

Slide 29 text

jgs SER 594 Software Engineering for Machine Learning Javier Gonzalez-Sanchez, Ph.D. [email protected] Spring 2022 Copyright. These slides can only be used as study material for the class CSE205 at Arizona State University. They cannot be distributed or used for another purpose.