$30 off During Our Annual Pro Sale. View Details »

JGS594 Lecture 15

JGS594 Lecture 15

Software Engineering for Machine Learning
Clustering II
(202203)

Javier Gonzalez-Sanchez
PRO

March 29, 2022
Tweet

More Decks by Javier Gonzalez-Sanchez

Other Decks in Programming

Transcript

  1. jgs
    SER 594
    Software Engineering for
    Machine Learning
    Lecture 15: Clustering II
    Dr. Javier Gonzalez-Sanchez
    [email protected]
    javiergs.engineering.asu.edu | javiergs.com
    PERALTA 230U
    Office Hours: By appointment

    View Slide

  2. jgs
    Previously …
    Unsupervised Learning

    View Slide

  3. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 3
    jgs
    Machine Learning

    View Slide

  4. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 4
    jgs
    Algorithm: K-Means

    View Slide

  5. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 5
    jgs
    Code: K-means
    https://github.com/javiergs/Medium/tree/main/Clustering

    View Slide

  6. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 6
    jgs
    Weka GUI

    View Slide

  7. jgs
    Weka API
    Code

    View Slide

  8. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 8
    jgs
    Weka API
    weka.jar

    View Slide

  9. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 9
    jgs
    Weka: SimpleKMeans
    https://github.com/javiergs/Medium/blob/main/Clustering/Weka/WekaKMeans.java

    View Slide

  10. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 10
    jgs
    PS.

    View Slide

  11. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 11
    jgs
    ARFF
    § An ARFF (Attribute-Relation File Format) file is a text file that describes a
    list of instances sharing a set of attributes.
    § ARFF files were developed by the Machine Learning Project at the
    Department of Computer Science of The University of Waikato for use with
    the Weka machine learning software

    View Slide

  12. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 12
    jgs
    ARFF Examples
    @RELATION iris
    @ATTRIBUTE sepallength NUMERIC
    @ATTRIBUTE sepalwidth NUMERIC
    @ATTRIBUTE petallength NUMERIC
    @ATTRIBUTE petalwidth NUMERIC
    @ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}
    @DATA
    5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa
    4.7,3.2,1.3,0.2,Iris-setosa 4.6,3.1,1.5,0.2,Iris-setosa
    5.0,3.6,1.4,0.2,Iris-setosa 5.4,3.9,1.7,0.4,Iris-setosa
    4.6,3.4,1.4,0.3,Iris-setosa 5.0,3.4,1.5,0.2,Iris-setosa
    4.4,2.9,1.4,0.2,Iris-setosa 4.9,3.1,1.5,0.1,Iris-setosa
    ...

    View Slide

  13. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 13
    jgs
    ARFF Examples
    https://storm.cis.fordham.edu/~gweiss/data-mining/datasets.html

    View Slide

  14. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 14
    jgs
    Results

    View Slide

  15. jgs
    Algorithms

    View Slide

  16. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 16
    jgs
    K-means Problems
    § K-Means clustering may cluster loosely related observations together.
    Every observation becomes a part of some cluster eventually, even if the
    observations are scattered far away in the vector space
    § Clusters depend on the mean value of cluster elements; each data point
    plays a role in forming the clusters. A slight change in data
    points might affect the clustering outcome.
    § Another challenge with k-means is that you need to specify the number of
    clusters (“k”) in order to use it. Much of the time, we won’t know what a
    reasonable k value is a priori.

    View Slide

  17. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 17
    jgs
    Algorithms
    § K-Means - distance between points. Minimize square-error criterion.
    § DBSCAN (Density-Based Spatial Clustering of Applications with
    Noise) - distance between nearest points.
    § Simple EM (Expectation Maximization) is finding likelihood of an
    observation belonging to a cluster(probability). Maximize log-
    likelihood criterion

    View Slide

  18. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 18
    jgs
    DBSCAN
    § The algorithm proceeds by arbitrarily picking up a point in the dataset
    § If there are at least N points within a radius of E to the point, then we
    consider all these points to be part of the same cluster.
    § Repeat until all points have been visited

    View Slide

  19. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 19
    jgs
    Code
    Apache Commons
    DBSCANClusterer.java
    http://home.apache.org/~luc/commons-math-3.6-RC2-site/
    jacoco/org.apache.commons.math3.stat.clustering/
    DBSCANClusterer.java.html

    View Slide

  20. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 20
    jgs
    Weka Code

    View Slide

  21. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 21
    jgs
    Simple EM (Expectation Maximization)
    EM can decide how many clusters to create by cross validation, or you may
    specify apriori how many clusters to generate.
    1. The number of clusters is set to 1.
    2. EM assigns a probability distribution to each instance which indicates the
    probability of it belonging to each of the clusters.
    3. The training set is split randomly into 10 folds.
    4. EM is performed 10 times using the 10 folds.
    5. The loglikelihood is averaged over all 10 results.
    6. If loglikelihood has increased the number of clusters is increased by 1 and
    the program continues at step 2.
    The number of folds is fixed to 10, if the number of instances in the training
    set is not smaller 10. If this is the case the number of folds is set equal to the
    number of instances.

    View Slide

  22. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 22
    jgs
    Simple EM (Expectation Maximization)
    EM Demystified: An Expectation-Maximization Tutorial
    https://vannevar.ece.uw.edu/techsite/papers/documents/UWEETR-2010-0002.pdf

    View Slide

  23. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 23
    jgs
    Weka: EM

    View Slide

  24. jgs
    Evaluation
    Clustering

    View Slide

  25. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 25
    jgs
    Evaluation
    § A clustering algorithm's quality can be estimated by using
    the logLikelihood measure, which measures how consistent the identified
    clusters are.
    § The dataset is split into multiple folds, and clustering is run with each fold.
    The motivation is that, if the clustering algorithm assigns a high probability
    to similar data that wasn't used to fit parameters, then it has probably done
    a good job of capturing the data structure.
    § Weka offers the CluterEvaluation class to estimate it

    View Slide

  26. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 26
    jgs
    With a Dataset
    ClusterEvaluation eval = new ClusterEvaluation();
    DataSource src = new DataSource(”file.arff");
    Instances dt = src.getDataSet();
    eval.setClusterer(model);
    eval.evaluateClusterer(dt);
    System.out.println(eval.clusterResultsToString());
    System.out.println(eval.getLogLikelihood());

    View Slide

  27. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 27
    jgs
    One More Thing

    View Slide

  28. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 28
    jgs
    Questions

    View Slide

  29. jgs
    SER 594 Software Engineering for Machine Learning
    Javier Gonzalez-Sanchez, Ph.D.
    [email protected]
    Spring 2022
    Copyright. These slides can only be used as study material for the class CSE205 at Arizona State University.
    They cannot be distributed or used for another purpose.

    View Slide