Slide 1

Slide 1 text

June 6, 2023 1 Whiskey Clustering with Apache Groovy & Apache Ignite Paul King VP AT APACHE GROOVY, PRINCIPAL SOFTWARE ENGINEER AT UNITY FOUNDATION

Slide 2

Slide 2 text

Dr Paul King Unity Foundation Groovy Lead V.P. Apache Groovy Author: https://www.manning.com/books/groovy-in-action-second-edition Slides: https://speakerdeck.com/paulk/whiskey-groovy-ignite (this talk) https://speakerdeck.com/paulk/groovy-data-science (larger talk) Examples repo: https://github.com/paulk-asert/groovy-data-science Twitter: @paulk_asert, Mastodon: @[email protected]

Slide 3

Slide 3 text

3 Whiskey Clustering with Apache Groovy & Apache Ignite • Apache Groovy • Apache Ignite • Data Science • Whiskey Clustering & Visualization • Scaling Whiskey Clustering

Slide 4

Slide 4 text

Apache Groovy Programming Language • Multi-faceted extensible language • Imperative/OO & functional • Dynamic & static • Aligned closely with Java • 19+ years since inception • ~2.5B downloads (partial count) • ~500 contributors • 200+ releases • https://www.youtube.com/watch?v=eIGOG- F9ZTw&feature=youtu.be

Slide 5

Slide 5 text

Friends of Apache Groovy Open Collective

Slide 6

Slide 6 text

What is Groovy? It’s like a super version of Java: • Supports most Java syntax but allows simpler syntax for many constructs • Supports all Java libraries but provides many extensions and its own productivity libraries • Has both a static and dynamic nature • Extensible language and tooling Java Groovy

Slide 7

Slide 7 text

Why use Groovy in 2023? It’s still like a super version of Java: • Simpler scripting • Metaprogramming: runtime, compile-time, extension methods, AST transforms • Language features: power assert, powerful switch, traits, closures • Static and dynamic nature • Productivity libraries for common tasks • Extensibility: language, tooling, type checker Java Groovy Let’s look at just two features that reduce boilerplate code

Slide 8

Slide 8 text

Simpler scripting: Java7+ import java.util.List; import java.util.ArrayList; class Main { private List keepShorterThan(List strings, int length) { List result = new ArrayList(); for (int i = 0; i < strings.size(); i++) { String s = (String) strings.get(i); if (s.length() < length) { result.add(s); } } return result; } public static void main(String[] args) { List names = new ArrayList(); names.add("Ted"); names.add("Fred"); names.add("Jed"); names.add("Ned"); System.out.println(names); Main m = new Main(); List shortNames = m.keepShorterThan(names, 4); System.out.println(shortNames.size()); for (int i = 0; i < shortNames.size(); i++) { String s = (String) shortNames.get(i); System.out.println(s); } } }

Slide 9

Slide 9 text

Simpler scripting: Java21+ (with JEP 445 & preview) import java.util.List; import java.util.ArrayList; class Main { private List keepShorterThan(List strings, int length) { List result = new ArrayList(); for (int i = 0; i < strings.size(); i++) { String s = (String) strings.get(i); if (s.length() < length) { result.add(s); } } return result; } public static void main(String[] args) { List names = new ArrayList(); names.add("Ted"); names.add("Fred"); names.add("Jed"); names.add("Ned"); System.out.println(names); Main m = new Main(); List shortNames = m.keepShorterThan(names, 4); System.out.println(shortNames.size()); for (int i = 0; i < shortNames.size(); i++) { String s = (String) shortNames.get(i); System.out.println(s); } } } import java.util.List; void main() { var names = List.of("Ted", "Fred", "Jed", "Ned"); System.out.println(names); var shortNames = names.stream().filter(n -> n.length() < 4).toList(); System.out.println(shortNames.size()); shortNames.forEach(System.out::println); }

Slide 10

Slide 10 text

Simpler scripting: JDK5+/Groovy 1+ import java.util.List; import java.util.ArrayList; class Main { private List keepShorterThan(List strings, int length) { List result = new ArrayList(); for (int i = 0; i < strings.size(); i++) { String s = (String) strings.get(i); if (s.length() < length) { result.add(s); } } return result; } public static void main(String[] args) { List names = new ArrayList(); names.add("Ted"); names.add("Fred"); names.add("Jed"); names.add("Ned"); System.out.println(names); Main m = new Main(); List shortNames = m.keepShorterThan(names, 4); System.out.println(shortNames.size()); for (int i = 0; i < shortNames.size(); i++) { String s = (String) shortNames.get(i); System.out.println(s); } } } import java.util.List; void main() { var names = List.of("Ted", "Fred", "Jed", "Ned"); System.out.println(names); var shortNames = names.stream().filter(n -> n.length() < 4).toList(); System.out.println(shortNames.size()); shortNames.forEach(System.out::println); } names = ["Ted", "Fred", "Jed", "Ned"] println names shortNames = names.findAll{ it.size() < 4 } println shortNames.size() shortNames.each{ println it }

Slide 11

Slide 11 text

Simpler scripting: DSL/command chain support import java.util.List; import java.util.ArrayList; class Main { private List keepShorterThan(List strings, int length) { List result = new ArrayList(); for (int i = 0; i < strings.size(); i++) { String s = (String) strings.get(i); if (s.length() < length) { result.add(s); } } return result; } public static void main(String[] args) { List names = new ArrayList(); names.add("Ted"); names.add("Fred"); names.add("Jed"); names.add("Ned"); System.out.println(names); Main m = new Main(); List shortNames = m.keepShorterThan(names, 4); System.out.println(shortNames.size()); for (int i = 0; i < shortNames.size(); i++) { String s = (String) shortNames.get(i); System.out.println(s); } } } import java.util.List; void main() { var names = List.of("Ted", "Fred", "Jed", "Ned"); System.out.println(names); var shortNames = names.stream().filter(n -> n.length() < 4).toList(); System.out.println(shortNames.size()); shortNames.forEach(System.out::println); } names = ["Ted", "Fred", "Jed", "Ned"] println names shortNames = names.findAll{ it.size() < 4 } println shortNames.size() shortNames.each{ println it } given the names "Ted", "Fred", "Jed" and "Ned" display all the names display the number of names having size less than 4 display the names having size less than 4

Slide 12

Slide 12 text

Scripting for Data Science • Same example • Same library Array2DRowRealMatrix{{15.1379501385,40.488531856},{21.4354570637,59.5951246537}} import org.apache.commons.math3.linear.*; public class MatrixMain { public static void main(String[] args) { double[][] matrixData = { {1d,2d,3d}, {2d,5d,3d}}; RealMatrix m = MatrixUtils.createRealMatrix(matrixData); double[][] matrixData2 = { {1d,2d}, {2d,5d}, {1d, 7d}}; RealMatrix n = new Array2DRowRealMatrix(matrixData2); RealMatrix o = m.multiply(n); // Invert o, using LU decomposition RealMatrix oInverse = new LUDecomposition(o).getSolver().getInverse(); RealMatrix p = oInverse.scalarAdd(1d).scalarMultiply(2d); RealMatrix q = o.add(p.power(2)); System.out.println(q); } } Thanks to operator overloading and extensible tooling

Slide 13

Slide 13 text

Scripting in Notebooks • Jupyter/beakerx • Apache Zeppelin • GroovyLab • Seco

Slide 14

Slide 14 text

Metaprogramming: AST Transforms public final class Person { private final String first; private final String last; public String getFirst() { return first; } public String getLast() { return last; } @Override public int hashCode() { final int prime = 31; int result = 1; result = prime * result + ((first == null) ? 0 : first.hashCode()); result = prime * result + ((last == null) ? 0 : last.hashCode()); return result; } public Person(String first, String last) { this.first = first; this.last = last; } // ... // ... @Override public boolean equals(Object obj) { if (this == obj) return true; if (obj == null) return false; if (getClass() != obj.getClass()) return false; Person other = (Person) obj; if (first == null) { if (other.first != null) return false; } else if (!first.equals(other.first)) return false; if (last == null) { if (other.last != null) return false; } else if (!last.equals(other.last)) return false; return true; } @Override public String toString() { return "Person(first:" + first + ", last:" + last + ")"; } } • Writing a JavaBean Person class, Java 7-15

Slide 15

Slide 15 text

// ... @Override public boolean equals(Object obj) { if (this == obj) return true; if (obj == null) return false; if (getClass() != obj.getClass()) return false; Person other = (Person) obj; if (first == null) { if (other.first != null) return false; } else if (!first.equals(other.first)) return false; if (last == null) { if (other.last != null) return false; } else if (!last.equals(other.last)) return false; return true; } @Override public String toString() { return "Person(first:" + first + ", last:" + last + ")"; } } Metaprogramming: AST Transforms public final class Person { private final String first; private final String last; public String getFirst() { return first; } public String getLast() { return last; } @Override public int hashCode() { final int prime = 31; int result = 1; result = prime * result + ((first == null) ? 0 : first.hashCode()); result = prime * result + ((last == null) ? 0 : last.hashCode()); return result; } public Person(String first, String last) { this.first = first; this.last = last; } // ... • Groovy equivalent (JDK 7-15) @Immutable class Person { String first, last }

Slide 16

Slide 16 text

// ... @Override public boolean equals(Object obj) { if (this == obj) return true; if (obj == null) return false; if (getClass() != obj.getClass()) return false; Person other = (Person) obj; if (first == null) { if (other.first != null) return false; } else if (!first.equals(other.first)) return false; if (last == null) { if (other.last != null) return false; } else if (!last.equals(other.last)) return false; return true; } @Override public String toString() { return "Person(first:" + first + ", last:" + last + ")"; } } Metaprogramming: AST Transforms public final class Person { private final String first; private final String last; public String getFirst() { return first; } public String getLast() { return last; } @Override public int hashCode() { final int prime = 31; int result = 1; result = prime * result + ((first == null) ? 0 : first.hashCode()); result = prime * result + ((last == null) ? 0 : last.hashCode()); return result; } public Person(String first, String last) { this.first = first; this.last = last; } // ... • Java Record (JDK16+) / Groovy Record (JDK 8+) @Immutable class Person { String first, last } record Person(String first, String last) { }

Slide 17

Slide 17 text

Groovy Records: differences to Java Java Record Groovy Emulated Record Groovy Native Record JDK version 16+ 8+ 16+ Serialization Record spec Traditional Record spec Recognized by Java, Groovy Groovy Java, Groovy Standard features • accessors • tuple constructor • toString, equals, hashCode    Optional enhancements  toMap, toList, size, getAt, components, copyWith, named-arg constructor Customisable via coding    Customisable via AST transforms (declarative)   

Slide 18

Slide 18 text

Metaprogramming // imports not shown public class Book { private String $to$string; private int $hash$code; private final List authors; private final String title; private final Date publicationDate; private static final java.util.Comparator this$TitleComparator; private static final java.util.Comparator this$PublicationDateComparator; public Book(List authors, String title, Date publicationDate) { if (authors == null) { this.authors = null; } else { if (authors instanceof Cloneable) { List authorsCopy = (List) ((ArrayList>) authors).clone(); this.authors = (List) (authorsCopy instanceof SortedSet ? DefaultGroovyMethods.asImmutable(authorsCopy) : authorsCopy instanceof SortedMap ? DefaultGroovyMethods.asImmutable(authorsCopy) : authorsCopy instanceof Set ? DefaultGroovyMethods.asImmutable(authorsCopy) : authorsCopy instanceof Map ? DefaultGroovyMethods.asImmutable(authorsCopy) : authorsCopy instanceof List ? DefaultGroovyMethods.asImmutable(authorsCopy) : DefaultGroovyMethods.asImmutable(authorsCopy)); } else { this.authors = (List) (authors instanceof SortedSet ? DefaultGroovyMethods.asImmutable(authors) : authors instanceof SortedMap ? DefaultGroovyMethods.asImmutable(authors) : authors instanceof Set ? DefaultGroovyMethods.asImmutable(authors) : authors instanceof Map ? DefaultGroovyMethods.asImmutable(authors) : authors instanceof List ? DefaultGroovyMethods.asImmutable(authors) : DefaultGroovyMethods.asImmutable(authors)); } } this.title= title; if (publicationDate== null) { this.publicationDate= null; } else { this.publicationDate= (Date) publicationDate.clone(); } } public Book(Map args) { if ( args == null) { args = new HashMap(); } ImmutableASTTransformation.checkPropNames(this, args); if (args.containsKey("authors")) { if ( args.get("authors") == null) { this .authors = null; } else { if (args.get("authors") instanceof Cloneable) { List authorsCopy = (List) ((ArrayList>) args.get("authors")).clone(); this.authors = (List) (authorsCopy instanceof SortedSet ? DefaultGroovyMethods.asImmutable(authorsCopy) : authorsCopy instanceof SortedMap ? DefaultGroovyMethods.asImmutable(authorsCopy) : authorsCopy instanceof Set ? DefaultGroovyMethods.asImmutable(authorsCopy) : authorsCopy instanceof Map ? DefaultGroovyMethods.asImmutable(authorsCopy) : authorsCopy instanceof List ? DefaultGroovyMethods.asImmutable(authorsCopy) : DefaultGroovyMethods.asImmutable(authorsCopy)); } else { List authors = (List) args.get("authors"); this.authors = (List) (authors instanceof SortedSet ? DefaultGroovyMethods.asImmutable(authors) : authors instanceof SortedMap ? DefaultGroovyMethods.asImmutable(authors) : authors instanceof Set ? DefaultGroovyMethods.asImmutable(authors) : authors instanceof Map ? DefaultGroovyMethods.asImmutable(authors) : authors instanceof List ? DefaultGroovyMethods.asImmutable(authors) : DefaultGroovyMethods.asImmutable(authors)); } } } else { this .authors = null; } if (args.containsKey("title")) {this .title = (String) args.get("title"); } else { this .title = null;} if (args.containsKey("publicationDate")) { if (args.get("publicationDate") == null) { this.publicationDate = null; } else { this.publicationDate = (Date) ((Date) args.get("publicationDate")).clone(); } } else {this.publicationDate = null; } } … public Book() { this (new HashMap()); } public int compareTo(Book other) { if (this == other) { return 0; } Integer value = 0 value = this .title <=> other .title if ( value != 0) { return value } value = this .publicationDate <=> other .publicationDate if ( value != 0) { return value } return 0 } public static Comparator comparatorByTitle() { return this$TitleComparator; } public static Comparator comparatorByPublicationDate() { return this$PublicationDateComparator; } public String toString() { StringBuilder _result = new StringBuilder(); boolean $toStringFirst= true; _result.append("Book("); if ($toStringFirst) { $toStringFirst = false; } else { _result.append(", "); } _result.append(InvokerHelper.toString(this.getAuthors())); if ($toStringFirst) { $toStringFirst = false; } else { _result.append(", "); } _result.append(InvokerHelper.toString(this.getTitle())); if ($toStringFirst) { $toStringFirst = false; } else { _result.append(", "); } _result.append(InvokerHelper.toString(this.getPublicationDate())); _result.append(")"); if ($to$string == null) { $to$string = _result.toString(); } return $to$string; } public int hashCode() { if ( $hash$code == 0) { int _result = HashCodeHelper.initHash(); if (!(this.getAuthors().equals(this))) { _result = HashCodeHelper.updateHash(_result, this.getAuthors()); } if (!(this.getTitle().equals(this))) { _result = HashCodeHelper.updateHash(_result, this.getTitle()); } if (!(this.getPublicationDate().equals(this))) { _result = HashCodeHelper.updateHash(_result, this.getPublicationDate()); } $hash$code = (int) _result; } return $hash$code; } public boolean canEqual(Object other) { return other instanceof Book; } … public boolean equals(Object other) { if ( other == null) { return false; } if (this == other) { return true; } if (!( other instanceof Book)) { return false; } Book otherTyped = (Book) other; if (!(otherTyped.canEqual( this ))) { return false; } if (!(this.getAuthors() == otherTyped.getAuthors())) { return false; } if (!(this.getTitle().equals(otherTyped.getTitle()))) { return false; } if (!(this.getPublicationDate().equals(otherTyped.getPublicationDate()))) { return false; } return true; } public final Book copyWith(Map map) { if (map == null || map.size() == 0) { return this; } Boolean dirty = false; HashMap construct = new HashMap(); if (map.containsKey("authors")) { Object newValue = map.get("authors"); Object oldValue = this.getAuthors(); if (newValue != oldValue) { oldValue= newValue; dirty = true; } construct.put("authors", oldValue); } else { construct.put("authors", this.getAuthors()); } if (map.containsKey("title")) { Object newValue = map.get("title"); Object oldValue = this.getTitle(); if (newValue != oldValue) { oldValue= newValue; dirty = true; } construct.put("title", oldValue); } else { construct.put("title", this.getTitle()); } if (map.containsKey("publicationDate")) { Object newValue = map.get("publicationDate"); Object oldValue = this.getPublicationDate(); if (newValue != oldValue) { oldValue= newValue; dirty = true; } construct.put("publicationDate", oldValue); } else { construct.put("publicationDate", this.getPublicationDate()); } return dirty == true ? new Book(construct) : this; } public void writeExternal(ObjectOutputout) throws IOException { out.writeObject(authors); out.writeObject(title); out.writeObject(publicationDate); } public void readExternal(ObjectInputoin) throws IOException, ClassNotFoundException{ authors = (List) oin.readObject(); title = (String) oin.readObject(); publicationDate= (Date) oin.readObject(); } … static { this$TitleComparator = new Book$TitleComparator(); this$PublicationDateComparator = new Book$PublicationDateComparator(); } public String getAuthors(int index) { return authors.get(index); } public List getAuthors() { return authors; } public final String getTitle() { return title; } public final Date getPublicationDate() { if (publicationDate== null) { return publicationDate; } else { return (Date) publicationDate.clone(); } } public int compare(java.lang.Objectparam0, java.lang.Objectparam1) { return -1; } private static class Book$TitleComparator extends AbstractComparator { public Book$TitleComparator() { } public int compare(Book arg0, Book arg1) { if (arg0 == arg1) { return 0; } if (arg0 != null && arg1 == null) { return -1; } if (arg0 == null && arg1 != null) { return 1; } return arg0.title <=> arg1.title; } public int compare(java.lang.Objectparam0, java.lang.Objectparam1) { return -1; } } private static class Book$PublicationDateComparator extends AbstractComparator { public Book$PublicationDateComparator() { } public int compare(Book arg0, Book arg1) { if ( arg0 == arg1 ) { return 0; } if ( arg0 != null && arg1 == null) { return -1; } if ( arg0 == null && arg1 != null) { return 1; } return arg0 .publicationDate <=> arg1 .publicationDate; } public int compare(java.lang.Objectparam0, java.lang.Objectparam1) { return -1; } } } @Immutable(copyWith = true) @Sortable(excludes = 'authors') @AutoExternalize class Book { @IndexedProperty List authors String title Date publicationDate }

Slide 19

Slide 19 text

AST Transformations: Groovy 2.4, Groovy 2.5, Groovy 3.0, Groovy 4.0 @NonSealed @RecordBase @Sealed @PlatformLog @GQ @Final @RecordType @POJO @Pure @Contracted @Ensures @Invariant @Requires @ClassInvariant @ContractElement @Postcondition @Precondition (Improved in 2.5)

Slide 20

Slide 20 text

2 0 Whiskey Clustering with Apache Groovy & Apache Ignite • Apache Groovy • Apache Ignite • Data Science • Whiskey Clustering & Visualization • Scaling Whiskey Clustering

Slide 21

Slide 21 text

Scaling up machine learning: Apache Ignite Apache Ignite is a distributed database for high-performance computing with in-memory speed. In simple terms, it makes a cluster (or grid) of nodes appear like an in-memory cache. Ignite can be used as: • an in-memory cache with special features like SQL querying and transactional properties • an in-memory data-grid with advanced read-through & write-through capabilities on top of one or more distributed databases • an ultra-fast and horizontally scalable in-memory database • a high-performance computing engine for custom or built-in tasks including machine learning It is mostly this last capability that we will use. Ignite’s Machine Learning API has purpose built, cluster-aware machine learning and deep learning algorithms for Classification, Regression, Clustering, and Recommendation, among others. We’ll mostly use the distributed K-means Clustering algorithm from their library.

Slide 22

Slide 22 text

2 2 Whiskey Clustering with Apache Groovy & Apache Ignite • Apache Groovy • Apache Ignite • Data Science • Whiskey Clustering & Visualization • Scaling Whiskey Clustering

Slide 23

Slide 23 text

Data Science Process Research Goals Obtain Data Data Preparation Data Exploration Visualization Data Modeling Data ingestion Data storage Data processing platforms Modeling algorithms Math libraries Graphics processing Integration Deployment

Slide 24

Slide 24 text

Data science algorithms Data Mining Statistics Machine Learning Optimization • Analytics: descriptive, predictive, prescriptive • Analysis: anomaly detection, classification, regression, clustering, association, optimization, dimension reduction • Data relationship: linear, non-linear • Assumptions: parametric, non-parametric • Strategy: supervised, unsupervised, reinforcement • Combining: ensemble, boosting

Slide 25

Slide 25 text

Data Science Algorithms Source: Jason Brownlee, https://machinelearningmastery.com/master-machine-learning-algorithms/

Slide 26

Slide 26 text

Data Science Algorithms Source: Jason Brownlee, https://machinelearningmastery.com/master-machine-learning-algorithms/

Slide 27

Slide 27 text

2 7 Whiskey Clustering with Apache Groovy & Apache Ignite • Apache Groovy • Apache Ignite • Data Science • Whiskey Clustering & Visualization • Scaling Whiskey Clustering

Slide 28

Slide 28 text

Clustering Overview Clustering: • Grouping similar items Algorithm families: • Hierarchical • Partitioning k-means, x-means • Density-based • Graph-based Aspects: • Disjoint vs overlapping • Preset cluster number • Dimensionality reduction PCA • Nominal feature support Applications: • Market segmentation • Recommendation engines • Search result grouping • Social network analysis • Medical imaging

Slide 29

Slide 29 text

Clustering Overview Clustering: • Grouping similar items Algorithm families: • Hierarchical • Partitioning k-means, x-means • Density-based • Graph-based Aspects: • Disjoint vs overlapping • Preset cluster number • Dimensionality reduction PCA • Nominal feature support Applications: • Market segmentation • Recommendation engines • Search result grouping • Social network analysis • Medical imaging

Slide 30

Slide 30 text

Clustering https://commons.apache.org/proper/commons-math/userguide/ml.html

Slide 31

Slide 31 text

Clustering with KMeans Step 1: • Guess k cluster centroids at random

Slide 32

Slide 32 text

Clustering with KMeans Step 1: • Guess k cluster centroids

Slide 33

Slide 33 text

Clustering with KMeans Step 1: • Guess k cluster centroids Step 2: • Assign points to closest centroid

Slide 34

Slide 34 text

Clustering with KMeans Step 1: • Guess k cluster centroids Step 2: • Assign points to closest centroid

Slide 35

Slide 35 text

Clustering with KMeans Step 1: • Guess k cluster centroids Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points

Slide 36

Slide 36 text

Clustering with KMeans Step 1: • Guess k cluster centroids Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points

Slide 37

Slide 37 text

Clustering with KMeans Step 1: • Guess k cluster centroids Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points

Slide 38

Slide 38 text

Clustering with KMeans Step 1: • Guess k cluster centroids Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points

Slide 39

Slide 39 text

Clustering with KMeans Step 1: • Guess k cluster centroids Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points Repeat steps 2 and 3 until stable or some limit reached

Slide 40

Slide 40 text

Clustering with KMeans Step 1: • Guess k cluster centroids Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points Repeat steps 2 and 3 until stable or some limit reached

Slide 41

Slide 41 text

Clustering with KMeans Step 1: • Guess k cluster centroids Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points Repeat steps 2 and 3 until stable or some limit reached

Slide 42

Slide 42 text

Clustering with KMeans Step 1: • Guess k cluster centroids Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points Repeat steps 2 and 3 until stable or some limit reached

Slide 43

Slide 43 text

Clustering case study: Whiskey flavor profiles • 86 scotch whiskies • 12 flavor categories Pictures: https://prasant.net/clustering-scotch-whisky-grouping-distilleries-by-k-means-clustering-81f2ecde069c https://www.r-bloggers.com/where-the-whisky-flavor-profile-data-came-from/ https://www.centerspace.net/clustering-analysis-part-iv-non-negative-matrix-factorization/

Slide 44

Slide 44 text

Clustering case study: Whiskey flavor profiles RowID,Distillery,Body,Sweetness,Smoky,Medicinal,Tobacco,Honey,Spicy,Winey,Nutty,Malty,Fruity,Floral … 34,GlenElgin,2,3,1,0,0,2,1,1,1,1,2,3 35,GlenGarioch,2,1,3,0,0,0,3,1,0,2,2,2 36,GlenGrant,1,2,0,0,0,1,0,1,2,1,2,1 37,GlenKeith,2,3,1,0,0,1,2,1,2,1,2,1 38,GlenMoray,1,2,1,0,0,1,2,1,2,2,2,4 39,GlenOrd,3,2,1,0,0,1,2,1,1,2,2,2 40,GlenScotia,2,2,2,2,0,1,0,1,2,2,1,1 41,GlenSpey,1,3,1,0,0,0,1,1,1,2,0,2 42,Glenallachie,1,3,1,0,0,1,1,0,1,2,2,2 …

Slide 45

Slide 45 text

Whiskey – exploring with Dex

Slide 46

Slide 46 text

import … def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] def numClusters = 5 def loader = new CSVLoader(file: 'whiskey.csv') def clusterer = new SimpleKMeans(numClusters: numClusters, preserveInstancesOrder: true) def instances = loader.dataSet instances.deleteAttributeAt(0) // remove RowID clusterer.buildClusterer(instances) println ' ' + cols.join(', ') def dataset = new DefaultCategoryDataset() clusterer.clusterCentroids.eachWithIndex{ Instance ctrd, num -> print "Cluster ${num+1}: " println ((1..cols.size()).collect{ sprintf '%.3f', ctrd.value(it) }.join(', ')) (1..cols.size()).each { idx -> dataset.addValue(ctrd.value(idx), "Cluster ${num+1}", cols[idx-1]) } } def clusters = (0.. clusters[cnum] << instances.get(idx).stringValue(0) } clusters.each { k, v -> println "Cluster ${k+1}:" println v.join(', ') } def plot = new SpiderWebPlot(dataset: dataset) def chart = new JFreeChart('Whiskey clusters', plot) SwingUtil.show(new ChartPanel(chart)) Whiskey – clustering with radar plot and weka Body, Sweetness, Smoky, Medicinal, Tobacco, Honey, Spicy, Winey, Nutty, Malty, Fruity, Floral Cluster 1: 3.800, 1.600, 3.600, 3.600, 0.600, 0.200, 1.600, 0.600, 1.000, 1.400, 1.200, 0.000 Cluster 2: 2.773, 2.409, 1.545, 0.045, 0.000, 1.818, 1.591, 2.000, 2.091, 2.136, 2.136, 1.591 Cluster 3: 1.773, 2.455, 1.318, 0.636, 0.000, 0.636, 1.000, 0.409, 1.636, 1.364, 1.591, 1.591 Cluster 4: 1.500, 2.233, 1.267, 0.267, 0.000, 1.533, 1.400, 0.700, 1.000, 1.900, 1.900, 2.133 Cluster 5: 2.000, 2.143, 1.857, 0.857, 1.000, 0.857, 1.714, 1.000, 1.286, 2.000, 1.429, 1.714 Cluster 1: Ardbeg, Clynelish, Lagavulin, Laphroig, Talisker Cluster 2: Aberfeldy, Aberlour, Ardmore, Auchroisk, Balmenach, BenNevis, Benrinnes, Benromach, BlairAthol, Dailuaine, Dalmore, Edradour, Glendronach, Glendullan, Glenfarclas, Glenrothes, Glenturret, Longmorn, Macallan, Mortlach, RoyalLochnagar, Strathisla Cluster 3: ArranIsleOf, Aultmore, Balblair, Cardhu, Craigganmore, Dufftown, GlenGrant, GlenKeith, GlenScotia, GlenSpey, Glenfiddich, Glenmorangie, Isle of Jura, Mannochmore, Miltonduff, Oban, Speyside, Springbank, Strathmill, Tamnavulin, Teaninich, Tomore Cluster 4: AnCnoc, Auchentoshan, Belvenie, Benriach, Bladnoch, Bowmore, Bruichladdich, Bunnahabhain, Dalwhinnie, Deanston, GlenElgin, GlenGarioch, GlenMoray, GlenOrd, Glenallachie, Glengoyne, Glenkinchie, Glenlivet, Glenlossie, Highland Park, Inchgower, Knochando, Linkwood, Loch Lomond, Scapa, Speyburn, Tamdhu, Tobermory, Tomatin, Tomintoul Cluster 5: Caol Ila, Craigallechie, GlenDeveronMacduff, OldFettercairn, OldPulteney, RoyalBrackla, Tullibardine

Slide 47

Slide 47 text

import … def rows = CSV.withFirstRecordAsHeader().parse(new FileReader('whiskey.csv')) def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] def clusterer = new KMeansPlusPlusClusterer(5) def data = rows.collect{ row -> new DoublePoint(cols.collect{ col -> row[col] } as int[]) } def centroids = clusterer.cluster(data) println cols.join(', ') + ', Medoid' def dataset = new DefaultCategoryDataset() centroids.eachWithIndex{ ctrd, num -> def cpt = ctrd.center.point def closest = ctrd.points.min{ pt -> sumSq((0.. cols.collect{ row[it] as double } == closest.point }?.Distillery println cpt.collect{ sprintf '%.3f', it }.join(', ') + ", $medoid" cpt.eachWithIndex { val, idx -> dataset.addValue(val, "Cluster ${num+1}", cols[idx]) } } def plot = new SpiderWebPlot(dataset: dataset) def chart = new JFreeChart('Whiskey clusters', plot) SwingUtil.show(new ChartPanel(chart)) Whiskey – clustering with radar plot and medoids Libraries: Apache Commons Math and JFreeChart Body, Sweetness, Smoky, Medicinal, Tobacco, Honey, Spicy, Winey, Nutty, Malty, Fruity, Floral, Medoid 2.000, 2.533, 1.267, 0.267, 0.200, 1.067, 1.667, 0.933, 0.267, 1.733, 1.800, 1.733, GlenOrd 2.789, 2.474, 1.474, 0.053, 0.000, 1.895, 1.632, 2.211, 2.105, 2.105, 2.211, 1.737, Aberfeldy 2.909, 1.545, 2.909, 2.727, 0.455, 0.455, 1.455, 0.545, 1.545, 1.455, 1.182, 0.545, Clynelish 1.333, 2.333, 0.944, 0.111, 0.000, 1.000, 0.444, 0.444, 1.500, 1.944, 1.778, 1.778, Aultmore 1.696, 2.304, 1.565, 0.435, 0.087, 1.391, 1.696, 0.609, 1.652, 1.652, 1.783, 2.130, Benromach

Slide 48

Slide 48 text

Dimensionality reduction

Slide 49

Slide 49 text

import … def rows = Table.read().csv('whiskey.csv') def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] def data = table.as().doubleMatrix(*cols) def pca = new PCA(data) pca.projection = 2 def plots = [PlotCanvas.screeplot(pca)] def projected = pca.project(data) table = table.addColumns( *(1..2).collect { idx -> DoubleColumn.create("PCA$idx", (0.. def clusterer = new KMeans(data, k) double[][] components = table.as().doubleMatrix('PCA1', 'PCA2') plots << ScatterPlot.plot(components, clusterer.clusterLabel, symbols[0..

Slide 50

Slide 50 text

Whiskey – Exploring Weka clustering algorithms

Slide 51

Slide 51 text

Whiskey – clustering and visualizing centroids … def data = table.as().doubleMatrix(*cols) def pca = new PCA(data) pca.projection = 3 def projected = pca.project(data) def clusterer = new KMeans(data, 5) def labels = clusterer.clusterLabel.collect { "Cluster " + (it + 1) } table = table.addColumns( *(0..<3).collect { idx -> DoubleColumn.create("PCA${idx+1}", (0.. toAdd[0].setString("Cluster", "Cluster " + (idx+1)) (1..3).each { toAdd[0].setDouble("PCA" + it, centroids[idx][it-1]) } toAdd[0].setDouble("Centroid", 50) table.append(toAdd) } def title = "Clusters x Principal Components w/ centroids" Plot.show(Scatter3DPlot.create(title, table, *(1..3).collect { "PCA$it" }, "Centroid", "Cluster"))

Slide 52

Slide 52 text

Whiskey – Hierarchical clustering with Dendrogram … def dendrogram = new Dendrogram(clusters.tree, clusters.height, FOREST_GREEN).canvas().tap { title = 'Whiskey Dendrogram' setAxisLabels('Distilleries', 'Similarity') def lb = lowerBounds setBound([lb[0] - 1, lb[1] - 20] as double[], upperBounds) distilleries.eachWithIndex { String label, int i -> add(new Label(label, [i, -1] as double[], 0, 0, ninetyDeg, font, colorMap[partitions[i]])) } }.panel() def pca = PCA.fit(data) pca.projection = 2 def projected = pca.project(data) char mark = '#' def scatter = ScatterPlot.of(projected, partitions, mark).canvas().tap { title = 'Clustered by dendrogram partitions' setAxisLabels('PCA1', 'PCA2') }.panel() new PlotGrid(dendrogram, scatter).window()

Slide 53

Slide 53 text

5 3 Whiskey Clustering with Apache Groovy & Apache Ignite • Apache Groovy • Apache Ignite • Data Science • Whiskey Clustering & Visualization • Scaling Whiskey Clustering

Slide 54

Slide 54 text

Clustering case study: Whiskey flavor profiles • Distributed clustering?

Slide 55

Slide 55 text

Clustering case study: Whiskey flavor profiles Node 1 Node 2

Slide 56

Slide 56 text

Clustering case study: Whiskey flavor profiles Node 1 Node 2

Slide 57

Slide 57 text

Clustering case study: Whiskey flavor profiles • 86 scotch whiskies • 12 flavor categories • Apache Ignite has special capabilities for reading data into the cache • In a cluster environment, use IgniteDataStreamer or IgniteCache.loadCache() to load data from files, stream sources, database sources, etc. • For our little example, we have a small CSV file and a single node, so we’ll just read our data using Apache Commons CSV

Slide 58

Slide 58 text

Clustering case study: Whiskey flavor profiles • 86 scotch whiskies • 12 flavor categories • Let’s select the regions of interest

Slide 59

Slide 59 text

Clustering case study: Whiskey flavor profiles • Read CSV rows • Slice out segments of interest 0 1 2 -1 0 1 … … distilleries data features var file = getClass().classLoader.getResource('whiskey.csv').file as File var rows = file.withReader {r -> RFC4180.parse(r).records*.toList() } var data = rows[1..-1].collect{ it[2..-1]*.toDouble() } as double[][] var distilleries = rows[1..-1]*.get(1) var features = rows[0][2..-1]

Slide 60

Slide 60 text

Clustering case study: Whiskey flavor profiles • Set up configuration & define some helper variables // configure to all run on local machine but could be a cluster (can be hidden in XML) var cfg = new IgniteConfiguration( peerClassLoadingEnabled: true, discoverySpi: new TcpDiscoverySpi( ipFinder: new TcpDiscoveryMulticastIpFinder( addresses: ['127.0.0.1:47500..47509'] ) ) ) var pretty = this.&sprintf.curry('%.4f') var dist = new EuclideanDistance() // or ManhattanDistance var vectorizer = new DoubleArrayVectorizer()

Slide 61

Slide 61 text

Whiskey flavors – scaling clustering Ignition.start(cfg).withCloseable { ignite -> println ">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols" var dataCache = ignite.createCache(new CacheConfiguration( name: "TEST_${UUID.randomUUID()}", affinity: new RendezvousAffinityFunction(false, 10))) data.indices.each { int i -> dataCache.put(i, data[i]) } var trainer = new KMeansTrainer().withDistance(dist).withAmountOfClusters(5) var mdl = trainer.fit(ignite, dataCache, vectorizer) println ">>> KMeans centroids:\n${features.join(', ')}" var centroids = mdl.centers*.all() var cols = centroids.collect{ it*.get() } cols.each { c -> println c.collect(pretty).join(', ') } dataCache.destroy() }

Slide 62

Slide 62 text

Whiskey flavors – scaling clustering Ignition.start(cfg).withCloseable { ignite -> println ">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols" var dataCache = ignite.createCache(new CacheConfiguration( name: "TEST_${UUID.randomUUID()}", affinity: new RendezvousAffinityFunction(false, 10))) data.indices.each { int i -> dataCache.put(i, data[i]) } var trainer = new KMeansTrainer().withDistance(dist).withAmountOfClusters(5) var mdl = trainer.fit(ignite, dataCache, vectorizer) println ">>> KMeans centroids:\n${features.join(', ')}" var centroids = mdl.centers*.all() var cols = centroids.collect{ it*.get() } cols.each { c -> println c.collect(pretty).join(', ') } dataCache.destroy() } [11:48:48] __________ ________________ [11:48:48] / _/ ___/ |/ / _/_ __/ __/ [11:48:48] _/ // (7 7 // / / / / _/ [11:48:48] /___/\___/_/|_/___/ /_/ /x___/ [11:48:48] [11:48:48] ver. 2.15.0#20230425-sha1:f98f7f35 [11:48:48] 2023 Copyright(C) Apache Software Foundation … >>> Ignite grid started for data: 86 rows X 12 cols >>> KMeans centroids: Body, Sweetness, Smoky, Medicinal, Tobacco, Honey, Spicy, Winey, Nutty, Malty, Fruity, Floral 1.5000, 2.5000, 1.0000, 0.1818, 0.0455, 0.7727, 0.8182, 0.3636, 1.6818, 1.5909, 2.0455, 1.8182 2.4400, 2.3600, 1.4400, 0.0800, 0.0400, 1.8000, 1.6800, 1.6000, 1.9200, 2.2400, 2.0800, 1.7200 2.9091, 1.5455, 2.9091, 2.7273, 0.4545, 0.4545, 1.4545, 0.5455, 1.5455, 1.4545, 1.1818, 0.5455 1.6000, 2.3200, 1.4800, 0.4400, 0.1200, 1.3600, 1.6000, 0.7600, 0.6800, 1.7600, 1.5600, 2.1600 4.0000, 2.6667, 1.6667, 0.0000, 0.0000, 2.0000, 1.0000, 3.6667, 2.3333, 1.3333, 2.0000, 1.0000

Slide 63

Slide 63 text

Whiskey flavors – scaling clustering … var clusters = [:].withDefault{ [] } dataCache.query(new ScanQuery()).withCloseable { observations -> observations.each { observation -> def (k, v) = observation.with{ [getKey(), getValue()] } int prediction = mdl.predict(vectorizer.extractFeatures(k, v)) clusters[prediction] += distilleries[k] } } clusters.sort{ e -> e.key }.each{ k, v -> println "Cluster ${k+1}: ${v.join(', ')}" } … … Cluster 1: AnCnoc, Auchentoshan, Aultmore, BenNevis, Benriach, Bunnahabhain, Cardhu, Craigallechie, Dalwhinnie, Edradour, GlenElgin, GlenGrant, GlenMoray, GlenOrd, Glengoyne, Glenlossie, Glenmorangie, Knochando, Longmorn, Mannochmore, Scapa, Speyside, Strathmill, Tamdhu, Tobermory Cluster 2: Aberlour, Belvenie, Benrinnes, Deanston, Glendullan, Glenlivet, Strathisla Cluster 3: ArranIsleOf, Balblair, Bladnoch, Craigganmore, Dufftown, GlenDeveronMacduff, GlenGarioch, GlenSpey, Glenallachie, Glenfiddich, Glenkinchie, Inchgower, Linkwood, Loch Lomond, Miltonduff, RoyalBrackla, Speyburn, Tamnavulin, Teaninich, Tullibardine Cluster 4: Aberfeldy, Ardmore, Auchroisk, Balmenach, Benromach, BlairAthol, Bowmore, Bruichladdich, Dailuaine, Dalmore, GlenKeith, GlenScotia, Glendronach, Glenfarclas, Glenrothes, Glenturret, Highland Park, Macallan, Mortlach, OldFettercairn, RoyalLochnagar, Springbank, Tomatin, Tomintoul, Tomore Cluster 5: Ardbeg, Caol Ila, Clynelish, Isle of Jura, Lagavulin, Laphroig, Oban, OldPulteney, Talisker …

Slide 64

Slide 64 text

Whiskey flavors – scaling clustering … var clusters = [:].withDefault{ [] } dataCache.query(new ScanQuery()).withCloseable { observations -> observations.each { observation -> def (k, v) = observation.with{ [getKey(), getValue()] } int prediction = mdl.predict(vectorizer.extractFeatures(k, v)) clusters[prediction] += distilleries[k] } } clusters.sort{ e -> e.key }.each{ k, v -> println "Cluster ${k+1}: ${v.join(', ')}" } … … Cluster 1: AnCnoc, Auchentoshan, Aultmore, BenNevis, Benriach, Bunnahabhain, Cardhu, Craigallechie, Dalwhinnie, Edradour, GlenElgin, GlenGrant, GlenMoray, GlenOrd, Glengoyne, Glenlossie, Glenmorangie, Knochando, Longmorn, Mannochmore, Scapa, Speyside, Strathmill, Tamdhu, Tobermory Cluster 2: Aberlour, Belvenie, Benrinnes, Deanston, Glendullan, Glenlivet, Strathisla Cluster 3: ArranIsleOf, Balblair, Bladnoch, Craigganmore, Dufftown, GlenDeveronMacduff, GlenGarioch, GlenSpey, Glenallachie, Glenfiddich, Glenkinchie, Inchgower, Linkwood, Loch Lomond, Miltonduff, RoyalBrackla, Speyburn, Tamnavulin, Teaninich, Tullibardine Cluster 4: Aberfeldy, Ardmore, Auchroisk, Balmenach, Benromach, BlairAthol, Bowmore, Bruichladdich, Dailuaine, Dalmore, GlenKeith, GlenScotia, Glendronach, Glenfarclas, Glenrothes, Glenturret, Highland Park, Macallan, Mortlach, OldFettercairn, RoyalLochnagar, Springbank, Tomatin, Tomintoul, Tomore Cluster 5: Ardbeg, Caol Ila, Clynelish, Isle of Jura, Lagavulin, Laphroig, Oban, OldPulteney, Talisker …

Slide 65

Slide 65 text

Scaling clustering: K-means k=3 Euclidean var dist = new EuclideanDistance() … Ignition.start(cfg).withCloseable { ignite -> println ">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols" var dataCache = ignite.createCache(new CacheConfiguration( name: "TEST_${UUID.randomUUID()}", affinity: new RendezvousAffinityFunction(false, 10))) data.indices.each { int i -> dataCache.put(i, data[i]) } var trainer = new KMeansTrainer().withDistance(dist).withAmountOfClusters(3) var mdl = trainer.fit(ignite, dataCache, vectorizer) println ">>> KMeans centroids:\n${features.join(', ')}" var centroids = mdl.centers*.all() var cols = centroids.collect{ it*.get() } cols.each { c -> println c.collect(pretty).join(', ') } dataCache.destroy() } 5

Slide 66

Slide 66 text

Scaling clustering: K-means k=3 Euclidean var dist = new EuclideanDistance() … Ignition.start(cfg).withCloseable { ignite -> println ">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols" var dataCache = ignite.createCache(new CacheConfiguration( name: "TEST_${UUID.randomUUID()}", affinity: new RendezvousAffinityFunction(false, 10))) data.indices.each { int i -> dataCache.put(i, data[i]) } var trainer = new KMeansTrainer().withDistance(dist).withAmountOfClusters(3) var mdl = trainer.fit(ignite, dataCache, vectorizer) println ">>> KMeans centroids:\n${features.join(', ')}" var centroids = mdl.centers*.all() var cols = centroids.collect{ it*.get() } cols.each { c -> println c.collect(pretty).join(', ') } dataCache.destroy() } 5 … Cluster 1: Ardbeg, Caol Ila, Clynelish, Lagavulin, Laphroig, Talisker Distinguishing features: Body=3..4, Sweetness=1..2, Smoky=3..4, Medicinal=2..4, Honey=0..1, Winey=0..2, Nutty=1..2, Malty=1..2, Fruity=0..2, Floral=0..1 Cluster 2: Ardmore, ArranIsleOf, Balblair, Balmenach, BlairAthol, Bowmore, Bruichladdich, Dailuaine, Dalmore, GlenDeveronMacduff, GlenGarioch, GlenScotia, GlenSpey, Glendronach, Glenrothes, Highland Park, Isle of Jura, Loch Lomond, Mortlach, Oban, OldFettercairn, OldPulteney, Springbank, Teaninich, Tomatin, Tomore Distinguishing features: Sweetness=1..3, Smoky=1..3, Medicinal=0..2, Honey=0..2, Floral=0..2 Cluster 3: Aberfeldy, Aberlour, AnCnoc, Auchentoshan, Auchroisk, Aultmore, Belvenie, BenNevis, Benriach, Benrinnes, Benromach, Bladnoch, Bunnahabhain, Cardhu, Craigallechie, Craigganmore, Dalwhinnie, Deanston, Dufftown, Edradour, GlenElgin, GlenGrant, GlenKeith, GlenMoray, GlenOrd, Glenallachie, Glendullan, Glenfarclas, Glenfiddich, Glengoyne, Glenkinchie, Glenlivet, Glenlossie, Glenmorangie, Glenturret, Inchgower, Knochando, Linkwood, Longmorn, Macallan, Mannochmore, Miltonduff, RoyalBrackla, RoyalLochnagar, Scapa, Speyburn, Speyside, Strathisla, Strathmill, Tamdhu, Tamnavulin, Tobermory, Tomintoul, Tullibardine Distinguishing features: Smoky=0..2, Medicinal=0..1, Malty=1..3, Fruity=1..3 …

Slide 67

Slide 67 text

Scaling clustering: K-means k=3 Manhattan var dist = new ManhattanDistance() … Ignition.start(cfg).withCloseable { ignite -> println ">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols" var dataCache = ignite.createCache(new CacheConfiguration( name: "TEST_${UUID.randomUUID()}", affinity: new RendezvousAffinityFunction(false, 10))) data.indices.each { int i -> dataCache.put(i, data[i]) } var trainer = new KMeansTrainer().withDistance(dist).withAmountOfClusters(3) var mdl = trainer.fit(ignite, dataCache, vectorizer) println ">>> KMeans centroids:\n${features.join(', ')}" var centroids = mdl.centers*.all() var cols = centroids.collect{ it*.get() } cols.each { c -> println c.collect(pretty).join(', ') } dataCache.destroy() } 4 3 3 + 4 = 7

Slide 68

Slide 68 text

Scaling clustering: K-means k=3 Manhattan var dist = new ManhattanDistance() … Ignition.start(cfg).withCloseable { ignite -> println ">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols" var dataCache = ignite.createCache(new CacheConfiguration( name: "TEST_${UUID.randomUUID()}", affinity: new RendezvousAffinityFunction(false, 10))) data.indices.each { int i -> dataCache.put(i, data[i]) } var trainer = new KMeansTrainer().withDistance(dist).withAmountOfClusters(3) var mdl = trainer.fit(ignite, dataCache, vectorizer) println ">>> KMeans centroids:\n${features.join(', ')}" var centroids = mdl.centers*.all() var cols = centroids.collect{ it*.get() } cols.each { c -> println c.collect(pretty).join(', ') } dataCache.destroy() } 4 3 3 + 4 = 7 … Cluster 1: Aberfeldy, Aberlour, AnCnoc, Ardmore, Auchroisk, Balmenach, Belvenie, BenNevis, Benrinnes, Benromach, BlairAthol, Bowmore, Bruichladdich, Craigallechie, Dailuaine, Dalmore, Deanston, Edradour, Glendronach, Glendullan, Glenfarclas, Glenlivet, Glenturret, Knochando, Macallan, Mortlach, OldFettercairn, RoyalLochnagar, Scapa, Strathisla, Tomatin, Tomintoul Distinguishing features: Smoky=1..3, Medicinal=0..2 Cluster 2: Ardbeg, Caol Ila, Clynelish, GlenScotia, Highland Park, Isle of Jura, Lagavulin, Laphroig, Oban, OldPulteney, Springbank, Talisker Distinguishing features: Body=2..4, Sweetness=1..2, Smoky=2..4, Honey=0..2, Winey=0..2, Nutty=1..2, Malty=1..2, Fruity=0..2, Floral=0..2 Cluster 3: ArranIsleOf, Auchentoshan, Aultmore, Balblair, Benriach, Bladnoch, Bunnahabhain, Cardhu, Craigganmore, Dalwhinnie, Dufftown, GlenDeveronMacduff, GlenElgin, GlenGarioch, GlenGrant, GlenKeith, GlenMoray, GlenOrd, GlenSpey, Glenallachie, Glenfiddich, Glengoyne, Glenkinchie, Glenlossie, Glenmorangie, Glenrothes, Inchgower, Linkwood, Loch Lomond, Longmorn, Mannochmore, Miltonduff, RoyalBrackla, Speyburn, Speyside, Strathmill, Tamdhu, Tamnavulin, Teaninich, Tobermory, Tomore, Tullibardine Distinguishing features: Medicinal=0..1, Honey=0..2, Winey=0..2 …

Slide 69

Slide 69 text

Ignition.start(cfg).withCloseable { ignite -> println ">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols" var dataCache = ignite.createCache(new CacheConfiguration( name: "TEST_${UUID.randomUUID()}", affinity: new RendezvousAffinityFunction(false, 10))) data.indices.each { int i -> dataCache.put(i, data[i]) } var trainer = new GmmTrainer().withMaxCountOfClusters(5) var mdl = trainer.fit(ignite, dataCache, vectorizer) … dataCache.destroy() } Scaling clustering: Gaussian max clusters 5 Image source: wikipedia

Slide 70

Slide 70 text

Ignition.start(cfg).withCloseable { ignite -> println ">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols" var dataCache = ignite.createCache(new CacheConfiguration( name: "TEST_${UUID.randomUUID()}", affinity: new RendezvousAffinityFunction(false, 10))) data.indices.each { int i -> dataCache.put(i, data[i]) } var trainer = new GmmTrainer().withMaxCountOfClusters(5) var mdl = trainer.fit(ignite, dataCache, vectorizer) … dataCache.destroy() } Scaling clustering: Gaussian max clusters 5 … Cluster 1: Aberfeldy, Aberlour, AnCnoc, Ardmore, ArranIsleOf, Auchentoshan, Auchroisk, Aultmore, Balmenach, Belvenie, BenNevis, Benriach, Benrinnes, Benromach, Bladnoch, BlairAthol, Bunnahabhain, Cardhu, Craigallechie, Craigganmore, Dailuaine, Dalwhinnie, Deanston, Dufftown, Edradour, GlenDeveronMacduff, GlenElgin, GlenGrant, GlenKeith, GlenMoray, GlenOrd, GlenSpey, Glenallachie, Glendronach, Glendullan, Glenfarclas, Glenfiddich, Glengoyne, Glenkinchie, Glenlivet, Glenlossie, Glenrothes, Glenturret, Inchgower, Knochando, Linkwood, Loch Lomond, Longmorn, Macallan, Mannochmore, Miltonduff, Mortlach, OldFettercairn, RoyalLochnagar, Speyburn, Speyside, Strathisla, Tamdhu, Tamnavulin, Tobermory, Tomatin, Tomintoul, Tomore, Tullibardine Distinguishing features: Smoky=0..2, Medicinal=0..1 Cluster 2: Ardbeg, Balblair, Bowmore, Bruichladdich, Caol Ila, Clynelish, Dalmore, GlenGarioch, GlenScotia, Glenmorangie, Highland Park, Isle of Jura, Lagavulin, Laphroig, Oban, OldPulteney, RoyalBrackla, Scapa, Springbank, Strathmill, Talisker, Teaninich Distinguishing features: Sweetness=1..3, Honey=0..2, Winey=0..2, Nutty=0..2, Malty=0..2, Floral=0..2 … Image source: wikipedia

Slide 71

Slide 71 text

Scaling clustering: Running examples with BeakerX

Slide 72

Slide 72 text

Scaling clustering: Running examples with GitPod

Slide 73

Slide 73 text

THANK YOU Twitter: Mastodon: Apache Groovy: Apache Ignite: Repo: © 2023 Unity Foundation. All rights reserved. @paulk_asert @[email protected] https://groovy.apache.org/ https://groovy-lang.org/ https://ignite.apache.org/ https://github.com/paulk-asert/groovy-data-science