Slide 1

Slide 1 text

objectcomputing.com © 2018, Object Computing, Inc. (OCI). All rights reserved. No part of these notes may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior, written permission of Object Computing, Inc. (OCI) Whiskey Clustering with Groovy & Apache Ignite Presented by Paul King ASF Groovy PMC & Unity Foundation ASF – Software for the public good Unity – Open systems for sharing data and benefitting underserved populations © 2023 Unity Foundation. All rights reserved. See also: https://github.com/paulk-asert/groovy-data-science unityfoundation.io Halifax, Nova Scotia, October 7-10, 2023

Slide 2

Slide 2 text

Dr Paul King Unity Foundation Groovy Lead V.P. Apache Groovy Author: https://www.manning.com/books/groovy-in-action-second-edition Slides: https://speakerdeck.com/paulk/whiskey-groovy-ignite Examples repo: https://github.com/paulk-asert/groovy-data-science Twitter/X | Mastodon: @paulk_asert | @[email protected]

Slide 3

Slide 3 text

3 Whiskey Clustering with Groovy & Apache Ignite • Apache Groovy • Apache Ignite • Data Science • Whiskey Clustering & Visualization • Scaling Whiskey Clustering

Slide 4

Slide 4 text

Apache Groovy Programming Language • Multi-faceted extensible language • Imperative/OO & functional • Dynamic & static • Aligned closely with Java • 19+ years since inception • ~2.5B downloads (partial count) • ~500 contributors • 200+ releases • https://www.youtube.com/watch?v=eIGOG- F9ZTw&feature=youtu.be

Slide 5

Slide 5 text

Friends of Apache Groovy Open Collective

Slide 6

Slide 6 text

What is Groovy? It’s like a super version of Java: • Supports most Java syntax but allows simpler syntax for many constructs • Supports all Java libraries but provides many extensions and its own productivity libraries • Has both a static and dynamic nature • Extensible language and tooling Java Groovy

Slide 7

Slide 7 text

Why use Groovy in 2023? It’s still like a super version of Java: • Simpler scripting • Metaprogramming: runtime, compile-time, extension methods, AST transforms • Language features: power assert, powerful switch, traits, closures • Static and dynamic nature • Productivity libraries for common tasks • Extensibility: language, tooling, type checker Java Groovy Let’s look at just two features that reduce boilerplate code

Slide 8

Slide 8 text

Simpler scripting: Java7+ import java.util.List; import java.util.ArrayList; class Main { private List keepShorterThan(List strings, int length) { List result = new ArrayList(); for (int i = 0; i < strings.size(); i++) { String s = (String) strings.get(i); if (s.length() < length) { result.add(s); } } return result; } public static void main(String[] args) { List names = new ArrayList(); names.add("Ted"); names.add("Fred"); names.add("Jed"); names.add("Ned"); System.out.println(names); Main m = new Main(); List shortNames = m.keepShorterThan(names, 4); System.out.println(shortNames.size()); for (int i = 0; i < shortNames.size(); i++) { String s = (String) shortNames.get(i); System.out.println(s); } } }

Slide 9

Slide 9 text

Simpler scripting: Java21+ (with JEP 445 & preview) import java.util.List; import java.util.ArrayList; class Main { private List keepShorterThan(List strings, int length) { List result = new ArrayList(); for (int i = 0; i < strings.size(); i++) { String s = (String) strings.get(i); if (s.length() < length) { result.add(s); } } return result; } public static void main(String[] args) { List names = new ArrayList(); names.add("Ted"); names.add("Fred"); names.add("Jed"); names.add("Ned"); System.out.println(names); Main m = new Main(); List shortNames = m.keepShorterThan(names, 4); System.out.println(shortNames.size()); for (int i = 0; i < shortNames.size(); i++) { String s = (String) shortNames.get(i); System.out.println(s); } } } import java.util.List; void main() { var names = List.of("Ted", "Fred", "Jed", "Ned"); System.out.println(names); var shortNames = names.stream().filter(n -> n.length() < 4).toList(); System.out.println(shortNames.size()); shortNames.forEach(System.out::println); }

Slide 10

Slide 10 text

Simpler scripting: JDK5+/Groovy 1+ import java.util.List; import java.util.ArrayList; class Main { private List keepShorterThan(List strings, int length) { List result = new ArrayList(); for (int i = 0; i < strings.size(); i++) { String s = (String) strings.get(i); if (s.length() < length) { result.add(s); } } return result; } public static void main(String[] args) { List names = new ArrayList(); names.add("Ted"); names.add("Fred"); names.add("Jed"); names.add("Ned"); System.out.println(names); Main m = new Main(); List shortNames = m.keepShorterThan(names, 4); System.out.println(shortNames.size()); for (int i = 0; i < shortNames.size(); i++) { String s = (String) shortNames.get(i); System.out.println(s); } } } import java.util.List; void main() { var names = List.of("Ted", "Fred", "Jed", "Ned"); System.out.println(names); var shortNames = names.stream().filter(n -> n.length() < 4).toList(); System.out.println(shortNames.size()); shortNames.forEach(System.out::println); } names = ["Ted", "Fred", "Jed", "Ned"] println names shortNames = names.findAll{ it.size() < 4 } println shortNames.size() shortNames.each{ println it }

Slide 11

Slide 11 text

Simpler scripting: DSL/command chain support import java.util.List; import java.util.ArrayList; class Main { private List keepShorterThan(List strings, int length) { List result = new ArrayList(); for (int i = 0; i < strings.size(); i++) { String s = (String) strings.get(i); if (s.length() < length) { result.add(s); } } return result; } public static void main(String[] args) { List names = new ArrayList(); names.add("Ted"); names.add("Fred"); names.add("Jed"); names.add("Ned"); System.out.println(names); Main m = new Main(); List shortNames = m.keepShorterThan(names, 4); System.out.println(shortNames.size()); for (int i = 0; i < shortNames.size(); i++) { String s = (String) shortNames.get(i); System.out.println(s); } } } import java.util.List; void main() { var names = List.of("Ted", "Fred", "Jed", "Ned"); System.out.println(names); var shortNames = names.stream().filter(n -> n.length() < 4).toList(); System.out.println(shortNames.size()); shortNames.forEach(System.out::println); } names = ["Ted", "Fred", "Jed", "Ned"] println names shortNames = names.findAll{ it.size() < 4 } println shortNames.size() shortNames.each{ println it } given the names "Ted", "Fred", "Jed" and "Ned" display all the names display the number of names having size less than 4 display the names having size less than 4

Slide 12

Slide 12 text

Scripting for Data Science • Same example • Same library Array2DRowRealMatrix{{15.1379501385,40.488531856},{21.4354570637,59.5951246537}} import org.apache.commons.math3.linear.*; public class MatrixMain { public static void main(String[] args) { double[][] matrixData = { {1d,2d,3d}, {2d,5d,3d}}; RealMatrix m = MatrixUtils.createRealMatrix(matrixData); double[][] matrixData2 = { {1d,2d}, {2d,5d}, {1d, 7d}}; RealMatrix n = new Array2DRowRealMatrix(matrixData2); RealMatrix o = m.multiply(n); // Invert o, using LU decomposition RealMatrix oInverse = new LUDecomposition(o).getSolver().getInverse(); RealMatrix p = oInverse.scalarAdd(1d).scalarMultiply(2d); RealMatrix q = o.add(p.power(2)); System.out.println(q); } } Thanks to operator overloading and extensible tooling

Slide 13

Slide 13 text

Scripting in Notebooks • Jupyter/beakerx • Apache Zeppelin • GroovyLab • Seco

Slide 14

Slide 14 text

Metaprogramming // imports not shown public class Book { private String $to$string; private int $hash$code; private final List authors; private final String title; private final Date publicationDate; private static final java.util.Comparator this$TitleComparator; private static final java.util.Comparator this$PublicationDateComparator; public Book(List authors, String title, Date publicationDate) { if (authors == null) { this.authors = null; } else { if (authors instanceof Cloneable) { List authorsCopy = (List) ((ArrayList>) authors).clone(); this.authors = (List) (authorsCopy instanceof SortedSet ? DefaultGroovyMethods.asImmutable(authorsCopy) : authorsCopy instanceof SortedMap ? DefaultGroovyMethods.asImmutable(authorsCopy) : authorsCopy instanceof Set ? DefaultGroovyMethods.asImmutable(authorsCopy) : authorsCopy instanceof Map ? DefaultGroovyMethods.asImmutable(authorsCopy) : authorsCopy instanceof List ? DefaultGroovyMethods.asImmutable(authorsCopy) : DefaultGroovyMethods.asImmutable(authorsCopy)); } else { this.authors = (List) (authors instanceof SortedSet ? DefaultGroovyMethods.asImmutable(authors) : authors instanceof SortedMap ? DefaultGroovyMethods.asImmutable(authors) : authors instanceof Set ? DefaultGroovyMethods.asImmutable(authors) : authors instanceof Map ? DefaultGroovyMethods.asImmutable(authors) : authors instanceof List ? DefaultGroovyMethods.asImmutable(authors) : DefaultGroovyMethods.asImmutable(authors)); } } this.title = title; if (publicationDate == null) { this.publicationDate = null; } else { this.publicationDate = (Date) publicationDate.clone(); } } public Book(Map args) { if ( args == null) { args = new HashMap(); } ImmutableASTTransformation.checkPropNames(this, args); if (args.containsKey("authors")) { if ( args.get("authors") == null) { this .authors = null; } else { if (args.get("authors") instanceof Cloneable) { List authorsCopy = (List) ((ArrayList>) args.get("authors")).clone(); this.authors = (List) (authorsCopy instanceof SortedSet ? DefaultGroovyMethods.asImmutable(authorsCopy) : authorsCopy instanceof SortedMap ? DefaultGroovyMethods.asImmutable(authorsCopy) : authorsCopy instanceof Set ? DefaultGroovyMethods.asImmutable(authorsCopy) : authorsCopy instanceof Map ? DefaultGroovyMethods.asImmutable(authorsCopy) : authorsCopy instanceof List ? DefaultGroovyMethods.asImmutable(authorsCopy) : DefaultGroovyMethods.asImmutable(authorsCopy)); } else { List authors = (List) args.get("authors"); this.authors = (List) (authors instanceof SortedSet ? DefaultGroovyMethods.asImmutable(authors) : authors instanceof SortedMap ? DefaultGroovyMethods.asImmutable(authors) : authors instanceof Set ? DefaultGroovyMethods.asImmutable(authors) : authors instanceof Map ? DefaultGroovyMethods.asImmutable(authors) : authors instanceof List ? DefaultGroovyMethods.asImmutable(authors) : DefaultGroovyMethods.asImmutable(authors)); } } } else { this .authors = null; } if (args.containsKey("title")) {this .title = (String) args.get("title"); } else { this .title = null;} if (args.containsKey("publicationDate")) { if (args.get("publicationDate") == null) { this.publicationDate = null; } else { this.publicationDate = (Date) ((Date) args.get("publicationDate")).clone(); } } else {this.publicationDate = null; } } … public Book() { this (new HashMap()); } public int compareTo(Book other) { if (this == other) { return 0; } Integer value = 0 value = this .title <=> other .title if ( value != 0) { return value } value = this .publicationDate <=> other .publicationDate if ( value != 0) { return value } return 0 } public static Comparator comparatorByTitle() { return this$TitleComparator; } public static Comparator comparatorByPublicationDate() { return this$PublicationDateComparator; } public String toString() { StringBuilder _result = new StringBuilder(); boolean $toStringFirst = true; _result.append("Book("); if ($toStringFirst) { $toStringFirst = false; } else { _result.append(", "); } _result.append(InvokerHelper.toString(this.getAuthors())); if ($toStringFirst) { $toStringFirst = false; } else { _result.append(", "); } _result.append(InvokerHelper.toString(this.getTitle())); if ($toStringFirst) { $toStringFirst = false; } else { _result.append(", "); } _result.append(InvokerHelper.toString(this.getPublicationDate())); _result.append(")"); if ($to$string == null) { $to$string = _result.toString(); } return $to$string; } public int hashCode() { if ( $hash$code == 0) { int _result = HashCodeHelper.initHash(); if (!(this.getAuthors().equals(this))) { _result = HashCodeHelper.updateHash(_result, this.getAuthors()); } if (!(this.getTitle().equals(this))) { _result = HashCodeHelper.updateHash(_result, this.getTitle()); } if (!(this.getPublicationDate().equals(this))) { _result = HashCodeHelper.updateHash(_result, this.getPublicationDate()); } $hash$code = (int) _result; } return $hash$code; } public boolean canEqual(Object other) { return other instanceof Book; } … public boolean equals(Object other) { if ( other == null) { return false; } if (this == other) { return true; } if (!( other instanceof Book)) { return false; } Book otherTyped = (Book) other; if (!(otherTyped.canEqual( this ))) { return false; } if (!(this.getAuthors() == otherTyped.getAuthors())) { return false; } if (!(this.getTitle().equals(otherTyped.getTitle()))) { return false; } if (!(this.getPublicationDate().equals(otherTyped.getPublicationDate()))) { return false; } return true; } public final Book copyWith(Map map) { if (map == null || map.size() == 0) { return this; } Boolean dirty = false; HashMap construct = new HashMap(); if (map.containsKey("authors")) { Object newValue = map.get("authors"); Object oldValue = this.getAuthors(); if (newValue != oldValue) { oldValue = newValue; dirty = true; } construct.put("authors", oldValue); } else { construct.put("authors", this.getAuthors()); } if (map.containsKey("title")) { Object newValue = map.get("title"); Object oldValue = this.getTitle(); if (newValue != oldValue) { oldValue = newValue; dirty = true; } construct.put("title", oldValue); } else { construct.put("title", this.getTitle()); } if (map.containsKey("publicationDate")) { Object newValue = map.get("publicationDate"); Object oldValue = this.getPublicationDate(); if (newValue != oldValue) { oldValue = newValue; dirty = true; } construct.put("publicationDate", oldValue); } else { construct.put("publicationDate", this.getPublicationDate()); } return dirty == true ? new Book(construct) : this; } public void writeExternal(ObjectOutput out) throws IOException { out.writeObject(authors); out.writeObject(title); out.writeObject(publicationDate); } public void readExternal(ObjectInput oin) throws IOException, ClassNotFoundException { authors = (List) oin.readObject(); title = (String) oin.readObject(); publicationDate = (Date) oin.readObject(); } … static { this$TitleComparator = new Book$TitleComparator(); this$PublicationDateComparator = new Book$PublicationDateComparator(); } public String getAuthors(int index) { return authors.get(index); } public List getAuthors() { return authors; } public final String getTitle() { return title; } public final Date getPublicationDate() { if (publicationDate == null) { return publicationDate; } else { return (Date) publicationDate.clone(); } } public int compare(java.lang.Object param0, java.lang.Object param1) { return -1; } private static class Book$TitleComparator extends AbstractComparator { public Book$TitleComparator() { } public int compare(Book arg0, Book arg1) { if (arg0 == arg1) { return 0; } if (arg0 != null && arg1 == null) { return -1; } if (arg0 == null && arg1 != null) { return 1; } return arg0.title <=> arg1.title; } public int compare(java.lang.Object param0, java.lang.Object param1) { return -1; } } private static class Book$PublicationDateComparator extends AbstractComparator { public Book$PublicationDateComparator() { } public int compare(Book arg0, Book arg1) { if ( arg0 == arg1 ) { return 0; } if ( arg0 != null && arg1 == null) { return -1; } if ( arg0 == null && arg1 != null) { return 1; } return arg0 .publicationDate <=> arg1 .publicationDate; } public int compare(java.lang.Object param0, java.lang.Object param1) { return -1; } @Immutable(copyWith = true) @Sortable(excludes = 'authors') @AutoExternalize class Book { @IndexedProperty List authors String title Date publicationDate }

Slide 15

Slide 15 text

AST Transformations: Groovy 2.4, 2.5, 2.5 (improved), 3.0, 4.0, 5.0 • 80 AST transforms @NonSealed @RecordBase @Sealed @PlatformLog @GQ @Final @RecordType @POJO @Pure @Contracted @Ensures @Invariant @Requires @ClassInvariant @ContractElement @Postcondition @Precondition @OperatorRename

Slide 16

Slide 16 text

1 6 Whiskey Clustering with Groovy & Apache Ignite • Apache Groovy • Apache Ignite • Data Science • Whiskey Clustering & Visualization • Scaling Whiskey Clustering

Slide 17

Slide 17 text

Scaling up machine learning: Apache Ignite Apache Ignite is a distributed database for high-performance computing with in-memory speed. In simple terms, it makes a cluster (or grid) of nodes appear like an in-memory cache. Ignite can be used as: • an in-memory cache with special features like SQL querying and transactional properties • an in-memory data-grid with advanced read-through & write-through capabilities on top of one or more distributed databases • an ultra-fast and horizontally scalable in-memory database • a high-performance computing engine for custom or built-in tasks including machine learning It is mostly this last capability that we will use. Ignite’s Machine Learning API has purpose built, cluster-aware machine learning and deep learning algorithms for Classification, Regression, Clustering, and Recommendation, among others. We’ll mostly use the distributed K-means Clustering algorithm from their library.

Slide 18

Slide 18 text

1 8 Whiskey Clustering with Groovy & Apache Ignite • Apache Groovy • Apache Ignite • Data Science • Whiskey Clustering & Visualization • Scaling Whiskey Clustering

Slide 19

Slide 19 text

Data Science Process Research Goals Obtain Data Data Preparation Data Exploration Visualization Data Modeling Data ingestion Data storage Data processing platforms Modeling algorithms Math libraries Graphics processing Integration Deployment

Slide 20

Slide 20 text

Data Science Algorithms Source: Jason Brownlee, https://machinelearningmastery.com/master-machine-learning-algorithms/

Slide 21

Slide 21 text

Data Science Algorithms Source: Jason Brownlee, https://machinelearningmastery.com/master-machine-learning-algorithms/

Slide 22

Slide 22 text

2 2 Whiskey Clustering with Groovy & Apache Ignite • Apache Groovy • Apache Ignite • Data Science • Whiskey Clustering & Visualization • Scaling Whiskey Clustering

Slide 23

Slide 23 text

Clustering case study: Whiskey flavor profiles • 86 scotch whiskies • 12 flavor categories Pictures: https://prasant.net/clustering-scotch-whisky-grouping-distilleries-by-k-means-clustering-81f2ecde069c https://www.r-bloggers.com/where-the-whisky-flavor-profile-data-came-from/ https://www.centerspace.net/clustering-analysis-part-iv-non-negative-matrix-factorization/

Slide 24

Slide 24 text

Clustering case study: Whiskey flavor profiles RowID,Distillery,Body,Sweetness,Smoky,Medicinal,Tobacco,Honey,Spicy,Winey,Nutty,Malty,Fruity,Floral … 34,GlenElgin,2,3,1,0,0,2,1,1,1,1,2,3 35,GlenGarioch,2,1,3,0,0,0,3,1,0,2,2,2 36,GlenGrant,1,2,0,0,0,1,0,1,2,1,2,1 37,GlenKeith,2,3,1,0,0,1,2,1,2,1,2,1 38,GlenMoray,1,2,1,0,0,1,2,1,2,2,2,4 39,GlenOrd,3,2,1,0,0,1,2,1,1,2,2,2 40,GlenScotia,2,2,2,2,0,1,0,1,2,2,1,1 41,GlenSpey,1,3,1,0,0,0,1,1,1,2,0,2 42,Glenallachie,1,3,1,0,0,1,1,0,1,2,2,2 …

Slide 25

Slide 25 text

Whiskey – exploring with Dex

Slide 26

Slide 26 text

Clustering Overview Clustering: • Grouping similar items Algorithm families: • Hierarchical • Partitioning k-means, x-means • Density-based • Graph-based Aspects: • Disjoint vs overlapping • Preset cluster number • Dimensionality reduction PCA • Nominal feature support Applications: • Market segmentation • Recommendation engines • Search result grouping • Social network analysis • Medical imaging

Slide 27

Slide 27 text

Clustering Overview Clustering: • Grouping similar items Algorithm families: • Hierarchical • Partitioning k-means, x-means • Density-based • Graph-based Aspects: • Disjoint vs overlapping • Preset cluster number • Dimensionality reduction PCA • Nominal feature support Applications: • Market segmentation • Recommendation engines • Search result grouping • Social network analysis • Medical imaging

Slide 28

Slide 28 text

Clustering https://commons.apache.org/proper/commons-math/userguide/ml.html

Slide 29

Slide 29 text

Clustering with KMeans Step 1: • Guess k cluster centroids at random

Slide 30

Slide 30 text

Clustering with KMeans Step 1: • Guess k cluster centroids

Slide 31

Slide 31 text

Clustering with KMeans Step 1: • Guess k cluster centroids Step 2: • Assign points to closest centroid

Slide 32

Slide 32 text

Clustering with KMeans Step 1: • Guess k cluster centroids Step 2: • Assign points to closest centroid

Slide 33

Slide 33 text

Clustering with KMeans Step 1: • Guess k cluster centroids Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points

Slide 34

Slide 34 text

Clustering with KMeans Step 1: • Guess k cluster centroids Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points

Slide 35

Slide 35 text

Clustering with KMeans Step 1: • Guess k cluster centroids Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points

Slide 36

Slide 36 text

Clustering with KMeans Step 1: • Guess k cluster centroids Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points

Slide 37

Slide 37 text

Clustering with KMeans Step 1: • Guess k cluster centroids Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points Repeat steps 2 and 3 until stable or some limit reached

Slide 38

Slide 38 text

Clustering with KMeans Step 1: • Guess k cluster centroids Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points Repeat steps 2 and 3 until stable or some limit reached

Slide 39

Slide 39 text

Clustering with KMeans Step 1: • Guess k cluster centroids Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points Repeat steps 2 and 3 until stable or some limit reached

Slide 40

Slide 40 text

Clustering with KMeans Step 1: • Guess k cluster centroids Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points Repeat steps 2 and 3 until stable or some limit reached

Slide 41

Slide 41 text

import … def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] def numClusters = 5 def loader = new CSVLoader(file: 'whiskey.csv') def clusterer = new SimpleKMeans(numClusters: numClusters, preserveInstancesOrder: true) def instances = loader.dataSet instances.deleteAttributeAt(0) // remove RowID clusterer.buildClusterer(instances) println ' ' + cols.join(', ') def dataset = new DefaultCategoryDataset() clusterer.clusterCentroids.eachWithIndex{ Instance ctrd, num -> print "Cluster ${num+1}: " println ((1..cols.size()).collect{ sprintf '%.3f', ctrd.value(it) }.join(', ')) (1..cols.size()).each { idx -> dataset.addValue(ctrd.value(idx), "Cluster ${num+1}", cols[idx-1]) } } def clusters = (0.. clusters[cnum] << instances.get(idx).stringValue(0) } clusters.each { k, v -> println "Cluster ${k+1}:" println v.join(', ') } def plot = new SpiderWebPlot(dataset: dataset) def chart = new JFreeChart('Whiskey clusters', plot) SwingUtil.show(new ChartPanel(chart)) Whiskey – clustering with radar plot and weka Body, Sweetness, Smoky, Medicinal, Tobacco, Honey, Spicy, Winey, Nutty, Malty, Fruity, Floral Cluster 1: 3.800, 1.600, 3.600, 3.600, 0.600, 0.200, 1.600, 0.600, 1.000, 1.400, 1.200, 0.000 Cluster 2: 2.773, 2.409, 1.545, 0.045, 0.000, 1.818, 1.591, 2.000, 2.091, 2.136, 2.136, 1.591 Cluster 3: 1.773, 2.455, 1.318, 0.636, 0.000, 0.636, 1.000, 0.409, 1.636, 1.364, 1.591, 1.591 Cluster 4: 1.500, 2.233, 1.267, 0.267, 0.000, 1.533, 1.400, 0.700, 1.000, 1.900, 1.900, 2.133 Cluster 5: 2.000, 2.143, 1.857, 0.857, 1.000, 0.857, 1.714, 1.000, 1.286, 2.000, 1.429, 1.714 Cluster 1: Ardbeg, Clynelish, Lagavulin, Laphroig, Talisker Cluster 2: Aberfeldy, Aberlour, Ardmore, Auchroisk, Balmenach, BenNevis, Benrinnes, Benromach, BlairAthol, Dailuaine, Dalmore, Edradour, Glendronach, Glendullan, Glenfarclas, Glenrothes, Glenturret, Longmorn, Macallan, Mortlach, RoyalLochnagar, Strathisla Cluster 3: ArranIsleOf, Aultmore, Balblair, Cardhu, Craigganmore, Dufftown, GlenGrant, GlenKeith, GlenScotia, GlenSpey, Glenfiddich, Glenmorangie, Isle of Jura, Mannochmore, Miltonduff, Oban, Speyside, Springbank, Strathmill, Tamnavulin, Teaninich, Tomore Cluster 4: AnCnoc, Auchentoshan, Belvenie, Benriach, Bladnoch, Bowmore, Bruichladdich, Bunnahabhain, Dalwhinnie, Deanston, GlenElgin, GlenGarioch, GlenMoray, GlenOrd, Glenallachie, Glengoyne, Glenkinchie, Glenlivet, Glenlossie, Highland Park, Inchgower, Knochando, Linkwood, Loch Lomond, Scapa, Speyburn, Tamdhu, Tobermory, Tomatin, Tomintoul Cluster 5: Caol Ila, Craigallechie, GlenDeveronMacduff, OldFettercairn, OldPulteney, RoyalBrackla, Tullibardine

Slide 42

Slide 42 text

import … def rows = CSV.withFirstRecordAsHeader().parse(new FileReader('whiskey.csv')) def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] def clusterer = new KMeansPlusPlusClusterer(5) def data = rows.collect{ row -> new DoublePoint(cols.collect{ col -> row[col] } as int[]) } def centroids = clusterer.cluster(data) println cols.join(', ') + ', Medoid' def dataset = new DefaultCategoryDataset() centroids.eachWithIndex{ ctrd, num -> def cpt = ctrd.center.point def closest = ctrd.points.min{ pt -> sumSq((0.. cols.collect{ row[it] as double } == closest.point }?.Distillery println cpt.collect{ sprintf '%.3f', it }.join(', ') + ", $medoid" cpt.eachWithIndex { val, idx -> dataset.addValue(val, "Cluster ${num+1}", cols[idx]) } } def plot = new SpiderWebPlot(dataset: dataset) def chart = new JFreeChart('Whiskey clusters', plot) SwingUtil.show(new ChartPanel(chart)) Whiskey – clustering with radar plot and medoids Libraries: Apache Commons Math and JFreeChart Body, Sweetness, Smoky, Medicinal, Tobacco, Honey, Spicy, Winey, Nutty, Malty, Fruity, Floral, Medoid 2.000, 2.533, 1.267, 0.267, 0.200, 1.067, 1.667, 0.933, 0.267, 1.733, 1.800, 1.733, GlenOrd 2.789, 2.474, 1.474, 0.053, 0.000, 1.895, 1.632, 2.211, 2.105, 2.105, 2.211, 1.737, Aberfeldy 2.909, 1.545, 2.909, 2.727, 0.455, 0.455, 1.455, 0.545, 1.545, 1.455, 1.182, 0.545, Clynelish 1.333, 2.333, 0.944, 0.111, 0.000, 1.000, 0.444, 0.444, 1.500, 1.944, 1.778, 1.778, Aultmore 1.696, 2.304, 1.565, 0.435, 0.087, 1.391, 1.696, 0.609, 1.652, 1.652, 1.783, 2.130, Benromach

Slide 43

Slide 43 text

Dimensionality reduction

Slide 44

Slide 44 text

import … def rows = Table.read().csv('whiskey.csv') def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] def data = table.as().doubleMatrix(*cols) def pca = new PCA(data) pca.projection = 2 def plots = [PlotCanvas.screeplot(pca)] def projected = pca.project(data) table = table.addColumns( *(1..2).collect { idx -> DoubleColumn.create("PCA$idx", (0.. def clusterer = new KMeans(data, k) double[][] components = table.as().doubleMatrix('PCA1', 'PCA2') plots << ScatterPlot.plot(components, clusterer.clusterLabel, symbols[0..

Slide 45

Slide 45 text

Whiskey – Exploring Weka clustering algorithms

Slide 46

Slide 46 text

Whiskey – clustering and visualizing centroids … def data = table.as().doubleMatrix(*cols) def pca = new PCA(data) pca.projection = 3 def projected = pca.project(data) def clusterer = new KMeans(data, 5) def labels = clusterer.clusterLabel.collect { "Cluster " + (it + 1) } table = table.addColumns( *(0..<3).collect { idx -> DoubleColumn.create("PCA${idx+1}", (0.. toAdd[0].setString("Cluster", "Cluster " + (idx+1)) (1..3).each { toAdd[0].setDouble("PCA" + it, centroids[idx][it-1]) } toAdd[0].setDouble("Centroid", 50) table.append(toAdd) } def title = "Clusters x Principal Components w/ centroids" Plot.show(Scatter3DPlot.create(title, table, *(1..3).collect { "PCA$it" }, "Centroid", "Cluster"))

Slide 47

Slide 47 text

4 8 Whiskey Clustering with Groovy & Apache Ignite • Apache Groovy • Apache Ignite • Data Science • Whiskey Clustering & Visualization • Scaling Whiskey Clustering

Slide 48

Slide 48 text

Clustering case study: Whiskey flavor profiles • Distributed clustering?

Slide 49

Slide 49 text

Clustering case study: Whiskey flavor profiles Node 1 Node 2

Slide 50

Slide 50 text

Clustering case study: Whiskey flavor profiles Node 1 Node 2

Slide 51

Slide 51 text

Clustering case study: Whiskey flavor profiles • 86 scotch whiskies • 12 flavor categories • Apache Ignite has special capabilities for reading data into the cache • In a cluster environment, use IgniteDataStreamer or IgniteCache.loadCache() to load data from files, stream sources, database sources, etc. • For our little example, we have a small CSV file and a single node, so we’ll just read our data using Apache Commons CSV

Slide 52

Slide 52 text

Clustering case study: Whiskey flavor profiles • 86 scotch whiskies • 12 flavor categories • Let’s select the regions of interest

Slide 53

Slide 53 text

Clustering case study: Whiskey flavor profiles • Read CSV rows • Slice out segments of interest 0 1 2 -1 0 1 … … distilleries data features var file = getClass().classLoader.getResource('whiskey.csv').file as File var rows = file.withReader {r -> RFC4180.parse(r).records*.toList() } var data = rows[1..-1].collect{ it[2..-1]*.toDouble() } as double[][] var distilleries = rows[1..-1]*.get(1) var features = rows[0][2..-1]

Slide 54

Slide 54 text

Clustering case study: Whiskey flavor profiles • Set up configuration & define some helper variables // configure to all run on local machine but could be a cluster (can be hidden in XML) var cfg = new IgniteConfiguration( peerClassLoadingEnabled: true, discoverySpi: new TcpDiscoverySpi( ipFinder: new TcpDiscoveryMulticastIpFinder( addresses: ['127.0.0.1:47500..47509'] ) ) ) var pretty = this.&sprintf.curry('%.4f') var dist = new EuclideanDistance() // or ManhattanDistance var vectorizer = new DoubleArrayVectorizer().labeled(FIRST)

Slide 55

Slide 55 text

Whiskey flavors – scaling clustering Ignition.start(cfg).withCloseable { ignite -> println ">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols" var dataCache = ignite.createCache(new CacheConfiguration( name: "TEST_${UUID.randomUUID()}", affinity: new RendezvousAffinityFunction(false, 10))) data.indices.each { int i -> dataCache.put(i, data[i]) } var trainer = new KMeansTrainer().withDistance(dist).withAmountOfClusters(5) var mdl = trainer.fit(ignite, dataCache, vectorizer) println ">>> KMeans centroids:\n${features.join(', ')}" var centroids = mdl.centers*.all() var cols = centroids.collect{ it*.get() } cols.each { c -> println c.collect(pretty).join(', ') } dataCache.destroy() }

Slide 56

Slide 56 text

Whiskey flavors – scaling clustering Ignition.start(cfg).withCloseable { ignite -> println ">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols" var dataCache = ignite.createCache(new CacheConfiguration( name: "TEST_${UUID.randomUUID()}", affinity: new RendezvousAffinityFunction(false, 10))) data.indices.each { int i -> dataCache.put(i, data[i]) } var trainer = new KMeansTrainer().withDistance(dist).withAmountOfClusters(5) var mdl = trainer.fit(ignite, dataCache, vectorizer) println ">>> KMeans centroids:\n${features.join(', ')}" var centroids = mdl.centers*.all() var cols = centroids.collect{ it*.get() } cols.each { c -> println c.collect(pretty).join(', ') } dataCache.destroy() } [11:48:48] __________ ________________ [11:48:48] / _/ ___/ |/ / _/_ __/ __/ [11:48:48] _/ // (7 7 // / / / / _/ [11:48:48] /___/\___/_/|_/___/ /_/ /x___/ [11:48:48] [11:48:48] ver. 2.15.0#20230425-sha1:f98f7f35 [11:48:48] 2023 Copyright(C) Apache Software Foundation … >>> Ignite grid started for data: 86 rows X 12 cols >>> KMeans centroids: Body, Sweetness, Smoky, Medicinal, Tobacco, Honey, Spicy, Winey, Nutty, Malty, Fruity, Floral 2.3793, 1.0345, 0.2414, 0.0345, 0.8966, 1.1034, 0.5517, 1.5517, 1.6207, 2.1724, 2.1379 2.5556, 1.4444, 0.0556, 0.0000, 1.8333, 1.6667, 2.3333, 2.0000, 2.0000, 2.2222, 1.5556 3.1429, 1.0000, 0.2857, 0.1429, 0.8571, 0.5714, 0.7143, 0.7143, 1.5714, 0.7143, 1.5714 2.0476, 1.7619, 0.3333, 0.1429, 1.7619, 1.7619, 0.7143, 1.0952, 2.1429, 1.6190, 1.8571 1.5455, 2.9091, 2.7273, 0.4545, 0.4545, 1.4545, 0.5455, 1.5455, 1.4545, 1.1818, 0.5455

Slide 57

Slide 57 text

Whiskey flavors – scaling clustering … var clusters = [:].withDefault{ [] } dataCache.query(new ScanQuery()).withCloseable { observations -> observations.each { observation -> def (k, v) = observation.with{ [getKey(), getValue()] } int prediction = mdl.predict(vectorizer.extractFeatures(k, v)) clusters[prediction] += distilleries[k] } } clusters.sort{ e -> e.key }.each{ k, v -> println "Cluster ${k+1}: ${v.join(', ')}" } … … Cluster 1: Bunnahabhain, Dufftown, Glenmorangie, Teaninich, Glenallachie, Longmorn, Scapa, Tobermory, AnCnoc, Cardhu, GlenElgin, Mannochmore, Speyside, Craigganmore, GlenGrant, Tullibardine, Auchentoshan, Bladnoch, GlenKeith, Glengoyne, Knochando, Strathmill, GlenMoray, Aultmore, Tamdhu, Balblair, Glenlossie, Linkwood, Tamnavulin Cluster 2: Aberfeldy, Balmenach, RoyalLochnagar, Aberlour, Edradour, Glenrothes, Glendronach, Glenturret, Macallan, Glendullan, Glenfarclas, Mortlach, Strathisla, Dailuaine, Auchroisk, BlairAthol, Dalmore, Glenlivet Cluster 3: GlenSpey, GlenDeveronMacduff, Speyburn, Miltonduff, Tomore, ArranIsleOf, Glenfiddich Cluster 4: Loch Lomond, Belvenie, BenNevis, Tomatin, Benriach, Highland Park, Tomintoul, Ardmore, Benrinnes, Craigallechie, GlenGarioch, Inchgower, Benromach, Glenkinchie, OldFettercairn, Bowmore, Dalwhinnie, GlenOrd, Bruichladdich, Deanston, RoyalBrackla Cluster 5: Caol Ila, Ardbeg, Clynelish, Springbank, Isle of Jura, Oban, Lagavulin, Talisker, Laphroig, OldPulteney, GlenScotia …

Slide 58

Slide 58 text

Whiskey flavors – scaling clustering var dist = new EuclideanDistance() … Ignition.start(cfg).withCloseable { ignite -> println ">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols" var dataCache = ignite.createCache(new CacheConfiguration( name: "TEST_${UUID.randomUUID()}", affinity: new RendezvousAffinityFunction(false, 10))) data.indices.each { int i -> dataCache.put(i, data[i]) } var trainer = new KMeansTrainer().withDistance(dist).withAmountOfClusters(5) var mdl = trainer.fit(ignite, dataCache, vectorizer) println ">>> KMeans centroids:\n${features.join(', ')}" var centroids = mdl.centers*.all() var cols = centroids.collect{ it*.get() } cols.each { c -> println c.collect(pretty).join(', ') } dataCache.destroy() } 5

Slide 59

Slide 59 text

Whiskey flavors – scaling clustering var dist = new ManhattanDistance() … Ignition.start(cfg).withCloseable { ignite -> println ">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols" var dataCache = ignite.createCache(new CacheConfiguration( name: "TEST_${UUID.randomUUID()}", affinity: new RendezvousAffinityFunction(false, 10))) data.indices.each { int i -> dataCache.put(i, data[i]) } var trainer = new KMeansTrainer().withDistance(dist).withAmountOfClusters(5) var mdl = trainer.fit(ignite, dataCache, vectorizer) println ">>> KMeans centroids:\n${features.join(', ')}" var centroids = mdl.centers*.all() var cols = centroids.collect{ it*.get() } cols.each { c -> println c.collect(pretty).join(', ') } dataCache.destroy() } 4 3 3 + 4 = 7

Slide 60

Slide 60 text

Ignition.start(cfg).withCloseable { ignite -> println ">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols" var dataCache = ignite.createCache(new CacheConfiguration( name: "TEST_${UUID.randomUUID()}", affinity: new RendezvousAffinityFunction(false, 10))) data.indices.each { int i -> dataCache.put(i, data[i]) } var trainer = new GmmTrainer().withMaxCountOfClusters(5) var mdl = trainer.fit(ignite, dataCache, vectorizer) … dataCache.destroy() } Whiskey flavors – scaling clustering Image source: wikipedia

Slide 61

Slide 61 text

6 2 Whiskey Clustering with Apache Groovy & Apache Ignite Twitter: Mastodon: Apache Groovy: Apache Ignite: Repo: @paulk_asert @[email protected] https://groovy.apache.org/ https://groovy-lang.org/ https://ignite.apache.org/ https://github.com/paulk-asert/groovy-data-science