Whisky clustering with Apache Groovy and Apache Ignite

objectcomputing.com © 2018, Object Computing, Inc. (OCI). All rights reserved.
No part of these notes may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior, written permission of Object Computing, Inc. (OCI) Whiskey Clustering with Groovy & Apache Ignite Presented by Paul King ASF Groovy PMC & Unity Foundation ASF – Software for the public good Unity – Open systems for sharing data and benefitting underserved populations © 2023 Unity Foundation. All rights reserved. See also: https://github.com/paulk-asert/groovy-data-science unityfoundation.io Halifax, Nova Scotia, October 7-10, 2023

Dr Paul King Unity Foundation Groovy Lead V.P. Apache Groovy
Author: https://www.manning.com/books/groovy-in-action-second-edition Slides: https://speakerdeck.com/paulk/whiskey-groovy-ignite Examples repo: https://github.com/paulk-asert/groovy-data-science Twitter/X | Mastodon: @paulk_asert | @[email protected]

3 Whiskey Clustering with Groovy & Apache Ignite • Apache
Groovy • Apache Ignite • Data Science • Whiskey Clustering & Visualization • Scaling Whiskey Clustering

Apache Groovy Programming Language • Multi-faceted extensible language • Imperative/OO
& functional • Dynamic & static • Aligned closely with Java • 19+ years since inception • ~2.5B downloads (partial count) • ~500 contributors • 200+ releases • https://www.youtube.com/watch?v=eIGOG- F9ZTw&feature=youtu.be

Friends of Apache Groovy Open Collective

What is Groovy? It’s like a super version of Java:
• Supports most Java syntax but allows simpler syntax for many constructs • Supports all Java libraries but provides many extensions and its own productivity libraries • Has both a static and dynamic nature • Extensible language and tooling Java Groovy

Why use Groovy in 2023? It’s still like a super
version of Java: • Simpler scripting • Metaprogramming: runtime, compile-time, extension methods, AST transforms • Language features: power assert, powerful switch, traits, closures • Static and dynamic nature • Productivity libraries for common tasks • Extensibility: language, tooling, type checker Java Groovy Let’s look at just two features that reduce boilerplate code

Simpler scripting: Java7+ import java.util.List; import java.util.ArrayList; class Main {
private List keepShorterThan(List strings, int length) { List result = new ArrayList(); for (int i = 0; i < strings.size(); i++) { String s = (String) strings.get(i); if (s.length() < length) { result.add(s); } } return result; } public static void main(String[] args) { List names = new ArrayList(); names.add("Ted"); names.add("Fred"); names.add("Jed"); names.add("Ned"); System.out.println(names); Main m = new Main(); List shortNames = m.keepShorterThan(names, 4); System.out.println(shortNames.size()); for (int i = 0; i < shortNames.size(); i++) { String s = (String) shortNames.get(i); System.out.println(s); } } }

Simpler scripting: Java21+ (with JEP 445 & preview) import java.util.List;
import java.util.ArrayList; class Main { private List keepShorterThan(List strings, int length) { List result = new ArrayList(); for (int i = 0; i < strings.size(); i++) { String s = (String) strings.get(i); if (s.length() < length) { result.add(s); } } return result; } public static void main(String[] args) { List names = new ArrayList(); names.add("Ted"); names.add("Fred"); names.add("Jed"); names.add("Ned"); System.out.println(names); Main m = new Main(); List shortNames = m.keepShorterThan(names, 4); System.out.println(shortNames.size()); for (int i = 0; i < shortNames.size(); i++) { String s = (String) shortNames.get(i); System.out.println(s); } } } import java.util.List; void main() { var names = List.of("Ted", "Fred", "Jed", "Ned"); System.out.println(names); var shortNames = names.stream().filter(n -> n.length() < 4).toList(); System.out.println(shortNames.size()); shortNames.forEach(System.out::println); }

Simpler scripting: JDK5+/Groovy 1+ import java.util.List; import java.util.ArrayList; class Main
{ private List keepShorterThan(List strings, int length) { List result = new ArrayList(); for (int i = 0; i < strings.size(); i++) { String s = (String) strings.get(i); if (s.length() < length) { result.add(s); } } return result; } public static void main(String[] args) { List names = new ArrayList(); names.add("Ted"); names.add("Fred"); names.add("Jed"); names.add("Ned"); System.out.println(names); Main m = new Main(); List shortNames = m.keepShorterThan(names, 4); System.out.println(shortNames.size()); for (int i = 0; i < shortNames.size(); i++) { String s = (String) shortNames.get(i); System.out.println(s); } } } import java.util.List; void main() { var names = List.of("Ted", "Fred", "Jed", "Ned"); System.out.println(names); var shortNames = names.stream().filter(n -> n.length() < 4).toList(); System.out.println(shortNames.size()); shortNames.forEach(System.out::println); } names = ["Ted", "Fred", "Jed", "Ned"] println names shortNames = names.findAll{ it.size() < 4 } println shortNames.size() shortNames.each{ println it }

Simpler scripting: DSL/command chain support import java.util.List; import java.util.ArrayList; class
Main { private List keepShorterThan(List strings, int length) { List result = new ArrayList(); for (int i = 0; i < strings.size(); i++) { String s = (String) strings.get(i); if (s.length() < length) { result.add(s); } } return result; } public static void main(String[] args) { List names = new ArrayList(); names.add("Ted"); names.add("Fred"); names.add("Jed"); names.add("Ned"); System.out.println(names); Main m = new Main(); List shortNames = m.keepShorterThan(names, 4); System.out.println(shortNames.size()); for (int i = 0; i < shortNames.size(); i++) { String s = (String) shortNames.get(i); System.out.println(s); } } } import java.util.List; void main() { var names = List.of("Ted", "Fred", "Jed", "Ned"); System.out.println(names); var shortNames = names.stream().filter(n -> n.length() < 4).toList(); System.out.println(shortNames.size()); shortNames.forEach(System.out::println); } names = ["Ted", "Fred", "Jed", "Ned"] println names shortNames = names.findAll{ it.size() < 4 } println shortNames.size() shortNames.each{ println it } given the names "Ted", "Fred", "Jed" and "Ned" display all the names display the number of names having size less than 4 display the names having size less than 4

Scripting for Data Science • Same example • Same library
Array2DRowRealMatrix{{15.1379501385,40.488531856},{21.4354570637,59.5951246537}} import org.apache.commons.math3.linear.*; public class MatrixMain { public static void main(String[] args) { double[][] matrixData = { {1d,2d,3d}, {2d,5d,3d}}; RealMatrix m = MatrixUtils.createRealMatrix(matrixData); double[][] matrixData2 = { {1d,2d}, {2d,5d}, {1d, 7d}}; RealMatrix n = new Array2DRowRealMatrix(matrixData2); RealMatrix o = m.multiply(n); // Invert o, using LU decomposition RealMatrix oInverse = new LUDecomposition(o).getSolver().getInverse(); RealMatrix p = oInverse.scalarAdd(1d).scalarMultiply(2d); RealMatrix q = o.add(p.power(2)); System.out.println(q); } } Thanks to operator overloading and extensible tooling

Scripting in Notebooks • Jupyter/beakerx • Apache Zeppelin • GroovyLab
• Seco

Metaprogramming // imports not shown public class Book { private
String $to$string; private int $hash$code; private final List<String> authors; private final String title; private final Date publicationDate; private static final java.util.Comparator this$TitleComparator; private static final java.util.Comparator this$PublicationDateComparator; public Book(List<String> authors, String title, Date publicationDate) { if (authors == null) { this.authors = null; } else { if (authors instanceof Cloneable) { List<String> authorsCopy = (List<String>) ((ArrayList<?>) authors).clone(); this.authors = (List<String>) (authorsCopy instanceof SortedSet ? DefaultGroovyMethods.asImmutable(authorsCopy) : authorsCopy instanceof SortedMap ? DefaultGroovyMethods.asImmutable(authorsCopy) : authorsCopy instanceof Set ? DefaultGroovyMethods.asImmutable(authorsCopy) : authorsCopy instanceof Map ? DefaultGroovyMethods.asImmutable(authorsCopy) : authorsCopy instanceof List ? DefaultGroovyMethods.asImmutable(authorsCopy) : DefaultGroovyMethods.asImmutable(authorsCopy)); } else { this.authors = (List<String>) (authors instanceof SortedSet ? DefaultGroovyMethods.asImmutable(authors) : authors instanceof SortedMap ? DefaultGroovyMethods.asImmutable(authors) : authors instanceof Set ? DefaultGroovyMethods.asImmutable(authors) : authors instanceof Map ? DefaultGroovyMethods.asImmutable(authors) : authors instanceof List ? DefaultGroovyMethods.asImmutable(authors) : DefaultGroovyMethods.asImmutable(authors)); } } this.title = title; if (publicationDate == null) { this.publicationDate = null; } else { this.publicationDate = (Date) publicationDate.clone(); } } public Book(Map args) { if ( args == null) { args = new HashMap(); } ImmutableASTTransformation.checkPropNames(this, args); if (args.containsKey("authors")) { if ( args.get("authors") == null) { this .authors = null; } else { if (args.get("authors") instanceof Cloneable) { List<String> authorsCopy = (List<String>) ((ArrayList<?>) args.get("authors")).clone(); this.authors = (List<String>) (authorsCopy instanceof SortedSet ? DefaultGroovyMethods.asImmutable(authorsCopy) : authorsCopy instanceof SortedMap ? DefaultGroovyMethods.asImmutable(authorsCopy) : authorsCopy instanceof Set ? DefaultGroovyMethods.asImmutable(authorsCopy) : authorsCopy instanceof Map ? DefaultGroovyMethods.asImmutable(authorsCopy) : authorsCopy instanceof List ? DefaultGroovyMethods.asImmutable(authorsCopy) : DefaultGroovyMethods.asImmutable(authorsCopy)); } else { List<String> authors = (List<String>) args.get("authors"); this.authors = (List<String>) (authors instanceof SortedSet ? DefaultGroovyMethods.asImmutable(authors) : authors instanceof SortedMap ? DefaultGroovyMethods.asImmutable(authors) : authors instanceof Set ? DefaultGroovyMethods.asImmutable(authors) : authors instanceof Map ? DefaultGroovyMethods.asImmutable(authors) : authors instanceof List ? DefaultGroovyMethods.asImmutable(authors) : DefaultGroovyMethods.asImmutable(authors)); } } } else { this .authors = null; } if (args.containsKey("title")) {this .title = (String) args.get("title"); } else { this .title = null;} if (args.containsKey("publicationDate")) { if (args.get("publicationDate") == null) { this.publicationDate = null; } else { this.publicationDate = (Date) ((Date) args.get("publicationDate")).clone(); } } else {this.publicationDate = null; } } … public Book() { this (new HashMap()); } public int compareTo(Book other) { if (this == other) { return 0; } Integer value = 0 value = this .title <=> other .title if ( value != 0) { return value } value = this .publicationDate <=> other .publicationDate if ( value != 0) { return value } return 0 } public static Comparator comparatorByTitle() { return this$TitleComparator; } public static Comparator comparatorByPublicationDate() { return this$PublicationDateComparator; } public String toString() { StringBuilder _result = new StringBuilder(); boolean $toStringFirst = true; _result.append("Book("); if ($toStringFirst) { $toStringFirst = false; } else { _result.append(", "); } _result.append(InvokerHelper.toString(this.getAuthors())); if ($toStringFirst) { $toStringFirst = false; } else { _result.append(", "); } _result.append(InvokerHelper.toString(this.getTitle())); if ($toStringFirst) { $toStringFirst = false; } else { _result.append(", "); } _result.append(InvokerHelper.toString(this.getPublicationDate())); _result.append(")"); if ($to$string == null) { $to$string = _result.toString(); } return $to$string; } public int hashCode() { if ( $hash$code == 0) { int _result = HashCodeHelper.initHash(); if (!(this.getAuthors().equals(this))) { _result = HashCodeHelper.updateHash(_result, this.getAuthors()); } if (!(this.getTitle().equals(this))) { _result = HashCodeHelper.updateHash(_result, this.getTitle()); } if (!(this.getPublicationDate().equals(this))) { _result = HashCodeHelper.updateHash(_result, this.getPublicationDate()); } $hash$code = (int) _result; } return $hash$code; } public boolean canEqual(Object other) { return other instanceof Book; } … public boolean equals(Object other) { if ( other == null) { return false; } if (this == other) { return true; } if (!( other instanceof Book)) { return false; } Book otherTyped = (Book) other; if (!(otherTyped.canEqual( this ))) { return false; } if (!(this.getAuthors() == otherTyped.getAuthors())) { return false; } if (!(this.getTitle().equals(otherTyped.getTitle()))) { return false; } if (!(this.getPublicationDate().equals(otherTyped.getPublicationDate()))) { return false; } return true; } public final Book copyWith(Map map) { if (map == null || map.size() == 0) { return this; } Boolean dirty = false; HashMap construct = new HashMap(); if (map.containsKey("authors")) { Object newValue = map.get("authors"); Object oldValue = this.getAuthors(); if (newValue != oldValue) { oldValue = newValue; dirty = true; } construct.put("authors", oldValue); } else { construct.put("authors", this.getAuthors()); } if (map.containsKey("title")) { Object newValue = map.get("title"); Object oldValue = this.getTitle(); if (newValue != oldValue) { oldValue = newValue; dirty = true; } construct.put("title", oldValue); } else { construct.put("title", this.getTitle()); } if (map.containsKey("publicationDate")) { Object newValue = map.get("publicationDate"); Object oldValue = this.getPublicationDate(); if (newValue != oldValue) { oldValue = newValue; dirty = true; } construct.put("publicationDate", oldValue); } else { construct.put("publicationDate", this.getPublicationDate()); } return dirty == true ? new Book(construct) : this; } public void writeExternal(ObjectOutput out) throws IOException { out.writeObject(authors); out.writeObject(title); out.writeObject(publicationDate); } public void readExternal(ObjectInput oin) throws IOException, ClassNotFoundException { authors = (List) oin.readObject(); title = (String) oin.readObject(); publicationDate = (Date) oin.readObject(); } … static { this$TitleComparator = new Book$TitleComparator(); this$PublicationDateComparator = new Book$PublicationDateComparator(); } public String getAuthors(int index) { return authors.get(index); } public List<String> getAuthors() { return authors; } public final String getTitle() { return title; } public final Date getPublicationDate() { if (publicationDate == null) { return publicationDate; } else { return (Date) publicationDate.clone(); } } public int compare(java.lang.Object param0, java.lang.Object param1) { return -1; } private static class Book$TitleComparator extends AbstractComparator<Book> { public Book$TitleComparator() { } public int compare(Book arg0, Book arg1) { if (arg0 == arg1) { return 0; } if (arg0 != null && arg1 == null) { return -1; } if (arg0 == null && arg1 != null) { return 1; } return arg0.title <=> arg1.title; } public int compare(java.lang.Object param0, java.lang.Object param1) { return -1; } } private static class Book$PublicationDateComparator extends AbstractComparator<Book> { public Book$PublicationDateComparator() { } public int compare(Book arg0, Book arg1) { if ( arg0 == arg1 ) { return 0; } if ( arg0 != null && arg1 == null) { return -1; } if ( arg0 == null && arg1 != null) { return 1; } return arg0 .publicationDate <=> arg1 .publicationDate; } public int compare(java.lang.Object param0, java.lang.Object param1) { return -1; } @Immutable(copyWith = true) @Sortable(excludes = 'authors') @AutoExternalize class Book { @IndexedProperty List<String> authors String title Date publicationDate }

AST Transformations: Groovy 2.4, 2.5, 2.5 (improved), 3.0, 4.0, 5.0
• 80 AST transforms @NonSealed @RecordBase @Sealed @PlatformLog @GQ @Final @RecordType @POJO @Pure @Contracted @Ensures @Invariant @Requires @ClassInvariant @ContractElement @Postcondition @Precondition @OperatorRename

1 6 Whiskey Clustering with Groovy & Apache Ignite •
Apache Groovy • Apache Ignite • Data Science • Whiskey Clustering & Visualization • Scaling Whiskey Clustering

Scaling up machine learning: Apache Ignite Apache Ignite is a
distributed database for high-performance computing with in-memory speed. In simple terms, it makes a cluster (or grid) of nodes appear like an in-memory cache. Ignite can be used as: • an in-memory cache with special features like SQL querying and transactional properties • an in-memory data-grid with advanced read-through & write-through capabilities on top of one or more distributed databases • an ultra-fast and horizontally scalable in-memory database • a high-performance computing engine for custom or built-in tasks including machine learning It is mostly this last capability that we will use. Ignite’s Machine Learning API has purpose built, cluster-aware machine learning and deep learning algorithms for Classification, Regression, Clustering, and Recommendation, among others. We’ll mostly use the distributed K-means Clustering algorithm from their library.

Data Science Process Research Goals Obtain Data Data Preparation Data
Exploration Visualization Data Modeling Data ingestion Data storage Data processing platforms Modeling algorithms Math libraries Graphics processing Integration Deployment

Data Science Algorithms Source: Jason Brownlee, https://machinelearningmastery.com/master-machine-learning-algorithms/

Clustering case study: Whiskey flavor profiles • 86 scotch whiskies
• 12 flavor categories Pictures: https://prasant.net/clustering-scotch-whisky-grouping-distilleries-by-k-means-clustering-81f2ecde069c https://www.r-bloggers.com/where-the-whisky-flavor-profile-data-came-from/ https://www.centerspace.net/clustering-analysis-part-iv-non-negative-matrix-factorization/

Clustering case study: Whiskey flavor profiles RowID,Distillery,Body,Sweetness,Smoky,Medicinal,Tobacco,Honey,Spicy,Winey,Nutty,Malty,Fruity,Floral … 34,GlenElgin,2,3,1,0,0,2,1,1,1,1,2,3 35,GlenGarioch,2,1,3,0,0,0,3,1,0,2,2,2
36,GlenGrant,1,2,0,0,0,1,0,1,2,1,2,1 37,GlenKeith,2,3,1,0,0,1,2,1,2,1,2,1 38,GlenMoray,1,2,1,0,0,1,2,1,2,2,2,4 39,GlenOrd,3,2,1,0,0,1,2,1,1,2,2,2 40,GlenScotia,2,2,2,2,0,1,0,1,2,2,1,1 41,GlenSpey,1,3,1,0,0,0,1,1,1,2,0,2 42,Glenallachie,1,3,1,0,0,1,1,0,1,2,2,2 …

Whiskey – exploring with Dex

Clustering Overview Clustering: • Grouping similar items Algorithm families: •
Hierarchical • Partitioning k-means, x-means • Density-based • Graph-based Aspects: • Disjoint vs overlapping • Preset cluster number • Dimensionality reduction PCA • Nominal feature support Applications: • Market segmentation • Recommendation engines • Search result grouping • Social network analysis • Medical imaging

Clustering https://commons.apache.org/proper/commons-math/userguide/ml.html

Clustering with KMeans Step 1: • Guess k cluster centroids
at random

Step 2: • Assign points to closest centroid

Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points

Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points Repeat steps 2 and 3 until stable or some limit reached

import … def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco",
"Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] def numClusters = 5 def loader = new CSVLoader(file: 'whiskey.csv') def clusterer = new SimpleKMeans(numClusters: numClusters, preserveInstancesOrder: true) def instances = loader.dataSet instances.deleteAttributeAt(0) // remove RowID clusterer.buildClusterer(instances) println ' ' + cols.join(', ') def dataset = new DefaultCategoryDataset() clusterer.clusterCentroids.eachWithIndex{ Instance ctrd, num -> print "Cluster ${num+1}: " println ((1..cols.size()).collect{ sprintf '%.3f', ctrd.value(it) }.join(', ')) (1..cols.size()).each { idx -> dataset.addValue(ctrd.value(idx), "Cluster ${num+1}", cols[idx-1]) } } def clusters = (0..<numClusters).collectEntries{ [it, []] } clusterer.assignments.eachWithIndex { cnum, idx -> clusters[cnum] << instances.get(idx).stringValue(0) } clusters.each { k, v -> println "Cluster ${k+1}:" println v.join(', ') } def plot = new SpiderWebPlot(dataset: dataset) def chart = new JFreeChart('Whiskey clusters', plot) SwingUtil.show(new ChartPanel(chart)) Whiskey – clustering with radar plot and weka Body, Sweetness, Smoky, Medicinal, Tobacco, Honey, Spicy, Winey, Nutty, Malty, Fruity, Floral Cluster 1: 3.800, 1.600, 3.600, 3.600, 0.600, 0.200, 1.600, 0.600, 1.000, 1.400, 1.200, 0.000 Cluster 2: 2.773, 2.409, 1.545, 0.045, 0.000, 1.818, 1.591, 2.000, 2.091, 2.136, 2.136, 1.591 Cluster 3: 1.773, 2.455, 1.318, 0.636, 0.000, 0.636, 1.000, 0.409, 1.636, 1.364, 1.591, 1.591 Cluster 4: 1.500, 2.233, 1.267, 0.267, 0.000, 1.533, 1.400, 0.700, 1.000, 1.900, 1.900, 2.133 Cluster 5: 2.000, 2.143, 1.857, 0.857, 1.000, 0.857, 1.714, 1.000, 1.286, 2.000, 1.429, 1.714 Cluster 1: Ardbeg, Clynelish, Lagavulin, Laphroig, Talisker Cluster 2: Aberfeldy, Aberlour, Ardmore, Auchroisk, Balmenach, BenNevis, Benrinnes, Benromach, BlairAthol, Dailuaine, Dalmore, Edradour, Glendronach, Glendullan, Glenfarclas, Glenrothes, Glenturret, Longmorn, Macallan, Mortlach, RoyalLochnagar, Strathisla Cluster 3: ArranIsleOf, Aultmore, Balblair, Cardhu, Craigganmore, Dufftown, GlenGrant, GlenKeith, GlenScotia, GlenSpey, Glenfiddich, Glenmorangie, Isle of Jura, Mannochmore, Miltonduff, Oban, Speyside, Springbank, Strathmill, Tamnavulin, Teaninich, Tomore Cluster 4: AnCnoc, Auchentoshan, Belvenie, Benriach, Bladnoch, Bowmore, Bruichladdich, Bunnahabhain, Dalwhinnie, Deanston, GlenElgin, GlenGarioch, GlenMoray, GlenOrd, Glenallachie, Glengoyne, Glenkinchie, Glenlivet, Glenlossie, Highland Park, Inchgower, Knochando, Linkwood, Loch Lomond, Scapa, Speyburn, Tamdhu, Tobermory, Tomatin, Tomintoul Cluster 5: Caol Ila, Craigallechie, GlenDeveronMacduff, OldFettercairn, OldPulteney, RoyalBrackla, Tullibardine

import … def rows = CSV.withFirstRecordAsHeader().parse(new FileReader('whiskey.csv')) def cols =
["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] def clusterer = new KMeansPlusPlusClusterer(5) def data = rows.collect{ row -> new DoublePoint(cols.collect{ col -> row[col] } as int[]) } def centroids = clusterer.cluster(data) println cols.join(', ') + ', Medoid' def dataset = new DefaultCategoryDataset() centroids.eachWithIndex{ ctrd, num -> def cpt = ctrd.center.point def closest = ctrd.points.min{ pt -> sumSq((0..<cpt.size()).collect{ cpt[it] - pt.point[it] } as double[]) } def medoid = rows.find{ row -> cols.collect{ row[it] as double } == closest.point }?.Distillery println cpt.collect{ sprintf '%.3f', it }.join(', ') + ", $medoid" cpt.eachWithIndex { val, idx -> dataset.addValue(val, "Cluster ${num+1}", cols[idx]) } } def plot = new SpiderWebPlot(dataset: dataset) def chart = new JFreeChart('Whiskey clusters', plot) SwingUtil.show(new ChartPanel(chart)) Whiskey – clustering with radar plot and medoids Libraries: Apache Commons Math and JFreeChart Body, Sweetness, Smoky, Medicinal, Tobacco, Honey, Spicy, Winey, Nutty, Malty, Fruity, Floral, Medoid 2.000, 2.533, 1.267, 0.267, 0.200, 1.067, 1.667, 0.933, 0.267, 1.733, 1.800, 1.733, GlenOrd 2.789, 2.474, 1.474, 0.053, 0.000, 1.895, 1.632, 2.211, 2.105, 2.105, 2.211, 1.737, Aberfeldy 2.909, 1.545, 2.909, 2.727, 0.455, 0.455, 1.455, 0.545, 1.545, 1.455, 1.182, 0.545, Clynelish 1.333, 2.333, 0.944, 0.111, 0.000, 1.000, 0.444, 0.444, 1.500, 1.944, 1.778, 1.778, Aultmore 1.696, 2.304, 1.565, 0.435, 0.087, 1.391, 1.696, 0.609, 1.652, 1.652, 1.783, 2.130, Benromach

Dimensionality reduction

import … def rows = Table.read().csv('whiskey.csv') def cols = ["Body",
"Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] def data = table.as().doubleMatrix(*cols) def pca = new PCA(data) pca.projection = 2 def plots = [PlotCanvas.screeplot(pca)] def projected = pca.project(data) table = table.addColumns( *(1..2).collect { idx -> DoubleColumn.create("PCA$idx", (0..<data.size()).collect { projected[it][idx - 1] }) } ) def colors = [RED, BLUE, GREEN, ORANGE, MAGENTA, GRAY] def symbols = ['*', 'Q', '#', 'Q', '*', '#'] (2..6).each { k -> def clusterer = new KMeans(data, k) double[][] components = table.as().doubleMatrix('PCA1', 'PCA2') plots << ScatterPlot.plot(components, clusterer.clusterLabel, symbols[0..<k] as char[], colors[0..<k] as Color[]) } SwingUtil.show(size: [1200, 900], new PlotPanel(*plots)) Whiskey – Screeplot

Whiskey – Exploring Weka clustering algorithms

Whiskey – clustering and visualizing centroids … def data =
table.as().doubleMatrix(*cols) def pca = new PCA(data) pca.projection = 3 def projected = pca.project(data) def clusterer = new KMeans(data, 5) def labels = clusterer.clusterLabel.collect { "Cluster " + (it + 1) } table = table.addColumns( *(0..<3).collect { idx -> DoubleColumn.create("PCA${idx+1}", (0..<data.size()).collect{ projected[it][idx] })}, StringColumn.create("Cluster", labels), DoubleColumn.create("Centroid", [10] * labels.size()) ) def centroids = pca.project(clusterer.centroids()) def toAdd = table.emptyCopy(1) (0..<centroids.size()).each { idx -> toAdd[0].setString("Cluster", "Cluster " + (idx+1)) (1..3).each { toAdd[0].setDouble("PCA" + it, centroids[idx][it-1]) } toAdd[0].setDouble("Centroid", 50) table.append(toAdd) } def title = "Clusters x Principal Components w/ centroids" Plot.show(Scatter3DPlot.create(title, table, *(1..3).collect { "PCA$it" }, "Centroid", "Cluster"))

Clustering case study: Whiskey flavor profiles • Distributed clustering?

Clustering case study: Whiskey flavor profiles Node 1 Node 2

• 12 flavor categories • Apache Ignite has special capabilities for reading data into the cache • In a cluster environment, use IgniteDataStreamer or IgniteCache.loadCache() to load data from files, stream sources, database sources, etc. • For our little example, we have a small CSV file and a single node, so we’ll just read our data using Apache Commons CSV

• 12 flavor categories • Let’s select the regions of interest

Clustering case study: Whiskey flavor profiles • Read CSV rows
• Slice out segments of interest 0 1 2 -1 0 1 … … distilleries data features var file = getClass().classLoader.getResource('whiskey.csv').file as File var rows = file.withReader {r -> RFC4180.parse(r).records*.toList() } var data = rows[1..-1].collect{ it[2..-1]*.toDouble() } as double[][] var distilleries = rows[1..-1]*.get(1) var features = rows[0][2..-1]

Clustering case study: Whiskey flavor profiles • Set up configuration
& define some helper variables // configure to all run on local machine but could be a cluster (can be hidden in XML) var cfg = new IgniteConfiguration( peerClassLoadingEnabled: true, discoverySpi: new TcpDiscoverySpi( ipFinder: new TcpDiscoveryMulticastIpFinder( addresses: ['127.0.0.1:47500..47509'] ) ) ) var pretty = this.&sprintf.curry('%.4f') var dist = new EuclideanDistance() // or ManhattanDistance var vectorizer = new DoubleArrayVectorizer().labeled(FIRST)

Whiskey flavors – scaling clustering Ignition.start(cfg).withCloseable { ignite -> println
">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols" var dataCache = ignite.createCache(new CacheConfiguration<Integer, double[]>( name: "TEST_${UUID.randomUUID()}", affinity: new RendezvousAffinityFunction(false, 10))) data.indices.each { int i -> dataCache.put(i, data[i]) } var trainer = new KMeansTrainer().withDistance(dist).withAmountOfClusters(5) var mdl = trainer.fit(ignite, dataCache, vectorizer) println ">>> KMeans centroids:\n${features.join(', ')}" var centroids = mdl.centers*.all() var cols = centroids.collect{ it*.get() } cols.each { c -> println c.collect(pretty).join(', ') } dataCache.destroy() }

Whiskey flavors – scaling clustering Ignition.start(cfg).withCloseable { ignite -> println
">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols" var dataCache = ignite.createCache(new CacheConfiguration<Integer, double[]>( name: "TEST_${UUID.randomUUID()}", affinity: new RendezvousAffinityFunction(false, 10))) data.indices.each { int i -> dataCache.put(i, data[i]) } var trainer = new KMeansTrainer().withDistance(dist).withAmountOfClusters(5) var mdl = trainer.fit(ignite, dataCache, vectorizer) println ">>> KMeans centroids:\n${features.join(', ')}" var centroids = mdl.centers*.all() var cols = centroids.collect{ it*.get() } cols.each { c -> println c.collect(pretty).join(', ') } dataCache.destroy() } [11:48:48] __________ ________________ [11:48:48] / _/ ___/ |/ / _/_ __/ __/ [11:48:48] _/ // (7 7 // / / / / _/ [11:48:48] /___/\___/_/|_/___/ /_/ /x___/ [11:48:48] [11:48:48] ver. 2.15.0#20230425-sha1:f98f7f35 [11:48:48] 2023 Copyright(C) Apache Software Foundation … >>> Ignite grid started for data: 86 rows X 12 cols >>> KMeans centroids: Body, Sweetness, Smoky, Medicinal, Tobacco, Honey, Spicy, Winey, Nutty, Malty, Fruity, Floral 2.3793, 1.0345, 0.2414, 0.0345, 0.8966, 1.1034, 0.5517, 1.5517, 1.6207, 2.1724, 2.1379 2.5556, 1.4444, 0.0556, 0.0000, 1.8333, 1.6667, 2.3333, 2.0000, 2.0000, 2.2222, 1.5556 3.1429, 1.0000, 0.2857, 0.1429, 0.8571, 0.5714, 0.7143, 0.7143, 1.5714, 0.7143, 1.5714 2.0476, 1.7619, 0.3333, 0.1429, 1.7619, 1.7619, 0.7143, 1.0952, 2.1429, 1.6190, 1.8571 1.5455, 2.9091, 2.7273, 0.4545, 0.4545, 1.4545, 0.5455, 1.5455, 1.4545, 1.1818, 0.5455

Whiskey flavors – scaling clustering … var clusters = [:].withDefault{
[] } dataCache.query(new ScanQuery()).withCloseable { observations -> observations.each { observation -> def (k, v) = observation.with{ [getKey(), getValue()] } int prediction = mdl.predict(vectorizer.extractFeatures(k, v)) clusters[prediction] += distilleries[k] } } clusters.sort{ e -> e.key }.each{ k, v -> println "Cluster ${k+1}: ${v.join(', ')}" } … … Cluster 1: Bunnahabhain, Dufftown, Glenmorangie, Teaninich, Glenallachie, Longmorn, Scapa, Tobermory, AnCnoc, Cardhu, GlenElgin, Mannochmore, Speyside, Craigganmore, GlenGrant, Tullibardine, Auchentoshan, Bladnoch, GlenKeith, Glengoyne, Knochando, Strathmill, GlenMoray, Aultmore, Tamdhu, Balblair, Glenlossie, Linkwood, Tamnavulin Cluster 2: Aberfeldy, Balmenach, RoyalLochnagar, Aberlour, Edradour, Glenrothes, Glendronach, Glenturret, Macallan, Glendullan, Glenfarclas, Mortlach, Strathisla, Dailuaine, Auchroisk, BlairAthol, Dalmore, Glenlivet Cluster 3: GlenSpey, GlenDeveronMacduff, Speyburn, Miltonduff, Tomore, ArranIsleOf, Glenfiddich Cluster 4: Loch Lomond, Belvenie, BenNevis, Tomatin, Benriach, Highland Park, Tomintoul, Ardmore, Benrinnes, Craigallechie, GlenGarioch, Inchgower, Benromach, Glenkinchie, OldFettercairn, Bowmore, Dalwhinnie, GlenOrd, Bruichladdich, Deanston, RoyalBrackla Cluster 5: Caol Ila, Ardbeg, Clynelish, Springbank, Isle of Jura, Oban, Lagavulin, Talisker, Laphroig, OldPulteney, GlenScotia …

Whiskey flavors – scaling clustering var dist = new EuclideanDistance()
… Ignition.start(cfg).withCloseable { ignite -> println ">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols" var dataCache = ignite.createCache(new CacheConfiguration<Integer, double[]>( name: "TEST_${UUID.randomUUID()}", affinity: new RendezvousAffinityFunction(false, 10))) data.indices.each { int i -> dataCache.put(i, data[i]) } var trainer = new KMeansTrainer().withDistance(dist).withAmountOfClusters(5) var mdl = trainer.fit(ignite, dataCache, vectorizer) println ">>> KMeans centroids:\n${features.join(', ')}" var centroids = mdl.centers*.all() var cols = centroids.collect{ it*.get() } cols.each { c -> println c.collect(pretty).join(', ') } dataCache.destroy() } 5

Whiskey flavors – scaling clustering var dist = new ManhattanDistance()
… Ignition.start(cfg).withCloseable { ignite -> println ">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols" var dataCache = ignite.createCache(new CacheConfiguration<Integer, double[]>( name: "TEST_${UUID.randomUUID()}", affinity: new RendezvousAffinityFunction(false, 10))) data.indices.each { int i -> dataCache.put(i, data[i]) } var trainer = new KMeansTrainer().withDistance(dist).withAmountOfClusters(5) var mdl = trainer.fit(ignite, dataCache, vectorizer) println ">>> KMeans centroids:\n${features.join(', ')}" var centroids = mdl.centers*.all() var cols = centroids.collect{ it*.get() } cols.each { c -> println c.collect(pretty).join(', ') } dataCache.destroy() } 4 3 3 + 4 = 7

Ignition.start(cfg).withCloseable { ignite -> println ">>> Ignite grid started for
data: ${data.size()} rows X ${data[0].size()} cols" var dataCache = ignite.createCache(new CacheConfiguration<Integer, double[]>( name: "TEST_${UUID.randomUUID()}", affinity: new RendezvousAffinityFunction(false, 10))) data.indices.each { int i -> dataCache.put(i, data[i]) } var trainer = new GmmTrainer().withMaxCountOfClusters(5) var mdl = trainer.fit(ignite, dataCache, vectorizer) … dataCache.destroy() } Whiskey flavors – scaling clustering Image source: wikipedia

6 2 Whiskey Clustering with Apache Groovy & Apache Ignite
Twitter: Mastodon: Apache Groovy: Apache Ignite: Repo: @paulk_asert @[email protected] https://groovy.apache.org/ https://groovy-lang.org/ https://ignite.apache.org/ https://github.com/paulk-asert/groovy-data-science

Whisky clustering with Apache Groovy and Apache...

Whisky clustering with Apache Groovy and Apache Ignite

More Decks by paulking

Other Decks in Technology

Featured

Transcript