Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Groovy Whiskey

Groovy Whiskey

Do you have a penchant for fine whiskey! This presentation embarks on a quest talk to analyze whiskeys produced by the world’s top 86 distilleries to identify the perfect single-malt Scotch. For fun, several Apache projects will be used. Groovy simplifies your data science code. Commons Math and Commons CSV let you write code for reading your data and your processing logic. Beam, Flink, Ignite, Spark and Wayang let you scale your machine learning applications.

paulking

May 29, 2024
Tweet

More Decks by paulking

Other Decks in Technology

Transcript

  1. objectcomputing.com © 2018, Object Computing, Inc. (OCI). All rights reserved.

    No part of these notes may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior, written permission of Object Computing, Inc. (OCI) Whiskey Clustering with Apache Projects: Groovy, Commons CSV, Commons Math, Ignite, Spark, Wayang, Beam, Flink Dr Paul King Object Computing & VP Apache Groovy Twitter/X | Mastodon : Apache Groovy: Repo: Slides: @paulk_asert | @[email protected] https://groovy.apache.org/ https://groovy-lang.org/ https://github.com/paulk-asert/groovy-data-science https://speakerdeck.com/paulk/groovy-whiskey
  2. • Apache Groovy • Clustering Overview • Whiskey Clustering &

    Visualization • Scaling Whiskey Clustering
  3. Apache Groovy Programming Language • Multi-faceted extensible language • Imperative/OO

    & functional • Dynamic & static • Aligned closely with Java • 20+ years since inception • 3.5+B downloads (partial count) • 520+ contributors • 240+ releases • https://www.youtube.com/watch?v=eIGOG- F9ZTw&feature=youtu.be
  4. Why use Groovy in 2024? It’s like a super version

    of Java: • Simpler scripting: more powerful yet more concise • Extension methods: 2000+ enhancements to Java classes for a great out-of-the box experience (batteries included) • Flexible Typing: from dynamic duck-typing (terse code) to extensible stronger-than-Java static typing (better checking) • Improved OO & Functional Features: from traits (more powerful and flexible OO designs) to tail recursion and memorizing/partial application of pure functions • AST transforms: 10s of lines instead of 100/1000s of lines • Java Features Earlier: recent features on older JDKs
  5. Scripting for Data Science • Same example • Same library

    Array2DRowRealMatrix{{15.1379501385,40.488531856},{21.4354570637,59.5951246537}} import org.apache.commons.math3.linear.*; public class MatrixMain { public static void main(String[] args) { double[][] matrixData = { {1d,2d,3d}, {2d,5d,3d}}; RealMatrix m = MatrixUtils.createRealMatrix(matrixData); double[][] matrixData2 = { {1d,2d}, {2d,5d}, {1d, 7d}}; RealMatrix n = new Array2DRowRealMatrix(matrixData2); RealMatrix o = m.multiply(n); // Invert o, using LU decomposition RealMatrix oInverse = new LUDecomposition(o).getSolver().getInverse(); RealMatrix p = oInverse.scalarAdd(1d).scalarMultiply(2d); RealMatrix q = o.add(p.power(2)); System.out.println(q); } } Thanks to operator overloading and extensible tooling
  6. • Apache Groovy • Clustering Overview • Whiskey Clustering &

    Visualization • Scaling Whiskey Clustering
  7. Clustering Overview Clustering: • Grouping similar items Algorithm families: •

    Hierarchical • Partitioning k-means, x-means • Density-based • Graph-based Aspects: • Disjoint vs overlapping • Preset cluster number • Dimensionality reduction PCA • Nominal feature support Applications: • Market segmentation • Recommendation engines • Search result grouping • Social network analysis • Medical imaging
  8. Clustering Overview Clustering: • Grouping similar items Algorithm families: •

    Hierarchical • Partitioning k-means, x-means • Density-based • Graph-based Aspects: • Disjoint vs overlapping • Preset cluster number • Dimensionality reduction PCA • Nominal feature support Applications: • Market segmentation • Recommendation engines • Search result grouping • Social network analysis • Medical imaging
  9. Clustering with KMeans Step 1: • Guess k cluster centroids

    Step 2: • Assign points to closest centroid
  10. Clustering with KMeans Step 1: • Guess k cluster centroids

    Step 2: • Assign points to closest centroid
  11. Clustering with KMeans Step 1: • Guess k cluster centroids

    Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points
  12. Clustering with KMeans Step 1: • Guess k cluster centroids

    Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points
  13. Clustering with KMeans Step 1: • Guess k cluster centroids

    Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points
  14. Clustering with KMeans Step 1: • Guess k cluster centroids

    Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points
  15. Clustering with KMeans Step 1: • Guess k cluster centroids

    Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points Repeat steps 2 and 3 until stable or some limit reached
  16. Clustering with KMeans Step 1: • Guess k cluster centroids

    Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points Repeat steps 2 and 3 until stable or some limit reached
  17. Clustering with KMeans Step 1: • Guess k cluster centroids

    Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points Repeat steps 2 and 3 until stable or some limit reached
  18. Clustering with KMeans Step 1: • Guess k cluster centroids

    Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points Repeat steps 2 and 3 until stable or some limit reached
  19. • Apache Groovy • Clustering Overview • Whiskey Clustering &

    Visualization • Scaling Whiskey Clustering
  20. Clustering case study: Whiskey flavor profiles • 86 scotch whiskies

    • 12 flavor categories Pictures: https://prasant.net/clustering-scotch-whisky-grouping-distilleries-by-k-means-clustering-81f2ecde069c https://www.r-bloggers.com/where-the-whisky-flavor-profile-data-came-from/ https://www.centerspace.net/clustering-analysis-part-iv-non-negative-matrix-factorization/
  21. var file = getClass().classLoader.getResource('whiskey.csv').file as File var builder = RFC4180.builder().build()

    var records = file.withReader { r -> builder.parse(r).records*.toList() } var features = records[0][2..-1] var data = records[1..-1].collect{ new DoublePoint(it[2..-1] as int[]) } var distilleries = records[1..-1]*.get(1) Clustering case study: Whiskey flavor profiles • Read CSV records • Slice out segments of interest 0 1 2 -1 0 1 … … distilleries data features
  22. var clusterer = new KMeansPlusPlusClusterer(4) Map<Integer, List> clusterPts = [:]

    var clusters = clusterer.cluster(data) println features.join(', ') var centroids = categoryDataset() clusters.eachWithIndex { ctrd, num -> var cpt = ctrd.center.point clusterPts[num] = ctrd.points.collect { pt -> data.point.findIndexOf { it == pt.point } } println cpt.collect { sprintf '%.3f', it }.join(', ') cpt.eachWithIndex { val, idx -> centroids.addValue(val, "Cluster ${num + 1}", features[idx]) } } Whiskey Clusters – Apache Commons Math Body, Sweetness, Smoky, Medicinal, Tobacco, Honey, Spicy, Winey, Nutty, Malty, Fruity, Floral 1.630, 2.333, 1.148, 0.222, 0.037, 1.185, 1.037, 0.556, 1.963, 1.630, 2.000, 2.111 2.909, 1.545, 2.909, 2.727, 0.455, 0.455, 1.455, 0.545, 1.545, 1.455, 1.182, 0.545 1.450, 2.550, 1.150, 0.400, 0.150, 0.850, 1.400, 0.600, 0.450, 1.800, 1.700, 2.000 2.607, 2.357, 1.643, 0.107, 0.036, 1.893, 1.679, 1.821, 1.679, 2.107, 1.929, 1.536
  23. Whiskey Clusters – Apache Commons Math println "\n${cols.join(', ')}, Medoid"

    var medoids = categoryDataset() clusters.eachWithIndex { ctrd, num -> var cpt = ctrd.center.point var closest = ctrd.points.min { pt -> sumSq((0..<cpt.size()).collect { cpt[it] - pt.point[it] } as double[]) } var medoidIdx = data.findIndexOf { row -> row.point == closest.point } println data[medoidIdx].point.collect { sprintf '%.3f', it }.join(', ') + ", ${distilleries[medoidIdx]}" data[medoidIdx].point.eachWithIndex { val, idx -> medoids.addValue(val, distilleries[medoidIdx], cols[idx]) } }
  24. println "\n${cols.join(', ')}, Medoid" var medoids = categoryDataset() clusters.eachWithIndex {

    ctrd, num -> var cpt = ctrd.center.point var closest = ctrd.points.min { pt -> sumSq((0..<cpt.size()).collect { cpt[it] - pt.point[it] } as double[]) } var medoidIdx = data.findIndexOf { row -> row.point == closest.point } println data[medoidIdx].point.collect { sprintf '%.3f', it }.join(', ') + ", ${distilleries[medoidIdx]}" data[medoidIdx].point.eachWithIndex { val, idx -> medoids.addValue(val, distilleries[medoidIdx], cols[idx]) } } Whiskey Clusters – Apache Commons Math Body, Sweetness, Smoky, Medicinal, Tobacco, Honey, Spicy, Winey, Nutty, Malty, Fruity, Floral, Medoid 1.000, 3.000, 1.000, 0.000, 0.000, 1.000, 1.000, 0.000, 2.000, 2.000, 2.000, 2.000, Cardhu 3.000, 2.000, 3.000, 3.000, 1.000, 0.000, 2.000, 0.000, 1.000, 1.000, 2.000, 0.000, Clynelish 1.000, 3.000, 1.000, 0.000, 0.000, 1.000, 1.000, 0.000, 1.000, 2.000, 2.000, 2.000, Glenallachie 2.000, 2.000, 2.000, 0.000, 0.000, 2.000, 1.000, 2.000, 2.000, 2.000, 2.000, 2.000, Aberfeldy
  25. import … def rows = Table.read().csv('whiskey.csv') def cols = ["Body",

    "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] def data = table.as().doubleMatrix(*cols) def pca = new PCA(data) pca.projection = 2 def plots = [PlotCanvas.screeplot(pca)] def projected = pca.project(data) table = table.addColumns( *(1..2).collect { idx -> DoubleColumn.create("PCA$idx", (0..<data.size()).collect { projected[it][idx - 1] }) } ) def colors = [RED, BLUE, GREEN, ORANGE, MAGENTA, GRAY] def symbols = ['*', 'Q', '#', 'Q', '*', '#'] (2..6).each { k -> def clusterer = new KMeans(data, k) double[][] components = table.as().doubleMatrix('PCA1', 'PCA2') plots << ScatterPlot.plot(components, clusterer.clusterLabel, symbols[0..<k] as char[], colors[0..<k] as Color[]) } SwingUtil.show(size: [1200, 900], new PlotPanel(*plots)) Whiskey – Screeplot
  26. Whiskey – clustering and visualizing centroids … def data =

    table.as().doubleMatrix(*cols) def pca = new PCA(data) pca.projection = 3 def projected = pca.project(data) def clusterer = new KMeans(data, 5) def labels = clusterer.clusterLabel.collect { "Cluster " + (it + 1) } table = table.addColumns( *(0..<3).collect { idx -> DoubleColumn.create("PCA${idx+1}", (0..<data.size()).collect{ projected[it][idx] })}, StringColumn.create("Cluster", labels), DoubleColumn.create("Centroid", [10] * labels.size()) ) def centroids = pca.project(clusterer.centroids()) def toAdd = table.emptyCopy(1) (0..<centroids.size()).each { idx -> toAdd[0].setString("Cluster", "Cluster " + (idx+1)) (1..3).each { toAdd[0].setDouble("PCA" + it, centroids[idx][it-1]) } toAdd[0].setDouble("Centroid", 50) table.append(toAdd) } def title = "Clusters x Principal Components w/ centroids" Plot.show(Scatter3DPlot.create(title, table, *(1..3).collect { "PCA$it" }, "Centroid", "Cluster"))
  27. Whiskey – Hierarchical clustering with Dendrogram … def dendrogram =

    new Dendrogram(clusters.tree, clusters.height, FOREST_GREEN).canvas().tap { title = 'Whiskey Dendrogram' setAxisLabels('Distilleries', 'Similarity') def lb = lowerBounds setBound([lb[0] - 1, lb[1] - 20] as double[], upperBounds) distilleries.eachWithIndex { String label, int i -> add(new Label(label, [i, -1] as double[], 0, 0, ninetyDeg, font, colorMap[partitions[i]])) } }.panel() def pca = PCA.fit(data) pca.projection = 2 def projected = pca.project(data) char mark = '#' def scatter = ScatterPlot.of(projected, partitions, mark).canvas().tap { title = 'Clustered by dendrogram partitions' setAxisLabels('PCA1', 'PCA2') }.panel() new PlotGrid(dendrogram, scatter).window()
  28. • Apache Groovy • Clustering Overview • Whiskey Clustering &

    Visualization • Scaling Whiskey Clustering
  29. Scaling up machine learning: Apache Ignite • Apache Ignite is

    a distributed database for high- performance computing with in-memory speed. In simple terms, it makes a cluster (or grid) of nodes appear like an in-memory cache. • It has cluster-aware machine learning and deep learning algorithms for Classification, Regression, Clustering, and Recommendation, among others. Image source: Apache Ignite documentation
  30. Clustering case study: Whiskey flavor profiles • 86 scotch whiskies

    • 12 flavor categories • Apache Ignite has special capabilities for reading data into the cache • In a cluster environment, use IgniteDataStreamer or IgniteCache.loadCache() to load data from files, stream sources, database sources, etc. • For our little example, we have a small CSV file and a single node, so we’ll just read our data using Apache Commons CSV
  31. Clustering case study: Whiskey flavor profiles • 86 scotch whiskies

    • 12 flavor categories • Let’s select the regions of interest
  32. Clustering case study: Whiskey flavor profiles • Read CSV rows

    • Slice out segments of interest 0 1 2 -1 0 1 … … distilleries data features var file = getClass().classLoader.getResource('whiskey.csv').file as File var rows = file.withReader {r -> RFC4180.parse(r).records*.toList() } var data = rows[1..-1].collect{ it[2..-1]*.toDouble() } as double[][] var distilleries = rows[1..-1]*.get(1) var features = rows[0][2..-1]
  33. Clustering case study: Whiskey flavor profiles • Set up configuration

    & define some helper variables // configure to all run on local machine but could be a cluster (can be hidden in XML) var cfg = new IgniteConfiguration( peerClassLoadingEnabled: true, discoverySpi: new TcpDiscoverySpi( ipFinder: new TcpDiscoveryMulticastIpFinder( addresses: ['127.0.0.1:47500..47509'] ) ) ) var pretty = this.&sprintf.curry('%.4f') var dist = new EuclideanDistance() // or ManhattanDistance var vectorizer = new DoubleArrayVectorizer().labeled(FIRST)
  34. Whiskey flavors – scaling clustering Ignition.start(cfg).withCloseable { ignite -> println

    ">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols" var dataCache = ignite.createCache(new CacheConfiguration<Integer, double[]>( name: "TEST_${UUID.randomUUID()}", affinity: new RendezvousAffinityFunction(false, 10))) data.indices.each { int i -> dataCache.put(i, data[i]) } var trainer = new KMeansTrainer().withDistance(dist).withAmountOfClusters(5) var mdl = trainer.fit(ignite, dataCache, vectorizer) println ">>> KMeans centroids:\n${features.join(', ')}" var centroids = mdl.centers*.all() var cols = centroids.collect{ it*.get() } cols.each { c -> println c.collect(pretty).join(', ') } dataCache.destroy() }
  35. Whiskey flavors – scaling clustering Ignition.start(cfg).withCloseable { ignite -> println

    ">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols" var dataCache = ignite.createCache(new CacheConfiguration<Integer, double[]>( name: "TEST_${UUID.randomUUID()}", affinity: new RendezvousAffinityFunction(false, 10))) data.indices.each { int i -> dataCache.put(i, data[i]) } var trainer = new KMeansTrainer().withDistance(dist).withAmountOfClusters(5) var mdl = trainer.fit(ignite, dataCache, vectorizer) println ">>> KMeans centroids:\n${features.join(', ')}" var centroids = mdl.centers*.all() var cols = centroids.collect{ it*.get() } cols.each { c -> println c.collect(pretty).join(', ') } dataCache.destroy() } [11:48:48] __________ ________________ [11:48:48] / _/ ___/ |/ / _/_ __/ __/ [11:48:48] _/ // (7 7 // / / / / _/ [11:48:48] /___/\___/_/|_/___/ /_/ /x___/ [11:48:48] [11:48:48] ver. 2.15.0#20230425-sha1:f98f7f35 [11:48:48] 2023 Copyright(C) Apache Software Foundation … >>> Ignite grid started for data: 86 rows X 12 cols >>> KMeans centroids: Body, Sweetness, Smoky, Medicinal, Tobacco, Honey, Spicy, Winey, Nutty, Malty, Fruity, Floral 2.3793, 1.0345, 0.2414, 0.0345, 0.8966, 1.1034, 0.5517, 1.5517, 1.6207, 2.1724, 2.1379 2.5556, 1.4444, 0.0556, 0.0000, 1.8333, 1.6667, 2.3333, 2.0000, 2.0000, 2.2222, 1.5556 3.1429, 1.0000, 0.2857, 0.1429, 0.8571, 0.5714, 0.7143, 0.7143, 1.5714, 0.7143, 1.5714 2.0476, 1.7619, 0.3333, 0.1429, 1.7619, 1.7619, 0.7143, 1.0952, 2.1429, 1.6190, 1.8571 1.5455, 2.9091, 2.7273, 0.4545, 0.4545, 1.4545, 0.5455, 1.5455, 1.4545, 1.1818, 0.5455
  36. Whiskey flavors – scaling clustering … var clusters = [:].withDefault{

    [] } dataCache.query(new ScanQuery()).withCloseable { observations -> observations.each { observation -> def (k, v) = observation.with{ [getKey(), getValue()] } int prediction = mdl.predict(vectorizer.extractFeatures(k, v)) clusters[prediction] += distilleries[k] } } clusters.sort{ e -> e.key }.each{ k, v -> println "Cluster ${k+1}: ${v.join(', ')}" } … … Cluster 1: Bunnahabhain, Dufftown, Glenmorangie, Teaninich, Glenallachie, Longmorn, Scapa, Tobermory, AnCnoc, Cardhu, GlenElgin, Mannochmore, Speyside, Craigganmore, GlenGrant, Tullibardine, Auchentoshan, Bladnoch, GlenKeith, Glengoyne, Knochando, Strathmill, GlenMoray, Aultmore, Tamdhu, Balblair, Glenlossie, Linkwood, Tamnavulin Cluster 2: Aberfeldy, Balmenach, RoyalLochnagar, Aberlour, Edradour, Glenrothes, Glendronach, Glenturret, Macallan, Glendullan, Glenfarclas, Mortlach, Strathisla, Dailuaine, Auchroisk, BlairAthol, Dalmore, Glenlivet Cluster 3: GlenSpey, GlenDeveronMacduff, Speyburn, Miltonduff, Tomore, ArranIsleOf, Glenfiddich Cluster 4: Loch Lomond, Belvenie, BenNevis, Tomatin, Benriach, Highland Park, Tomintoul, Ardmore, Benrinnes, Craigallechie, GlenGarioch, Inchgower, Benromach, Glenkinchie, OldFettercairn, Bowmore, Dalwhinnie, GlenOrd, Bruichladdich, Deanston, RoyalBrackla Cluster 5: Caol Ila, Ardbeg, Clynelish, Springbank, Isle of Jura, Oban, Lagavulin, Talisker, Laphroig, OldPulteney, GlenScotia …
  37. Whiskey flavors – scaling clustering var dist = new EuclideanDistance()

    … Ignition.start(cfg).withCloseable { ignite -> println ">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols" var dataCache = ignite.createCache(new CacheConfiguration<Integer, double[]>( name: "TEST_${UUID.randomUUID()}", affinity: new RendezvousAffinityFunction(false, 10))) data.indices.each { int i -> dataCache.put(i, data[i]) } var trainer = new KMeansTrainer().withDistance(dist).withAmountOfClusters(5) var mdl = trainer.fit(ignite, dataCache, vectorizer) println ">>> KMeans centroids:\n${features.join(', ')}" var centroids = mdl.centers*.all() var cols = centroids.collect{ it*.get() } cols.each { c -> println c.collect(pretty).join(', ') } dataCache.destroy() } 5
  38. Whiskey flavors – scaling clustering var dist = new ManhattanDistance()

    … Ignition.start(cfg).withCloseable { ignite -> println ">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols" var dataCache = ignite.createCache(new CacheConfiguration<Integer, double[]>( name: "TEST_${UUID.randomUUID()}", affinity: new RendezvousAffinityFunction(false, 10))) data.indices.each { int i -> dataCache.put(i, data[i]) } var trainer = new KMeansTrainer().withDistance(dist).withAmountOfClusters(5) var mdl = trainer.fit(ignite, dataCache, vectorizer) println ">>> KMeans centroids:\n${features.join(', ')}" var centroids = mdl.centers*.all() var cols = centroids.collect{ it*.get() } cols.each { c -> println c.collect(pretty).join(', ') } dataCache.destroy() } 4 3 3 + 4 = 7
  39. Ignition.start(cfg).withCloseable { ignite -> println ">>> Ignite grid started for

    data: ${data.size()} rows X ${data[0].size()} cols" var dataCache = ignite.createCache(new CacheConfiguration<Integer, double[]>( name: "TEST_${UUID.randomUUID()}", affinity: new RendezvousAffinityFunction(false, 10))) data.indices.each { int i -> dataCache.put(i, data[i]) } var trainer = new GmmTrainer().withMaxCountOfClusters(5) var mdl = trainer.fit(ignite, dataCache, vectorizer) … dataCache.destroy() } Whiskey flavors – scaling clustering Image source: wikipedia
  40. Apache Spark • Multi-language engine for executing data engineering, data

    science, and machine learning on single-node machines or clusters Spark Session Cluster Manager Executor Executor Executor Driver node Worker nodes Cache Tasks
  41. Apache Spark MLlib ML Algorithms • Clustering • Kmeans •

    Bisecting Kmeans • Latent Dirichlet Allocation • Gaussian Mixture Model • Power Iteration Clustering • Classification • Regression • Feature engineering • Stats • Utility functions MLlib Your APP Spark Core Spark SQL
  42. Whiskey flavors – scaling clustering var spark = builder().config('spark.master', 'local[8]').appName('Whiskey').orCreate

    var file = WhiskeySpark.classLoader.getResource('whiskey.csv').file var rows = spark.read().format('com.databricks.spark.csv') .options(header: 'true', inferSchema: 'true').load(file) String[] colNames = rows.columns().toList() - ['RowID', 'Distillery'] var assembler = new VectorAssembler(inputCols: colNames, outputCol: 'features') var dataset = assembler.transform(rows) var kmeans = new KMeans(k: 5, seed: 1L) var model = kmeans.fit(dataset) println 'Cluster centers:' model.clusterCenters().each { println it.values().collect { sprintf '%.2f', it }.join(', ') } var result = model.transform(dataset) var clusters = result.toLocalIterator().collect { row -> [row.getAs('prediction'), row.getAs('Distillery')] }.groupBy { it[0] }.collectValues { it*.get(1) } clusters.each { k, v -> println "Cluster$k: ${v.join(', ')}"} spark.stop()
  43. Whiskey flavors – scaling clustering var spark = builder().config('spark.master', 'local[8]').appName('Whiskey').orCreate

    var file = WhiskeySpark.classLoader.getResource('whiskey.csv').file var rows = spark.read().format('com.databricks.spark.csv') .options(header: 'true', inferSchema: 'true').load(file) String[] colNames = rows.columns().toList() - ['RowID', 'Distillery'] var assembler = new VectorAssembler(inputCols: colNames, outputCol: 'features') var dataset = assembler.transform(rows) var kmeans = new KMeans(k: 5, seed: 1L) var model = kmeans.fit(dataset) println 'Cluster centers:' model.clusterCenters().each { println it.values().collect { sprintf '%.2f', it }.join(', ') } var result = model.transform(dataset) var clusters = result.toLocalIterator().collect { row -> [row.getAs('prediction'), row.getAs('Distillery')] }.groupBy { it[0] }.collectValues { it*.get(1) } clusters.each { k, v -> println "Cluster$k: ${v.join(', ')}"} spark.stop() Cluster centers: 2.89, 2.42, 1.53, 0.05, 0.00, 1.84, 1.58, 2.11, 2.11, 2.11, 2.26, 1.58 1.45, 2.35, 1.06, 0.26, 0.06, 0.84, 1.13, 0.45, 1.26, 1.65, 2.19, 2.10 1.83, 3.17, 1.00, 0.33, 0.17, 1.00, 0.67, 0.83, 0.83, 1.50, 0.50, 1.50 3.00, 1.50, 3.00, 2.80, 0.50, 0.30, 1.40, 0.50, 1.50, 1.50, 1.30, 0.50 1.85, 2.20, 1.70, 0.40, 0.10, 1.85, 1.80, 1.00, 1.35, 2.00, 1.40, 1.85 Cluster0: Aberfeldy, Aberlour, Auchroisk, Balmenach, BenNevis, Benrinnes, BlairAthol, Dailuaine, Dalmore, Edradour, Glendronach, Glendullan, Glenfarclas, Glenrothes, Longmorn, Macallan, Mortlach, RoyalLochnagar, Strathisla Cluster1: AnCnoc, Auchentoshan, Aultmore, Balblair, Benriach, Bladnoch, Bunnahabhain, Cardhu, Craigganmore, Dufftown, GlenElgin, GlenGrant, GlenKeith, GlenMoray, Glenallachie, Glenfiddich, Glengoyne, Glenkinchie, Glenlossie, Glenmorangie, Linkwood, Loch Lomond, Mannochmore, RoyalBrackla, Speyside, Strathmill, Tamdhu, Tamnavulin, Teaninich, Tobermory, Tullibardine Cluster3: Ardbeg, Caol Ila, Clynelish, GlenScotia, Isle of Jura, Lagavulin, Laphroig, Oban, OldPulteney, Talisker Cluster4: Ardmore, Belvenie, Benromach, Bowmore, Bruichladdich, Craigallechie, Dalwhinnie, Deanston, GlenGarioch, GlenOrd, Glenlivet, Glenturret, Highland Park, Inchgower, Knochando, OldFettercairn, Scapa, Springbank, Tomatin, Tomintoul Cluster2: ArranIsleOf, GlenDeveronMacduff, GlenSpey, Miltonduff, Speyburn, Tomore
  44. Apache Wayang • A unified data processing framework that seamlessly

    integrates and orchestrates multiple data platforms to deliver unparalleled performance and flexibility Image source: Apache Wayang documentation
  45. Apache Wayang • Offers two approaches for us: • Roll

    your own Kmeans algorithm using existing operators • Built upon 4 abstractions: UnaryToUnaryOperator, BinaryToUnaryOperator, UnarySource, UnarySink • Many built-in operators: Map, Filter, Reduce, Distinct, Count, GroupBy • ML4all abstracts most ML algorithms with seven operators: • Transform, Stage, Compute, Update, Sample, Converge, Loop • Kmeans implementation included in next release Image source: Apache Wayang documentation
  46. Apache Wayang: Roll your own Kmeans Domain classes: record Point(double[]

    pts) implements Serializable { } record PointGrouping(double[] pts, int cluster, long count) implements Serializable { PointGrouping(List<Double> pts, int cluster, long count) { this(pts as double[], cluster, count) } PointGrouping plus(PointGrouping that) { var newPts = pts.indices.collect{ pts[it] + that.pts[it] } new PointGrouping(newPts, cluster, count + that.count) } PointGrouping average() { new PointGrouping(pts.collect{ double d -> d/count }, cluster, 1) } }
  47. Apache Wayang: Roll your own Kmeans Algorithm class: class SelectNearestCentroid

    implements ExtendedSerializableFunction<Point, PointGrouping> { Iterable<PointGrouping> centroids void open(ExecutionContext context) { centroids = context.getBroadcast('centroids') } PointGrouping apply(Point p) { var minDistance = Double.POSITIVE_INFINITY var nearestCentroidId = -1 for (c in centroids) { var distance = sqrt(p.pts.indices.collect{ p.pts[it] - c.pts[it] }.sum{ it ** 2 } as double) if (distance < minDistance) { minDistance = distance nearestCentroidId = c.cluster } } new PointGrouping(p.pts, nearestCentroidId, 1) } }
  48. Apache Wayang: Roll your own Kmeans class PipelineOps { public

    static SerializableFunction<PointGrouping, Integer> cluster = tpc -> tpc.cluster public static SerializableFunction<PointGrouping, PointGrouping> average = tpc -> tpc.average() public static SerializableBinaryOperator<PointGrouping> plus = (tpc1, tpc2) -> tpc1 + tpc2 } import static PipelineOps.* int k = 5 int iterations = 10 // read in data from our file var url = WhiskeyWayang.classLoader.getResource('whiskey.csv').file def rows = new File(url).readLines()[1..-1]*.split(',') var distilleries = rows*.getAt(1) var pointsData = rows.collect{ new Point(it[2..-1] as double[]) } var dims = pointsData[0].pts.size() // create some random points as initial centroids var r = new Random() var randomPoint = { (0..<dims).collect { r.nextGaussian() + 2 } as double[] } var initPts = (1..k).collect(randomPoint)
  49. Apache Wayang: Roll your own Kmeans var context = new

    WayangContext() .withPlugin(Java.basicPlugin()) .withPlugin(Spark.basicPlugin()) var planBuilder = new JavaPlanBuilder(context, "KMeans ($url, k=$k, iterations=$iterations)") var points = planBuilder .loadCollection(pointsData).withName('Load points') var initialCentroids = planBuilder .loadCollection((0..<k).collect{ idx -> new PointGrouping(initPts[idx], idx, 0) }) .withName('Load random centroids') var finalCentroids = initialCentroids.repeat(iterations, currentCentroids -> points.map(new SelectNearestCentroid()) .withBroadcast(currentCentroids, 'centroids').withName('Find nearest centroid') .reduceByKey(cluster, plus).withName('Aggregate points') .map(average).withName('Average points') .withOutputClass(PointGrouping) ).withName('Loop').collect()
  50. Apache Wayang: Roll your own Kmeans println 'Centroids:' finalCentroids.each {

    c -> println "Cluster $c.cluster: ${c.pts.collect { sprintf '%.2f', it }.join(', ')}" } Centroids: Cluster 0: 2.53, 1.65, 2.76, 2.12, 0.29, 0.65, 1.65, 0.59, 1.35, 1.41, 1.35, 0.94 Cluster 2: 3.33, 2.56, 1.67, 0.11, 0.00, 1.89, 1.89, 2.78, 2.00, 1.89, 2.33, 1.33 Cluster 3: 1.42, 2.47, 1.03, 0.22, 0.06, 1.00, 1.03, 0.47, 1.19, 1.72, 1.92, 2.08 Cluster 4: 2.25, 2.38, 1.38, 0.08, 0.13, 1.79, 1.54, 1.33, 1.75, 2.17, 1.75, 1.79 var allocator = new SelectNearestCentroid(centroids: finalCentroids) var allocations = pointsData.withIndex() .collect{ pt, idx -> [allocator.apply(pt).cluster, distilleries[idx]] } .groupBy{ cluster, ds -> "Cluster $cluster" } .collectValues{ v -> v.collect{ it[1] } } .sort{ e1, e2 -> e1.key <=> e2.key } allocations.each{ c, ds -> println "$c (${ds.size()} members): ${ds.join(', ')}" } Cluster 0 (17 members): Ardbeg, Balblair, Bowmore, Bruichladdich, Caol Ila, Clynelish, GlenGarioch, GlenScotia, Highland Park, Isle of Jura, Lagavulin, Laphroig, Oban, OldPulteney, Springbank, Talisker, Teaninich Cluster 2 (9 members): Aberlour, Balmenach, Dailuaine, Dalmore, Glendronach, Glenfarclas, Macallan, Mortlach, RoyalLochnagar Cluster 3 (36 members): AnCnoc, ArranIsleOf, Auchentoshan, Aultmore, Benriach, Bladnoch, Bunnahabhain, Cardhu, Craigganmore, Dalwhinnie, Dufftown, GlenElgin, GlenGrant, GlenMoray, GlenSpey, Glenallachie, Glenfiddich, Glengoyne, Glenkinchie, Glenlossie, Glenmorangie, Inchgower, Linkwood, Loch Lomond, Mannochmore, Miltonduff, RoyalBrackla, Speyburn, Speyside, Strathmill, Tamdhu, Tamnavulin, Tobermory, Tomintoul, Tomore, Tullibardine Cluster 4 (24 members): Aberfeldy, Ardmore, Auchroisk, Belvenie, BenNevis, Benrinnes, Benromach, BlairAthol, Craigallechie, Deanston, Edradour, GlenDeveronMacduff, GlenKeith, GlenOrd, Glendullan, Glenlivet, Glenrothes, Glenturret, Knochando, Longmorn, OldFettercairn, Scapa, Strathisla, Tomatin
  51. Apache Wayang: ML4all int k = 3 int maxIterations =

    100 double accuracy = 0 class TransformCSV extends Transform<double[], String> { double[] transform(String input) { input.split(',')[2..-1] as double[] } } class KMeansStageWithRandoms extends LocalStage { int k, dimension private r = new Random() void staging(ML4allModel model) { double[][] centers = new double[k][] for (i in 0..<k) { centers[i] = (0..<dimension).collect { r.nextGaussian() + 2 } as double[] } model.put('centers', centers) } }
  52. Apache Wayang: ML4all var url = WhiskeyWayangML.classLoader.getResource('whiskey_noheader.csv').path var dims =

    12 var context = new WayangContext() .withPlugin(Spark.basicPlugin()) .withPlugin(Java.basicPlugin()) var plan = new ML4allPlan( transformOp: new TransformCSV(), localStage: new KMeansStageWithRandoms(k: k, dimension: dims), computeOp: new KMeansCompute(), updateOp: new KMeansUpdate(), loopOp: new KMeansConvergeOrMaxIterationsLoop(accuracy, maxIterations) ) var model = plan.execute('file:' + url, context) model.getByKey("centers").eachWithIndex { center, idx -> var pts = center.collect { sprintf '%.2f', it }.join(', ') println "Cluster$idx: $pts" } Cluster0: 1.57, 2.32, 1.32, 0.45, 0.09, 1.08, 1.19, 0.60, 1.26, 1.74, 1.72, 1.85 Cluster1: 3.43, 1.57, 3.43, 3.14, 0.57, 0.14, 1.71, 0.43, 1.29, 1.43, 1.29, 0.14 Cluster2: 2.73, 2.42, 1.46, 0.04, 0.04, 1.88, 1.69, 1.88, 1.92, 2.04, 2.12, 1.81
  53. Apache Beam® • Apache Beam offers a unified programming model

    for batch and streaming data processing pipelines • The pipeline abstraction encapsulates all the data and steps in your data processing task • Apache Beam unifies multiple data processing engines and SDKs around its distinctive Beam model • Several language SDKs: Java, Groovy (via Java JDK), Python, Go, SQL, … Image sources: Apache Beam documentation
  54. Apache Beam Kmeans record Point(double[] pts) implements Serializable { private

    static Random r = new Random() private static Closure<double[]> randomPoint = { dims -> (1..dims).collect { r.nextGaussian() + 2 } as double[] } static Point ofRandom(int dims) { new Point(randomPoint(dims)) } String toString() { "Point[${pts.collect{ sprintf '%.2f', it }.join('. ')}]" } } record Points(List<Point> pts) implements Serializable { }
  55. Apache Beam Kmeans var readCsv = new DoFn<String, Point>() {

    @ProcessElement void processElement(@Element String path, OutputReceiver<Point> receiver) throws IOException { def parser= CSV.builder().setHeader().setSkipHeaderRecord(true).build() def records= new File(path).withReader{ rdr -> parser.parse(rdr).records*.toList() } records.each { receiver.output(new Point(it[2..-1] as double[])) } } } var pointArray2out = new DoFn<Points, String>() { @ProcessElement void processElement(@Element Points pts, OutputReceiver<String> out) { String log = "Centroids:\n${pts.pts()*.toString().join('\n')}" out.output(log) } }
  56. Apache Beam Kmeans class MeanDoubleArrayCols implements SerializableFunction<Iterable<Point>, Point> { @Override

    Point apply(Iterable<Point> inputs) { double[] result = new double[12] int count = 0 for (Point input : inputs) { result.indices.each { result[it] += input.pts()[it] } count++ } result.indices.each { result[it] /= count } new Point(result) } } class Squash extends Combine.CombineFn<KV<Integer, Point>, Accum, Points> { int k, dims @Override Accum createAccumulator() { new Accum() } @Override Accum addInput(Accum mutableAccumulator, KV<Integer, Point> input) { … } @Override Accum mergeAccumulators(Iterable<Accum> accumulators) { … } @Override Points extractOutput(Accum accumulator) { … } static class Accum implements Serializable { List<Point> pts = [] } }
  57. Apache Beam Kmeans var assign = { Point pt, Points

    centroids -> var minDistance = Double.POSITIVE_INFINITY var nearestCentroidId = -1 var idxs = pt.pts().indices centroids.pts().eachWithIndex { Point next, int cluster -> var distance = sqrt(sumSq(idxs.collect { pt.pts()[it] - next.pts()[it] } as double[])) if (distance < minDistance) { minDistance = distance nearestCentroidId = cluster } } KV.of(nearestCentroidId, pt) }
  58. Apache Beam Kmeans Points initCentroids = new Points((1..k).collect{ Point.ofRandom(dims) })

    var points = p .apply(Create.of(filename)) .apply('Read points', ParDo.of(readCsv)) var centroids = p.apply(Create.of(initCentroids)) iterations.times { var centroidsView = centroids .apply(View.<Points> asSingleton()) centroids = points .apply('Assign clusters', ParDo.of(new AssignClusters(centroidsView, assign)).withSideInputs(centroidsView)) .apply('Calculate new centroids', Combine.<Integer, Point> perKey(new MeanDoubleArrayCols())) .apply('As Points', Combine.<KV<Integer, Point>, Points> globally(new Squash(k: k, dims: dims))) } centroids .apply('Display centroids', ParDo.of(pointArray2out)).apply(Log.ofElements())
  59. Apache Beam Kmeans int k = 5 int iterations =

    10 int dims = 12 var pipeline = Pipeline.create() def csv = getClass().classLoader.getResource('whiskey.csv').path buildPipeline(pipeline, csv, k, iterations, dims) pipeline.run().waitUntilFinish() May 29, 2024 5:47:06 PM org.codehaus.groovy.vmplugin.v8.IndyInterface fromCache INFO: Centroids: Point[1.22. 2.87. 0.78. 0.11. 0.35. 0.90. 1.87. 0.81. 0.94. 1.86. 1.65. 1.67] Point[3.67. 1.50. 3.67. 3.33. 0.67. 0.17. 1.67. 0.50. 1.17. 1.33. 1.17. 0.17] Point[1.29. 1.62. 1.00. 0.10. 0.02. 1.17. 0.40. 0.31. 1.36. 1.93. 2.00. 2.14] Point[2.81. 2.43. 1.52. 0.05. 0.00. 2.00. 1.71. 2.05. 1.95. 2.05. 2.19. 1.71] Point[1.86. 2.00. 1.93. 1.07. 0.21. 1.29. 1.29. 1.00. 1.57. 1.86. 1.00. 1.00]
  60. Apache Beam with Groovy metaprogramming for Python-style coding Points initCentroids

    = new Points((1..k).collect { Point.ofRandom(dims) }) var points = p | Create.of(filename) | 'Read points' >> ParDo.of(readCsv) var centroids = p | Create.of(initCentroids) iterations.times { var centroidsView = centroids | View.asSingleton() centroids = points | 'Assign clusters' >> ParDo.of(new AssignClusters(centroidsView, assign)).withSideInputs(centroidsView) | 'Calculate new centroids' >> Combine.perKey(new MeanDoubleArrayCols()) | 'As Points' >> Combine.globally(new Squash(k: k, dims: dims)) } centroids | 'Display centroids' >> ParDo.of(pointArray2out) | Log.ofElements() INFO: Centroids: Point[4.00. 1.33. 4.00. 4.00. 0.67. 0.00. 1.00. 1.00. 1.00. 1.33. 0.67. 0.00] Point[1.56. 2.58. 1.07. 0.02. 0.08. 1.07. 1.00. 0.59. 1.39. 1.60. 1.52. 1.76] Point[2.18. 1.88. 2.35. 1.59. 0.24. 0.76. 1.76. 0.47. 1.41. 1.47. 1.65. 1.29] Point[2.42. 2.48. 1.27. 0.08. 0.08. 1.84. 1.73. 1.95. 1.98. 2.15. 2.16. 1.92] Point[2.24. 2.20. 3.55. 1.85. 1.58. 1.50. 1.97. 2.45. 0.84. 2.02. 0.77. 0.73]
  61. Apache Flink® • Distributed processing engine for stateful computations over

    unbounded and bounded data streams Image sources: Apache Flink documentation
  62. Apache Flink ML ML Algorithms • Clustering • Kmeans •

    AgglomerativeClustering • Classification • Regression • Evaluation • Feature engineering • Recommendation • Stats • Utility functions Image based on Apache Flink documentation ML Your APP
  63. var eEnv = StreamExecutionEnvironment.executionEnvironment var tEnv = StreamTableEnvironment.create(eEnv) var file

    = WhiskeyFlink.classLoader.getResource('whiskey.csv').file var source = FileSource.forRecordStreamFormat(new TextLineInputFormat(), new Path(file)).build() var stream = eEnv .fromSource(source, WatermarkStrategy.noWatermarks(), "csvfile") .filter(skipHeader).flatMap(splitAndChop) var inputTable = tEnv.fromDataStream(stream).as("features") var kmeans = new KMeans(k: 3, seed: 1L) var kmeansModel = kmeans.fit(inputTable) var outputTable = kmeansModel.transform(inputTable)[0] var clusters = [:].withDefault { [] } outputTable.execute().collect().each { row -> var features = row.getField(kmeans.featuresCol) var clusterId = row.getField(kmeans.predictionCol) clusters[clusterId] << features } clusters.each { k, v -> println "Cluster $k has ${v.size()} members:\n${v.join('\n')}" } Flink ML KMeans
  64. var eEnv = StreamExecutionEnvironment.executionEnvironment var tEnv = StreamTableEnvironment.create(eEnv) var file

    = WhiskeyFlink.classLoader.getResource('whiskey.csv').file var source = FileSource.forRecordStreamFormat(new TextLineInputFormat(), new Path(file)).build() var stream = eEnv .fromSource(source, WatermarkStrategy.noWatermarks(), "csvfile") .filter(skipHeader).flatMap(splitAndChop) var inputTable = tEnv.fromDataStream(stream).as("features") var kmeans = new KMeans(k: 3, seed: 1L) var kmeansModel = kmeans.fit(inputTable) var outputTable = kmeansModel.transform(inputTable)[0] var clusters = [:].withDefault { [] } outputTable.execute().collect().each { row -> var features = row.getField(kmeans.featuresCol) var clusterId = row.getField(kmeans.predictionCol) clusters[clusterId] << features } clusters.each { k, v -> println "Cluster $k has ${v.size()} members:\n${v.join('\n')}" } Flink ML KMeans Cluster 2 has 23 members: [2.0, 2.0, 2.0, 0.0, 0.0, 2.0, 1.0, 2.0, 2.0, 2.0, 2.0, 2.0] [3.0, 3.0, 1.0, 0.0, 0.0, 4.0, 3.0, 2.0, 2.0, 3.0, 3.0, 2.0] [2.0, 3.0, 1.0, 0.0, 0.0, 2.0, 1.0, 2.0, 2.0, 2.0, 2.0, 1.0] [4.0, 3.0, 2.0, 0.0, 0.0, 2.0, 1.0, 3.0, 3.0, 0.0, 1.0, 2.0] … Cluster 0 has 46 members: [1.0, 3.0, 2.0, 0.0, 0.0, 2.0, 0.0, 0.0, 2.0, 2.0, 3.0, 2.0] [2.0, 2.0, 2.0, 0.0, 0.0, 1.0, 1.0, 1.0, 2.0, 3.0, 1.0, 1.0] [2.0, 3.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 2.0] [0.0, 2.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 2.0, 2.0, 3.0, 3.0] …
  65. … var data = stream.executeAndCollect().collect{ Row.of(it) } var train =

    data[0..79] var predict = data[80..-1] var trainSource = new PeriodicSourceFunction(1000, train.collate(8)) var trainStream = eEnv.addSource(trainSource, new RowTypeInfo(DenseVectorTypeInfo.INSTANCE)) var trainTable = tEnv.fromDataStream(trainStream).as("features") var predictSource = new PeriodicSourceFunction(1000, [predict]) var predictStream = eEnv.addSource(predictSource, new RowTypeInfo(DenseVectorTypeInfo.INSTANCE)) var predictTable = tEnv.fromDataStream(predictStream).as("features") var kmeans = new OnlineKMeans(featuresCol: 'features', predictionCol: 'prediction', globalBatchSize: 8, initialModelData: randomInit, k: 3) var kmeansModel = kmeans.fit(trainTable) var outputTable = kmeansModel.transform(predictTable)[0] outputTable.execute().collect().each { row -> DenseVector features = (DenseVector) row.getField(kmeans.featuresCol) var clusterId = row.getField(kmeans.predictionCol) println "Cluster $clusterId: ${features}" } Flink ML Online KMeans
  66. Flink ML Online KMeans Cluster 1: [1.0, 1.0, 1.0, 0.0,

    0.0, 1.0, 0.0, 0.0, 1.0, 2.0, 2.0, 2.0] Cluster 2: [2.0, 3.0, 0.0, 0.0, 1.0, 0.0, 2.0, 1.0, 1.0, 2.0, 2.0, 1.0] Cluster 1: [0.0, 3.0, 1.0, 0.0, 0.0, 2.0, 2.0, 1.0, 1.0, 2.0, 1.0, 2.0] Cluster 2: [2.0, 3.0, 2.0, 0.0, 0.0, 2.0, 2.0, 1.0, 1.0, 2.0, 0.0, 1.0] Cluster 2: [2.0, 2.0, 2.0, 1.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 2.0, 2.0] Cluster 0: [2.0, 2.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 2.0, 1.0, 0.0, 0.0] … Cluster 1: [0.0, 3.0, 1.0, 0.0, 0.0, 2.0, 2.0, 1.0, 1.0, 2.0, 1.0, 2.0] Cluster 1: [1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 2.0, 2.0, 2.0] Cluster 1: [2.0, 3.0, 0.0, 0.0, 1.0, 0.0, 2.0, 1.0, 1.0, 2.0, 2.0, 1.0] Cluster 0: [2.0, 3.0, 2.0, 0.0, 0.0, 2.0, 2.0, 1.0, 1.0, 2.0, 0.0, 1.0] Cluster 2: [2.0, 2.0, 2.0, 1.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 2.0, 2.0] Cluster 2: [2.0, 2.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 2.0, 1.0, 0.0, 0.0] …
  67. Questions? Twitter/X | Mastodon : Apache Groovy: Repo: Slides: @paulk_asert

    | @[email protected] https://groovy.apache.org/ https://groovy-lang.org/ https://github.com/paulk-asert/groovy-data-science https://speakerdeck.com/paulk/groovy-whiskey