Slide 1

Slide 1 text

objectcomputing.com © 2018, Object Computing, Inc. (OCI). All rights reserved. No part of these notes may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior, written permission of Object Computing, Inc. (OCI) Whiskey Clustering with Apache Projects: Groovy, Commons CSV, Commons Math, Ignite, Spark, Wayang, Beam, Flink Dr Paul King Object Computing & VP Apache Groovy Twitter/X | Mastodon : Apache Groovy: Repo: Slides: @paulk_asert | @[email protected] https://groovy.apache.org/ https://groovy-lang.org/ https://github.com/paulk-asert/groovy-data-science https://speakerdeck.com/paulk/groovy-whiskey

Slide 2

Slide 2 text

• Apache Groovy • Clustering Overview • Whiskey Clustering & Visualization • Scaling Whiskey Clustering

Slide 3

Slide 3 text

Apache Groovy Programming Language • Multi-faceted extensible language • Imperative/OO & functional • Dynamic & static • Aligned closely with Java • 20+ years since inception • 3.5+B downloads (partial count) • 520+ contributors • 240+ releases • https://www.youtube.com/watch?v=eIGOG- F9ZTw&feature=youtu.be

Slide 4

Slide 4 text

Friends of Apache Groovy Open Collective

Slide 5

Slide 5 text

Why use Groovy in 2024? It’s like a super version of Java: • Simpler scripting: more powerful yet more concise • Extension methods: 2000+ enhancements to Java classes for a great out-of-the box experience (batteries included) • Flexible Typing: from dynamic duck-typing (terse code) to extensible stronger-than-Java static typing (better checking) • Improved OO & Functional Features: from traits (more powerful and flexible OO designs) to tail recursion and memorizing/partial application of pure functions • AST transforms: 10s of lines instead of 100/1000s of lines • Java Features Earlier: recent features on older JDKs

Slide 6

Slide 6 text

Scripting for Data Science • Same example • Same library Array2DRowRealMatrix{{15.1379501385,40.488531856},{21.4354570637,59.5951246537}} import org.apache.commons.math3.linear.*; public class MatrixMain { public static void main(String[] args) { double[][] matrixData = { {1d,2d,3d}, {2d,5d,3d}}; RealMatrix m = MatrixUtils.createRealMatrix(matrixData); double[][] matrixData2 = { {1d,2d}, {2d,5d}, {1d, 7d}}; RealMatrix n = new Array2DRowRealMatrix(matrixData2); RealMatrix o = m.multiply(n); // Invert o, using LU decomposition RealMatrix oInverse = new LUDecomposition(o).getSolver().getInverse(); RealMatrix p = oInverse.scalarAdd(1d).scalarMultiply(2d); RealMatrix q = o.add(p.power(2)); System.out.println(q); } } Thanks to operator overloading and extensible tooling

Slide 7

Slide 7 text

• Apache Groovy • Clustering Overview • Whiskey Clustering & Visualization • Scaling Whiskey Clustering

Slide 8

Slide 8 text

Clustering Overview Clustering: • Grouping similar items Algorithm families: • Hierarchical • Partitioning k-means, x-means • Density-based • Graph-based Aspects: • Disjoint vs overlapping • Preset cluster number • Dimensionality reduction PCA • Nominal feature support Applications: • Market segmentation • Recommendation engines • Search result grouping • Social network analysis • Medical imaging

Slide 9

Slide 9 text

Clustering Overview Clustering: • Grouping similar items Algorithm families: • Hierarchical • Partitioning k-means, x-means • Density-based • Graph-based Aspects: • Disjoint vs overlapping • Preset cluster number • Dimensionality reduction PCA • Nominal feature support Applications: • Market segmentation • Recommendation engines • Search result grouping • Social network analysis • Medical imaging

Slide 10

Slide 10 text

Clustering https://commons.apache.org/proper/commons-math/userguide/ml.html

Slide 11

Slide 11 text

Clustering with KMeans Step 1: • Guess k cluster centroids at random

Slide 12

Slide 12 text

Clustering with KMeans Step 1: • Guess k cluster centroids

Slide 13

Slide 13 text

Clustering with KMeans Step 1: • Guess k cluster centroids Step 2: • Assign points to closest centroid

Slide 14

Slide 14 text

Clustering with KMeans Step 1: • Guess k cluster centroids Step 2: • Assign points to closest centroid

Slide 15

Slide 15 text

Clustering with KMeans Step 1: • Guess k cluster centroids Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points

Slide 16

Slide 16 text

Clustering with KMeans Step 1: • Guess k cluster centroids Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points

Slide 17

Slide 17 text

Clustering with KMeans Step 1: • Guess k cluster centroids Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points

Slide 18

Slide 18 text

Clustering with KMeans Step 1: • Guess k cluster centroids Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points

Slide 19

Slide 19 text

Clustering with KMeans Step 1: • Guess k cluster centroids Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points Repeat steps 2 and 3 until stable or some limit reached

Slide 20

Slide 20 text

Clustering with KMeans Step 1: • Guess k cluster centroids Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points Repeat steps 2 and 3 until stable or some limit reached

Slide 21

Slide 21 text

Clustering with KMeans Step 1: • Guess k cluster centroids Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points Repeat steps 2 and 3 until stable or some limit reached

Slide 22

Slide 22 text

Clustering with KMeans Step 1: • Guess k cluster centroids Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points Repeat steps 2 and 3 until stable or some limit reached

Slide 23

Slide 23 text

• Apache Groovy • Clustering Overview • Whiskey Clustering & Visualization • Scaling Whiskey Clustering

Slide 24

Slide 24 text

Clustering case study: Whiskey flavor profiles • 86 scotch whiskies • 12 flavor categories Pictures: https://prasant.net/clustering-scotch-whisky-grouping-distilleries-by-k-means-clustering-81f2ecde069c https://www.r-bloggers.com/where-the-whisky-flavor-profile-data-came-from/ https://www.centerspace.net/clustering-analysis-part-iv-non-negative-matrix-factorization/

Slide 25

Slide 25 text

var file = getClass().classLoader.getResource('whiskey.csv').file as File var builder = RFC4180.builder().build() var records = file.withReader { r -> builder.parse(r).records*.toList() } var features = records[0][2..-1] var data = records[1..-1].collect{ new DoublePoint(it[2..-1] as int[]) } var distilleries = records[1..-1]*.get(1) Clustering case study: Whiskey flavor profiles • Read CSV records • Slice out segments of interest 0 1 2 -1 0 1 … … distilleries data features

Slide 26

Slide 26 text

var clusterer = new KMeansPlusPlusClusterer(4) Map clusterPts = [:] var clusters = clusterer.cluster(data) println features.join(', ') var centroids = categoryDataset() clusters.eachWithIndex { ctrd, num -> var cpt = ctrd.center.point clusterPts[num] = ctrd.points.collect { pt -> data.point.findIndexOf { it == pt.point } } println cpt.collect { sprintf '%.3f', it }.join(', ') cpt.eachWithIndex { val, idx -> centroids.addValue(val, "Cluster ${num + 1}", features[idx]) } } Whiskey Clusters – Apache Commons Math Body, Sweetness, Smoky, Medicinal, Tobacco, Honey, Spicy, Winey, Nutty, Malty, Fruity, Floral 1.630, 2.333, 1.148, 0.222, 0.037, 1.185, 1.037, 0.556, 1.963, 1.630, 2.000, 2.111 2.909, 1.545, 2.909, 2.727, 0.455, 0.455, 1.455, 0.545, 1.545, 1.455, 1.182, 0.545 1.450, 2.550, 1.150, 0.400, 0.150, 0.850, 1.400, 0.600, 0.450, 1.800, 1.700, 2.000 2.607, 2.357, 1.643, 0.107, 0.036, 1.893, 1.679, 1.821, 1.679, 2.107, 1.929, 1.536

Slide 27

Slide 27 text

Whiskey Clusters – Apache Commons Math println "\n${cols.join(', ')}, Medoid" var medoids = categoryDataset() clusters.eachWithIndex { ctrd, num -> var cpt = ctrd.center.point var closest = ctrd.points.min { pt -> sumSq((0.. row.point == closest.point } println data[medoidIdx].point.collect { sprintf '%.3f', it }.join(', ') + ", ${distilleries[medoidIdx]}" data[medoidIdx].point.eachWithIndex { val, idx -> medoids.addValue(val, distilleries[medoidIdx], cols[idx]) } }

Slide 28

Slide 28 text

println "\n${cols.join(', ')}, Medoid" var medoids = categoryDataset() clusters.eachWithIndex { ctrd, num -> var cpt = ctrd.center.point var closest = ctrd.points.min { pt -> sumSq((0.. row.point == closest.point } println data[medoidIdx].point.collect { sprintf '%.3f', it }.join(', ') + ", ${distilleries[medoidIdx]}" data[medoidIdx].point.eachWithIndex { val, idx -> medoids.addValue(val, distilleries[medoidIdx], cols[idx]) } } Whiskey Clusters – Apache Commons Math Body, Sweetness, Smoky, Medicinal, Tobacco, Honey, Spicy, Winey, Nutty, Malty, Fruity, Floral, Medoid 1.000, 3.000, 1.000, 0.000, 0.000, 1.000, 1.000, 0.000, 2.000, 2.000, 2.000, 2.000, Cardhu 3.000, 2.000, 3.000, 3.000, 1.000, 0.000, 2.000, 0.000, 1.000, 1.000, 2.000, 0.000, Clynelish 1.000, 3.000, 1.000, 0.000, 0.000, 1.000, 1.000, 0.000, 1.000, 2.000, 2.000, 2.000, Glenallachie 2.000, 2.000, 2.000, 0.000, 0.000, 2.000, 1.000, 2.000, 2.000, 2.000, 2.000, 2.000, Aberfeldy

Slide 29

Slide 29 text

Dimensionality reduction

Slide 30

Slide 30 text

import … def rows = Table.read().csv('whiskey.csv') def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] def data = table.as().doubleMatrix(*cols) def pca = new PCA(data) pca.projection = 2 def plots = [PlotCanvas.screeplot(pca)] def projected = pca.project(data) table = table.addColumns( *(1..2).collect { idx -> DoubleColumn.create("PCA$idx", (0.. def clusterer = new KMeans(data, k) double[][] components = table.as().doubleMatrix('PCA1', 'PCA2') plots << ScatterPlot.plot(components, clusterer.clusterLabel, symbols[0..

Slide 31

Slide 31 text

Whiskey – Exploring Weka clustering algorithms

Slide 32

Slide 32 text

Whiskey – clustering and visualizing centroids … def data = table.as().doubleMatrix(*cols) def pca = new PCA(data) pca.projection = 3 def projected = pca.project(data) def clusterer = new KMeans(data, 5) def labels = clusterer.clusterLabel.collect { "Cluster " + (it + 1) } table = table.addColumns( *(0..<3).collect { idx -> DoubleColumn.create("PCA${idx+1}", (0.. toAdd[0].setString("Cluster", "Cluster " + (idx+1)) (1..3).each { toAdd[0].setDouble("PCA" + it, centroids[idx][it-1]) } toAdd[0].setDouble("Centroid", 50) table.append(toAdd) } def title = "Clusters x Principal Components w/ centroids" Plot.show(Scatter3DPlot.create(title, table, *(1..3).collect { "PCA$it" }, "Centroid", "Cluster"))

Slide 33

Slide 33 text

Whiskey – Hierarchical clustering with Dendrogram … def dendrogram = new Dendrogram(clusters.tree, clusters.height, FOREST_GREEN).canvas().tap { title = 'Whiskey Dendrogram' setAxisLabels('Distilleries', 'Similarity') def lb = lowerBounds setBound([lb[0] - 1, lb[1] - 20] as double[], upperBounds) distilleries.eachWithIndex { String label, int i -> add(new Label(label, [i, -1] as double[], 0, 0, ninetyDeg, font, colorMap[partitions[i]])) } }.panel() def pca = PCA.fit(data) pca.projection = 2 def projected = pca.project(data) char mark = '#' def scatter = ScatterPlot.of(projected, partitions, mark).canvas().tap { title = 'Clustered by dendrogram partitions' setAxisLabels('PCA1', 'PCA2') }.panel() new PlotGrid(dendrogram, scatter).window()

Slide 34

Slide 34 text

• Apache Groovy • Clustering Overview • Whiskey Clustering & Visualization • Scaling Whiskey Clustering

Slide 35

Slide 35 text

Clustering case study: Whiskey flavor profiles • Distributed clustering?

Slide 36

Slide 36 text

Clustering case study: Whiskey flavor profiles Node 1 Node 2

Slide 37

Slide 37 text

Clustering case study: Whiskey flavor profiles Node 1 Node 2

Slide 38

Slide 38 text

Scaling up machine learning: Apache Ignite • Apache Ignite is a distributed database for high- performance computing with in-memory speed. In simple terms, it makes a cluster (or grid) of nodes appear like an in-memory cache. • It has cluster-aware machine learning and deep learning algorithms for Classification, Regression, Clustering, and Recommendation, among others. Image source: Apache Ignite documentation

Slide 39

Slide 39 text

Clustering case study: Whiskey flavor profiles • 86 scotch whiskies • 12 flavor categories • Apache Ignite has special capabilities for reading data into the cache • In a cluster environment, use IgniteDataStreamer or IgniteCache.loadCache() to load data from files, stream sources, database sources, etc. • For our little example, we have a small CSV file and a single node, so we’ll just read our data using Apache Commons CSV

Slide 40

Slide 40 text

Clustering case study: Whiskey flavor profiles • 86 scotch whiskies • 12 flavor categories • Let’s select the regions of interest

Slide 41

Slide 41 text

Clustering case study: Whiskey flavor profiles • Read CSV rows • Slice out segments of interest 0 1 2 -1 0 1 … … distilleries data features var file = getClass().classLoader.getResource('whiskey.csv').file as File var rows = file.withReader {r -> RFC4180.parse(r).records*.toList() } var data = rows[1..-1].collect{ it[2..-1]*.toDouble() } as double[][] var distilleries = rows[1..-1]*.get(1) var features = rows[0][2..-1]

Slide 42

Slide 42 text

Clustering case study: Whiskey flavor profiles • Set up configuration & define some helper variables // configure to all run on local machine but could be a cluster (can be hidden in XML) var cfg = new IgniteConfiguration( peerClassLoadingEnabled: true, discoverySpi: new TcpDiscoverySpi( ipFinder: new TcpDiscoveryMulticastIpFinder( addresses: ['127.0.0.1:47500..47509'] ) ) ) var pretty = this.&sprintf.curry('%.4f') var dist = new EuclideanDistance() // or ManhattanDistance var vectorizer = new DoubleArrayVectorizer().labeled(FIRST)

Slide 43

Slide 43 text

Whiskey flavors – scaling clustering Ignition.start(cfg).withCloseable { ignite -> println ">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols" var dataCache = ignite.createCache(new CacheConfiguration( name: "TEST_${UUID.randomUUID()}", affinity: new RendezvousAffinityFunction(false, 10))) data.indices.each { int i -> dataCache.put(i, data[i]) } var trainer = new KMeansTrainer().withDistance(dist).withAmountOfClusters(5) var mdl = trainer.fit(ignite, dataCache, vectorizer) println ">>> KMeans centroids:\n${features.join(', ')}" var centroids = mdl.centers*.all() var cols = centroids.collect{ it*.get() } cols.each { c -> println c.collect(pretty).join(', ') } dataCache.destroy() }

Slide 44

Slide 44 text

Whiskey flavors – scaling clustering Ignition.start(cfg).withCloseable { ignite -> println ">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols" var dataCache = ignite.createCache(new CacheConfiguration( name: "TEST_${UUID.randomUUID()}", affinity: new RendezvousAffinityFunction(false, 10))) data.indices.each { int i -> dataCache.put(i, data[i]) } var trainer = new KMeansTrainer().withDistance(dist).withAmountOfClusters(5) var mdl = trainer.fit(ignite, dataCache, vectorizer) println ">>> KMeans centroids:\n${features.join(', ')}" var centroids = mdl.centers*.all() var cols = centroids.collect{ it*.get() } cols.each { c -> println c.collect(pretty).join(', ') } dataCache.destroy() } [11:48:48] __________ ________________ [11:48:48] / _/ ___/ |/ / _/_ __/ __/ [11:48:48] _/ // (7 7 // / / / / _/ [11:48:48] /___/\___/_/|_/___/ /_/ /x___/ [11:48:48] [11:48:48] ver. 2.15.0#20230425-sha1:f98f7f35 [11:48:48] 2023 Copyright(C) Apache Software Foundation … >>> Ignite grid started for data: 86 rows X 12 cols >>> KMeans centroids: Body, Sweetness, Smoky, Medicinal, Tobacco, Honey, Spicy, Winey, Nutty, Malty, Fruity, Floral 2.3793, 1.0345, 0.2414, 0.0345, 0.8966, 1.1034, 0.5517, 1.5517, 1.6207, 2.1724, 2.1379 2.5556, 1.4444, 0.0556, 0.0000, 1.8333, 1.6667, 2.3333, 2.0000, 2.0000, 2.2222, 1.5556 3.1429, 1.0000, 0.2857, 0.1429, 0.8571, 0.5714, 0.7143, 0.7143, 1.5714, 0.7143, 1.5714 2.0476, 1.7619, 0.3333, 0.1429, 1.7619, 1.7619, 0.7143, 1.0952, 2.1429, 1.6190, 1.8571 1.5455, 2.9091, 2.7273, 0.4545, 0.4545, 1.4545, 0.5455, 1.5455, 1.4545, 1.1818, 0.5455

Slide 45

Slide 45 text

Whiskey flavors – scaling clustering … var clusters = [:].withDefault{ [] } dataCache.query(new ScanQuery()).withCloseable { observations -> observations.each { observation -> def (k, v) = observation.with{ [getKey(), getValue()] } int prediction = mdl.predict(vectorizer.extractFeatures(k, v)) clusters[prediction] += distilleries[k] } } clusters.sort{ e -> e.key }.each{ k, v -> println "Cluster ${k+1}: ${v.join(', ')}" } … … Cluster 1: Bunnahabhain, Dufftown, Glenmorangie, Teaninich, Glenallachie, Longmorn, Scapa, Tobermory, AnCnoc, Cardhu, GlenElgin, Mannochmore, Speyside, Craigganmore, GlenGrant, Tullibardine, Auchentoshan, Bladnoch, GlenKeith, Glengoyne, Knochando, Strathmill, GlenMoray, Aultmore, Tamdhu, Balblair, Glenlossie, Linkwood, Tamnavulin Cluster 2: Aberfeldy, Balmenach, RoyalLochnagar, Aberlour, Edradour, Glenrothes, Glendronach, Glenturret, Macallan, Glendullan, Glenfarclas, Mortlach, Strathisla, Dailuaine, Auchroisk, BlairAthol, Dalmore, Glenlivet Cluster 3: GlenSpey, GlenDeveronMacduff, Speyburn, Miltonduff, Tomore, ArranIsleOf, Glenfiddich Cluster 4: Loch Lomond, Belvenie, BenNevis, Tomatin, Benriach, Highland Park, Tomintoul, Ardmore, Benrinnes, Craigallechie, GlenGarioch, Inchgower, Benromach, Glenkinchie, OldFettercairn, Bowmore, Dalwhinnie, GlenOrd, Bruichladdich, Deanston, RoyalBrackla Cluster 5: Caol Ila, Ardbeg, Clynelish, Springbank, Isle of Jura, Oban, Lagavulin, Talisker, Laphroig, OldPulteney, GlenScotia …

Slide 46

Slide 46 text

Whiskey flavors – scaling clustering var dist = new EuclideanDistance() … Ignition.start(cfg).withCloseable { ignite -> println ">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols" var dataCache = ignite.createCache(new CacheConfiguration( name: "TEST_${UUID.randomUUID()}", affinity: new RendezvousAffinityFunction(false, 10))) data.indices.each { int i -> dataCache.put(i, data[i]) } var trainer = new KMeansTrainer().withDistance(dist).withAmountOfClusters(5) var mdl = trainer.fit(ignite, dataCache, vectorizer) println ">>> KMeans centroids:\n${features.join(', ')}" var centroids = mdl.centers*.all() var cols = centroids.collect{ it*.get() } cols.each { c -> println c.collect(pretty).join(', ') } dataCache.destroy() } 5

Slide 47

Slide 47 text

Whiskey flavors – scaling clustering var dist = new ManhattanDistance() … Ignition.start(cfg).withCloseable { ignite -> println ">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols" var dataCache = ignite.createCache(new CacheConfiguration( name: "TEST_${UUID.randomUUID()}", affinity: new RendezvousAffinityFunction(false, 10))) data.indices.each { int i -> dataCache.put(i, data[i]) } var trainer = new KMeansTrainer().withDistance(dist).withAmountOfClusters(5) var mdl = trainer.fit(ignite, dataCache, vectorizer) println ">>> KMeans centroids:\n${features.join(', ')}" var centroids = mdl.centers*.all() var cols = centroids.collect{ it*.get() } cols.each { c -> println c.collect(pretty).join(', ') } dataCache.destroy() } 4 3 3 + 4 = 7

Slide 48

Slide 48 text

Ignition.start(cfg).withCloseable { ignite -> println ">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols" var dataCache = ignite.createCache(new CacheConfiguration( name: "TEST_${UUID.randomUUID()}", affinity: new RendezvousAffinityFunction(false, 10))) data.indices.each { int i -> dataCache.put(i, data[i]) } var trainer = new GmmTrainer().withMaxCountOfClusters(5) var mdl = trainer.fit(ignite, dataCache, vectorizer) … dataCache.destroy() } Whiskey flavors – scaling clustering Image source: wikipedia

Slide 49

Slide 49 text

Apache Spark • Multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters Spark Session Cluster Manager Executor Executor Executor Driver node Worker nodes Cache Tasks

Slide 50

Slide 50 text

Apache Spark MLlib ML Algorithms • Clustering • Kmeans • Bisecting Kmeans • Latent Dirichlet Allocation • Gaussian Mixture Model • Power Iteration Clustering • Classification • Regression • Feature engineering • Stats • Utility functions MLlib Your APP Spark Core Spark SQL

Slide 51

Slide 51 text

Whiskey flavors – scaling clustering var spark = builder().config('spark.master', 'local[8]').appName('Whiskey').orCreate var file = WhiskeySpark.classLoader.getResource('whiskey.csv').file var rows = spark.read().format('com.databricks.spark.csv') .options(header: 'true', inferSchema: 'true').load(file) String[] colNames = rows.columns().toList() - ['RowID', 'Distillery'] var assembler = new VectorAssembler(inputCols: colNames, outputCol: 'features') var dataset = assembler.transform(rows) var kmeans = new KMeans(k: 5, seed: 1L) var model = kmeans.fit(dataset) println 'Cluster centers:' model.clusterCenters().each { println it.values().collect { sprintf '%.2f', it }.join(', ') } var result = model.transform(dataset) var clusters = result.toLocalIterator().collect { row -> [row.getAs('prediction'), row.getAs('Distillery')] }.groupBy { it[0] }.collectValues { it*.get(1) } clusters.each { k, v -> println "Cluster$k: ${v.join(', ')}"} spark.stop()

Slide 52

Slide 52 text

Whiskey flavors – scaling clustering var spark = builder().config('spark.master', 'local[8]').appName('Whiskey').orCreate var file = WhiskeySpark.classLoader.getResource('whiskey.csv').file var rows = spark.read().format('com.databricks.spark.csv') .options(header: 'true', inferSchema: 'true').load(file) String[] colNames = rows.columns().toList() - ['RowID', 'Distillery'] var assembler = new VectorAssembler(inputCols: colNames, outputCol: 'features') var dataset = assembler.transform(rows) var kmeans = new KMeans(k: 5, seed: 1L) var model = kmeans.fit(dataset) println 'Cluster centers:' model.clusterCenters().each { println it.values().collect { sprintf '%.2f', it }.join(', ') } var result = model.transform(dataset) var clusters = result.toLocalIterator().collect { row -> [row.getAs('prediction'), row.getAs('Distillery')] }.groupBy { it[0] }.collectValues { it*.get(1) } clusters.each { k, v -> println "Cluster$k: ${v.join(', ')}"} spark.stop() Cluster centers: 2.89, 2.42, 1.53, 0.05, 0.00, 1.84, 1.58, 2.11, 2.11, 2.11, 2.26, 1.58 1.45, 2.35, 1.06, 0.26, 0.06, 0.84, 1.13, 0.45, 1.26, 1.65, 2.19, 2.10 1.83, 3.17, 1.00, 0.33, 0.17, 1.00, 0.67, 0.83, 0.83, 1.50, 0.50, 1.50 3.00, 1.50, 3.00, 2.80, 0.50, 0.30, 1.40, 0.50, 1.50, 1.50, 1.30, 0.50 1.85, 2.20, 1.70, 0.40, 0.10, 1.85, 1.80, 1.00, 1.35, 2.00, 1.40, 1.85 Cluster0: Aberfeldy, Aberlour, Auchroisk, Balmenach, BenNevis, Benrinnes, BlairAthol, Dailuaine, Dalmore, Edradour, Glendronach, Glendullan, Glenfarclas, Glenrothes, Longmorn, Macallan, Mortlach, RoyalLochnagar, Strathisla Cluster1: AnCnoc, Auchentoshan, Aultmore, Balblair, Benriach, Bladnoch, Bunnahabhain, Cardhu, Craigganmore, Dufftown, GlenElgin, GlenGrant, GlenKeith, GlenMoray, Glenallachie, Glenfiddich, Glengoyne, Glenkinchie, Glenlossie, Glenmorangie, Linkwood, Loch Lomond, Mannochmore, RoyalBrackla, Speyside, Strathmill, Tamdhu, Tamnavulin, Teaninich, Tobermory, Tullibardine Cluster3: Ardbeg, Caol Ila, Clynelish, GlenScotia, Isle of Jura, Lagavulin, Laphroig, Oban, OldPulteney, Talisker Cluster4: Ardmore, Belvenie, Benromach, Bowmore, Bruichladdich, Craigallechie, Dalwhinnie, Deanston, GlenGarioch, GlenOrd, Glenlivet, Glenturret, Highland Park, Inchgower, Knochando, OldFettercairn, Scapa, Springbank, Tomatin, Tomintoul Cluster2: ArranIsleOf, GlenDeveronMacduff, GlenSpey, Miltonduff, Speyburn, Tomore

Slide 53

Slide 53 text

Apache Wayang • A unified data processing framework that seamlessly integrates and orchestrates multiple data platforms to deliver unparalleled performance and flexibility Image source: Apache Wayang documentation

Slide 54

Slide 54 text

Apache Wayang • Offers two approaches for us: • Roll your own Kmeans algorithm using existing operators • Built upon 4 abstractions: UnaryToUnaryOperator, BinaryToUnaryOperator, UnarySource, UnarySink • Many built-in operators: Map, Filter, Reduce, Distinct, Count, GroupBy • ML4all abstracts most ML algorithms with seven operators: • Transform, Stage, Compute, Update, Sample, Converge, Loop • Kmeans implementation included in next release Image source: Apache Wayang documentation

Slide 55

Slide 55 text

Apache Wayang: Roll your own Kmeans Domain classes: record Point(double[] pts) implements Serializable { } record PointGrouping(double[] pts, int cluster, long count) implements Serializable { PointGrouping(List pts, int cluster, long count) { this(pts as double[], cluster, count) } PointGrouping plus(PointGrouping that) { var newPts = pts.indices.collect{ pts[it] + that.pts[it] } new PointGrouping(newPts, cluster, count + that.count) } PointGrouping average() { new PointGrouping(pts.collect{ double d -> d/count }, cluster, 1) } }

Slide 56

Slide 56 text

Apache Wayang: Roll your own Kmeans Algorithm class: class SelectNearestCentroid implements ExtendedSerializableFunction { Iterable centroids void open(ExecutionContext context) { centroids = context.getBroadcast('centroids') } PointGrouping apply(Point p) { var minDistance = Double.POSITIVE_INFINITY var nearestCentroidId = -1 for (c in centroids) { var distance = sqrt(p.pts.indices.collect{ p.pts[it] - c.pts[it] }.sum{ it ** 2 } as double) if (distance < minDistance) { minDistance = distance nearestCentroidId = c.cluster } } new PointGrouping(p.pts, nearestCentroidId, 1) } }

Slide 57

Slide 57 text

Apache Wayang: Roll your own Kmeans class PipelineOps { public static SerializableFunction cluster = tpc -> tpc.cluster public static SerializableFunction average = tpc -> tpc.average() public static SerializableBinaryOperator plus = (tpc1, tpc2) -> tpc1 + tpc2 } import static PipelineOps.* int k = 5 int iterations = 10 // read in data from our file var url = WhiskeyWayang.classLoader.getResource('whiskey.csv').file def rows = new File(url).readLines()[1..-1]*.split(',') var distilleries = rows*.getAt(1) var pointsData = rows.collect{ new Point(it[2..-1] as double[]) } var dims = pointsData[0].pts.size() // create some random points as initial centroids var r = new Random() var randomPoint = { (0..

Slide 58

Slide 58 text

Apache Wayang: Roll your own Kmeans var context = new WayangContext() .withPlugin(Java.basicPlugin()) .withPlugin(Spark.basicPlugin()) var planBuilder = new JavaPlanBuilder(context, "KMeans ($url, k=$k, iterations=$iterations)") var points = planBuilder .loadCollection(pointsData).withName('Load points') var initialCentroids = planBuilder .loadCollection((0.. new PointGrouping(initPts[idx], idx, 0) }) .withName('Load random centroids') var finalCentroids = initialCentroids.repeat(iterations, currentCentroids -> points.map(new SelectNearestCentroid()) .withBroadcast(currentCentroids, 'centroids').withName('Find nearest centroid') .reduceByKey(cluster, plus).withName('Aggregate points') .map(average).withName('Average points') .withOutputClass(PointGrouping) ).withName('Loop').collect()

Slide 59

Slide 59 text

Apache Wayang: Roll your own Kmeans println 'Centroids:' finalCentroids.each { c -> println "Cluster $c.cluster: ${c.pts.collect { sprintf '%.2f', it }.join(', ')}" } Centroids: Cluster 0: 2.53, 1.65, 2.76, 2.12, 0.29, 0.65, 1.65, 0.59, 1.35, 1.41, 1.35, 0.94 Cluster 2: 3.33, 2.56, 1.67, 0.11, 0.00, 1.89, 1.89, 2.78, 2.00, 1.89, 2.33, 1.33 Cluster 3: 1.42, 2.47, 1.03, 0.22, 0.06, 1.00, 1.03, 0.47, 1.19, 1.72, 1.92, 2.08 Cluster 4: 2.25, 2.38, 1.38, 0.08, 0.13, 1.79, 1.54, 1.33, 1.75, 2.17, 1.75, 1.79 var allocator = new SelectNearestCentroid(centroids: finalCentroids) var allocations = pointsData.withIndex() .collect{ pt, idx -> [allocator.apply(pt).cluster, distilleries[idx]] } .groupBy{ cluster, ds -> "Cluster $cluster" } .collectValues{ v -> v.collect{ it[1] } } .sort{ e1, e2 -> e1.key <=> e2.key } allocations.each{ c, ds -> println "$c (${ds.size()} members): ${ds.join(', ')}" } Cluster 0 (17 members): Ardbeg, Balblair, Bowmore, Bruichladdich, Caol Ila, Clynelish, GlenGarioch, GlenScotia, Highland Park, Isle of Jura, Lagavulin, Laphroig, Oban, OldPulteney, Springbank, Talisker, Teaninich Cluster 2 (9 members): Aberlour, Balmenach, Dailuaine, Dalmore, Glendronach, Glenfarclas, Macallan, Mortlach, RoyalLochnagar Cluster 3 (36 members): AnCnoc, ArranIsleOf, Auchentoshan, Aultmore, Benriach, Bladnoch, Bunnahabhain, Cardhu, Craigganmore, Dalwhinnie, Dufftown, GlenElgin, GlenGrant, GlenMoray, GlenSpey, Glenallachie, Glenfiddich, Glengoyne, Glenkinchie, Glenlossie, Glenmorangie, Inchgower, Linkwood, Loch Lomond, Mannochmore, Miltonduff, RoyalBrackla, Speyburn, Speyside, Strathmill, Tamdhu, Tamnavulin, Tobermory, Tomintoul, Tomore, Tullibardine Cluster 4 (24 members): Aberfeldy, Ardmore, Auchroisk, Belvenie, BenNevis, Benrinnes, Benromach, BlairAthol, Craigallechie, Deanston, Edradour, GlenDeveronMacduff, GlenKeith, GlenOrd, Glendullan, Glenlivet, Glenrothes, Glenturret, Knochando, Longmorn, OldFettercairn, Scapa, Strathisla, Tomatin

Slide 60

Slide 60 text

Apache Wayang: ML4all int k = 3 int maxIterations = 100 double accuracy = 0 class TransformCSV extends Transform { double[] transform(String input) { input.split(',')[2..-1] as double[] } } class KMeansStageWithRandoms extends LocalStage { int k, dimension private r = new Random() void staging(ML4allModel model) { double[][] centers = new double[k][] for (i in 0..

Slide 61

Slide 61 text

Apache Wayang: ML4all var url = WhiskeyWayangML.classLoader.getResource('whiskey_noheader.csv').path var dims = 12 var context = new WayangContext() .withPlugin(Spark.basicPlugin()) .withPlugin(Java.basicPlugin()) var plan = new ML4allPlan( transformOp: new TransformCSV(), localStage: new KMeansStageWithRandoms(k: k, dimension: dims), computeOp: new KMeansCompute(), updateOp: new KMeansUpdate(), loopOp: new KMeansConvergeOrMaxIterationsLoop(accuracy, maxIterations) ) var model = plan.execute('file:' + url, context) model.getByKey("centers").eachWithIndex { center, idx -> var pts = center.collect { sprintf '%.2f', it }.join(', ') println "Cluster$idx: $pts" } Cluster0: 1.57, 2.32, 1.32, 0.45, 0.09, 1.08, 1.19, 0.60, 1.26, 1.74, 1.72, 1.85 Cluster1: 3.43, 1.57, 3.43, 3.14, 0.57, 0.14, 1.71, 0.43, 1.29, 1.43, 1.29, 0.14 Cluster2: 2.73, 2.42, 1.46, 0.04, 0.04, 1.88, 1.69, 1.88, 1.92, 2.04, 2.12, 1.81

Slide 62

Slide 62 text

Apache Beam® • Apache Beam offers a unified programming model for batch and streaming data processing pipelines • The pipeline abstraction encapsulates all the data and steps in your data processing task • Apache Beam unifies multiple data processing engines and SDKs around its distinctive Beam model • Several language SDKs: Java, Groovy (via Java JDK), Python, Go, SQL, … Image sources: Apache Beam documentation

Slide 63

Slide 63 text

Apache Beam Kmeans record Point(double[] pts) implements Serializable { private static Random r = new Random() private static Closure randomPoint = { dims -> (1..dims).collect { r.nextGaussian() + 2 } as double[] } static Point ofRandom(int dims) { new Point(randomPoint(dims)) } String toString() { "Point[${pts.collect{ sprintf '%.2f', it }.join('. ')}]" } } record Points(List pts) implements Serializable { }

Slide 64

Slide 64 text

Apache Beam Kmeans var readCsv = new DoFn() { @ProcessElement void processElement(@Element String path, OutputReceiver receiver) throws IOException { def parser= CSV.builder().setHeader().setSkipHeaderRecord(true).build() def records= new File(path).withReader{ rdr -> parser.parse(rdr).records*.toList() } records.each { receiver.output(new Point(it[2..-1] as double[])) } } } var pointArray2out = new DoFn() { @ProcessElement void processElement(@Element Points pts, OutputReceiver out) { String log = "Centroids:\n${pts.pts()*.toString().join('\n')}" out.output(log) } }

Slide 65

Slide 65 text

Apache Beam Kmeans class MeanDoubleArrayCols implements SerializableFunction, Point> { @Override Point apply(Iterable inputs) { double[] result = new double[12] int count = 0 for (Point input : inputs) { result.indices.each { result[it] += input.pts()[it] } count++ } result.indices.each { result[it] /= count } new Point(result) } } class Squash extends Combine.CombineFn, Accum, Points> { int k, dims @Override Accum createAccumulator() { new Accum() } @Override Accum addInput(Accum mutableAccumulator, KV input) { … } @Override Accum mergeAccumulators(Iterable accumulators) { … } @Override Points extractOutput(Accum accumulator) { … } static class Accum implements Serializable { List pts = [] } }

Slide 66

Slide 66 text

Apache Beam Kmeans var assign = { Point pt, Points centroids -> var minDistance = Double.POSITIVE_INFINITY var nearestCentroidId = -1 var idxs = pt.pts().indices centroids.pts().eachWithIndex { Point next, int cluster -> var distance = sqrt(sumSq(idxs.collect { pt.pts()[it] - next.pts()[it] } as double[])) if (distance < minDistance) { minDistance = distance nearestCentroidId = cluster } } KV.of(nearestCentroidId, pt) }

Slide 67

Slide 67 text

Apache Beam Kmeans Points initCentroids = new Points((1..k).collect{ Point.ofRandom(dims) }) var points = p .apply(Create.of(filename)) .apply('Read points', ParDo.of(readCsv)) var centroids = p.apply(Create.of(initCentroids)) iterations.times { var centroidsView = centroids .apply(View. asSingleton()) centroids = points .apply('Assign clusters', ParDo.of(new AssignClusters(centroidsView, assign)).withSideInputs(centroidsView)) .apply('Calculate new centroids', Combine. perKey(new MeanDoubleArrayCols())) .apply('As Points', Combine., Points> globally(new Squash(k: k, dims: dims))) } centroids .apply('Display centroids', ParDo.of(pointArray2out)).apply(Log.ofElements())

Slide 68

Slide 68 text

Apache Beam Kmeans int k = 5 int iterations = 10 int dims = 12 var pipeline = Pipeline.create() def csv = getClass().classLoader.getResource('whiskey.csv').path buildPipeline(pipeline, csv, k, iterations, dims) pipeline.run().waitUntilFinish() May 29, 2024 5:47:06 PM org.codehaus.groovy.vmplugin.v8.IndyInterface fromCache INFO: Centroids: Point[1.22. 2.87. 0.78. 0.11. 0.35. 0.90. 1.87. 0.81. 0.94. 1.86. 1.65. 1.67] Point[3.67. 1.50. 3.67. 3.33. 0.67. 0.17. 1.67. 0.50. 1.17. 1.33. 1.17. 0.17] Point[1.29. 1.62. 1.00. 0.10. 0.02. 1.17. 0.40. 0.31. 1.36. 1.93. 2.00. 2.14] Point[2.81. 2.43. 1.52. 0.05. 0.00. 2.00. 1.71. 2.05. 1.95. 2.05. 2.19. 1.71] Point[1.86. 2.00. 1.93. 1.07. 0.21. 1.29. 1.29. 1.00. 1.57. 1.86. 1.00. 1.00]

Slide 69

Slide 69 text

Apache Beam with Groovy metaprogramming for Python-style coding Points initCentroids = new Points((1..k).collect { Point.ofRandom(dims) }) var points = p | Create.of(filename) | 'Read points' >> ParDo.of(readCsv) var centroids = p | Create.of(initCentroids) iterations.times { var centroidsView = centroids | View.asSingleton() centroids = points | 'Assign clusters' >> ParDo.of(new AssignClusters(centroidsView, assign)).withSideInputs(centroidsView) | 'Calculate new centroids' >> Combine.perKey(new MeanDoubleArrayCols()) | 'As Points' >> Combine.globally(new Squash(k: k, dims: dims)) } centroids | 'Display centroids' >> ParDo.of(pointArray2out) | Log.ofElements() INFO: Centroids: Point[4.00. 1.33. 4.00. 4.00. 0.67. 0.00. 1.00. 1.00. 1.00. 1.33. 0.67. 0.00] Point[1.56. 2.58. 1.07. 0.02. 0.08. 1.07. 1.00. 0.59. 1.39. 1.60. 1.52. 1.76] Point[2.18. 1.88. 2.35. 1.59. 0.24. 0.76. 1.76. 0.47. 1.41. 1.47. 1.65. 1.29] Point[2.42. 2.48. 1.27. 0.08. 0.08. 1.84. 1.73. 1.95. 1.98. 2.15. 2.16. 1.92] Point[2.24. 2.20. 3.55. 1.85. 1.58. 1.50. 1.97. 2.45. 0.84. 2.02. 0.77. 0.73]

Slide 70

Slide 70 text

Apache Flink® • Distributed processing engine for stateful computations over unbounded and bounded data streams Image sources: Apache Flink documentation

Slide 71

Slide 71 text

Apache Flink ML ML Algorithms • Clustering • Kmeans • AgglomerativeClustering • Classification • Regression • Evaluation • Feature engineering • Recommendation • Stats • Utility functions Image based on Apache Flink documentation ML Your APP

Slide 72

Slide 72 text

var eEnv = StreamExecutionEnvironment.executionEnvironment var tEnv = StreamTableEnvironment.create(eEnv) var file = WhiskeyFlink.classLoader.getResource('whiskey.csv').file var source = FileSource.forRecordStreamFormat(new TextLineInputFormat(), new Path(file)).build() var stream = eEnv .fromSource(source, WatermarkStrategy.noWatermarks(), "csvfile") .filter(skipHeader).flatMap(splitAndChop) var inputTable = tEnv.fromDataStream(stream).as("features") var kmeans = new KMeans(k: 3, seed: 1L) var kmeansModel = kmeans.fit(inputTable) var outputTable = kmeansModel.transform(inputTable)[0] var clusters = [:].withDefault { [] } outputTable.execute().collect().each { row -> var features = row.getField(kmeans.featuresCol) var clusterId = row.getField(kmeans.predictionCol) clusters[clusterId] << features } clusters.each { k, v -> println "Cluster $k has ${v.size()} members:\n${v.join('\n')}" } Flink ML KMeans

Slide 73

Slide 73 text

var eEnv = StreamExecutionEnvironment.executionEnvironment var tEnv = StreamTableEnvironment.create(eEnv) var file = WhiskeyFlink.classLoader.getResource('whiskey.csv').file var source = FileSource.forRecordStreamFormat(new TextLineInputFormat(), new Path(file)).build() var stream = eEnv .fromSource(source, WatermarkStrategy.noWatermarks(), "csvfile") .filter(skipHeader).flatMap(splitAndChop) var inputTable = tEnv.fromDataStream(stream).as("features") var kmeans = new KMeans(k: 3, seed: 1L) var kmeansModel = kmeans.fit(inputTable) var outputTable = kmeansModel.transform(inputTable)[0] var clusters = [:].withDefault { [] } outputTable.execute().collect().each { row -> var features = row.getField(kmeans.featuresCol) var clusterId = row.getField(kmeans.predictionCol) clusters[clusterId] << features } clusters.each { k, v -> println "Cluster $k has ${v.size()} members:\n${v.join('\n')}" } Flink ML KMeans Cluster 2 has 23 members: [2.0, 2.0, 2.0, 0.0, 0.0, 2.0, 1.0, 2.0, 2.0, 2.0, 2.0, 2.0] [3.0, 3.0, 1.0, 0.0, 0.0, 4.0, 3.0, 2.0, 2.0, 3.0, 3.0, 2.0] [2.0, 3.0, 1.0, 0.0, 0.0, 2.0, 1.0, 2.0, 2.0, 2.0, 2.0, 1.0] [4.0, 3.0, 2.0, 0.0, 0.0, 2.0, 1.0, 3.0, 3.0, 0.0, 1.0, 2.0] … Cluster 0 has 46 members: [1.0, 3.0, 2.0, 0.0, 0.0, 2.0, 0.0, 0.0, 2.0, 2.0, 3.0, 2.0] [2.0, 2.0, 2.0, 0.0, 0.0, 1.0, 1.0, 1.0, 2.0, 3.0, 1.0, 1.0] [2.0, 3.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 2.0] [0.0, 2.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 2.0, 2.0, 3.0, 3.0] …

Slide 74

Slide 74 text

… var data = stream.executeAndCollect().collect{ Row.of(it) } var train = data[0..79] var predict = data[80..-1] var trainSource = new PeriodicSourceFunction(1000, train.collate(8)) var trainStream = eEnv.addSource(trainSource, new RowTypeInfo(DenseVectorTypeInfo.INSTANCE)) var trainTable = tEnv.fromDataStream(trainStream).as("features") var predictSource = new PeriodicSourceFunction(1000, [predict]) var predictStream = eEnv.addSource(predictSource, new RowTypeInfo(DenseVectorTypeInfo.INSTANCE)) var predictTable = tEnv.fromDataStream(predictStream).as("features") var kmeans = new OnlineKMeans(featuresCol: 'features', predictionCol: 'prediction', globalBatchSize: 8, initialModelData: randomInit, k: 3) var kmeansModel = kmeans.fit(trainTable) var outputTable = kmeansModel.transform(predictTable)[0] outputTable.execute().collect().each { row -> DenseVector features = (DenseVector) row.getField(kmeans.featuresCol) var clusterId = row.getField(kmeans.predictionCol) println "Cluster $clusterId: ${features}" } Flink ML Online KMeans

Slide 75

Slide 75 text

Flink ML Online KMeans Cluster 1: [1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 2.0, 2.0, 2.0] Cluster 2: [2.0, 3.0, 0.0, 0.0, 1.0, 0.0, 2.0, 1.0, 1.0, 2.0, 2.0, 1.0] Cluster 1: [0.0, 3.0, 1.0, 0.0, 0.0, 2.0, 2.0, 1.0, 1.0, 2.0, 1.0, 2.0] Cluster 2: [2.0, 3.0, 2.0, 0.0, 0.0, 2.0, 2.0, 1.0, 1.0, 2.0, 0.0, 1.0] Cluster 2: [2.0, 2.0, 2.0, 1.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 2.0, 2.0] Cluster 0: [2.0, 2.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 2.0, 1.0, 0.0, 0.0] … Cluster 1: [0.0, 3.0, 1.0, 0.0, 0.0, 2.0, 2.0, 1.0, 1.0, 2.0, 1.0, 2.0] Cluster 1: [1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 2.0, 2.0, 2.0] Cluster 1: [2.0, 3.0, 0.0, 0.0, 1.0, 0.0, 2.0, 1.0, 1.0, 2.0, 2.0, 1.0] Cluster 0: [2.0, 3.0, 2.0, 0.0, 0.0, 2.0, 2.0, 1.0, 1.0, 2.0, 0.0, 1.0] Cluster 2: [2.0, 2.0, 2.0, 1.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 2.0, 2.0] Cluster 2: [2.0, 2.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 2.0, 1.0, 0.0, 0.0] …

Slide 76

Slide 76 text

Questions? Twitter/X | Mastodon : Apache Groovy: Repo: Slides: @paulk_asert | @[email protected] https://groovy.apache.org/ https://groovy-lang.org/ https://github.com/paulk-asert/groovy-data-science https://speakerdeck.com/paulk/groovy-whiskey