Slide 1

Slide 1 text

objectcomputing.com © 2021, Object Computing, Inc. (OCI). All rights reserved. No part of these notes may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, rowing, or otherwise, without the prior, written permission of Object Computing, Inc. (OCI) Groovy and Data Science Presented by Dr Paul King © 2021 Object Computing, Inc. (OCI). All rights reserved.

Slide 2

Slide 2 text

Dr Paul King OCI Groovy Lead V.P. and PMC Chair Apache Groovy Author: https://www.manning.com/books/groovy-in-action-second-edition Slides: https://speakerdeck.com/paulk/groovy-data-science Examples repo: https://github.com/paulk-asert/groovy-data-science Twitter: @paulk_asert

Slide 3

Slide 3 text

Apache Groovy Programming Language • Multi-faceted language • Imperative/OO & functional (+ other styles via libraries) • Dynamic & static • Aligned closely with Java language • Very extensible • 17+ years since inception • ~1B downloads (partial count) • ~500 contributors • 200+ releases • https://www.youtube.com/watch?v=eIGOG- F9ZTw&feature=youtu.be

Slide 4

Slide 4 text

Friends of Apache Groovy Open Collective

Slide 5

Slide 5 text

What is Groovy? It’s like a super version of Java: • Supports most Java syntax but allows simpler syntax for many constructs • Supports all Java libraries but provides many extensions and its own productivity libraries • Has both a static and dynamic nature • Extensible language and tooling

Slide 6

Slide 6 text

Groovy has a low learning curve, particularly for Java developers

Slide 7

Slide 7 text

Groovy is Java's friend Seamless integration • IDEs provide cross-language compile, navigation, and refactoring • Arbitrarily mix source language • Drop-in replace any class • Overloaded methods • Syntax alignment • Shared data types Java Groovy

Slide 8

Slide 8 text

Java-like or concise shortcuts import java.util.List; import java.util.ArrayList; class Main { private List keepShorterThan(List strings, int length) { List result = new ArrayList(); for (int i = 0; i < strings.size(); i++) { String s = (String) strings.get(i); if (s.length() < length) { result.add(s); } } return result; } public static void main(String[] args) { List names = new ArrayList(); names.add("Ted"); names.add("Fred"); names.add("Jed"); names.add("Ned"); System.out.println(names); Main m = new Main(); List shortNames = m.keepShorterThan(names, 4); System.out.println(shortNames.size()); for (int i = 0; i < shortNames.size(); i++) { String s = (String) shortNames.get(i); System.out.println(s); } } } names = ["Ted", "Fred", "Jed", "Ned"] println names shortNames = names.findAll{ it.size() < 4 } println shortNames.size() shortNames.each{ println it }

Slide 9

Slide 9 text

Concise syntax including DSL friendly import java.util.List; import java.util.ArrayList; class Main { private List keepShorterThan(List strings, int length) { List result = new ArrayList(); for (int i = 0; i < strings.size(); i++) { String s = (String) strings.get(i); if (s.length() < length) { result.add(s); } } return result; } public static void main(String[] args) { List names = new ArrayList(); names.add("Ted"); names.add("Fred"); names.add("Jed"); names.add("Ned"); System.out.println(names); Main m = new Main(); List shortNames = m.keepShorterThan(names, 4); System.out.println(shortNames.size()); for (int i = 0; i < shortNames.size(); i++) { String s = (String) shortNames.get(i); System.out.println(s); } } } names = ["Ted", "Fred", "Jed", "Ned"] println names shortNames = names.findAll{ it.size() < 4 } println shortNames.size() shortNames.each{ println it } given the names "Ted", "Fred", "Jed" and "Ned" display all the names display the number of names having size less than 4 display the names having size less than 4

Slide 10

Slide 10 text

Metaprogramming

Slide 11

Slide 11 text

Matrix manipulation: Java vs Groovy • Same example • Same library Array2DRowRealMatrix{{15.1379501385,40.488531856},{21.4354570637,59.5951246537}} import org.apache.commons.math3.linear.*; public class MatrixMain { public static void main(String[] args) { double[][] matrixData = { {1d,2d,3d}, {2d,5d,3d}}; RealMatrix m = MatrixUtils.createRealMatrix(matrixData); double[][] matrixData2 = { {1d,2d}, {2d,5d}, {1d, 7d}}; RealMatrix n = new Array2DRowRealMatrix(matrixData2); RealMatrix o = m.multiply(n); // Invert p, using LU decomposition RealMatrix oInverse = new LUDecomposition(o).getSolver().getInverse(); RealMatrix p = oInverse.scalarAdd(1d).scalarMultiply(2d); RealMatrix q = o.add(p.power(2)); System.out.println(q); } } Thanks to operator overloading and extensible tooling

Slide 12

Slide 12 text

groovyConsole showing @Grab, power asserts and natural language processing

Slide 13

Slide 13 text

REPL for Groovy (groovysh)

Slide 14

Slide 14 text

Command chains and extensible type system Text file? No, it’s a statically typed Groovy program which won’t compile if the animal names aren’t valid. A constraint programming solver is used to find the solution. cranes have 2 legs tortoises have 4 legs there are 7 animals there are 20 legs display solution Cranes 4 Tortoises 3

Slide 15

Slide 15 text

Command chains and extensible type system cranes have 2 legs tortoises have 4 legs millipedes have 1000 legs there are 8 animals there are 1020 legs display solution Cranes 4 Tortoises 3 Millipedes 1

Slide 16

Slide 16 text

IDE support Intellij IDEA, Apache NetBeans, Eclipse, Groovy Web Console

Slide 17

Slide 17 text

Build tools Gradle, Apache Maven, Apache Ant plugins { id 'groovy' } repositories { jcenter() } dependencies { implementation 'org.codehaus.groovy:groovy-all:2.5.7' testImplementation 'org.spockframework:spock-core:1.2-groovy-2.5' } project { modelVersion '4.0.0' groupId 'org.exampledriven' artifactId 'maven-polyglot-groovy-example-parent' version '1.0-SNAPSHOT' packaging 'pom' modules { module 'maven-polyglot-submodule' module 'maven-polyglot-spring' } } ant.sequential { echo("inside sequential") def myDir = "target/AntTest/" mkdir(dir: myDir) copy(todir: myDir) { fileset(dir: "src/test") { include(name: "**/*.groovy") } } echo("done") }

Slide 18

Slide 18 text

Groovy frameworks plus entire Java ecosystem

Slide 19

Slide 19 text

Data Science Process Research Goals Obtain Data Data Preparation Data Exploration Visualization Data Modeling Data ingestion Data storage Data processing platforms Modeling algorithms Math libraries Graphics processing Integration Deployment

Slide 20

Slide 20 text

Source: https://medium.com/activewizards-machine-learning-company/comparison-of-top-data-science-libraries-for-python-r-and-scala-infographic-574069949267 Python/Scala/R Libraries/Tools/Frameworks for Data Science

Slide 21

Slide 21 text

Groovy/Java Libraries/Tools/Frameworks for Data Science Source: https://medium.com/activewizards-machine-learning-company/comparison-of-top-data-science-libraries-for-python-r-and-scala-infographic-574069949267 Machine Learning DeepLearning4J Apache Spark ML Apache MXNet Visualization GroovyFX JFreeChart Tablesaw Mathematics and Engineering GPars Apache Commons Math Smile Weka Groovylab Apache Spark Apache Ignite Apache Beam Data Manipulation and Analysis Tablesaw Apache Commons CSV OpenCSV

Slide 22

Slide 22 text

Obtain data Identify sources • Internal/External (data providers) • Databases, files, events, IoT • Unstructured/structured • Text, XML, JSON, YAML, CSV/spreadsheets, audio, images, video, web services • Apache Tika • JExcel • Apache POI Clarify ownership Check quality def jsonSlurper = new JsonSlurper() def object = jsonSlurper.parseText('{ "myList": [4, 8, 15, 16, 23, 42] }’) assert object instanceof Map assert object.myList instanceof List assert object.myList == [4, 8, 15, 16, 23, 42] def response = new XmlSlurper().parseText(books) def authorResult = response.value.books.book[0].author assert authorResult.text() == 'Miguel de Cervantes' def qry = 'SELECT * FROM Author’ assert sql.rows(qry, 1, 3)*.firstname == ['Dierk', 'Paul', 'Guillaume’] assert sql.rows(qry, 4, 3)*.firstname == ['Hamlet', 'Cedric', 'Erik’] assert sql.rows(qry, 7, 3)*.firstname == ['Jon']

Slide 23

Slide 23 text

Data preparation Collections: • Java: lists, maps, sets • Groovy: literal notation, GDK methods, GPath • Libraries • Google Guava https://github.com/google/guava • Apache Common Collections https://commons.apache.org/collections/ DataFrames: • Joinery https://cardillo.github.io/joinery/ A data frame implementation in the spirit of Pandas or R data frames with show/plot • Tablesaw https://jtablesaw.github.io/tablesaw/ Java dataframe and visualization library • Apache Spark DataFrames https://spark.apache.org/docs/latest/sql-programming- guide.html Equivalent to a table in a relational database or a data frame in R/Python • Paleo https://github.com/netzwerg/paleo Immutable Java 8 data frames with typed columns (including primitives)

Slide 24

Slide 24 text

Data exploration - DEX

Slide 25

Slide 25 text

Data exploration - Weka More info: https://www.youtube.com/watch?v=7quZv6WCTQc

Slide 26

Slide 26 text

Visualization • Open-Source plotting libraries: GRAL, JFreeChart, Xchart, JMathPlot, Jzy3d, JavaFX, GroovyFX • Types: Scatter plot, Line plot, Area plot, Pie plot, Donut plot, Horizontal bars, Vertical bars, Bubble plot, Radar plot, Box plot, Raster plot, Contour plot, Gauge plot, Step plot, Gantt plot, Histogram • Features: Multiple axes, Logarithmic axes, Line smoothing, Data filtering/aggregation

Slide 27

Slide 27 text

Notebooks supporting Groovy • Jupyter/beakerx • Apache Zeppelin • GroovyLab • Seco

Slide 28

Slide 28 text

Notebooks supporting Groovy • Jupyter/beakerx • Apache Zeppelin • GroovyLab • Seco

Slide 29

Slide 29 text

Data preparation IO: • Java in-built • Groovy enhancements • Libraries • Apache Commons IO https://commons.apache.org/io/ • AOL Cyclops-React https://github.com/aol/cyclops-react Web scraping: • GEB Batch processing/ETL: • Spring Batch http://projects.spring.io/spring-batch/ • Apache Camel • Apache Gobblin

Slide 30

Slide 30 text

Math libraries Math and stats libraries • Java: log, exp, other basic methods. Libraries: • Apache Commons Math http://commons.apache.org/math/ statistics, optimization, and linear algebra • Apache Mahout http://mahout.apache.org/ linear algebra, plus distributed linear algebra and machine learning • JBlas http://jblas.org/ optimized and very fast linear algebra package that uses the BLAS library Also, many of the machine learning libraries come with some math functionality, often linear algebra, stats, and optimization.

Slide 31

Slide 31 text

Data storage Database: • Java’s JDBC • Groovy JDBC enhancements and Datasets • Other NOSQL • Apache Cassandra • MongoDB • Apache CouchDB • Neo4j import com.gmongo.GMongo def db = new GMongo().getDB('athletes') db.athletes.drop() db.athletes << [ first: 'Paul', last: 'Tergat', dob: '1969-06-17', runs: [[distance: 42195, time: 2 * 60 * 60 + 4 * 60 + 55, venue: 'Berlin', when: '2003-09-28’]] ] println "World rows following $marathon3.venue $marathon3.when:" def t = new Traversal() for (Path p in t.description().breadthFirst(). relationships(MyRelationshipTypes.supercedes). evaluator(Evaluators.fromDepth(1)). uniqueness(Uniqueness.NONE). traverse(marathon3)) { def newRow = p.endNode() println "$newRow.venue $newRow.when" } Graph g = new Neo4jGraph(graphDb) def pretty = { it.collect { "$it.venue $it.when" }.join(', ') } def results = [] g.V('venue', 'London').fill(results) println 'London world rows: ' + pretty(results) results = [] g.V('venue', 'London').in('supercedes').fill(results) println 'World rows after London: ' + pretty(results)

Slide 32

Slide 32 text

Machine learning and data mining Machine learning and data mining libraries: • Weka http://www.cs.waikato.ac.nz/ml/weka/ • JavaML http://java-ml.sourceforge.net/ old and reliable ML library but not actively updated • Smile http://haifengl.github.io/smile/ ML library under active development • JSAT https://github.com/EdwardRaff/JSAT contains many machine learning algorithms • H2O http://www.h2o.ai/ distributed Java ML framework also supports Scala, R and Python • Apache Mahout http://mahout.apache.org/ distributed linear algebra framework • Apache Spark ml and mllib https://github.com/apache/spark/tree/master/examples/src/main/java/org/apache/spark/examples

Slide 33

Slide 33 text

Neural networks and text processing Neural networks: • Encog http://www.heatonresearch.com/encog/ • DeepLearning4j http://deeplearning4j.org/ natively supports Keras models Text processing • Java: StringTokenizer, the java.text package, regular expressions • Groovy: Regex enhancements, templates, String enhancements • Libraries: • Apache Lucene https://lucene.apache.org/ • Stanford CoreNLP http://stanfordnlp.github.io/CoreNLP/ • Apache OpenNLP https://opennlp.apache.org/ • LingPipe http://alias-i.com/lingpipe/ • GATE https://gate.ac.uk/ • MALLET http://mallet.cs.umass.edu/ • Smile http://haifengl.github.io/smile/

Slide 34

Slide 34 text

Scaling up Microservice frameworks: • Micronaut Distributed data processing: • Apache Hadoop • Apache Spark • Apache Flink • Apache Samza • Apache Kafka • Apache Storm • Apache Apex • Apache Beam • Apache Ignite • Apache Nifi

Slide 35

Slide 35 text

Scaling up: Apache Ignite Ignite™ is a memory-centric distributed database, caching, and processing platform for transactional, analytical, and streaming workloads delivering in-memory speeds at petabyte scale

Slide 36

Slide 36 text

Scaling up: Apache Spark Source: https://spark.apache.org/ and https://renovacloud.com/en/an-introduction-to-and-evaluation-of-apache-spark-for-big-data-architectures/ Apache Spark™ is a unified analytics engine for large-scale data processing

Slide 37

Slide 37 text

• A Concurrency & Parallelism Framework for Groovy and Java • Fork/join, parallel arrays, map/reduce, actors, agents, STM, CSP, dataflow, asynchronous functions Scaling Up: GPars Data Parallelism: Fork/Join Map/Reduce Fixed coordination (for collections) Actors Explicit coordination Safe Agents Delegated coordination Dataflow Implicit coordination def decryptor = actor { loop { react { String message -> reply message.reverse() } } } def console = actor { decryptor << 'lellarap si yvoorG' react { println 'Decrypted message: ' + it } } console.join() final def df = new Dataflows() task { df.z = df.x + df.y } task { df.x = 10 } task { df.y = 5 } assert df.z == 15 def nums = [1, 2, 3, 4] def result = nums.collectParallel{it * 2} assert result == [2, 4, 6, 8] def agent = new Agent([]) agent { it << 'Dave' } agent { it << 'Joe' } assert agent.val.size() == 2 class Account { private amount = newTxnInteger(0) void transfer(final int a) { atomic { amount.increment(a) } } int getCurrentAmount() { atomicWithInt { amount.get() } } } def future = {it * 2}.callAsync(3) assert 6 == future.get() 5 10 15 x y z +

Slide 38

Slide 38 text

Scaling up: Apache Beam Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines Source: https://beam.apache.org/get-started/beam-overview/

Slide 39

Slide 39 text

Computer science algorithms Decision problems Search problems Counting problems Optimization problems • Paradigms: brute force, divide & conquer, search & enumerate, randomized, complexity reduction, recursive, back tracking, graph • Optimization approaches: linear programming, dynamic programming, greedy, heuristics

Slide 40

Slide 40 text

Data science algorithms Data Mining Statistics Machine Learning Optimization • Analytics: descriptive, predictive, prescriptive • Analysis: anomaly detection, classification, regression, clustering, association, optimization, dimension reduction • Data relationship: linear, non-linear • Assumptions: parametric, non-parametric • Strategy: supervised, unsupervised, reinforcement • Combining: ensemble, boosting

Slide 41

Slide 41 text

Data Science Algorithms Source: Jason Brownlee, https://machinelearningmastery.com/master-machine-learning-algorithms/

Slide 42

Slide 42 text

Data Science Algorithms Source: Jason Brownlee, https://machinelearningmastery.com/master-machine-learning-algorithms/

Slide 43

Slide 43 text

Candles and COVID-19

Slide 44

Slide 44 text

Candles and COVID-19

Slide 45

Slide 45 text

Working with tables and plots List traces(String filename, String lineColor, String markerColor) { def url = getClass().classLoader.getResource(filename) def table = new XlsxReader().read(builder(url).build()) table.addColumns(DateColumn.create('YearMonth', table.column('Date').collect { LocalDate.of(it.year, it.month, 15) })) def janFirst2017 = LocalDateTime.of(2017, JANUARY, 1, 0, 0) Function from2017 = r -> r.dateTimeColumn('Date').isAfter(janFirst2017) Function top3 = r -> r.intColumn('CandleID').isLessThanOrEqualTo(3) def byMonth = table.sortAscendingOn('Date') .where(and(from2017, top3)) .summarize('Rating', mean).by('YearMonth') def byDate = table.sortAscendingOn('Date') .where(and(from2017, top3)) .summarize('Rating', mean).by('Date') def averaged = ScatterTrace.builder(byMonth.dateColumn('YearMonth'), byMonth.nCol('Mean [Rating]')) .mode(ScatterTrace.Mode.LINE) .line(Line.builder().width(5).color(lineColor).shape(Line.Shape.SPLINE).smoothing(1.3).build()) .build() def scatter = ScatterTrace.builder(byDate.dateTimeColumn('Date'), byDate.nCol('Mean [Rating]')) .marker(Marker.builder().opacity(0.3).color(markerColor).build()) .build() [averaged, scatter] }

Slide 46

Slide 46 text

Working with tables and plots def (sAverage, sScatter) = traces('Scented_all.xlsx', 'seablue', 'lightskyblue') def (uAverage, uScatter) = traces('Unscented_all.xlsx', 'seagreen', 'lightgreen') Plot.show(new Figure(layout('scented'), sAverage, sScatter, line)) Plot.show(new Figure(layout('unscented'), uAverage, uScatter, line))

Slide 47

Slide 47 text

Working with tables and plots

Slide 48

Slide 48 text

Regression Overview Clustering: • Relationship between variables Types: • Linear OLSR • Logistic • Polynomial • Stepwise • Ridge • Lasso • ElasticNet Aspects: • Under/over fitting • Outliers Applications: • House price prediction • Market predictions • Ratings prediction • Weather forecast

Slide 49

Slide 49 text

Linear regression case study: House price predictions See also: https://nicolai92.github.io/posts/pentaho-weka-prediction Dataset: https://www.kaggle.com/harlfoxem/housesalesprediction/

Slide 50

Slide 50 text

House price predictions – view data graphically import org.apache.commons.math3.random.EmpiricalDistribution import static groovyx.javafx.GroovyFX.start import static org.apache.commons.csv.CSVFormat.RFC4180 as CSV def file = 'kc_house_data.csv' as File def csv = CSV.withFirstRowAsHeader().parse(new FileReader(file)) def all = csv.collect { it.bedrooms.toInteger() } def dist = new EmpiricalDistribution(all.max()).tap{ load(all as double[]) } def bins = dist.binStats.withIndex().collectMany { v, i -> [i.toString(), v.n] } start { stage(title: 'Number of bedrooms histogram', show: true, width: 800, height: 600) { scene { barChart(title: 'Bedroom count', barGap: 0, categoryGap: 2) { series(name: 'Number of properties', data: bins) } } } }

Slide 51

Slide 51 text

House price predictions – view data graphically

Slide 52

Slide 52 text

House price predictions – view stats import org.apache.commons.math3.stat.descriptive.SummaryStatistics import static org.apache.commons.csv.CSVFormat.RFC4180 as CSV def file = 'kc_house_data.csv' as File def csv = CSV.withFirstRowAsHeader().parse(new FileReader(file)) def all = csv.collect { it.bedrooms.toInteger() } def stats = new SummaryStatistics() all.each{ stats.addValue(it as double) } println stats.summary n: 21613 min: 0.0 max: 33.0 mean: 3.370841623097218 std dev: 0.930061831147451 variance: 0.8650150097573497 sum: 72854.0

Slide 53

Slide 53 text

House price predictions – investigate outliers import org.apache.commons.csv.CSVFormat.RFC4180 as CSV def file = 'kc_house_data.csv' as File def csv = CSV.withFirstRowAsHeader().parse(new FileReader(file)) csv.findAll{ it.bedrooms.toInteger() > 10 }.each{ println it.toMap() as TreeMap } [bathrooms:3, bedrooms:11, condition:3, date:20140821T000000, floors:2, grade:7, id:1773100755, lat:47.556, long:-122.363, price:520000, sqft_above:2400, sqft_basement:600, sqft_living:3000, sqft_living15:1420, sqft_lot:4960, sqft_lot15:4960, view:0, waterfront:0, yr_built:1918, yr_renovated:1999, zipcode:98106] [bathrooms:1.75, bedrooms:33, condition:5, date:20140625T000000, floors:1, grade:7, id:2402100895, lat:47.6878, long:-122.331, price:640000, sqft_above:1040, sqft_basement:580, sqft_living:1620, sqft_living15:1330, sqft_lot:6000, sqft_lot15:4700, view:0, waterfront:0, yr_built:1947, yr_renovated:0, zipcode:98103]

Slide 54

Slide 54 text

House price predictions – investigate outliers Some libraries optionally support CSV -> richly-typed domain classes import com.opencsv.bean.* import groovy.transform.ToString @ToString(includeNames = true) class House { @CsvBindByName Integer bedrooms @CsvBindByName String bathrooms @CsvBindByName(column = 'sqft_lot') Integer area_lot } def file = 'kc_house_data.csv' as File def builder = new CsvToBeanBuilder(new FileReader(file)) def rows = builder.withType(House).build().parse() rows.findAll{ it.bedrooms > 10 }.each{ println it } House(bedrooms:11, bathrooms:3, area_lot:4960) House(bedrooms:33, bathrooms:1.75, area_lot:6000)

Slide 55

Slide 55 text

House prices – remove outlier and view bedrooms import org.apache.commons.math3.random.EmpiricalDistribution import static groovyx.javafx.GroovyFX.start import static org.apache.commons.csv.CSVFormat.RFC4180 as CSV def csv = CSV.withFirstRowAsHeader().parse(new FileReader('kc_house_data.csv')) def all = csv.collect { it.bedrooms.toInteger() }.findAll{ it < 30 } def dist = new EmpiricalDistribution(all.max()).tap{ load(all as double[]) } def bins = dist.binStats.withIndex().collectMany { v, i -> [i.toString(), v.n] } start { stage(title: 'Number of bedrooms histogram', show: true, width: 800, height: 600) { scene { barChart(title: 'Bedroom count', barGap: 0, categoryGap: 2) { series(name: 'Number of properties', data: bins) } } } }

Slide 56

Slide 56 text

House price predictions – remove outlier and view price import org.apache.commons.math3.random.EmpiricalDistribution import static groovyx.javafx.GroovyFX.start import static org.apache.commons.csv.CSVFormat.RFC4180 as CSV def file = 'kc_house_data.csv' as File def csv = CSV.withFirstRowAsHeader().parse(new FileReader(file)) def all = csv.findAll { it.bedrooms.toInteger() < 30 }.collect { it.price.toDouble() } def info = new SummaryStatistics(); all.each(info::addValue) def head = "Price percentile (min=\$$info.min, mean=\$${info.mean as int}, max=\$$info.max)" def dist = new EmpiricalDistribution(100).tap{ load(all as double[]) } def bins = dist.binStats.withIndex().collectMany { v, i -> [i.toString(), v.n] } start { stage(title: 'Price histogram', show: true, width: 800, height: 600) { scene { barChart(title: head, barGap: 0, categoryGap: 0) { series(name: 'Number of properties', data: bins) } } } }

Slide 57

Slide 57 text

House price predictions – explore with tablesaw

Slide 58

Slide 58 text

House price predictions – explore with tablesaw import tech.tablesaw.api.* import tech.tablesaw.plotly.Plot import tech.tablesaw.plotly.api.* import static tech.tablesaw.aggregate.AggregateFunctions.* def rows = Table.read().csv('kc_house_data.csv') println rows.shape() println rows.structure() println rows.column("bedrooms").summary().print() println rows.where(rows.column("bedrooms").isGreaterThan(10)) def cleaned = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30)) println cleaned.shape() println cleaned.summarize("price", mean, min, max).by("bedrooms") Plot.show(ScatterPlot.create("Price x bathrooms x grade", cleaned, "bathrooms", "price", 'grade'))

Slide 59

Slide 59 text

House price predictions – explore with tablesaw import tech.tablesaw.api.* import tech.tablesaw.plotly.Plot import tech.tablesaw.plotly.api.* import static tech.tablesaw.aggregate.AggregateFunctions.* def rows = Table.read().csv('kc_house_data.csv') println rows.shape() println rows.structure() println rows.column("bedrooms").summary().print() println rows.where(rows.column("bedrooms").isGreaterThan(10)) def cleaned = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30)) println cleaned.shape() println cleaned.summarize("price", mean, min, max).by("bedrooms") Plot.show(ScatterPlot.create("Price x bathrooms x grade", cleaned, "bathrooms", "price", 'grade')) 21613 rows X 21 cols

Slide 60

Slide 60 text

House price predictions – explore with tablesaw import tech.tablesaw.api.* import tech.tablesaw.plotly.Plot import tech.tablesaw.plotly.api.* import static tech.tablesaw.aggregate.AggregateFunctions.* def rows = Table.read().csv('kc_house_data.csv') println rows.shape() println rows.structure() println rows.column("bedrooms").summary().print() println rows.where(rows.column("bedrooms").isGreaterThan(10)) def cleaned = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30)) println cleaned.shape() println cleaned.summarize("price", mean, min, max).by("bedrooms") Plot.show(ScatterPlot.create("Price x bathrooms x grade", cleaned, "bathrooms", "price", 'grade')) Structure of kc_house_data.csv Index | Column Name | Column Type | ------------------------------------------- 0 | id | LONG | 1 | date | STRING | 2 | price | DOUBLE | 3 | bedrooms | INTEGER | 4 | bathrooms | DOUBLE | 5 | sqft_living | INTEGER | 6 | sqft_lot | INTEGER | 7 | floors | DOUBLE | 8 | waterfront | INTEGER | 9 | view | INTEGER | ... | ... | ... | 11 | grade | INTEGER | 12 | sqft_above | INTEGER | 13 | sqft_basement | INTEGER | 14 | yr_built | INTEGER | 15 | yr_renovated | INTEGER | 16 | zipcode | INTEGER | 17 | lat | DOUBLE | 18 | long | DOUBLE | 19 | sqft_living15 | INTEGER | 20 | sqft_lot15 | INTEGER |

Slide 61

Slide 61 text

House price predictions – explore with tablesaw import tech.tablesaw.api.* import tech.tablesaw.plotly.Plot import tech.tablesaw.plotly.api.* import static tech.tablesaw.aggregate.AggregateFunctions.* def rows = Table.read().csv('kc_house_data.csv') println rows.shape() println rows.structure() println rows.column("bedrooms").summary().print() println rows.where(rows.column("bedrooms").isGreaterThan(10)) def cleaned = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30)) println cleaned.shape() println cleaned.summarize("price", mean, min, max).by("bedrooms") Plot.show(ScatterPlot.create("Price x bathrooms x grade", cleaned, "bathrooms", "price", 'grade')) Column: bedrooms Measure | Value | ----------------------------------- n | 21613.0 | sum | 72854.0 | Mean | 3.370841623097218 | Min | 0.0 | Max | 33.0 | Range | 33.0 | Variance | 0.8650150097573497 | Std. Dev | 0.930061831147451 |

Slide 62

Slide 62 text

House price predictions – explore with tablesaw import tech.tablesaw.api.* import tech.tablesaw.plotly.Plot import tech.tablesaw.plotly.api.* import static tech.tablesaw.aggregate.AggregateFunctions.* def rows = Table.read().csv('kc_house_data.csv') println rows.shape() println rows.structure() println rows.column("bedrooms").summary().print() println rows.where(rows.column("bedrooms").isGreaterThan(10)) def cleaned = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30)) println cleaned.shape() println cleaned.summarize("price", mean, min, max).by("bedrooms") Plot.show(ScatterPlot.create("Price x bathrooms x grade", cleaned, "bathrooms", "price", 'grade')) kc_house_data.csv id | date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | grade | sqft_above | sqft_basement | yr_built | yr_renovated | zipcode | lat | long | sqft_living15 | sqft_lot15 | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 1773100755 | 20140821 | 520000.0 | 11 | 3.0 | 3000 | 4960 | 2.0 | 0 | 0 | 3 | 7 | 2400 | 600 | 1918 | 1999 | 98106 | 47.556 | -122.363 | 1420 | 4960 | 2402100895 | 20140625 | 640000.0 | 33 | 1.75 | 1620 | 6000 | 1.0 | 0 | 0 | 5 | 7 | 1040 | 580 | 1947 | 0 | 98103 | 47.687 | -122.331 | 1330 | 4700 |

Slide 63

Slide 63 text

House price predictions – explore with tablesaw import tech.tablesaw.api.* import tech.tablesaw.plotly.Plot import tech.tablesaw.plotly.api.* import static tech.tablesaw.aggregate.AggregateFunctions.* def rows = Table.read().csv('kc_house_data.csv') println rows.shape() println rows.structure() println rows.column("bedrooms").summary().print() println rows.where(rows.column("bedrooms").isGreaterThan(10)) def cleaned = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30)) println cleaned.shape() println cleaned.summarize("price", mean, min, max).by("bedrooms") Plot.show(ScatterPlot.create("Price x bathrooms x grade", cleaned, "bathrooms", "price", 'grade')) 21612 rows X 21 cols

Slide 64

Slide 64 text

House price predictions – explore with tablesaw import tech.tablesaw.api.* import tech.tablesaw.plotly.Plot import tech.tablesaw.plotly.api.* import static tech.tablesaw.aggregate.AggregateFunctions.* def rows = Table.read().csv('kc_house_data.csv') println rows.shape() println rows.structure() println rows.column("bedrooms").summary().print() println rows.where(rows.column("bedrooms").isGreaterThan(10)) def cleaned = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30)) println cleaned.shape() println cleaned.summarize("price", mean, min, max).by("bedrooms") Plot.show(ScatterPlot.create("Price x bathrooms x grade", cleaned, "bathrooms", "price", 'grade')) kc_house_data.csv summary bedrooms | Mean [price] | Min [price] | Max [price] | ------------------------------------------------------------------- 0 | 409503.8461538461 | 139950.0 | 1295650.0 | 1 | 317642.8844221105 | 75000.0 | 1247000.0 | 2 | 401372.68188405805 | 78000.0 | 3278000.0 | 3 | 466232.0784812695 | 82000.0 | 3800000.0 | 4 | 635419.5042138912 | 100000.0 | 4489000.0 | 5 | 786599.8288569644 | 133000.0 | 7062500.0 | 6 | 825520.6360294118 | 175000.0 | 7700000.0 | 7 | 951184.6578947367 | 280000.0 | 3200000.0 | 8 | 1105076.923076923 | 340000.0 | 3300000.0 | 9 | 893999.8333333334 | 450000.0 | 1400000.0 | 10 | 819333.3333333334 | 650000.0 | 1148000.0 | 11 | 520000.0 | 520000.0 | 520000.0 |

Slide 65

Slide 65 text

House price predictions – explore with tablesaw import tech.tablesaw.api.* import tech.tablesaw.plotly.Plot import tech.tablesaw.plotly.api.* import static tech.tablesaw.aggregate.AggregateFunctions.* def rows = Table.read().csv('kc_house_data.csv') println rows.shape() println rows.structure() println rows.column("bedrooms").summary().print() println rows.where(rows.column("bedrooms").isGreaterThan(10)) def cleaned = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30)) println cleaned.shape() println cleaned.summarize("price", mean, min, max).by("bedrooms") Plot.show(ScatterPlot.create("Price x bathrooms x grade", cleaned, "bathrooms", "price", 'grade'))

Slide 66

Slide 66 text

House price predictions – explore with tablesaw cleaned.addColumns( StringColumn.create("waterfrontDesc", cleaned.column("waterfront").collect{ it ? 'waterfront' : 'interior' }), DoubleColumn.create("scaledGrade", cleaned.column("grade").collect{ it * 2 }), DoubleColumn.create("scaledPrice", cleaned.column("price").collect{ it / 100000 }) ) Plot.show(BubblePlot.create("Price vs living area and grade (bubble size)", cleaned, "sqft_living", "price", "scaledGrade", "waterfrontDesc")) Plot.show(Scatter3DPlot.create("Grade, living space, bathrooms and price (bubble size)", cleaned, "sqft_living", "bathrooms", "grade", "scaledPrice", "waterfrontDesc"))

Slide 67

Slide 67 text

House price predictions – explore with jupyter/beakerx

Slide 68

Slide 68 text

Regression with Least Squares Find line of “best fit”:

Slide 69

Slide 69 text

Regression with Least Squares Find line of “best fit”: • Find intercept and slope of line which minimizes the sum of the square of the residual errors Residual errors y x

Slide 70

Slide 70 text

Regression with Least Squares Find line of “best fit”: • Find intercept and slope of line which minimizes the sum of the square of the residual errors • Line is represented by: y = m x + b • For points (x0 , y0 )..(xn , yn ) errors will be: err0 = y0 – (m x0 + b) .. errn = yn – (m xn + b) which are then squared • To minimise, we find where the derivative is zero • Solving the math gives: b = σ 𝑦 σ 𝑥2−σ 𝑥 σ 𝑥 𝑦 𝑛 σ 𝑥2 − σ 𝑥 2 m = 𝑛 σ 𝑥 𝑦 − σ 𝑥 σ 𝑦 𝑛 σ 𝑥2 − σ 𝑥 2 Residual errors y x (0, b) (x0 , y0 ) (xn , yn ) errn err0

Slide 71

Slide 71 text

Regression with Gradient Descent Find line of “best fit”: • Find intercept and slope of line which minimizes the sum of the square of the residual errors • Line is represented by: y = m x + b and we might have a similar loss function • Instead of solving for the derivative of loss function, we take repeated steps in the opposite direction of the gradient, taking smaller steps as we approach zero • A learning rate influences step size • Amenable to streaming & can involve random or mini-batch subsets of data for efficiency Loss function (residual error) Parameter being optimized (could be multiple dimensions)

Slide 72

Slide 72 text

House price predictions – linear regression import org.apache.commons.math3.stat.regression.SimpleRegression import static groovyx.javafx.GroovyFX.start import static org.apache.commons.csv.CSVFormat.RFC4180 as CSV def feature = 'bedrooms' def nonOutliers = feature == 'bedrooms' ? { it[0] < 30 } : { true } def csv = CSV.withFirstRecordAsHeader().parse(new FileReader('kc_house_data.csv')) def all = csv.collect { [it[feature].toDouble(), it.price.toDouble()] }.findAll(nonOutliers) def reg = new SimpleRegression().tap{ addData(all as double[][]) } def (min, max) = all.transpose().with{ [it[0].min(), it[0].max()] } def predicted = [[min, reg.predict(min)], [max, reg.predict(max)]] start { stage(title: 'Price vs Number of bedrooms', show: true, width: 800, height: 600) { scene { scatterChart { series(name: 'Actual', data: all) } lineChart { series(name: 'Predicted', data: predicted) } } } }

Slide 73

Slide 73 text

House price predictions – linear regression • Can be repeated for other features

Slide 74

Slide 74 text

House price predictions – linear regression

Slide 75

Slide 75 text

House price predictions – evaluating regression @Canonical @FXBindable class Feature { String name double rr, rmseTrain, rmseTest, mean } def features = [new Feature('bedrooms'), new Feature('bathrooms'), new Feature('grade'), new Feature('sqft_living')] def idxs = 0.. [*idxs.collect{ row[features[it].name].toDouble() }, row.price.toDouble()] }.findAll{ it[0] < 30 } def (train, test) = all.chop(all.size() * 0.8 as int, -1) def trainT = train.transpose() def regs = idxs.collect { idx -> new SimpleRegression().tap{ addData([trainT[idx], trainT[priceIdx]].transpose() as double[][]) } } def residuals = idxs.collect{ idx -> test.collect{ regs[idx].predict(it[idx]) - it[priceIdx] } }

Slide 76

Slide 76 text

House price predictions – evaluating regression idxs.each { idx -> features[idx].rr = regs[idx].RSquare features[idx].rmseTrain = Math.sqrt(regs[idx].meanSquareError) features[idx].rmseTest = Math.sqrt(StatUtils.sumSq(residuals[idx] as double[]) / (test.size() - 1)) features[idx].mean = StatUtils.mean(residuals[idx] as double[]) } start { stage(title: "Error statistics for feature", visible: true) { scene(fill: groovyblue, width: 800, height:200) { stackPane(padding: 10) { tableView(items: features) { tableColumn(text: "Feature", property: 'name') tableColumn(text: "R²", property: 'rr') tableColumn(text: "RMSE (train)", property: 'rmseTrain') tableColumn(text: "RMSE (test)", property: 'rmseTest') tableColumn(text: "Residuals Mean", property: 'mean') } } } } }

Slide 77

Slide 77 text

House price predictions – evaluating regression def maxError = [residuals[graphIdx].min(), residuals[graphIdx].max()].max{ Math.abs(it) } residuals[graphIdx] << maxError * -1 // make graph even around origin maxError = Math.abs(maxError) def step = maxError.toInteger() / 50 def dist = new EmpiricalDistribution(100).tap{ load(residuals[graphIdx] as double[]) } def ndist = new NormalDistribution(0, dist.sampleStats.standardDeviation) def bins = dist.binStats.indexed().collect { i, v -> [v.n ? v.mean.toInteger().toString(): (-maxError + i * step).toInteger().toString(), v.n] } def nbins = dist.binStats.indexed().collect { i, v -> def x = v.n ? v.mean.toInteger() : (-maxError + i * step).toInteger(); [x.toString(), ndist.probability(x, x + step)] } def scale = dist.binStats.max{ it.n }.n / nbins.max{ it[1] }[1] nbins = nbins.collect{ [it[0], it[1] * scale] } start { stage(title: "Error histogram for ${features[graphIdx].name}", show: true, width: 800, height: 600) { scene { barChart(title: 'Error percentile', barGap: 0, categoryGap: 0) { series(name: 'Error in prediction', data: bins) series(name: 'Normal distribution', data: nbins) } } } }

Slide 78

Slide 78 text

House price predictions – evaluating regression

Slide 79

Slide 79 text

House price predictions – multi linear regression def features = [ 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_living15', 'lat', 'sqft_above', 'grade', 'view', 'waterfront', 'floors' ] def idxs = 0.. [*idxs.collect{ row[features[it]].toDouble() }, row.price.toDouble()] }.findAll{ it[0] < 30 } def (train, test) = all.chop(all.size() * 0.8 as int, -1) def trainT = train.transpose() def price = trainT[priceIdx] def reg = new OLSMultipleLinearRegression().tap{ newSampleData(price as double[], trainT[0.. params[0] + (1.. predicted[i] - data[priceIdx] } def rr = reg.calculateRSquared() def rmseTrain = Math.sqrt(reg.calculateResidualSumOfSquares()) def rmseTest = Math.sqrt(StatUtils.sumSq(residuals as double[]) / (test.size() - 1)) def mean = StatUtils.mean(residuals as double[]) println "$rr $rmseTrain $rmseTest $mean" [-3.1635392225013282E7, -27484.961699434185, -5236.246584477492, 191.507648719199, 17.94379836055797, 657803.1115560957, 4.369533695386711, 74850.68999978402, 66459.03393303146, 604896.8569593901, -27203.796342244277] 0.6536012563023335 215379.66626496855 212609.34498021202 1886.6015077985683 R2: https://en.wikipedia.org/wiki/Coefficient_of_determination rmse: https://en.wikipedia.org/wiki/Root-mean-square_deviation

Slide 80

Slide 80 text

House price predictions – multi linear regression

Slide 81

Slide 81 text

House price predictions – multi linear regression def file = getClass().classLoader.getResource('kc_house_data.csv').file as File def loader = new CSVLoader(file: file) def model = new LinearRegression() def allInstances = loader.dataSet def priceIndex = 2 allInstances.classIndex = priceIndex // remove "id" and "date" columns def rm = new Remove(attributeIndices: '1,2', inputFormat: allInstances) def instances = Filter.useFilter(allInstances, rm) model.buildClassifier(instances) println model def actual = instances.collect{ it.value(0).toDouble() } def predicted = instances.collect{ model.classifyInstance(it) } def chart = new XYChartBuilder()… Linear Regression Model price = -35766.5414 * bedrooms + 41144.2785 * bathrooms + 82.8088 * sqft_living + 0.1286 * sqft_lot + 6689.5501 * floors + 582960.4584 * waterfront + 52870.9424 * view + 26385.6491 * condition + 95890.4452 * grade + 98.4193 * sqft_above + 67.2917 * sqft_basement + -2620.2232 * yr_built + 19.8126 * yr_renovated + -582.4199 * zipcode + 602748.2265 * lat + -214729.8283 * long + 21.6814 * sqft_living15 + -0.3826 * sqft_lot15 + 6690324.6024

Slide 82

Slide 82 text

House price predictions – multi linear regression def file = getClass().classLoader.getResource('kc_house_data.csv').file as File def loader = new CSVLoader(file: file) def model = new LinearRegression() def allInstances = loader.dataSet def priceIndex = 2 allInstances.classIndex = priceIndex // remove "id" and "date" columns def rm = new Remove(attributeIndices: '1,2', inputFormat: allInstances) def instances = Filter.useFilter(allInstances, rm) model.buildClassifier(instances) println model def actual = instances.collect{ it.value(0).toDouble() } def predicted = instances.collect{ model.classifyInstance(it) } def chart = new XYChartBuilder()… Linear Regression Model price = -35766.5414 * bedrooms + 41144.2785 * bathrooms + 82.8088 * sqft_living + 0.1286 * sqft_lot + 6689.5501 * floors + 582960.4584 * waterfront + 52870.9424 * view + 26385.6491 * condition + 95890.4452 * grade + 98.4193 * sqft_above + 67.2917 * sqft_basement + -2620.2232 * yr_built + 19.8126 * yr_renovated + -582.4199 * zipcode + 602748.2265 * lat + -214729.8283 * long + 21.6814 * sqft_living15 + -0.3826 * sqft_lot15 + 6690324.6024

Slide 83

Slide 83 text

House price predictions – multi linear regression def file = getClass().classLoader.getResource('kc_house_data.csv').file as File def loader = new CSVLoader(file: file) def model = new SGD() model.options = ['-F', '4', '-N'] as String[] // Huber loss, unscaled def allInstances = loader.dataSet def priceIndex = 2 allInstances.classIndex = priceIndex // remove "id", "date", 'zip', 'lat', 'long' columns def rm = new Remove(attributeIndices: '1,2,17,18,19', inputFormat: allInstances) def instances = Filter.useFilter(allInstances, rm) model.buildClassifier(instances) println model def actual = instances.collect{ it.value(0).toDouble() } def predicted = instances.collect{ model.classifyInstance(it) } def chart = new XYChartBuilder()… Loss function: Huber loss (robust regression) price = -9.6832 bedrooms + 0.4953 bathrooms + 112.6971 sqft_living + 0.8189 sqft_lot + 5.5 floors + 0.7041 waterfront + 11.3372 view + 7.8443 condition + 23.9548 grade + 37.5499 sqft_above + 75.1472 sqft_basement + -7.8827 yr_built + 63.7169 yr_renovated + 103.4893 sqft_living15 + -1.3449 sqft_lot15 + 0.2788

Slide 84

Slide 84 text

House price predictions – multi linear regression def file = getClass().classLoader.getResource('kc_house_data.csv').file as File def loader = new CSVLoader(file: file) def model = new SGD() model.options = ['-F', '4', '-N'] as String[] // Huber loss, unscaled def allInstances = loader.dataSet def priceIndex = 2 allInstances.classIndex = priceIndex // remove "id", "date", 'zip', 'lat', 'long' columns def rm = new Remove(attributeIndices: '1,2,17,18,19', inputFormat: allInstances) def instances = Filter.useFilter(allInstances, rm) model.buildClassifier(instances) println model def actual = instances.collect{ it.value(0).toDouble() } def predicted = instances.collect{ model.classifyInstance(it) } def chart = new XYChartBuilder()… Loss function: Huber loss (robust regression) price = -9.6832 bedrooms + 0.4953 bathrooms + 112.6971 sqft_living + 0.8189 sqft_lot + 5.5 floors + 0.7041 waterfront + 11.3372 view + 7.8443 condition + 23.9548 grade + 37.5499 sqft_above + 75.1472 sqft_basement + -7.8827 yr_built + 63.7169 yr_renovated + 103.4893 sqft_living15 + -1.3449 sqft_lot15 + 0.2788

Slide 85

Slide 85 text

House price predictions – explore with Weka

Slide 86

Slide 86 text

House price predictions – explore with Weka

Slide 87

Slide 87 text

House price predictions – analyze with Weka

Slide 88

Slide 88 text

House price predictions – analyze with Weka

Slide 89

Slide 89 text

House price predictions – analyze with Weka

Slide 90

Slide 90 text

House price predictions – scaling regression import … def rows = Table.read().csv('kc_house_data.csv') rows = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30)) String[] features = [ 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_living15', 'lat', 'sqft_above', 'grade', 'view', 'waterfront', 'floors' ] def data = rows.as().doubleMatrix(features) // configure to all run on local machine but could be a cluster (can be hidden in XML) def cfg = new IgniteConfiguration( peerClassLoadingEnabled: true, discoverySpi: new TcpDiscoverySpi( ipFinder: new TcpDiscoveryMulticastIpFinder( addresses: ['127.0.0.1:47500..47509'] ) ) ) static pretty(mdl, features) { def sign = { val -> val < 0 ? '- ' : '+ ' } def valIdx = { idx, val -> sprintf '%.2f*%s', val, features[idx+1] } def valIdxSign = { idx, val -> sign(val) + valIdx(idx, Math.abs(val)) } def valSign = { val -> sign(val) + sprintf('%.2f', Math.abs(val)) } def (w, i) = [mdl.weights, mdl.intercept] [valIdx(0, w.get(0)), *(1..

Slide 91

Slide 91 text

House price predictions – scaling regression … Ignition.start(cfg).withCloseable { ignite -> println ">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols" def dataCache = new SandboxMLCache(ignite).fillCacheWith(data) def trainer = new LSQRTrainer().withEnvironmentBuilder(defaultBuilder().withRNGSeed(0)) def vectorizer = new DoubleArrayVectorizer().labeled(Vectorizer.LabelCoordinate.FIRST) def split = new TrainTestDatasetSplitter().split(0.8) def mdl = trainer.fit(ignite, dataCache, split.trainFilter, vectorizer) def rr = Evaluator.evaluate(dataCache, split.testFilter, mdl, vectorizer, new RM().withMetric { it.r2() }) def rmse = Evaluator.evaluate(dataCache, split.testFilter, mdl, vectorizer, new RM()) dataCache.destroy() println ">>> Model: " + pretty(mdl, features) println ">>> R^2 : " + rr println ">>> RMSE : " + rmse } Candidates for distributed processing

Slide 92

Slide 92 text

House price predictions – scaling regression … Ignition.start(cfg).withCloseable { ignite -> println ">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols" def dataCache = new SandboxMLCache(ignite).fillCacheWith(data) def trainer = new LSQRTrainer().withEnvironmentBuilder(defaultBuilder().withRNGSeed(0)) def vectorizer = new DoubleArrayVectorizer().labeled(Vectorizer.LabelCoordinate.FIRST) def split = new TrainTestDatasetSplitter().split(0.8) def mdl = trainer.fit(ignite, dataCache, split.trainFilter, vectorizer) def rr = Evaluator.evaluate(dataCache, split.testFilter, mdl, vectorizer, new RM().withMetric { it.r2() }) def rmse = Evaluator.evaluate(dataCache, split.testFilter, mdl, vectorizer, new RM()) dataCache.destroy() println ">>> Model: " + pretty(mdl, features) println ">>> R^2 : " + rr println ">>> RMSE : " + rmse } [15:02:09] Ignite node started OK (id=63b2d47d) >>> Ignite grid started for data: 21612 rows X 11 cols >>> Model: -33081.12*bedrooms - 16205.93*bathrooms + 225.40*sqft_living + 6.77*sqft_living15 - 8766.52*lat - 38.59*sqft_above + 96251.64*grade + 62303.58*view + 572116.79*waterfront - 11620.55*floors - 48504.18 >>> R^2 : 0.607023233473466 >>> RMSE : 228035.85816493785 [15:02:10] Ignite node stopped OK [uptime=00:00:01.465] See also: https://github.com/apache/ignite/blob/master/examples/src/main/java/org/apache/ignite/examples/ml/regression/linear/BostonHousePricesPredictionExample.java

Slide 93

Slide 93 text

House prices – scaling regression import … def spark = builder().config('spark.master', 'local[8]').appName('HousePrices').orCreate def file = '/path/to/kc_house_data.csv' int k = 5 Dataset ds = spark.read().format('csv') .options('header': 'true', 'inferSchema': 'true').load(file) double[] splits = [80, 20] def (training, test) = ds.randomSplit(splits) String[] colNames = ds.columns().toList().minus(['id', 'date', 'price']) def assembler = new VectorAssembler(inputCols: colNames, outputCol: 'features') Dataset dataset = assembler.transform(training) def lr = new LinearRegression(labelCol: 'price', maxIter: 10) def model = lr.fit(dataset) println 'Coefficients:' println model.coefficients().values()[1..-1].collect{ sprintf '%.2f', it }.join(', ') def testSummary = model.evaluate(assembler.transform(test)) printf 'RMSE: %.2f%n', testSummary.rootMeanSquaredError printf 'r2: %.2f%n', testSummary.r2 spark.stop() Coefficients: 41979.78, 80853.89, 0.15, 5412.83, 564343.22, 53834.10, 24817.09, 93195.29, -80662.68, -80694.28, -2713.58, 19.02, -628.67, 594468.23, -228397.19, 21.23, -0.42 RMSE: 187242.12 r2: 0.70

Slide 94

Slide 94 text

Clustering Overview Clustering: • Grouping similar items Algorithm families: • Hierarchical • Partitioning k-means, x-means • Density-based • Graph-based Aspects: • Disjoint vs overlapping • Preset cluster number • Dimensionality reduction PCA • Nominal feature support Applications: • Market segmentation • Recommendation engines • Search result grouping • Social network analysis • Medical imaging

Slide 95

Slide 95 text

Clustering https://commons.apache.org/proper/commons-math/userguide/ml.html

Slide 96

Slide 96 text

Clustering case study: Whiskey flavor profiles • 86 scotch whiskies • 12 flavor categories Pictures: https://prasant.net/clustering-scotch-whisky-grouping-distilleries-by-k-means-clustering-81f2ecde069c https://www.r-bloggers.com/where-the-whisky-flavor-profile-data-came-from/ https://www.centerspace.net/clustering-analysis-part-iv-non-negative-matrix-factorization/

Slide 97

Slide 97 text

Whiskey – exploring with Dex

Slide 98

Slide 98 text

import … def table = Table.read().csv('whiskey.csv') table = table.removeColumns(0) def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] def panel = new PlotPanel( *[0.. def color = new Color(72 + (first * 16), 72 + (second * 16), 200 - (first * 4) - (second * 4)) ScatterPlot.plot(table.as().doubleMatrix(cols[first], cols[second]), '#' as char, color) } ) new SwingBuilder().edt { frame(title: 'Frame', size: [800, 600], show: true, defaultCloseOperation: DISPOSE) { widget(panel) } } Whiskey: How to visualize? Feature combinations?

Slide 99

Slide 99 text

Hmm… not particularly enlightening Whiskey: How to visualize? Feature combinations?

Slide 100

Slide 100 text

Another dimension? Whiskey: How to visualize? Feature combinations?

Slide 101

Slide 101 text

import … def table = Table.read().csv('whiskey.csv') table = table.removeColumns(0) def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] Color[] colors = [CYAN, PINK, MAGENTA, ORANGE, GREEN, BLUE, RED, YELLOW] def panel = new PlotPanel( *[cols, cols].combinations().collect { first, second -> Histogram3D.plot(table.as().doubleMatrix(first, second), 4, colors) } ) new SwingBuilder().edt { frame(title: 'Frame', size: [800, 600], show: true, defaultCloseOperation: DISPOSE) { widget(panel) } } Whiskey: How to visualize? Feature combinations?

Slide 102

Slide 102 text

import … def table = Table.read().csv('whiskey.csv') table = table.removeColumns(0) def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] Color[] colors = [CYAN, PINK, MAGENTA, ORANGE, GREEN, BLUE, RED, YELLOW] def panel = new PlotPanel( *[cols, cols].combinations().collect { first, second -> Histogram3D.plot(table.as().doubleMatrix(first, second), 4, colors) } ) new SwingBuilder().edt { frame(title: 'Frame', size: [800, 600], show: true, defaultCloseOperation: DISPOSE) { widget(panel) } } Whiskey: How to visualize? Feature combinations?

Slide 103

Slide 103 text

import … def table = Table.read().csv('whiskey.csv') table = table.removeColumns(0) def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] Color[] colors = [CYAN, PINK, MAGENTA, ORANGE, GREEN, BLUE, RED, YELLOW] def panel = new PlotPanel( *[cols, cols].combinations().collect { first, second -> Histogram3D.plot(table.as().doubleMatrix(first, second), 4, colors) } ) SwingUtil.show(size: [1200, 900], panel) Whiskey: How to visualize? Feature combinations?

Slide 104

Slide 104 text

Pretty… but only marginally more useful Whiskey: How to visualize? Feature combinations?

Slide 105

Slide 105 text

Clustering with KMeans Step 1: • Guess k cluster centroids at random

Slide 106

Slide 106 text

Clustering with KMeans Step 1: • Guess k cluster centroids

Slide 107

Slide 107 text

Clustering with KMeans Step 1: • Guess k cluster centroids Step 2: • Assign points to closest centroid

Slide 108

Slide 108 text

Clustering with KMeans Step 1: • Guess k cluster centroids Step 2: • Assign points to closest centroid

Slide 109

Slide 109 text

Clustering with KMeans Step 1: • Guess k cluster centroids Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points

Slide 110

Slide 110 text

Clustering with KMeans Step 1: • Guess k cluster centroids Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points

Slide 111

Slide 111 text

Clustering with KMeans Step 1: • Guess k cluster centroids Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points

Slide 112

Slide 112 text

Clustering with KMeans Step 1: • Guess k cluster centroids Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points

Slide 113

Slide 113 text

Clustering with KMeans Step 1: • Guess k cluster centroids Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points Repeat steps 2 and 3 until stable or some limit reached

Slide 114

Slide 114 text

Clustering with KMeans Step 1: • Guess k cluster centroids Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points Repeat steps 2 and 3 until stable or some limit reached

Slide 115

Slide 115 text

Clustering with KMeans Step 1: • Guess k cluster centroids Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points Repeat steps 2 and 3 until stable or some limit reached

Slide 116

Slide 116 text

Clustering with KMeans Step 1: • Guess k cluster centroids Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points Repeat steps 2 and 3 until stable or some limit reached

Slide 117

Slide 117 text

import … def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] def numClusters = 5 def loader = new CSVLoader(file: 'whiskey.csv') def clusterer = new SimpleKMeans(numClusters: numClusters, preserveInstancesOrder: true) def instances = loader.dataSet instances.deleteAttributeAt(0) // remove RowID clusterer.buildClusterer(instances) println ' ' + cols.join(', ') def dataset = new DefaultCategoryDataset() clusterer.clusterCentroids.eachWithIndex{ Instance ctrd, num -> print "Cluster ${num+1}: " println ((1..cols.size()).collect{ sprintf '%.3f', ctrd.value(it) }.join(', ')) (1..cols.size()).each { idx -> dataset.addValue(ctrd.value(idx), "Cluster ${num+1}", cols[idx-1]) } } def clusters = (0.. clusters[cnum] << instances.get(idx).stringValue(0) } clusters.each { k, v -> println "Cluster ${k+1}:" println v.join(', ') } def plot = new SpiderWebPlot(dataset: dataset) def chart = new JFreeChart('Whiskey clusters', plot) SwingUtil.show(new ChartPanel(chart)) Whiskey – clustering with radar plot and weka Read CSV and create clusterer Display whiskies assigned to each cluster Graph cluster centroids

Slide 118

Slide 118 text

import … def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] def numClusters = 5 def loader = new CSVLoader(file: 'whiskey.csv') def clusterer = new SimpleKMeans(numClusters: numClusters, preserveInstancesOrder: true) def instances = loader.dataSet instances.deleteAttributeAt(0) // remove RowID clusterer.buildClusterer(instances) println ' ' + cols.join(', ') def dataset = new DefaultCategoryDataset() clusterer.clusterCentroids.eachWithIndex{ Instance ctrd, num -> print "Cluster ${num+1}: " println ((1..cols.size()).collect{ sprintf '%.3f', ctrd.value(it) }.join(', ')) (1..cols.size()).each { idx -> dataset.addValue(ctrd.value(idx), "Cluster ${num+1}", cols[idx-1]) } } def clusters = (0.. clusters[cnum] << instances.get(idx).stringValue(0) } clusters.each { k, v -> println "Cluster ${k+1}:" println v.join(', ') } def plot = new SpiderWebPlot(dataset: dataset) def chart = new JFreeChart('Whiskey clusters', plot) SwingUtil.show(new ChartPanel(chart)) Whiskey – clustering with radar plot and weka Body, Sweetness, Smoky, Medicinal, Tobacco, Honey, Spicy, Winey, Nutty, Malty, Fruity, Floral Cluster 1: 3.800, 1.600, 3.600, 3.600, 0.600, 0.200, 1.600, 0.600, 1.000, 1.400, 1.200, 0.000 Cluster 2: 2.773, 2.409, 1.545, 0.045, 0.000, 1.818, 1.591, 2.000, 2.091, 2.136, 2.136, 1.591 Cluster 3: 1.773, 2.455, 1.318, 0.636, 0.000, 0.636, 1.000, 0.409, 1.636, 1.364, 1.591, 1.591 Cluster 4: 1.500, 2.233, 1.267, 0.267, 0.000, 1.533, 1.400, 0.700, 1.000, 1.900, 1.900, 2.133 Cluster 5: 2.000, 2.143, 1.857, 0.857, 1.000, 0.857, 1.714, 1.000, 1.286, 2.000, 1.429, 1.714 Cluster 1: Ardbeg, Clynelish, Lagavulin, Laphroig, Talisker Cluster 2: Aberfeldy, Aberlour, Ardmore, Auchroisk, Balmenach, BenNevis, Benrinnes, Benromach, BlairAthol, Dailuaine, Dalmore, Edradour, Glendronach, Glendullan, Glenfarclas, Glenrothes, Glenturret, Longmorn, Macallan, Mortlach, RoyalLochnagar, Strathisla Cluster 3: ArranIsleOf, Aultmore, Balblair, Cardhu, Craigganmore, Dufftown, GlenGrant, GlenKeith, GlenScotia, GlenSpey, Glenfiddich, Glenmorangie, Isle of Jura, Mannochmore, Miltonduff, Oban, Speyside, Springbank, Strathmill, Tamnavulin, Teaninich, Tomore Cluster 4: AnCnoc, Auchentoshan, Belvenie, Benriach, Bladnoch, Bowmore, Bruichladdich, Bunnahabhain, Dalwhinnie, Deanston, GlenElgin, GlenGarioch, GlenMoray, GlenOrd, Glenallachie, Glengoyne, Glenkinchie, Glenlivet, Glenlossie, Highland Park, Inchgower, Knochando, Linkwood, Loch Lomond, Scapa, Speyburn, Tamdhu, Tobermory, Tomatin, Tomintoul Cluster 5: Caol Ila, Craigallechie, GlenDeveronMacduff, OldFettercairn, OldPulteney, RoyalBrackla, Tullibardine

Slide 119

Slide 119 text

import … def rows = CSV.withFirstRecordAsHeader().parse(new FileReader('whiskey.csv')) def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] def clusterer = new KMeansPlusPlusClusterer(5) def data = rows.collect{ row -> new DoublePoint(cols.collect{ col -> row[col] } as int[]) } def centroids = clusterer.cluster(data) println cols.join(', ') + ', Medoid' def dataset = new DefaultCategoryDataset() centroids.eachWithIndex{ ctrd, num -> def cpt = ctrd.center.point def closest = ctrd.points.min{ pt -> sumSq((0.. cols.collect{ row[it] as double } == closest.point }?.Distillery println cpt.collect{ sprintf '%.3f', it }.join(', ') + ", $medoid" cpt.eachWithIndex { val, idx -> dataset.addValue(val, "Cluster ${num+1}", cols[idx]) } } def plot = new SpiderWebPlot(dataset: dataset) def chart = new JFreeChart('Whiskey clusters', plot) SwingUtil.show(new ChartPanel(chart)) Whiskey – clustering with radar plot and medoids Same again with Apache Commons Math but also calculate medoids

Slide 120

Slide 120 text

import … def rows = CSV.withFirstRecordAsHeader().parse(new FileReader('whiskey.csv')) def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] def clusterer = new KMeansPlusPlusClusterer(5) def data = rows.collect{ row -> new DoublePoint(cols.collect{ col -> row[col] } as int[]) } def centroids = clusterer.cluster(data) println cols.join(', ') + ', Medoid' def dataset = new DefaultCategoryDataset() centroids.eachWithIndex{ ctrd, num -> def cpt = ctrd.center.point def closest = ctrd.points.min{ pt -> sumSq((0.. cols.collect{ row[it] as double } == closest.point }?.Distillery println cpt.collect{ sprintf '%.3f', it }.join(', ') + ", $medoid" cpt.eachWithIndex { val, idx -> dataset.addValue(val, "Cluster ${num+1}", cols[idx]) } } def plot = new SpiderWebPlot(dataset: dataset) def chart = new JFreeChart('Whiskey clusters', plot) SwingUtil.show(new ChartPanel(chart)) Whiskey – clustering with radar plot and medoids Libraries: Apache Commons Math and JFreeChart Body, Sweetness, Smoky, Medicinal, Tobacco, Honey, Spicy, Winey, Nutty, Malty, Fruity, Floral, Medoid 2.000, 2.533, 1.267, 0.267, 0.200, 1.067, 1.667, 0.933, 0.267, 1.733, 1.800, 1.733, GlenOrd 2.789, 2.474, 1.474, 0.053, 0.000, 1.895, 1.632, 2.211, 2.105, 2.105, 2.211, 1.737, Aberfeldy 2.909, 1.545, 2.909, 2.727, 0.455, 0.455, 1.455, 0.545, 1.545, 1.455, 1.182, 0.545, Clynelish 1.333, 2.333, 0.944, 0.111, 0.000, 1.000, 0.444, 0.444, 1.500, 1.944, 1.778, 1.778, Aultmore 1.696, 2.304, 1.565, 0.435, 0.087, 1.391, 1.696, 0.609, 1.652, 1.652, 1.783, 2.130, Benromach

Slide 121

Slide 121 text

Dimensionality reduction

Slide 122

Slide 122 text

Whiskey – clustering with PCA plot import … def rows = Table.read().csv('whiskey.csv') def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] def data = table.as().doubleMatrix(*cols) def pca = new PCA(data) pca.projection = 2 def projected = pca.project(data) def clusterer = new KMeans(data, 5) table = table.addColumns( DoubleColumn.create("PCA1", (0..

Slide 123

Slide 123 text

Whiskey – clustering with 2-4D PCA plot import … def rows = Table.read().csv('whiskey.csv') def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] def data = rows.as().doubleMatrix(*cols) def pca = new PCA(data) def dims = 4 // can be 2, 3 or 4 pca.projection = dims def projected = pca.project(data) def adj = [1, 1, 1, 5] def clusterer = new KMeans(data, 5) def labels = clusterer.clusterLabel.collect{ "Cluster " + (it+1) } rows = rows.addColumns( *(0.. DoubleColumn.create("PCA${idx+1}", (0..

Slide 124

Slide 124 text

Whiskey – clustering and visualizing centroids … def data = table.as().doubleMatrix(*cols) def pca = new PCA(data) pca.projection = 3 def projected = pca.project(data) def clusterer = new KMeans(data, 5) def labels = clusterer.clusterLabel.collect { "Cluster " + (it + 1) } table = table.addColumns( *(0..<3).collect { idx -> DoubleColumn.create("PCA${idx+1}", (0.. toAdd[0].setString("Cluster", "Cluster " + (idx+1)) (1..3).each { toAdd[0].setDouble("PCA" + it, centroids[idx][it-1]) } toAdd[0].setDouble("Centroid", 50) table.append(toAdd) } def title = "Clusters x Principal Components w/ centroids" Plot.show(Scatter3DPlot.create(title, table, *(1..3).collect { "PCA$it" }, "Centroid", "Cluster"))

Slide 125

Slide 125 text

Whiskey – Screeplot import … def rows = Table.read().csv('whiskey.csv') def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] def data = table.as().doubleMatrix(*cols) def pca = new PCA(data) pca.projection = 2 def plots = [PlotCanvas.screeplot(pca)] def projected = pca.project(data) table = table.addColumns( *(1..2).collect { idx -> DoubleColumn.create("PCA$idx", (0.. def clusterer = new KMeans(data, k) double[][] components = table.as().doubleMatrix('PCA1', 'PCA2') plots << ScatterPlot.plot(components, clusterer.clusterLabel, symbols[0..

Slide 126

Slide 126 text

import … def rows = Table.read().csv('whiskey.csv') def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] def data = table.as().doubleMatrix(*cols) def pca = new PCA(data) pca.projection = 2 def plots = [PlotCanvas.screeplot(pca)] def projected = pca.project(data) table = table.addColumns( *(1..2).collect { idx -> DoubleColumn.create("PCA$idx", (0.. def clusterer = new KMeans(data, k) double[][] components = table.as().doubleMatrix('PCA1', 'PCA2') plots << ScatterPlot.plot(components, clusterer.clusterLabel, symbols[0..

Slide 127

Slide 127 text

Whiskey – XMeans clustering with 2-4D PCA plot import … def rows = Table.read().csv('whiskey.csv') def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] def data = rows.as().doubleMatrix(*cols) def pca = new PCA(data) def dims = 4 // can be 2, 3 or 4 pca.projection = dims def projected = pca.project(data) def adj = [1, 1, 1, 5] def kmax = 10 def clusterer = new XMeans(data, kmax) def labels = clusterer.clusterLabel.collect { "Cluster " + (it + 1) } rows = rows.addColumns( *(0.. DoubleColumn.create("PCA${idx+1}", (0..

Slide 128

Slide 128 text

Whiskey – SOM with Heatmap import … def file = new File(getClass().classLoader.getResource('whiskey.csv').file) def table = Read.csv(file.toPath(), CSV.withFirstRecordAsHeader()) String[] cols = ['Body', 'Sweetness', 'Smoky', 'Medicinal', 'Tobacco', 'Honey', 'Spicy', 'Winey', 'Nutty', 'Malty', 'Fruity', 'Floral'] def data = table.select(cols).toArray() def distilleries = table.column('Distillery').toStringArray() //def (ncols, nrows) = [3, 3] //def (ncols, nrows) = [7, 4] def (ncols, nrows) = [7, 6] //def (ncols, nrows) = [8, 6] def lattice = SOM.lattice(nrows, ncols, data) int epochs = 100 def model = new SOM(lattice, TimeFunction.constant(0.1), Neighborhood.Gaussian(1, data.length * epochs / 8.0)) for (int i = 0; i < epochs; i++) { for (int j : MathEx.permutate(data.length)) { model.update(data[j]) } } …

Slide 129

Slide 129 text

Whiskey – SOM with Heatmap … def groups = data.toList().indices.groupBy { idx -> group(model, data[idx]) } def names = groups.collectEntries { k, v -> [k, distilleries[v].join(', ')] } private group(SOM model, double[] row) { double[][][] neurons = model.neurons() [0.. def (i, j) = pair MathEx.distance(neurons[i][j], row) } } names.toSorted{ e1, e2 -> e1.key[0] <=> e2.key[0] ?: e1.key[1] <=> e2.key[1] }.each { k, v -> println "Cluster ${k[0]},${k[1]}: $v" } def tooltip = { i, j -> names[[i, j]] ?: '' } as Hexmap.Tooltip new Hexmap(model.umatrix(), Palette.jet(256), tooltip) .canvas().tap { setAxisLabels('', '') }.window()

Slide 130

Slide 130 text

Whiskey – SOM with Heatmap Cluster 0,0: Auchentoshan, Glengoyne Cluster 0,1: AnCnoc Cluster 0,2: Inchgower, Tomintoul Cluster 0,3: ArranIsleOf, Speyburn Cluster 0,4: GlenElgin, Linkwood, RoyalBrackla Cluster 0,5: Glenfarclas, Glenlivet, Glenturret Cluster 0,6: Aberlour, Strathisla Cluster 1,0: Bladnoch, Bunnahabhain, Glenfiddich Cluster 1,1: Cardhu, Glenallachie Cluster 1,2: Glenkinchie, Glenlossie, Loch Lomond Cluster 1,3: Benriach, Dalwhinnie Cluster 1,4: GlenOrd Cluster 1,5: Craigallechie, GlenMoray, Longmorn Cluster 1,6: Edradour, Knochando Cluster 2,0: GlenSpey, Miltonduff Cluster 2,1: Tamnavulin Cluster 2,2: Mannochmore Cluster 2,3: Aultmore, GlenGrant, Speyside, Tamdhu, Tobermory Cluster 2,4: Deanston, Scapa Cluster 2,5: Benromach Cluster 2,6: Belvenie, BenNevis, Benrinnes Cluster 3,0: GlenKeith, Tullibardine Cluster 3,1: Balblair, Glenmorangie, Strathmill Cluster 3,3: Tomore Cluster 3,4: Ardmore, OldFettercairn Cluster 3,5: Aberfeldy, BlairAthol Cluster 3,6: Glendullan, RoyalLochnagar Cluster 4,0: Craigganmore, Dufftown Cluster 4,1: OldPulteney Cluster 4,2: Oban Cluster 4,3: Bruichladdich, GlenScotia, Isle of Jura, Springbank Cluster 4,4: GlenDeveronMacduff, Tomatin Cluster 4,5: Auchroisk, Glenrothes Cluster 4,6: Balmenach, Glendronach, Macallan Cluster 5,0: GlenGarioch, Teaninich Cluster 5,1: Caol Ila, Clynelish, Talisker Cluster 5,2: Ardbeg, Lagavulin, Laphroig Cluster 5,4: Bowmore, Highland Park Cluster 5,5: Dalmore Cluster 5,6: Dailuaine, Mortlach

Slide 131

Slide 131 text

Whiskey – SOM with Heatmap Cluster 0,0: Auchentoshan, Glengoyne Cluster 0,1: AnCnoc Cluster 0,2: Inchgower, Tomintoul Cluster 0,3: ArranIsleOf, Speyburn Cluster 0,4: GlenElgin, Linkwood, RoyalBrackla Cluster 0,5: Glenfarclas, Glenlivet, Glenturret Cluster 0,6: Aberlour, Strathisla Cluster 1,0: Bladnoch, Bunnahabhain, Glenfiddich Cluster 1,1: Cardhu, Glenallachie Cluster 1,2: Glenkinchie, Glenlossie, Loch Lomond Cluster 1,3: Benriach, Dalwhinnie Cluster 1,4: GlenOrd Cluster 1,5: Craigallechie, GlenMoray, Longmorn Cluster 1,6: Edradour, Knochando Cluster 2,0: GlenSpey, Miltonduff Cluster 2,1: Tamnavulin Cluster 2,2: Mannochmore Cluster 2,3: Aultmore, GlenGrant, Speyside, Tamdhu, Tobermory Cluster 2,4: Deanston, Scapa Cluster 2,5: Benromach Cluster 2,6: Belvenie, BenNevis, Benrinnes Cluster 3,0: GlenKeith, Tullibardine Cluster 3,1: Balblair, Glenmorangie, Strathmill Cluster 3,3: Tomore Cluster 3,4: Ardmore, OldFettercairn Cluster 3,5: Aberfeldy, BlairAthol Cluster 3,6: Glendullan, RoyalLochnagar Cluster 4,0: Craigganmore, Dufftown Cluster 4,1: OldPulteney Cluster 4,2: Oban Cluster 4,3: Bruichladdich, GlenScotia, Isle of Jura, Springbank Cluster 4,4: GlenDeveronMacduff, Tomatin Cluster 4,5: Auchroisk, Glenrothes Cluster 4,6: Balmenach, Glendronach, Macallan Cluster 5,0: GlenGarioch, Teaninich Cluster 5,1: Caol Ila, Clynelish, Talisker Cluster 5,2: Ardbeg, Lagavulin, Laphroig Cluster 5,4: Bowmore, Highland Park Cluster 5,5: Dalmore Cluster 5,6: Dailuaine, Mortlach

Slide 132

Slide 132 text

Whiskey – Hierarchical clustering with Dendrogram import … def file = new File(getClass().classLoader.getResource('whiskey.csv').file) def table = Read.csv(file.toPath(), CSV.withFirstRecordAsHeader()) String[] cols = ['Body', 'Sweetness', 'Smoky', 'Medicinal', 'Tobacco', 'Honey', 'Spicy', 'Winey', 'Nutty', 'Malty', 'Fruity', 'Floral'] def data = table.select(cols).toArray() def distilleries = table.column('Distillery').toStringArray() def ninetyDeg = 1.57 // radians def FOREST_GREEN = new Color(0X808000) def clusters = HierarchicalClustering.fit(CompleteLinkage.of(data)) //println clusters.tree //println clusters.height def partitions = clusters.partition(4) // little trick to work out cluster colors def colorMap = new LinkedHashSet(partitions.toList()).toList().reverse().indexed().collectEntries { k, v -> [v, Palette.COLORS[k]] } Font font = new Font("BitStream Vera Sans", Font.PLAIN, 12) …

Slide 133

Slide 133 text

Whiskey – Hierarchical clustering with Dendrogram … def dendrogram = new Dendrogram(clusters.tree, clusters.height, FOREST_GREEN).canvas().tap { title = 'Whiskey Dendrogram' setAxisLabels('Distilleries', 'Similarity') def lb = lowerBounds setBound([lb[0] - 1, lb[1] - 20] as double[], upperBounds) distilleries.eachWithIndex { String label, int i -> add(new Label(label, [i, -1] as double[], 0, 0, ninetyDeg, font, colorMap[partitions[i]])) } }.panel() def pca = PCA.fit(data) pca.projection = 2 def projected = pca.project(data) char mark = '#' def scatter = ScatterPlot.of(projected, partitions, mark).canvas().tap { title = 'Clustered by dendrogram partitions' setAxisLabels('PCA1', 'PCA2') }.panel() new PlotGrid(dendrogram, scatter).window()

Slide 134

Slide 134 text

Whiskey – Hierarchical clustering with Dendrogram … def dendrogram = new Dendrogram(clusters.tree, clusters.height, FOREST_GREEN).canvas().tap { title = 'Whiskey Dendrogram' setAxisLabels('Distilleries', 'Similarity') def lb = lowerBounds setBound([lb[0] - 1, lb[1] - 20] as double[], upperBounds) distilleries.eachWithIndex { String label, int i -> add(new Label(label, [i, -1] as double[], 0, 0, ninetyDeg, font, colorMap[partitions[i]])) } }.panel() def pca = PCA.fit(data) pca.projection = 2 def projected = pca.project(data) char mark = '#' def scatter = ScatterPlot.of(projected, partitions, mark).canvas().tap { title = 'Clustered by dendrogram partitions' setAxisLabels('PCA1', 'PCA2') }.panel() new PlotGrid(dendrogram, scatter).window()

Slide 135

Slide 135 text

Whiskey – Exploring Weka clustering algorithms

Slide 136

Slide 136 text

Whiskey flavors – scaling clustering import … def rows = Table.read().csv('whiskey.csv') String[] features = [ 'Distillery', 'Body', 'Sweetness', 'Smoky', 'Medicinal', 'Tobacco', 'Honey', 'Spicy', 'Winey', 'Nutty', 'Malty', 'Fruity', 'Floral' ] def data = rows.as().doubleMatrix(features) def cfg = new IgniteConfiguration(…) Ignition.start(cfg).withCloseable { ignite -> println ">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols" def trainer = new KMeansTrainer().withDistance(new EuclideanDistance()).withAmountOfClusters(5) def dataCache = new SandboxMLCache(ignite).fillCacheWith(data) def vectorizer = new DoubleArrayVectorizer().labeled(Vectorizer.LabelCoordinate.FIRST) def mdl = trainer.fit(ignite, dataCache, vectorizer) println ">>> KMeans centroids" println features[1..-1].join(', ‘) mdl.centers.each { c -> println c.all().collect{ sprintf '%.4f', it.get() }.join(', ') } dataCache.destroy() } [17:18:07] Ignite node started OK (id=ea92fc20) >>> Ignite grid started for data: 86 rows X 13 cols >>> KMeans centroids Body, Sweetness, Smoky, Medicinal, Tobacco, Honey, Spicy, Winey, Nutty, Malty, Fruity, Floral 2.9091, 1.5455, 2.9091, 2.7273, 0.4545, 0.4545, 1.4545, 0.5455, 1.5455, 1.4545, 1.1818, 0.5455 1.8095, 2.4762, 1.5714, 0.4286, 0.1429, 1.5238, 1.7143, 0.8571, 0.7143, 1.6667, 1.6190, 2.0476 2.5217, 2.3913, 1.4783, 0.0435, 0.0435, 1.6087, 1.4783, 1.9565, 2.0435, 2.0435, 1.9565, 1.3913 1.3704, 2.3704, 1.0000, 0.2593, 0.0370, 0.7778, 0.9259, 0.4074, 1.5185, 1.7407, 2.0000, 2.1111 3.2500, 2.2500, 1.5000, 0.0000, 0.0000, 3.0000, 2.0000, 1.0000, 1.5000, 2.5000, 2.2500, 2.0000 [17:18:07] Ignite node stopped OK [uptime=00:00:00.368]

Slide 137

Slide 137 text

Whiskey flavors – scaling clustering import … def spark = builder().config('spark.master', 'local[8]').appName('Whiskey').orCreate def file = '/path/to/whiskey.csv' int k = 5 Dataset rows = spark.read().format('com.databricks.spark.csv') .options('header': 'true', 'inferSchema': 'true').load(file) String[] colNames = rows.columns().toList().minus(['RowID', 'Distillery']) def assembler = new VectorAssembler(inputCols: colNames, outputCol: 'features') Dataset dataset = assembler.transform(rows) def clusterer = new KMeans(k: k, seed: 1L) def model = clusterer.fit(dataset) println 'Cluster centers:' model.clusterCenters().each{ println it.values().collect{ sprintf '%.2f', it }.join(', ') } spark.stop() Cluster centers: 1.73, 2.35, 1.58, 0.81, 0.19, 1.15, 1.42, 0.81, 1.23, 1.77, 1.23, 1.31 2.00, 1.00, 3.00, 0.00, 0.00, 0.00, 3.00, 1.00, 0.00, 2.00, 2.00, 2.00 2.86, 2.38, 1.52, 0.05, 0.00, 1.95, 1.76, 2.05, 1.81, 2.05, 2.19, 1.71 1.53, 2.38, 1.06, 0.16, 0.03, 1.09, 1.00, 0.50, 1.53, 1.75, 2.13, 2.28 3.67, 1.50, 3.67, 3.33, 0.67, 0.17, 1.67, 0.50, 1.17, 1.33, 1.17, 0.17

Slide 138

Slide 138 text

Whiskey flavors – scaling clustering class Point { double[] pts } class TaggedPoint extends Point { int cluster } @Canonical(includeSuperProperties=true) class TaggedPointCounter extends TaggedPoint { long count TaggedPointCounter plus(TaggedPointCounter other) { new TaggedPointCounter((0..

Slide 139

Slide 139 text

Whiskey flavors – scaling clustering class SelectNearestCentroid implements FunctionDescriptor.ExtendedSerializableFunction { Iterable centroids @Override void open(ExecutionContext executionCtx) { centroids = executionCtx.getBroadcast("centroids") } @Override TaggedPointCounter apply(Point point) { def minDistance = Double.POSITIVE_INFINITY def nearestCentroidId = -1 for (centroid in centroids) { def distance = sqrt((0..

Slide 140

Slide 140 text

Whiskey flavors – scaling clustering def url = WhiskeyRheem.classLoader.getResource('whiskey.csv').file def rheemContext = new RheemContext(configuration) .withPlugin(Java.basicPlugin()) .withPlugin(Spark.basicPlugin()) def planBuilder = new JavaPlanBuilder(rheemContext) .withJobName("KMeans ($url, k=$k, iterations=$iterations)") def points = new File(url).readLines()[1..-1].collect{ new Point(pts: it.split(",")[2..-1]*.toDouble()) } def pointsBuilder = planBuilder.loadCollection(points) def random = new Random() double[][] initPts = (1..k).collect{ (0..

Slide 141

Slide 141 text

Whiskey flavors – scaling clustering def url = WhiskeyRheem.classLoader.getResource('whiskey.csv').file def rheemContext = new RheemContext(configuration) .withPlugin(Java.basicPlugin()) .withPlugin(Spark.basicPlugin()) def planBuilder = new JavaPlanBuilder(rheemContext) .withJobName("KMeans ($url, k=$k, iterations=$iterations)") def points = new File(url).readLines()[1..-1].collect{ new Point(pts: it.split(",")[2..-1]*.toDouble()) } def pointsBuilder = planBuilder.loadCollection(points) def random = new Random() double[][] initPts = (1..k).collect{ (0..

Slide 142

Slide 142 text

Clustering lab – repeat some of the examples with beer https://rstudio-pubs-static.s3.amazonaws.com/560654_2b44eef198f24a0bba45e002ae86d237.html

Slide 143

Slide 143 text

Clustering lab – repeat some of the examples with beer

Slide 144

Slide 144 text

Classification Overview Clustering: • Predicting class of some data Algorithms: • Logistic Regression • Naïve Bayes • Stochastic Gradient Descent • K-Nearest Neighbours • Decision Tree • Random Forest • Support Vector Machine Aspects: • Over/underfitting • Ensemble • Confusion Applications: • Image recognition • Speech recognition • Medical daignosis

Slide 145

Slide 145 text

Case Study: classification of Iris flowers Created by British statistician and biologist Ronald Fisher for his 1936 paper “The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis” 150 samples, 50 each of three species of Iris: • setosa • versicolor • virginica Four features measured for each sample: • sepal length • sepal width • petal length • petal width https://en.wikipedia.org/wiki/Iris_flower_data_set https://archive.ics.uci.edu/ml/datasets/Iris setosa versicolor virginica sepal petal

Slide 146

Slide 146 text

Iris flower data – explore with Weka

Slide 147

Slide 147 text

Iris flower data – explore with Weka (cont’d)

Slide 148

Slide 148 text

Iris flower data – explore with Weka (cont’d)

Slide 149

Slide 149 text

Iris flower data – explore with Weka (cont’d)

Slide 150

Slide 150 text

Iris flower data – explore with Weka (cont’d)

Slide 151

Slide 151 text

Iris flower data – explore with Weka (cont’d)

Slide 152

Slide 152 text

Iris flower data – Weka Naïve Bayes def file = getClass().classLoader.getResource('iris_data.csv').file as File def species = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'] def loader = new CSVLoader(file: file) def model = new NaiveBayes() def allInstances = loader.dataSet allInstances.classIndex = 4 model.buildClassifier(allInstances) double[] actual = allInstances.collect{ it.value(4) } double[] predicted = allInstances.collect{ model.classifyInstance(it) } double[] petalW = allInstances.collect{ it.value(2) } double[] petalL = allInstances.collect{ it.value(3) } def indices = actual.indices

Slide 153

Slide 153 text

Iris flower data – Weka Naïve Bayes def file = getClass().classLoader.getResource('iris_data.csv').file as File def species = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'] def loader = new CSVLoader(file: file) def model = new NaiveBayes() def allInstances = loader.dataSet allInstances.classIndex = 4 model.buildClassifier(allInstances) double[] actual = allInstances.collect{ it.value(4) } double[] predicted = allInstances.collect{ model.classifyInstance(it) } double[] petalW = allInstances.collect{ it.value(0) } double[] petalL = allInstances.collect{ it.value(1) } def indices = actual.indices def chart = new XYChartBuilder().width(900).height(450). title("Species").xAxisTitle("Petal length").yAxisTitle("Petal width").build() species.eachWithIndex{ String name, int i -> def groups = indices.findAll{ predicted[it] == i }.groupBy{ actual[it] == i } Collection found = groups[true] ?: [] Collection errors = groups[false] ?: [] println "$name: ${found.size()} correct, ${errors.size()} incorrect" chart.addSeries("$name correct", petalW[found], petalL[found]).with { XYSeriesRenderStyle = Scatter } if (errors) { chart.addSeries("$name incorrect", petalW[errors], petalL[errors]).with { XYSeriesRenderStyle = Scatter } } } new SwingWrapper(chart).displayChart()

Slide 154

Slide 154 text

def chart = new XYChartBuilder().width(900).height(450). title("Species").xAxisTitle("Petal length").yAxisTitle("Petal width").build() species.eachWithIndex{ String name, int i -> def groups = indices.findAll{ predicted[it] == i }.groupBy{ actual[it] == i } Collection found = groups[true] ?: [] Collection errors = groups[false] ?: [] println "$name: ${found.size()} correct, ${errors.size()} incorrect" chart.addSeries("$name correct", petalW[found], petalL[found]).with { XYSeriesRenderStyle = Scatter } if (errors) { chart.addSeries("$name incorrect", petalW[errors], petalL[errors]).with { XYSeriesRenderStyle = Scatter } } } new SwingWrapper(chart).displayChart() Iris flower data – Weka Naïve Bayes def file = getClass().classLoader.getResource('iris_data.csv').file as File def species = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'] def loader = new CSVLoader(file: file) def model = new NaiveBayes() def allInstances = loader.dataSet allInstances.classIndex = 4 model.buildClassifier(allInstances) double[] actual = allInstances.collect{ it.value(4) } double[] predicted = allInstances.collect{ model.classifyInstance(it) } double[] petalW = allInstances.collect{ it.value(0) } double[] petalL = allInstances.collect{ it.value(1) } def indices = actual.indices Iris-setosa: 50 correct, 0 incorrect Iris-versicolor: 48 correct, 4 incorrect Iris-virginica: 46 correct, 2 incorrect

Slide 155

Slide 155 text

Iris flower data – Weka Logistic Regression def file = getClass().classLoader.getResource('iris_data.csv').file as File def species = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'] def loader = new CSVLoader(file: file) def model = new SimpleLogistic() def allInstances = loader.dataSet allInstances.classIndex = 4 model.buildClassifier(allInstances) double[] actual = allInstances.collect{ it.value(4) } double[] predicted = allInstances.collect{ model.classifyInstance(it) } double[] petalW = allInstances.collect{ it.value(2) } double[] petalL = allInstances.collect{ it.value(3) } def indices = actual.indices

Slide 156

Slide 156 text

Iris flower data – Weka Logistic Regression def file = getClass().classLoader.getResource('iris_data.csv').file as File def species = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'] def loader = new CSVLoader(file: file) def model = new SimpleLogistic() def allInstances = loader.dataSet allInstances.classIndex = 4 model.buildClassifier(allInstances) double[] actual = allInstances.collect{ it.value(4) } double[] predicted = allInstances.collect{ model.classifyInstance(it) } double[] petalW = allInstances.collect{ it.value(0) } double[] petalL = allInstances.collect{ it.value(1) } def indices = actual.indices def chart = new XYChartBuilder().width(900).height(450). title("Species").xAxisTitle("Petal length").yAxisTitle("Petal width").build() species.eachWithIndex{ String name, int i -> def groups = indices.findAll{ predicted[it] == i }.groupBy{ actual[it] == i } Collection found = groups[true] ?: [] Collection errors = groups[false] ?: [] println "$name: ${found.size()} correct, ${errors.size()} incorrect" chart.addSeries("$name correct", petalW[found], petalL[found]).with { XYSeriesRenderStyle = Scatter } if (errors) { chart.addSeries("$name incorrect", petalW[errors], petalL[errors]).with { XYSeriesRenderStyle = Scatter } } } new SwingWrapper(chart).displayChart() Iris-setosa: 50 correct, 0 incorrect Iris-versicolor: 49 correct, 1 incorrect Iris-virginica: 49 correct, 1 incorrect

Slide 157

Slide 157 text

Iris flower data – Weka J48 Decision Tree def file = getClass().classLoader.getResource('iris_data.csv').file as File def species = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'] def loader = new CSVLoader(file: file) def model = new J48() def allInstances = loader.dataSet allInstances.classIndex = 4 model.buildClassifier(allInstances) double[] actual = allInstances.collect{ it.value(4) } double[] predicted = allInstances.collect{ model.classifyInstance(it) } double[] petalW = allInstances.collect{ it.value(0) } double[] petalL = allInstances.collect{ it.value(1) } def indices = actual.indices def chart = new XYChartBuilder().width(900).height(450). title("Species").xAxisTitle("Petal length").yAxisTitle("Petal width").build() species.eachWithIndex{ String name, int i -> def groups = indices.findAll{ predicted[it] == i }.groupBy{ actual[it] == i } Collection found = groups[true] ?: [] Collection errors = groups[false] ?: [] println "$name: ${found.size()} correct, ${errors.size()} incorrect" chart.addSeries("$name correct", petalW[found], petalL[found]).with { XYSeriesRenderStyle = Scatter } if (errors) { chart.addSeries("$name incorrect", petalW[errors], petalL[errors]).with { XYSeriesRenderStyle = Scatter } } } new SwingWrapper(chart).displayChart() Petal width <= 0.6: Iris-setosa (50.0) Petal width > 0.6 | Petal width <= 1.7 | | Petal length <= 4.9: Iris-versicolor (48.0/1.0) | | Petal length > 4.9 | | | Petal width <= 1.5: Iris-virginica (3.0) | | | Petal width > 1.5: Iris-versicolor (3.0/1.0) | Petal width > 1.7: Iris-virginica (46.0/1.0) Number of Leaves : 5 Size of the tree : 9 Iris-setosa: 50 correct, 0 incorrect Iris-versicolor: 49 correct, 2 incorrect Iris-virginica: 48 correct, 1 incorrect

Slide 158

Slide 158 text

Iris flower data – wekaDeeplearning4j WekaPackageManager.loadPackages(true) def file = getClass().classLoader.getResource('iris_data.csv').file as File def loader = new CSVLoader(file: file) def data = loader.dataSet data.classIndex = 4 def options = Utils.splitOptions("-S 1 -numEpochs 10 -layer \"weka.dl4j.layers.OutputLayer -activation weka.dl4j.activations.ActivationSoftmax \ -lossFn weka.dl4j.lossfunctions.LossMCXENT\"") AbstractClassifier myClassifier = Utils.forName(AbstractClassifier, "weka.classifiers.functions.Dl4jMlpClassifier", options) // Stratify and split Random rand = new Random(0) Instances randData = new Instances(data) randData.randomize(rand) randData.stratify(3) Instances train = randData.trainCV(3, 0) Instances test = randData.testCV(3, 0) // Build the classifier on the training data myClassifier.buildClassifier(train) // Evaluate the model on test data Evaluation eval = new Evaluation(test) eval.evaluateModel(myClassifier, test) println eval.toSummaryString() println eval.toMatrixString() Image: https://www.neuraldesigner.com/learning/examples/iris-flowers-classification

Slide 159

Slide 159 text

Iris flower data – wekaDeeplearning4j WekaPackageManager.loadPackages(true) def file = getClass().classLoader.getResource('iris_data.csv').file as File def loader = new CSVLoader(file: file) def data = loader.dataSet data.classIndex = 4 def options = Utils.splitOptions("-S 1 -numEpochs 10 -layer \"weka.dl4j.layers.OutputLayer -activation weka.dl4j.activations.ActivationSoftmax \ -lossFn weka.dl4j.lossfunctions.LossMCXENT\"") AbstractClassifier myClassifier = Utils.forName(AbstractClassifier, "weka.classifiers.functions.Dl4jMlpClassifier", options) // Stratify and split Random rand = new Random(0) Instances randData = new Instances(data) randData.randomize(rand) randData.stratify(3) Instances train = randData.trainCV(3, 0) Instances test = randData.testCV(3, 0) // Build the classifier on the training data myClassifier.buildClassifier(train) // Evaluate the model on test data Evaluation eval = new Evaluation(test) eval.evaluateModel(myClassifier, test) println eval.toSummaryString() println eval.toMatrixString() [main] INFO org.deeplearning4j.nn.graph.ComputationGraph - Starting ComputationGraph with WorkspaceModes set to [training: ENABLED; inference: ENABLED], cacheMode set to [NONE] Training Dl4jMlpClassifier...: [] ETA: 00:00:00[INFO ] 00:03:31.035 [main] weka.classifiers.functions.Dl4jMlpClassifier - Epoch [1/10] took 00:00:00.670 Training Dl4jMlpClassifier...: [====== ] ETA: 00:00:06[INFO ] 00:03:31.152 [main] weka.classifiers.functions.Dl4jMlpClassifier - Epoch [2/10] took 00:00:00.113 Training Dl4jMlpClassifier...: [============ ] ETA: 00:00:03[INFO ] 00:03:31.244 [main] weka.classifiers.functions.Dl4jMlpClassifier - Epoch [3/10] took 00:00:00.090 Training Dl4jMlpClassifier...: [================== ] ETA: 00:00:02[INFO ] 00:03:31.325 [main] weka.classifiers.functions.Dl4jMlpClassifier - Epoch [4/10] took 00:00:00.079 Training Dl4jMlpClassifier...: [======================== ] ETA: 00:00:01[INFO ] 00:03:31.470 [main] weka.dl4j.listener.EpochListener - Epoch [5/10] Train Set: Loss: 0.510342 [INFO ] 00:03:31.470 [main] weka.classifiers.functions.Dl4jMlpClassifier - Epoch [5/10] took 00:00:00.144 Training Dl4jMlpClassifier...: [============================== ] ETA: 00:00:01[INFO ] 00:03:31.546 [main] weka.classifiers.functions.Dl4jMlpClassifier - Epoch [6/10] took 00:00:00.073 Training Dl4jMlpClassifier...: [==================================== ] ETA: 00:00:00[INFO ] 00:03:31.611 [main] weka.classifiers.functions.Dl4jMlpClassifier - Epoch [7/10] took 00:00:00.063 Training Dl4jMlpClassifier...: [========================================== ] ETA: 00:00:00[INFO ] 00:03:31.714 [main] weka.classifiers.functions.Dl4jMlpClassifier - Epoch [8/10] took 00:00:00.101 Training Dl4jMlpClassifier...: [================================================ ] ETA: 00:00:00[INFO ] 00:03:31.790 [main] weka.classifiers.functions.Dl4jMlpClassifier - Epoch [9/10] took 00:00:00.074 Training Dl4jMlpClassifier...: [====================================================== ] ETA: 00:00:00[INFO ] 00:03:31.882 [main] weka.dl4j.listener.EpochListener - Epoch [10/10] Train Set: Loss: 0.286469 …

Slide 160

Slide 160 text

Iris flower data – wekaDeeplearning4j WekaPackageManager.loadPackages(true) def file = getClass().classLoader.getResource('iris_data.csv').file as File def loader = new CSVLoader(file: file) def data = loader.dataSet data.classIndex = 4 def options = Utils.splitOptions("-S 1 -numEpochs 10 -layer \"weka.dl4j.layers.OutputLayer -activation weka.dl4j.activations.ActivationSoftmax \ -lossFn weka.dl4j.lossfunctions.LossMCXENT\"") AbstractClassifier myClassifier = Utils.forName(AbstractClassifier, "weka.classifiers.functions.Dl4jMlpClassifier", options) // Stratify and split Random rand = new Random(0) Instances randData = new Instances(data) randData.randomize(rand) randData.stratify(3) Instances train = randData.trainCV(3, 0) Instances test = randData.testCV(3, 0) // Build the classifier on the training data myClassifier.buildClassifier(train) // Evaluate the model on test data Evaluation eval = new Evaluation(test) eval.evaluateModel(myClassifier, test) println eval.toSummaryString() println eval.toMatrixString() … [INFO ] 00:03:31.883 [main] weka.classifiers.functions.Dl4jMlpClassifier - Epoch [10/10] took 00:00:00.091 Training Dl4jMlpClassifier...: [============================================================] ETA: 00:00:00 Done! Correctly Classified Instances 40 80 % Incorrectly Classified Instances 10 20 % Kappa statistic 0.701 Mean absolute error 0.2542 Root mean squared error 0.3188 Relative absolute error 57.2154 % Root relative squared error 67.6336 % Total Number of Instances 50 === Confusion Matrix === a b c <-- classified as 17 0 0 | a = Iris-setosa 0 9 8 | b = Iris-versicolor 0 2 14 | c = Iris-virginica BUILD SUCCESSFUL in 22s

Slide 161

Slide 161 text

Classify language Apache OpenNLP parts: • sentence detector, tokenizer, name finder, document categorizer, part-of-speech tagger, chunker, parser, coreference resolution import opennlp.tools.langdetect.* def base = 'http://apache.forsale.plus/opennlp/models' def url = "$base/langdetect/1.8.3/langdetect-183.bin" def model = new LanguageDetectorModel(new URL(url)) def detector = new LanguageDetectorME(model) [ 'spa': 'Bienvenido a Madrid', 'fra': 'Bienvenue à Paris', 'dan': 'Velkommen til København' ].each { k, v -> assert detector.predictLanguage(v).lang == k } Supported language detection training models: • MaxEnt • Perceptron • Naïve Bayes

Slide 162

Slide 162 text

Classify language – entities String[] sentences = [ "A commit by Daniel Sun on December 6, 2020 improved Groovy 4's language integrated query.", "A commit by Daniel on Sun., December 6, 2020 improved Groovy 4's language integrated query.", 'The Groovy in Action book by Dierk Koenig et. al. is a bargain at $50, or indeed any price.', 'The conference wrapped up yesterday at 5:30 p.m. in Copenhagen, Denmark.', 'I saw Ms. May Smith waving to June Jones.', 'The parcel was passed from May to June.' ] // use a helper to cache models def helper = new ResourceHelper('http://opennlp.sourceforge.net/models-1.5') def modelNames = ['person', 'money', 'date', 'time', 'location'] def finders = modelNames.collect{ new NameFinderME(new TokenNameFinderModel(helper.load("en-ner-$it"))) } def tokenizer = SimpleTokenizer.INSTANCE …

Slide 163

Slide 163 text

Classify language – entities sentences.each { sentence -> String[] tokens = tokenizer.tokenize(sentence) Span[] tokenSpans = tokenizer.tokenizePos(sentence) def entityText = [:] def entityPos = [:] finders.indices.each {fi -> // could be made smarter by looking at probabilities and overlapping spans Span[] spans = finders[fi].find(tokens) spans.each{span -> def se = span.start..

Slide 164

Slide 164 text

Classify language – entities sentences.each { sentence -> String[] tokens = tokenizer.tokenize(sentence) Span[] tokenSpans = tokenizer.tokenizePos(sentence) def entityText = [:] def entityPos = [:] finders.indices.each {fi -> // could be made smarter by looking at probabilities and overlapping spans Span[] spans = finders[fi].find(tokens) spans.each{span -> def se = span.start..

Slide 165

Slide 165 text

Classify language – parts of speech def sentences = [ 'Paul has two sisters, Maree and Christine.', 'His bark was much worse than his bite', 'Turn on the lights to the main bedroom', "Light 'em all up", 'Make it dark downstairs' ] def tokenizer = SimpleTokenizer.INSTANCE sentences.each { String[] tokens = tokenizer.tokenize(it) def model = new POSModel(helper.load('en-pos-maxent')) def posTagger = new POSTaggerME(model) String[] tags = posTagger.tag(tokens) println tokens.indices.collect{tags[it] == tokens[it] ? tags[it] : "${tags[it]}(${tokens[it]})" }.join(' ') } NNP(Paul) VBZ(has) CD(two) NNS(sisters) , NNP(Maree) CC(and) NNP(Christine) . PRP$(His) NN(bark) VBD(was) RB(much) JJR(worse) IN(than) PRP$(his) NN(bite) VB(Turn) IN(on) DT(the) NNS(lights) TO(to) DT(the) JJ(main) NN(bedroom) NN(Light) POS(') NN(em) DT(all) IN(up) VB(Make) PRP(it) JJ(dark) NN(downstairs)

Slide 166

Slide 166 text

Classify language – sentence detection def helper = new ResourceHelper('http://opennlp.sourceforge.net/models-1.5') def model = new SentenceModel(helper.load('en-sent')) def detector = new SentenceDetectorME(model) def sentences = detector.sentDetect(text) assert text.count('.') == 28 assert sentences.size() == 4 println sentences.join('\n\n')

Slide 167

Slide 167 text

Classify language – sentence detection def helper = new ResourceHelper('http://opennlp.sourceforge.net/models-1.5') def model = new SentenceModel(helper.load('en-sent')) def detector = new SentenceDetectorME(model) def sentences = detector.sentDetect(text) assert text.count('.') == 28 assert sentences.size() == 4 println sentences.join('\n\n') The most referenced scientific paper of all time is "Protein measurement with the Folin phenol reagent" by Lowry, O. H., Rosebrough, N. J., Farr, A. L. & Randall, R. J. and was published in the J. BioChem. in 1951. It describes a method for measuring the amount of protein (even as small as 0.2 γ, were γ is the specific weight) in solutions and has been cited over 300,000 times and can be found here: https://www.jbc.org/content/193/1/265.full.pdf. Dr. Lowry completed two doctoral degrees under an M.D.-Ph.D. program from the University of Chicago before moving to Harvard under A. Baird Hastings. He was also the H.O.D of Pharmacology at Washington University in St. Louis for 29 years. The most referenced scientific paper of all time is "Protein measurement with the Folin phenol reagent" by Lowry, O. H., Rosebrough, N. J., Farr, A. L. & Randall, R. J. and was published in the J. BioChem. in 1951. It describes a method for measuring the amount of protein (even as small as 0.2 ?, were ? is the specific weight) in solutions and has been cited over 300,000 times and can be found here: https://www.jbc.org/content/193/1/265.full.pdf. Dr. Lowry completed two doctoral degrees under an M.D.-Ph.D. program from the University of Chicago before moving to Harvard under A. Baird Hastings. He was also the H.O.D of Pharmacology at Washington University in St. Louis for 29 years.

Slide 168

Slide 168 text

Classify language Apache NLPCraft (incubating) start(LightSwitchModel) def cli = new NCTestClientBuilder().newBuilder().build() cli.open("nlpcraft.lightswitch.ex.java") println cli.ask('Turn on the lights in the master bedroom') println cli.ask("Light 'em all up") println cli.ask('Make it dark downstairs') // expecting no match if (cli) { cli.close() } stop() Lights are [on] in [master bedroom]. Lights are [on] in [entire house]. No matching intent found.

Slide 169

Slide 169 text

Classify language Apache NLPCraft (incubating) @NCIntentRef("ls") @NCIntentSample([ "Turn the lights off in the entire house.", "Switch on the illumination in the master bedroom closet.", "Get the lights on.", "Lights up in the kitchen.", "Please, put the light out in the upstairs bedroom.", "Set the lights on in the entire house.", … "Kill off all the lights now!", "No lights in the bedroom, please.", "Light up the garage, please!", "Kill the illumination now!" ]) NCResult onMatch( @NCIntentTerm("act") NCToken actTok, @NCIntentTerm("loc") List locToks) { String status = actTok.id == "ls:on" ? "on" : "off" String locations = locToks ? locToks.collect{ it.meta("nlpcraft:nlp:origtext") }.join(", ") : "entire house" // Add HomeKit, Arduino or other integration here. // By default - just return a descriptive action string. NCResult.text("Lights are [$status] in [${locations.toLowerCase()}].") } Lights are [on] in [master bedroom]. Lights are [on] in [entire house]. No matching intent found.

Slide 170

Slide 170 text

Classification with Neural networks For >1 layers (deep) model caters for non- linear relationships See: https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6

Slide 171

Slide 171 text

Classification with Neural networks Case Study: MNIST • Handwritten digits • 60,000 training examples • 10,000 test examples • Size-normalized (28x28 pixels) and centered • http://yann.lecun.com/exdb/mnist/

Slide 172

Slide 172 text

MNIST with Apache Commons Math class MnistNeuralNet { private static int inodes = 784 private static int hnodes = 200 private static int onodes = 10 private static double learning_rate = 0.1 private static RealMatrix inputHidden private static RealMatrix hiddenOutput static initRandom() { inputHidden = initRandom(createRealMatrix(hnodes, inodes), inodes**-0.5 as double) hiddenOutput = initRandom(createRealMatrix(onodes, hnodes), hnodes**-0.5 as double) } static double[] normalize(Integer[] img) { img.collect { it / 255.0d * 0.999d + 0.001d } as double[] } static RealMatrix createTarget(int label) { createRealMatrix(onodes, 1).tap { (0..

Slide 173

Slide 173 text

MNIST with Apache Commons Math … static RealMatrix initRandom(RealMatrix matrix, double desiredStandardDeviation) { scalar(matrix, it -> new Random().nextGaussian() * desiredStandardDeviation) } static void train(RealMatrix inputs, RealMatrix targets) { // forward RealMatrix hiddenInputs = inputHidden * inputs RealMatrix hiddenOutputs = scalarSigmoid(hiddenInputs) RealMatrix finalInputs = hiddenOutput * hiddenOutputs RealMatrix finalOutputs = scalarSigmoid(finalInputs) // back RealMatrix outputErrors = targets.subtract(finalOutputs) RealMatrix t1 = multiplyElements(outputErrors, finalOutputs) RealMatrix t2 = multiplyElements(t1, scalar(finalOutputs, it -> 1.0d - it)) RealMatrix t3 = t2 * hiddenOutputs.transpose() hiddenOutput = hiddenOutput.add(scalar(t3, it -> learning_rate * it)) RealMatrix hiddenErrors = hiddenOutput.transpose() * outputErrors t1 = multiplyElements(hiddenErrors, hiddenOutputs) t2 = multiplyElements(t1, scalar(hiddenOutputs, it -> 1.0d - it)) t3 = t2 * inputs.transpose() inputHidden = inputHidden.add(scalar(t3, it -> learning_rate * it)) } static void save(String filename) { new FileOutputStream(filename).withObjectOutputStream { oos -> oos.writeObject(inputHidden.data) oos.writeObject(hiddenOutput.data) } } } Inspired by: https://golb.hplar.ch/2018/12/simple-neural-network.html MnistNeuralNet class supports training and prediction cont’d

Slide 174

Slide 174 text

MNIST with Apache Commons Math static RealMatrix scalar(RealMatrix matrix, Function function) { int numRows = matrix.rowDimension int numCols = matrix.columnDimension RealMatrix result = createRealMatrix(numRows, numCols) for (r in 0..= 0 && yRot >= 0 && xRot <= 27 && yRot <= 27) { result[y][x] = img[yRot][xRot] } } } return result } … Inspired by: https://golb.hplar.ch/2018/12/simple-neural-network.html Utility methods

Slide 175

Slide 175 text

MNIST with Apache Commons Math … static RealMatrix multiplyElements(RealMatrix matrixA, RealMatrix matrixB) { // elementWise multiplication has same compatibility requirements as addition checkAdditionCompatible(matrixA, matrixB) int numRows = matrixA.rowDimension int numCols = matrixA.columnDimension RealMatrix product = createRealMatrix(numRows, numCols) for (r in 0.. }, visit: { r, c, double v -> 1 / (1 + Math.exp(-v)) }, end : { -> 0d }] as RealMatrixChangingVisitor RealMatrix result = matrix.copy() result.walkInRowOrder(sigmoid) result } Inspired by: https://golb.hplar.ch/2018/12/simple-neural-network.html Utility methods cont’d

Slide 176

Slide 176 text

MNIST with Apache Commons Math int epochs = 10 initRandom() int[] labels = MnistReader.getLabels(Paths.get("/path/to/train-labels-idx1-ubyte.gz")) List images = MnistReader.getImages(Paths.get("/path/to/train-images-idx3-ubyte.gz")) println 'Processing images' def scaledImages = images.collect { image -> normalize(image.flatten() as Integer[]) } as double[][] def scaledImagesRot10 = images.collect { image -> normalize(rotate(image, 10).flatten() as Integer[]) } as double[][] def scaledImagesRot350 = images.collect { image -> normalize(rotate(image, -10).flatten() as Integer[]) } as double[][] epochs.times { e -> println "Running epoch: ${e + 1}" for (i in 0.. testImages = MnistReader.getImages(Paths.get("/path/to/t10k-images-idx3-ubyte.gz")) int correct = 0 for (i in 0..

Slide 177

Slide 177 text

MNIST with DeepLearning4J int rngSeed = 123 int batchSize = 125 int numEpochs = 15 int numInputs = 28 * 28 int hiddenLayerSize = 1000 int numOutputs = 10 def (trainSet, testSet) = [true, false].collect { new MnistDataSetIterator(batchSize, it, rngSeed) } def log = LoggerFactory.getLogger(getClass()) def conf = new NeuralNetConfiguration.Builder() .seed(rngSeed) //include the random seed for reproducibility // use stochastic gradient descent as an optimization algorithm .updater(new Nesterovs(0.006, 0.9)) .l2(1e-4) .list() // input layer: .layer(new DenseLayer.Builder() .nIn(numInputs) .nOut(hiddenLayerSize) .activation(Activation.RELU) .weightInit(WeightInit.XAVIER) .build()) // hidden layer: .layer(new OutputLayer.Builder(LossFunction.NEGATIVELOGLIKELIHOOD) .nIn(hiddenLayerSize) .nOut(numOutputs) .activation(Activation.SOFTMAX) .weightInit(WeightInit.XAVIER) .build()) .build() … … log.info("Creating model ...") def model = new MultiLayerNetwork(conf) model.init() log.info("Training model ...") model.listeners = new ScoreIterationListener(100) model.fit(trainSet, numEpochs) // model.save('mlp_model.dat' as File) log.info("Evaluating model...") def eval = model.evaluate(testSet) log.info(eval.stats()) ======== Evaluation Metrics ======== # of classes: 10 Accuracy: 0.9728 Precision: 0.9728 Recall: 0.9726 F1 Score: 0.9726 Based on: https://github.com/deeplearning4j/dl4j-examples/blob/master/dl4j- examples/src/main/java/org/deeplearning4j/examples/feedforward/mnist/MLPMnistSingleLayerExample.java 0 1 2 9 … … … 10 28 * 28 = 784 1000

Slide 178

Slide 178 text

… MNIST with DeepLearning4J Based on: https://github.com/deeplearning4j/dl4j-examples/blob/master/dl4j- examples/src/main/java/org/deeplearning4j/examples/feedforward/mnist/MLPMnistSingleLayerExample.java 0 1 2 9 … … … 10 28 * 28 = 784 500 int rngSeed = 123 int batchSize = 64 int numEpochs = 15 int numInputs = 28 * 28 int numOutputs = 10 double rate = 0.0015 def (trainSet, testSet) = [true, false].collect { new MnistDataSetIterator(batchSize, it, rngSeed) } def log = LoggerFactory.getLogger(getClass()) def conf = new NeuralNetConfiguration.Builder() .seed(rngSeed) //include the random seed for reproducibility // use stochastic gradient descent as an optimization algorithm .updater(new Nadam()) .l2(rate * 0.005) // regularized .list() // first input layer: .layer(new DenseLayer.Builder() .nIn(numInputs) .nOut(500) .build()) // second input layer: .layer(new DenseLayer.Builder() .nIn(500) .nOut(100) .build()) // hidden layer: .layer(new OutputLayer.Builder(LossFunction.NEGATIVELOGLIKELIHOOD) .nOut(numOutputs) .activation(Activation.SOFTMAX) .build()) .build() … log.info("Creating model ...") def model = new MultiLayerNetwork(conf) model.init() log.info("Training model ...") model.listeners = new ScoreIterationListener(100) model.fit(trainSet, numEpochs) log.info("Evaluating model...") def eval = model.evaluate(testSet) log.info(eval.stats()) ======== Evaluation Metrics ======== # of classes: 10 Accuracy: 0.9820 Precision: 0.9820 Recall: 0.9818 F1 Score: 0.9819 100

Slide 179

Slide 179 text

MNIST GUI with GroovyFX import javafx.embed.swing.SwingFXUtils import javafx.scene.canvas.GraphicsContext import javafx.scene.image.Image import javafx.scene.image.PixelFormat import javafx.scene.paint.Color import javax.imageio.ImageIO static clear(GraphicsContext g, size) { g.fill = Color.WHITE; g.fillRect(0, 0, size, size); g.fill = Color.BLACK } static int[] imageToArray(Image img) { int w = img.width int h = img.height int[] buf = new int[h * w] img.pixelReader.getPixels(0, 0, w, h, PixelFormat.intArgbInstance, buf, 0, w) buf } static Image snapshot(canvas) { def baos = new ByteArrayOutputStream() ImageIO.write(SwingFXUtils.fromFXImage(canvas.snapshot(null, null), null), "png", baos) new Image(new ByteArrayInputStream(baos.toByteArray()), 28, 28, true, true) } static displayResult(List items, Integer predict) { items.indexed().collect{ idx, val -> def marker = (idx == predict && items[predict] > 0.5) ? ' **' : '' "$idx ${val ? sprintf('%.4f', val) + marker : '?'}" }.join('\n') } A few more utility methods

Slide 180

Slide 180 text

MNIST GUI with GroovyFX class MnistInfer { private static RealMatrix inputHidden private static RealMatrix hiddenOutput static void load(Class klass) { def weights = klass.classLoader.getResourceAsStream('weights') try (ObjectInputStream ois = new ObjectInputStream(weights)) { inputHidden = MatrixUtils.createRealMatrix((double[][]) ois.readObject()) hiddenOutput = MatrixUtils.createRealMatrix((double[][]) ois.readObject()) } } static RealMatrix query(double[] inputArray) { RealMatrix inputs = createColumnRealMatrix(inputArray) RealMatrix hiddenInputs = inputHidden * inputs RealMatrix hiddenOutputs = scalarSigmoid(hiddenInputs) RealMatrix finalInputs = hiddenOutput * hiddenOutputs return scalarSigmoid(finalInputs) } static double[] normalize(int[] img) { double[] result = new double[img.length] for (int i = 0; i < img.length; i++) { int red = (img[i] >> 16) & 0xff int green = (img[i] >> 8) & 0xff int blue = img[i] & 0xff result[i] = 1 - ((red + green + blue) / 765.0 * 999 + 1) / 1000 } return result } } Predictor class

Slide 181

Slide 181 text

MNIST GUI with GroovyFX … start { stage(title: 'MNIST', visible: true) { scene(id: "scene", fill: WHITE) { borderPane { top { canvas(id: "canvas", width: size, height: size) def g = canvas.graphicsContext2D clear(g, size) canvas.onMouseDragged { e -> g.fillOval e.x - pen, e.y - pen, pen * 2, pen * 2 } } center { hbox(alignment: 'Center') { button('Clear', onAction: { clear(canvas.graphicsContext2D, size) out.text = displayResult([null] * 10, null) }) button('Predict', onAction: { def result = Mnist.query(Mnist.scale(imageToArray(snapshot(canvas)))) def predictLabel = Mnist.maxIndex(result) out.text = displayResult(result.data.collect{ it[0] }, predictLabel) }) } } bottom { textArea(displayResult([null] * 10, null), id: 'out', editable: false, prefColumnCount: 16) } } } } }

Slide 182

Slide 182 text

Apache MXNET Source: https://mxnet.apache.org/versions/master/faq/why_mxnet.html

Slide 183

Slide 183 text

Apache MXNET Object detection import groovy.swing.SwingBuilder import javax.imageio.ImageIO import org.apache.mxnet.infer.javaapi.ObjectDetector import org.apache.mxnet.javaapi.* import static javax.swing.WindowConstants.DISPOSE_ON_CLOSE static void downloadUrl(String srcDirUrl, String destDir, String filename) { def destination = new File(destDir, filename) if (!destination.parentFile.exists()) { destination.parentFile.mkdirs() } if (!destination.exists()) { destination.bytes = new URL(srcDirUrl + filename).bytes } } static downloadModelImage() { String baseDir = "${System.getProperty('java.io.tmpdir')}/resnetssd/" def imageName = 'dog-ssd.jpg' def modelName = 'resnet50_ssd_model' String imgURL = "https://s3.amazonaws.com/model-server/inputs/" println "Downloading image to ${baseDir}..." downloadUrl(imgURL, baseDir, imageName) println "Downloading model files to ${baseDir}... (may take a while)" String modelURL = "https://s3.amazonaws.com/model-server/models/resnet50_ssd/" downloadUrl(modelURL, baseDir, "$modelName-symbol.json") downloadUrl(modelURL, baseDir, "$modelName-0000.params") downloadUrl(modelURL, baseDir, 'synset.txt') [baseDir + imageName, baseDir + modelName] } … Utility methods for downloading test image and pre-trained model Test image

Slide 184

Slide 184 text

Apache MXNET Object detection … static detectObjects(String modelPath, String imagePath, inputShape) { def context = [Context.cpu()] def inputDescriptor = new DataDesc("data", inputShape, DType.Float32(), "NCHW") def objDet = new ObjectDetector(modelPath, [inputDescriptor], context, 0) return objDet.imageObjectDetect(ObjectDetector.loadImageFromFile(imagePath), 3) } def (imagePath, modelPath) = downloadModelImage() def image = ImageIO.read(imagePath as File) def (w, h) = image.with{ [it.width, it.height] } def count = 1 // batch size def channels = 3 // for RGB def inputShape = new Shape(count, channels, w * 0.8 as int, h) def results = detectObjects(modelPath, imagePath, inputShape).sum() def boxes = results.collect {[ xmin: w * it.XMin as int, ymin: h * it.YMin as int, xmax: w * it.XMax as int, ymax: h * it.YMax as int ]} def names = results.collect{ it.className + sprintf(' %.3f', it.probability) } (0..

Slide 185

Slide 185 text

Bonus Material

Slide 186

Slide 186 text

Scaling up Data Science If your big data framework doesn’t have the algorithm you need? Preprocessing Partition based dataset Linear regression K-Means clustering Genetic algorithms Multilayer perceptron Decision trees k-NN classification k-NN regression SVM binary classification SVM multi-class Classification Model cross validation Logistic regression Random forest Gradient boosting Model updating ANN (Approximate Nearest Neighbor) Deep learning Summary statistics Correlations Stratified sampling Hypothesis testing Random data generation Classification and regression: support Vector machines logistic regression Linear regression Decision trees Naive Bayes classification Alternating least squares (ALS) K-Means clustering Latent Dirichlet allocation (LDA) Singular value decomposition (SVD) Principal component analysis (PCA) Feature extraction Transformation functions Optimization algorithms Re-formulate your algorithm using map/filter/reduce

Slide 187

Slide 187 text

House pricing revisited – Initial algorithm Data Training Test Model Stats Split Fit Predict and Evaluate

Slide 188

Slide 188 text

House pricing revisited – Rejigging our algorithm Data Training Test Model Stats Split Fit Predict and Evaluate Data Chunk1 Model1 Stats Split Fit Predict and Evaluate … Chunk2 Model2 ChunkN ModelN Fit Model Aggregate … … ChunkN Stats1 Stats2 StatsN … Aggregate Chunk2 Chunk1

Slide 189

Slide 189 text

House pricing revisited - GPars import org.apache.commons.math3.stat.StatUtils import smile.math.Math import smile.regression.OLS import tech.tablesaw.api.* import groovyx.gpars.GParsPool import static java.lang.Math.sqrt def features = [ 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_living15', 'lat', 'sqft_above', 'grade', 'view', 'waterfront', 'floors' ] def file = getClass().classLoader.getResource('kc_house_data.csv').file def table = Table.read().csv(file) table = table.dropWhere(table.column("bedrooms").isGreaterThan(30)) def data = table.as().doubleMatrix(*features) … • Read CSV in normal way • Split data into chunks • Fit each chunk and aggregate resulting models • Split data into chunks • Evaluate each chunk against model and aggregate resulting stats

Slide 190

Slide 190 text

House pricing revisited - GPars … GParsPool.withPool { def trainChunkSize = 3000 def numTrainChunks = 8 def models = (0.. def predicted = chunk.collect { row -> intercept + dot(row[1..-1] as double[], coefficients) } def residuals = chunk.toList().indexed().collect { idx, row -> predicted[idx] - row[0] } def rmse = sqrt(sumSq(residuals as double[]) / chunk.size()) [rmse, residuals.average(), chunk.size()] } def evalChunkSize = 2000 def results = data.collate(evalChunkSize).collectParallel(stats) println 'RMSE: ' + sqrt(results.collect{ it[0] * it[0] * it[2] }.sum() / data.size()) println 'mean: ' + results.collect{ it[1] * it[2] }.sum() / data.size() } • Read CSV in normal way • Split data into chunks • Fit each chunk and aggregate resulting models • Split data into chunks • Evaluate each chunk against model and aggregate resulting stats

Slide 191

Slide 191 text

House pricing revisited - GPars … GParsPool.withPool { def trainChunkSize = 3000 def numTrainChunks = 8 def models = (0.. def predicted = chunk.collect { row -> intercept + dot(row[1..-1] as double[], coefficients) } def residuals = chunk.toList().indexed().collect { idx, row -> predicted[idx] - row[0] } def rmse = sqrt(sumSq(residuals as double[]) / chunk.size()) [rmse, residuals.average(), chunk.size()] } def evalChunkSize = 2000 def results = data.collate(evalChunkSize).collectParallel(stats) println 'RMSE: ' + sqrt(results.collect{ it[0] * it[0] * it[2] }.sum() / data.size()) println 'mean: ' + results.collect{ it[1] * it[2] }.sum() / data.size() } Intercept: -3.16904445043026E7 Coefficients: -21131.141578491588, -5431.675525641348, 179.47932136012156, 11.273983523229091, 657752.8815669084, -10.658550821995767, 86876.08541623666, 61708.58759419169, 611717.9878928157, -21819.967894981055 RMSE: 215185.8177879586 mean: -2151.974648800721

Slide 192

Slide 192 text

… GParsPool.withPool { def trainChunkSize = 3000 def numTrainChunks = 8 def models = (0.. def predicted = chunk.collect { row -> intercept + dot(row[1..-1] as double[], coefficients) } def residuals = chunk.toList().indexed().collect { idx, row -> predicted[idx] - row[0] } def rmse = sqrt(sumSq(residuals as double[]) / chunk.size()) [rmse, residuals.average(), chunk.size()] } def evalChunkSize = 2000 def results = data.collate(evalChunkSize).collectParallel(stats) println 'RMSE: ' + sqrt(results.collect{ it[0] * it[0] * it[2] }.sum() / data.size()) println 'mean: ' + results.collect{ it[1] * it[2] }.sum() / data.size() } House pricing revisited - GPars Data Chunk1 Model1 Stats shuffle/take results … Chunk2 Model2 ChunkN ModelN OLS#intercept/OLS#coefficients Model … … ChunkN Stats1 Stats2 StatsN … Chunk2 Chunk1 results.collect transpose/sum model collate stats

Slide 193

Slide 193 text

House pricing revisited – Apache Beam … class MeanDoubleArrayCols implements SerializableFunction, double[]> { @Override double[] apply(Iterable input) { double[] sum = null def count = 0 for (double[] next : input) { if (sum == null) { sum = new double[next.size()] (0.., double[]> { @Override double[] apply(Iterable input) { double[] sum = null for (double[] next : input) { if (sum == null) { sum = new double[next.size()] (0..

Slide 194

Slide 194 text

House pricing revisited – Apache Beam … static buildPipeline(Pipeline p) { def features = [ 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_living15', 'lat', 'sqft_above', 'grade', 'view', 'waterfront', 'floors' ] def readCsvChunks = new DoFn() { @ProcessElement void processElement(@Element String path, OutputReceiver receiver) throws IOException { def chunkSize = 6000 def table = Table.read().csv(path) table = table.dropWhere(table.column("bedrooms").isGreaterThan(30)) def idxs = 0..() { @ProcessElement void processElement(@Element double[][] rows, OutputReceiver receiver) throws IOException { double[] model = new OLS(rows.collect{ it[1..-1] } as double[][], rows.collect{ it[0] } as double[]).with{ [it.intercept(), *it.coefficients()] } receiver.output(model) } } def evalModel = { double[][] chunk, double[] model -> double intercept = model[0] double[] coefficients = model[1..-1] def predicted = chunk.collect { row -> intercept + dot(row[1..-1] as double[], coefficients) } def residuals = chunk.toList().indexed().collect { idx, row -> predicted[idx] - row[0] } def rmse = sqrt(sumSq(residuals as double[]) / chunk.size()) def mean = residuals.average() [rmse, mean, chunk.size()] as double[] } … Map filename to chunks Map chunk to model Map chunk to stats

Slide 195

Slide 195 text

House pricing revisited – Apache Beam … def model2out = new DoFn() { @ProcessElement void processElement(@Element double[] ds, OutputReceiver out) { out.output("intercept: ${ds[0]}, coeffs: ${ds[1..-1].join(', ')}".toString()) } } def stats2out = new DoFn() { @ProcessElement void processElement(@Element double[] ds, OutputReceiver out) { out.output("rmse: ${ds[0]}, mean: ${ds[1]}, count: ${ds[2]}".toString()) } } var csvChunks = p .apply(Create.of('/path/to/kc_house_data.csv')) .apply('Create chunks', ParDo.of(readCsvChunks)) var model = csvChunks .apply('Fit chunks', ParDo.of(fitModel)) .apply(Combine.globally(new MeanDoubleArrayCols())) var modelView = model .apply(View.asSingleton()) csvChunks .apply(ParDo.of(new EvaluateModel(modelView, evalModel)).withSideInputs(modelView)) .apply(Combine.globally(new AggregateModelStats())) .apply('Log stats', ParDo.of(stats2out)).apply(Log.ofElements()) model .apply('Log model', ParDo.of(model2out)).apply(Log.ofElements()) } def pipeline = Pipeline.create() buildPipeline(pipeline) pipeline.run().waitUntilFinish() INFO: intercept: -3.2504584735458568E7, coeffs: -27862.329468295735, -2269.8423856962077, 198.1365232723505, 2.9357512250793327, 675276.3494316386, -2.323389992226333, 83241.42291377622, 66606.19682072607, 607326.5819472861, -30354.844181689274 INFO: rmse: 214702.83588071115, mean: -1191.2433423574066, count: 21612.0 Map model to output Map stats to output Create and run pipeline Build pipeline

Slide 196

Slide 196 text

… var csvChunks = p | Create.of('/path/to/kc_house_data.csv') | 'Create chunks' >> ParDo.of(readCsvChunks) var model = csvChunks | 'Fit chunks' >> ParDo.of(fitModel) | Combine.globally(new MeanDoubleArrayCols()) var modelView = model | View.asSingleton() csvChunks | ParDo.of(new EvaluateModel(modelView, evalModel)).withSideInputs(modelView) | Combine.globally(new AggregateModelStats()) | 'Log stats' >> ParDo.of(stats2out) | Log.ofElements() model | 'Log model' >> ParDo.of(model2out) | Log.ofElements() } PCollection.metaClass.or = { List arg -> delegate.apply(*arg) } PCollection.metaClass.or = { PTransform arg -> delegate.apply(arg) } String.metaClass.rightShift = { PTransform arg -> [delegate, arg] } Pipeline.metaClass.or = { PTransform arg -> delegate.apply(arg) } def pipeline = Pipeline.create() buildPipeline(pipeline) pipeline.run().waitUntilFinish() House pricing revisited – Apache Beam INFO: intercept: -3.2504584735458568E7, coeffs: -27862.329468295735, -2269.8423856962077, 198.1365232723505, 2.9357512250793327, 675276.3494316386, -2.323389992226333, 83241.42291377622, 66606.19682072607, 607326.5819472861, -30354.844181689274 INFO: rmse: 214702.83588071115, mean: -1191.2433423574066, count: 21612.0 Create and run pipeline Some optional metaprogramming for Python-like coding style

Slide 197

Slide 197 text

House pricing revisited – Apache Beam … var csvChunks = p | Create.of('/path/to/kc_house_data.csv') | 'Create chunks' >> ParDo.of(readCsvChunks) var model = csvChunks | 'Fit chunks' >> ParDo.of(fitModel) | Combine.globally(new MeanDoubleArrayCols()) var modelView = model | View.asSingleton() csvChunks | ParDo.of(new EvaluateModel(modelView, evalModel)).withSideInputs(modelView) | Combine.globally(new AggregateModelStats()) | 'Log stats' >> ParDo.of(stats2out) | Log.ofElements() model | 'Log model' >> ParDo.of(model2out) | Log.ofElements() } PCollection.metaClass.or = { List arg -> delegate.apply(*arg) } PCollection.metaClass.or = { PTransform arg -> delegate.apply(arg) } String.metaClass.rightShift = { PTransform arg -> [delegate, arg] } Pipeline.metaClass.or = { PTransform arg -> delegate.apply(arg) } def pipeline = Pipeline.create() buildPipeline(pipeline) pipeline.run().waitUntilFinish() INFO: intercept: -3.2504584735458568E7, coeffs: -27862.329468295735, -2269.8423856962077, 198.1365232723505, 2.9357512250793327, 675276.3494316386, -2.323389992226333, 83241.42291377622, 66606.19682072607, 607326.5819472861, -30354.844181689274 INFO: rmse: 214702.83588071115, mean: -1191.2433423574066, count: 21612.0 Data Chunk1 Model1 Stats readCsvChunks csvChunks … Chunk2 Model2 ChunkN ModelN fitModel Model … … ChunkN Stats1 Stats2 StatsN … Chunk2 Chunk1 AggregateModelStats MeanDoubleArrayCols csvChunks model modelView sideInput evalModel

Slide 198

Slide 198 text

Optimization Overview Optimization: • Finding (optimal) solutions Approaches: • Dynamic programming • Linear programming • Non-linear programming • Constraint programming • Integer programming Aspects: • Combinatorial optimization • Continuous optimization Applications: • Timetabling/scheduling • Vehicle routing • Packing problems • Trading systems • Logistics • Language detection CP Constraint Programming CIP Constraint Integer Programming MINLP Mixed Integer Non-linear Programming MIP Mixed Integer Linear Programming IP Integer Linear Programming SAT Satisfiability LP Linear Programming

Slide 199

Slide 199 text

Constraint Programming Picture: http://osullivan.ucc.ie/AAAI_OSullivan.pdf

Slide 200

Slide 200 text

Constraint Programming Typical domains: • Boolean (SAT problem) • integer • rational • interval (scheduling problems) • linear (approaches to non- linear problems do exist) • finite sets • mixed (2 or more of above) Picture: http://osullivan.ucc.ie/AAAI_OSullivan.pdf

Slide 201

Slide 201 text

Cryptarithmetic S E N D + M O R E = M O N E Y Replace the letters with decimal digits to make a valid arithmetic sum

Slide 202

Slide 202 text

Cryptarithmetic: Brute force def solutions(): # letters = ('s', 'e', 'n', 'd', 'm', 'o', 'r', 'y') all_solutions = list() for s in range(1, 10): for e in range(0, 10): for n in range(0, 10): for d in range(0, 10): for m in range(1, 10): for o in range(0, 10): for r in range(0, 10): for y in range(0, 10): if len({s, e, n, d, m, o, r, y}) == 8: send = 1000 * s + 100 * e + 10 * n + d more = 1000 * m + 100 * o + 10 * r + e money = 10000 * m + 1000 * o + 100 * n + 10 * e + y if send + more == money: all_solutions.append((send, more, money)) return all_solutions print(solutions()) S E N D + M O R E = M O N E Y [(9567, 1085, 10652)]

Slide 203

Slide 203 text

Cryptarithmetic: Brute force def solutions() { // letters = ['s', 'e', 'n', 'd', 'm', 'o', 'r', 'y'] def all_solutions = [] for (s in 1..<10) for (e in 0..9) for (n in 0..9) for (d in 0..9) for (m in 1..9) for (o in 0..9) for (r in 0..9) for (y in 0..9) if ([s, e, n, d, m, o, r, y].toSet().size() == 8) { def send = 1000 * s + 100 * e + 10 * n + d def more = 1000 * m + 100 * o + 10 * r + e def money = 10000 * m + 1000 * o + 100 * n + 10 * e + y if (send + more == money) all_solutions.add([send, more, money]) } return all_solutions } print(solutions()) S E N D + M O R E = M O N E Y [[9567, 1085, 10652]]

Slide 204

Slide 204 text

Cryptarithmetic: Brute force from itertools import permutations def solution2(): letters = ('s', 'e', 'n', 'd', 'm', 'o', 'r', 'y') digits = range(10) for perm in permutations(digits, len(letters)): sol = dict(zip(letters, perm)) if sol['s'] == 0 or sol['m'] == 0: continue send = 1000 * sol['s'] + 100 * sol['e'] + 10 * sol['n'] + sol['d'] more = 1000 * sol['m'] + 100 * sol['o'] + 10 * sol['r'] + sol['e'] money = 10000 * sol['m'] + 1000 * sol['o'] + 100 * sol['n'] + 10 * sol['e'] + sol['y'] if send + more == money: return send, more, money print(solution2()) S E N D + M O R E = M O N E Y (9567, 1085, 10652)

Slide 205

Slide 205 text

Cryptarithmetic: Brute force def solution2() { def digits = 0..9 for (p in digits.permutations()) { if (p[-1] < p[-2]) continue def (s, e, n, d, m, o, r, y) = p if (s == 0 || m == 0) continue def send = 1000 * s + 100 * e + 10 * n + d def more = 1000 * m + 100 * o + 10 * r + e def money = 10000 * m + 1000 * o + 100 * n + 10 * e + y if (send + more == money) return [send, more, money] } } print(solution2()) S E N D + M O R E = M O N E Y [9567, 1085, 10652]

Slide 206

Slide 206 text

from csp import Constraint, CSP from typing import Dict, List, Optional class SendMoreMoneyConstraint(Constraint[str, int]): def __init__(self, letters: List[str]) -> None: super().__init__(letters) self.letters: List[str] = letters def satisfied(self, assignment: Dict[str, int]) -> bool: # not a solution if duplicate values if len(set(assignment.values())) < len(assignment): return False # if all vars assigned, check if correct if len(assignment) == len(self.letters): s: int = assignment["S"] e: int = assignment["E"] n: int = assignment["N"] d: int = assignment["D"] m: int = assignment["M"] o: int = assignment["O"] r: int = assignment["R"] y: int = assignment["Y"] send: int = s * 1000 + e * 100 + n * 10 + d more: int = m * 1000 + o * 100 + r * 10 + e money: int = m * 10000 + o * 1000 + n * 100 + e * 10 + y return send + more == money return True … Cryptarithmetic: Constraint programming … letters = ["S", "E", "N", "D", "M", "O", "R", "Y"] possible_digits = {} for letter in letters: possible_digits[letter] = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] possible_digits["M"] = [1] # can't start with 0 constraint csp: CSP[str, int] = CSP(letters, possible_digits) csp.add_constraint(SendMoreMoneyConstraint(letters)) solution: Optional[Dict[str, int]] = csp.backtracking_search() if solution is None: print("No solution found!") else: print(solution) S E N D + M O R E = M O N E Y {'S': 9, 'E': 5, 'N': 6, 'D': 7, 'M': 1, 'O': 0, 'R': 8, 'Y': 2} Adapted from: https://freecontent.manning.com/constraint-satisfaction-problems-in-python/

Slide 207

Slide 207 text

@Grab('org.choco-solver:choco-solver:4.10.0') import org.chocosolver.solver.Model import org.chocosolver.solver.variables.IntVar def model = new Model("SEND+MORE=MONEY") def S = model.intVar("S", 1, 9) def E = model.intVar("E", 0, 9) def N = model.intVar("N", 0, 9) def D = model.intVar("D", 0, 9) def M = model.intVar("M", 1, 9) def O = model.intVar("O", 0, 9) def R = model.intVar("R", 0, 9) def Y = model.intVar("Y", 0, 9) model.allDifferent(S, E, N, D, M, O, R, Y).post() … Cryptarithmetic: Constraint programming … IntVar[] ALL = [ S, E, N, D, M, O, R, E, M, O, N, E, Y] int[] COEFFS = [ 1000, 100, 10, 1, 1000, 100, 10, 1, -10000, -1000, -100, -10, -1] model.scalar(ALL, COEFFS, "=", 0).post() model.solver.findSolution() S E N D + M O R E = M O N E Y Solution: S=9, E=5, N=6, D=7, M=1, O=0, R=8, Y=2,

Slide 208

Slide 208 text

Dietary restrictions Minimise cost of diet given: • Must be at least 300 calories • Not more than 10 grams of protein • Not less than 10 grams of carbohydrates • Not less than 8 grams of fat • At least 0.5 units of fish • No more than 1 unit of milk Bread Milk Cheese Potato Fish Yogurt Cost 2.0 3.5 8.0 1.5 11.0 1.0 Protein (g) 4.0 8.0 7.0 1.3 8.0 9.2 Fat (g) 1.0 5.0 9.0 0.1 7.0 1.0 Carbohydrates (g) 15.0 11.7 0.4 22.6 0.0 17.0 Calories 90 120 106 97 130 180

Slide 209

Slide 209 text

Dietary restrictions (Excel)

Slide 210

Slide 210 text

Dietary restrictions (Excel)

Slide 211

Slide 211 text

Dietary restrictions (Google sheets)

Slide 212

Slide 212 text

Diet problem (ojalgo) def model = new ExpressionsBasedModel() def bread = model.addVariable("Bread").lower(0) def milk = model.addVariable("Milk").lower(0).upper(1) def cheese = model.addVariable("Cheese").lower(0) def potato = model.addVariable("Potato").lower(0) def fish = model.addVariable("Fish").lower(0.5) def yogurt = model.addVariable("Yogurt").lower(0) def cost = model.addExpression("Cost") cost.set(bread, 2.0).set(milk, 3.5).set(cheese, 8.0).set(potato, 1.5).set(fish, 11.0).set(yogurt, 1.0) def protein = model.addExpression("Protein").upper(10) protein.set(bread, 4.0).set(milk, 8.0).set(cheese, 7.0).set(potato, 1.3).set(fish, 8.0).set(yogurt, 9.2) def fat = model.addExpression("Fat").lower(8) fat.set(bread, 1.0).set(milk, 5.0).set(cheese, 9.0).set(potato, 0.1).set(fish, 7.0).set(yogurt, 1.0) def carbs = model.addExpression("Carbohydrates").lower(10) carbs.set(bread, 15.0).set(milk, 11.7).set(cheese, 0.4).set(potato, 22.6).set(fish, 0.0).set(yogurt, 17.0) def calories = model.addExpression("Calories").lower(300) calories.set(bread, 90).set(milk, 120).set(cheese, 106).set(potato, 97).set(fish, 130).set(yogurt, 180) def result = model.minimise() // for a variation, see: // https://www.ojalgo.org/2019/05/the-diet-problem/ … OPTIMAL 0.0 @ { 0.0, 0.0, 0.40813898143741034, 1.8538791051880057, 0.5916230366492151, 0.0 } ############################################ 0 <= Bread: 0 0 <= Milk: 0 <= 1 0 <= Cheese: 0.408139 0 <= Potato: 1.853879 0.5 <= Fish: 0.591623 0 <= Yogurt: 0 10 <= Carbohydrates: 42.060923 8 <= Fat: 8.0 Cost: 12.553784 Protein: 10.0 <= 10 300 <= Calories: 300.0 ############################################

Slide 213

Slide 213 text

Diet problem (Choco) def model = new Model("Diet problem") def unbounded = 100000 // scale quantities by 10, coefficients by 10, products by 100 def bread = model.intVar("Bread", 0, unbounded) def milk = model.intVar("Milk", 0, 10) def cheese = model.intVar("Cheese", 0, unbounded) def potato = model.intVar("Potato", 0, unbounded) def fish = model.intVar("Fish", 5, unbounded) def yogurt = model.intVar("Yogurt", 0, unbounded) IntVar[] all = [bread, milk, cheese, potato, fish, yogurt] def cost = model.intVar("Cost", 0, unbounded) model.scalar(all, [20, 35, 80, 15, 110, 10] as int[], "=", cost).post() def protein = model.intVar("Protein", 0, 1000) model.scalar(all, [40, 80, 70, 13, 80, 92] as int[], "=", protein).post() def fat = model.intVar("Fat", 800, unbounded) model.scalar(all, [10, 50, 90, 1, 70, 10] as int[], "=", fat).post() def carbs = model.intVar("Carbohydrates", 1000, unbounded) model.scalar(all, [150, 117, 4, 226, 0, 170] as int[], "=", carbs).post() def calories = model.intVar("Calories", 30000, unbounded) model.scalar(all, [900, 1200, 1060, 970, 1300, 1800] as int[], "=", calories).post() model.setObjective(Model.MINIMIZE, cost) def found = model.solver.findSolution() if (found) { all.each { println "$it.name: ${it.value / 10}" } [carbs, fat, protein, calories, cost].each { println "$it.name: ${it.value / 100}" } } else { println "No solution" } Bread: 0 Milk: 0 Cheese: 0.5 Potato: 1.9 Fish: 0.5 Yogurt: 0 Carbohydrates: 43.14 Fat: 8.19 Protein: 9.97 Calories: 302.3 Cost: 12.35 Choco supports RealVar but doesn’t have as rich a set of possible constraints for such vars.

Slide 214

Slide 214 text

Diet problem (Choco) def model = new Model("Diet problem") def unbounded = 1000.0d def precision = 0.00001d // scale quantities by 10, coefficients by 10, products by 100 def bread = model.realVar("Bread", 0.0, unbounded, precision) def milk = model.realVar("Milk", 0.0, 1.0, precision) def cheese = model.realVar("Cheese", 0.0, unbounded, precision) def potato = model.realVar("Potato", 0.0, unbounded, precision) def fish = model.realVar("Fish", 0.5, unbounded, precision) def yogurt = model.realVar("Yogurt", 0.0, unbounded, precision) RealVar[] all = [bread, milk, cheese, potato, fish, yogurt] def scalarIbex = { coeffs, var -> def (a, b, c, d, e, f) = coeffs model.realIbexGenericConstraint("$a*{0}+$b*{1}+$c*{2}+$d*{3}+$e*{4}+$f*{5}={6}", [*all, var] as RealVar[]).post(); } def cost = model.realVar("Cost", 0.0, unbounded, precision) scalarIbex([2.0, 3.5, 8.0, 1.5, 11.0, 1.0], cost) def protein = model.realVar("Protein", 0.0, 10.0, precision) scalarIbex([4.0, 8.0, 7.0, 1.3, 8.0, 9.2], protein) def fat = model.realVar("Fat", 8.0, unbounded, precision) scalarIbex([1.0, 5.0, 9.0, 0.1, 7.0, 1.0], fat) def carbs = model.realVar("Carbohydrates", 10.0, unbounded, precision) scalarIbex([15.0, 11.7, 0.4, 22.6, 0.0, 17.0], carbs) def calories = model.realVar("Calories", 300, unbounded, precision) scalarIbex([90, 120, 106, 97, 130, 180], calories) model.setObjective(Model.MINIMIZE, cost) def found = model.solver.findSolution() Bread: 0.025131 .. 0.025137 Milk: 0.000009 .. 0.000010 Cheese: 0.428571 .. 0.428571 Potato: 1.848118 .. 1.848124 Fish: 0.561836 .. 0.561836 Yogurt: 0.000007 .. 0.000010 Carbohydrates: 42.316203 .. 42.316211 Fat: 8.000000 .. 8.000005 Protein: 9.997920 .. 9.997926 Calories: 300.000000 .. 300.000008 Cost: 12.431241 .. 12.431245 Choco does have a plugin (via JNI) for the Ibex C++ constraint processing library which does handle real numbers. def pretty = { var -> def bounds = found.getRealBounds(var) printf "%s: %.6f .. %.6f%n", var.name, *bounds } if (found) { all.each { pretty(it) } [carbs, fat, protein, calories, cost].each { pretty(it) } } else { println "No solution" }

Slide 215

Slide 215 text

Diet problem (Apache Commons Math) import org.apache.commons.math3.optim.linear.* import org.apache.commons.math3.optim.nonlinear.scalar.GoalType import static org.apache.commons.math3.optim.linear.Relationship.* def cost = new LinearObjectiveFunction([2.0, 3.5, 8.0, 1.5, 11.0, 1.0] as double[], 0) static scalar(coeffs, rel, val) { new LinearConstraint(coeffs as double[], rel, val) } def bread_min = scalar([1, 0, 0, 0, 0, 0], GEQ, 0) def milk_min = scalar([0, 1, 0, 0, 0, 0], GEQ, 0) def milk_max = scalar([0, 1, 0, 0, 0, 0], LEQ, 1) def cheese_min = scalar([0, 0, 1, 0, 0, 0], GEQ, 0) def potato_min = scalar([0, 0, 0, 1, 0, 0], GEQ, 0) def fish_min = scalar([0, 0, 0, 0, 1, 0], GEQ, 0.5) def yogurt_min = scalar([0, 0, 0, 0, 0, 1], GEQ, 0) def protein = scalar([4.0, 8.0, 7.0, 1.3, 8.0, 9.2], LEQ, 10) def fat = scalar([1.0, 5.0, 9.0, 0.1, 7.0, 1.0], GEQ, 8) def carbs = scalar([15.0, 11.7, 0.4, 22.6, 0.0, 17.0], GEQ, 10) def calories = scalar([90, 120, 106, 97, 130, 180], GEQ, 300) LinearConstraintSet constraints = [bread_min, milk_min, milk_max, fish_min, cheese_min, potato_min, yogurt_min, protein, fat, carbs, calories] def solution = new SimplexSolver().optimize(cost, constraints, GoalType.MINIMIZE) if (solution != null) { printf "Opt: %.2f%n", solution.value println solution.point.collect{ sprintf '%.2f', it }.join(', ') } Opt: 12.08 -0.00, 0.05, 0.45, 1.87, 0.50, 0.00

Slide 216

Slide 216 text

Diet problem (Google OrTools) import … Loader.loadNativeLibraries() static addConstraint(MPSolver solver, List vars, MPVariable comp, List coeffs) { solver.makeConstraint(0, 0).tap {constraint -> constraint.setCoefficient(comp, -1) vars.indices.each { i -> constraint.setCoefficient(vars[i], coeffs[i]) } } } MPSolver.createSolver("SCIP").with {solver -> def bread = makeNumVar(0, infinity(),'bread') def milk = makeNumVar(0, 1, 'milk') def cheese = makeNumVar(0, infinity(), 'cheese') def potato = makeNumVar(0, infinity(), 'potato') def fish = makeNumVar(0.5, infinity(), 'fish') def yogurt = makeNumVar(0, infinity(), 'yogurt') def food = [bread, milk, cheese, potato, fish, yogurt] …

Slide 217

Slide 217 text

Diet problem (Google OrTools) … def cost = makeNumVar(0, infinity(),'Cost') addConstraint(solver, food, cost, [2.0, 3.5, 8.0, 1.5, 11.0, 1.0]) def protein = makeNumVar(0, 10,'Protein') addConstraint(solver, food, protein, [4.0, 8.0, 7.0, 1.3, 8.0, 9.2]) def fat = makeNumVar(8, infinity(),'Fat') addConstraint(solver, food, fat, [1.0, 5.0, 9.0, 0.1, 7.0, 1.0]) def carbs = makeNumVar(10, infinity(),'Carbs') addConstraint(solver, food, carbs, [15.0, 11.7, 0.4, 22.6, 0.0, 17.0]) def cals = makeNumVar(300, infinity(),'Calories') addConstraint(solver, food, cals, [90, 120, 106, 97, 130, 180]) def components = [protein, fat, carbs, cals] objective().with {objective -> objective.setCoefficient(cost, 1) objective.setMinimization() def result = solve() println result if (result == OPTIMAL) { println "Solution: " + objective.value() println "Foods: ${food.collect{ "${it.name()} ${it.solutionValue()}" }}" println "Components: ${components.collect{ "${it.name()} ${it.solutionValue()}" }}" println "Iterations: ${iterations()}, Wall time: ${wallTime()}ms" } else { System.err.println "The problem does not have an optimal solution!" } } }

Slide 218

Slide 218 text

Diet problem (Google OrTools) … def cost = makeNumVar(0, infinity(),'Cost') addConstraint(solver, food, cost, [2.0, 3.5, 8.0, 1.5, 11.0, 1.0]) def protein = makeNumVar(0, 10,'Protein') addConstraint(solver, food, protein, [4.0, 8.0, 7.0, 1.3, 8.0, 9.2]) def fat = makeNumVar(8, infinity(),'Fat') addConstraint(solver, food, fat, [1.0, 5.0, 9.0, 0.1, 7.0, 1.0]) def carbs = makeNumVar(10, infinity(),'Carbs') addConstraint(solver, food, carbs, [15.0, 11.7, 0.4, 22.6, 0.0, 17.0]) def cals = makeNumVar(300, infinity(),'Calories') addConstraint(solver, food, cals, [90, 120, 106, 97, 130, 180]) def components = [protein, fat, carbs, cals] objective().with {objective -> objective.setCoefficient(cost, 1) objective.setMinimization() def result = solve() println result if (result == OPTIMAL) { println "Solution: " + objective.value() println "Foods: ${food.collect{ "${it.name()} ${it.solutionValue()}" }}" println "Components: ${components.collect{ "${it.name()} ${it.solutionValue()}" }}" println "Iterations: ${iterations()}, Wall time: ${wallTime()}ms" } else { System.err.println "The problem does not have an optimal solution!" } } } OPTIMAL Solution: 12.081337881108173 Foods: [bread 0.0, milk 0.05359877488514538, cheese 0.44949881665042474, potato 1.8651677572045107, fish 0.5, yogurt 0.0] Components: [Protein 10.0, Fat 8.0, Carbs 42.95969650563832, Calories 300.0] Iterations: 4, Wall time: 89ms

Slide 219

Slide 219 text

225 objectcomputing.com © 2021, Object Computing, Inc. (OCI). All rights reserved. Groovy Community Information • groovy.apache.org • groovy-lang.org • github.com/apache/groovy • groovycommunity.com • [email protected][email protected] • @ApacheGroovy • objectcomputing.com/training/catalog/groovy-programming/ © 2021 Object Computing, Inc. (OCI). All rights reserved.

Slide 220

Slide 220 text

objectcomputing.com © 2021, Object Computing, Inc. (OCI). All rights reserved. 226 CONNECT WITH US 1+ (314) 579-0066 @objectcomputing objectcomputing.com THANK YOU Find me on twitter @paulk_asert © 2021 Object Computing, Inc. (OCI). All rights reserved.

Slide 221

Slide 221 text

References • https://www.analyticsvidhya.com/blog/2017/02/lintroductory-guide-on-linear-programming- explained-in-simple-english/ • Computational Mixed-Integer Programming https://www.birs.ca/cmo-workshops/2018/18w5208/files/GleixnerAmbros.pdf • Integer Linear Programming: Graphical Introduction https://www.youtube.com/watch?v=RhHhy-8sz-4 • Integer Linear Programming: Binary Constraints https://www.youtube.com/watch?v=-3my1TkyFiM https://www.youtube.com/watch?v=B3biWsBLeCw https://www.youtube.com/watch?v=MO8uQnIch6I • Linear Programming: Alternate solutions, Infeasibility, Unboundedness, & Redundancy https://www.youtube.com/watch?v=eMA0LWsRQQ