Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Groovy Data Science

Groovy Data Science

Groovy is a powerful multi-paradigm programming language for the JVM that offers a wealth of features that make it ideal for many data science and big data scenarios.

* Groovy has a dynamic nature like Python, which means that it is very powerful, easy to learn, and productive. The language gets out of the way and lets data scientists write their algorithms naturally.

* Groovy has a static nature like Java and Kotlin, which makes it fast when needed. Its close alignment with Java means that you can often just cut-and-paste the Java examples from various big data solutions and they'll work just fine in Groovy.

* Groovy has first-class functional support, meaning that it offers features and allows solutions similar to Scala. Functional and stream processing with immutable data structures can offer many advantages when working in parallel processing or clustered environments.

These slides review the key benefits of using Groovy to develop data science solutions, including integration with various JDK libraries commonly used in data science solutions including libraries for data manipulation, machine learning, plotting and various big data solutions for scaling up these algorithms.

Topics covered include: regression, clustering, classification, neural networks, language processing, image detection.

Math/Data Science libraries covered include:
Weka, Smile, Tablesaw, Apache Commons Math, Jupyter/Beakerx notebooks, Deep Learning4J.

Libraries for scaling/concurrency include:
GPars, Apache Spark, Apache Ignite, Apache MXNet, Apache Beam, Apache NLPCraft.

paulking

July 01, 2021
Tweet

More Decks by paulking

Other Decks in Programming

Transcript

  1. objectcomputing.com © 2021, Object Computing, Inc. (OCI). All rights reserved.

    No part of these notes may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, rowing, or otherwise, without the prior, written permission of Object Computing, Inc. (OCI) Groovy and Data Science Presented by Dr Paul King © 2021 Object Computing, Inc. (OCI). All rights reserved.
  2. Dr Paul King OCI Groovy Lead V.P. and PMC Chair

    Apache Groovy Author: https://www.manning.com/books/groovy-in-action-second-edition Slides: https://speakerdeck.com/paulk/groovy-data-science Examples repo: https://github.com/paulk-asert/groovy-data-science Twitter: @paulk_asert
  3. Apache Groovy Programming Language • Multi-faceted language • Imperative/OO &

    functional (+ other styles via libraries) • Dynamic & static • Aligned closely with Java language • Very extensible • 17+ years since inception • ~1B downloads (partial count) • ~500 contributors • 200+ releases • https://www.youtube.com/watch?v=eIGOG- F9ZTw&feature=youtu.be
  4. What is Groovy? It’s like a super version of Java:

    • Supports most Java syntax but allows simpler syntax for many constructs • Supports all Java libraries but provides many extensions and its own productivity libraries • Has both a static and dynamic nature • Extensible language and tooling
  5. Groovy is Java's friend Seamless integration • IDEs provide cross-language

    compile, navigation, and refactoring • Arbitrarily mix source language • Drop-in replace any class • Overloaded methods • Syntax alignment • Shared data types Java Groovy
  6. Java-like or concise shortcuts import java.util.List; import java.util.ArrayList; class Main

    { private List keepShorterThan(List strings, int length) { List result = new ArrayList(); for (int i = 0; i < strings.size(); i++) { String s = (String) strings.get(i); if (s.length() < length) { result.add(s); } } return result; } public static void main(String[] args) { List names = new ArrayList(); names.add("Ted"); names.add("Fred"); names.add("Jed"); names.add("Ned"); System.out.println(names); Main m = new Main(); List shortNames = m.keepShorterThan(names, 4); System.out.println(shortNames.size()); for (int i = 0; i < shortNames.size(); i++) { String s = (String) shortNames.get(i); System.out.println(s); } } } names = ["Ted", "Fred", "Jed", "Ned"] println names shortNames = names.findAll{ it.size() < 4 } println shortNames.size() shortNames.each{ println it }
  7. Concise syntax including DSL friendly import java.util.List; import java.util.ArrayList; class

    Main { private List keepShorterThan(List strings, int length) { List result = new ArrayList(); for (int i = 0; i < strings.size(); i++) { String s = (String) strings.get(i); if (s.length() < length) { result.add(s); } } return result; } public static void main(String[] args) { List names = new ArrayList(); names.add("Ted"); names.add("Fred"); names.add("Jed"); names.add("Ned"); System.out.println(names); Main m = new Main(); List shortNames = m.keepShorterThan(names, 4); System.out.println(shortNames.size()); for (int i = 0; i < shortNames.size(); i++) { String s = (String) shortNames.get(i); System.out.println(s); } } } names = ["Ted", "Fred", "Jed", "Ned"] println names shortNames = names.findAll{ it.size() < 4 } println shortNames.size() shortNames.each{ println it } given the names "Ted", "Fred", "Jed" and "Ned" display all the names display the number of names having size less than 4 display the names having size less than 4
  8. Matrix manipulation: Java vs Groovy • Same example • Same

    library Array2DRowRealMatrix{{15.1379501385,40.488531856},{21.4354570637,59.5951246537}} import org.apache.commons.math3.linear.*; public class MatrixMain { public static void main(String[] args) { double[][] matrixData = { {1d,2d,3d}, {2d,5d,3d}}; RealMatrix m = MatrixUtils.createRealMatrix(matrixData); double[][] matrixData2 = { {1d,2d}, {2d,5d}, {1d, 7d}}; RealMatrix n = new Array2DRowRealMatrix(matrixData2); RealMatrix o = m.multiply(n); // Invert p, using LU decomposition RealMatrix oInverse = new LUDecomposition(o).getSolver().getInverse(); RealMatrix p = oInverse.scalarAdd(1d).scalarMultiply(2d); RealMatrix q = o.add(p.power(2)); System.out.println(q); } } Thanks to operator overloading and extensible tooling
  9. Command chains and extensible type system Text file? No, it’s

    a statically typed Groovy program which won’t compile if the animal names aren’t valid. A constraint programming solver is used to find the solution. cranes have 2 legs tortoises have 4 legs there are 7 animals there are 20 legs display solution Cranes 4 Tortoises 3
  10. Command chains and extensible type system cranes have 2 legs

    tortoises have 4 legs millipedes have 1000 legs there are 8 animals there are 1020 legs display solution Cranes 4 Tortoises 3 Millipedes 1
  11. Build tools Gradle, Apache Maven, Apache Ant plugins { id

    'groovy' } repositories { jcenter() } dependencies { implementation 'org.codehaus.groovy:groovy-all:2.5.7' testImplementation 'org.spockframework:spock-core:1.2-groovy-2.5' } project { modelVersion '4.0.0' groupId 'org.exampledriven' artifactId 'maven-polyglot-groovy-example-parent' version '1.0-SNAPSHOT' packaging 'pom' modules { module 'maven-polyglot-submodule' module 'maven-polyglot-spring' } } ant.sequential { echo("inside sequential") def myDir = "target/AntTest/" mkdir(dir: myDir) copy(todir: myDir) { fileset(dir: "src/test") { include(name: "**/*.groovy") } } echo("done") }
  12. Data Science Process Research Goals Obtain Data Data Preparation Data

    Exploration Visualization Data Modeling Data ingestion Data storage Data processing platforms Modeling algorithms Math libraries Graphics processing Integration Deployment
  13. Groovy/Java Libraries/Tools/Frameworks for Data Science Source: https://medium.com/activewizards-machine-learning-company/comparison-of-top-data-science-libraries-for-python-r-and-scala-infographic-574069949267 Machine Learning DeepLearning4J

    Apache Spark ML Apache MXNet Visualization GroovyFX JFreeChart Tablesaw Mathematics and Engineering GPars Apache Commons Math Smile Weka Groovylab Apache Spark Apache Ignite Apache Beam Data Manipulation and Analysis Tablesaw Apache Commons CSV OpenCSV
  14. Obtain data Identify sources • Internal/External (data providers) • Databases,

    files, events, IoT • Unstructured/structured • Text, XML, JSON, YAML, CSV/spreadsheets, audio, images, video, web services • Apache Tika • JExcel • Apache POI Clarify ownership Check quality def jsonSlurper = new JsonSlurper() def object = jsonSlurper.parseText('{ "myList": [4, 8, 15, 16, 23, 42] }’) assert object instanceof Map assert object.myList instanceof List assert object.myList == [4, 8, 15, 16, 23, 42] def response = new XmlSlurper().parseText(books) def authorResult = response.value.books.book[0].author assert authorResult.text() == 'Miguel de Cervantes' def qry = 'SELECT * FROM Author’ assert sql.rows(qry, 1, 3)*.firstname == ['Dierk', 'Paul', 'Guillaume’] assert sql.rows(qry, 4, 3)*.firstname == ['Hamlet', 'Cedric', 'Erik’] assert sql.rows(qry, 7, 3)*.firstname == ['Jon']
  15. Data preparation Collections: • Java: lists, maps, sets • Groovy:

    literal notation, GDK methods, GPath • Libraries • Google Guava https://github.com/google/guava • Apache Common Collections https://commons.apache.org/collections/ DataFrames: • Joinery https://cardillo.github.io/joinery/ A data frame implementation in the spirit of Pandas or R data frames with show/plot • Tablesaw https://jtablesaw.github.io/tablesaw/ Java dataframe and visualization library • Apache Spark DataFrames https://spark.apache.org/docs/latest/sql-programming- guide.html Equivalent to a table in a relational database or a data frame in R/Python • Paleo https://github.com/netzwerg/paleo Immutable Java 8 data frames with typed columns (including primitives)
  16. Visualization • Open-Source plotting libraries: GRAL, JFreeChart, Xchart, JMathPlot, Jzy3d,

    JavaFX, GroovyFX • Types: Scatter plot, Line plot, Area plot, Pie plot, Donut plot, Horizontal bars, Vertical bars, Bubble plot, Radar plot, Box plot, Raster plot, Contour plot, Gauge plot, Step plot, Gantt plot, Histogram • Features: Multiple axes, Logarithmic axes, Line smoothing, Data filtering/aggregation
  17. Data preparation IO: • Java in-built • Groovy enhancements •

    Libraries • Apache Commons IO https://commons.apache.org/io/ • AOL Cyclops-React https://github.com/aol/cyclops-react Web scraping: • GEB Batch processing/ETL: • Spring Batch http://projects.spring.io/spring-batch/ • Apache Camel • Apache Gobblin
  18. Math libraries Math and stats libraries • Java: log, exp,

    other basic methods. Libraries: • Apache Commons Math http://commons.apache.org/math/ statistics, optimization, and linear algebra • Apache Mahout http://mahout.apache.org/ linear algebra, plus distributed linear algebra and machine learning • JBlas http://jblas.org/ optimized and very fast linear algebra package that uses the BLAS library Also, many of the machine learning libraries come with some math functionality, often linear algebra, stats, and optimization.
  19. Data storage Database: • Java’s JDBC • Groovy JDBC enhancements

    and Datasets • Other NOSQL • Apache Cassandra • MongoDB • Apache CouchDB • Neo4j import com.gmongo.GMongo def db = new GMongo().getDB('athletes') db.athletes.drop() db.athletes << [ first: 'Paul', last: 'Tergat', dob: '1969-06-17', runs: [[distance: 42195, time: 2 * 60 * 60 + 4 * 60 + 55, venue: 'Berlin', when: '2003-09-28’]] ] println "World rows following $marathon3.venue $marathon3.when:" def t = new Traversal() for (Path p in t.description().breadthFirst(). relationships(MyRelationshipTypes.supercedes). evaluator(Evaluators.fromDepth(1)). uniqueness(Uniqueness.NONE). traverse(marathon3)) { def newRow = p.endNode() println "$newRow.venue $newRow.when" } Graph g = new Neo4jGraph(graphDb) def pretty = { it.collect { "$it.venue $it.when" }.join(', ') } def results = [] g.V('venue', 'London').fill(results) println 'London world rows: ' + pretty(results) results = [] g.V('venue', 'London').in('supercedes').fill(results) println 'World rows after London: ' + pretty(results)
  20. Machine learning and data mining Machine learning and data mining

    libraries: • Weka http://www.cs.waikato.ac.nz/ml/weka/ • JavaML http://java-ml.sourceforge.net/ old and reliable ML library but not actively updated • Smile http://haifengl.github.io/smile/ ML library under active development • JSAT https://github.com/EdwardRaff/JSAT contains many machine learning algorithms • H2O http://www.h2o.ai/ distributed Java ML framework also supports Scala, R and Python • Apache Mahout http://mahout.apache.org/ distributed linear algebra framework • Apache Spark ml and mllib https://github.com/apache/spark/tree/master/examples/src/main/java/org/apache/spark/examples
  21. Neural networks and text processing Neural networks: • Encog http://www.heatonresearch.com/encog/

    • DeepLearning4j http://deeplearning4j.org/ natively supports Keras models Text processing • Java: StringTokenizer, the java.text package, regular expressions • Groovy: Regex enhancements, templates, String enhancements • Libraries: • Apache Lucene https://lucene.apache.org/ • Stanford CoreNLP http://stanfordnlp.github.io/CoreNLP/ • Apache OpenNLP https://opennlp.apache.org/ • LingPipe http://alias-i.com/lingpipe/ • GATE https://gate.ac.uk/ • MALLET http://mallet.cs.umass.edu/ • Smile http://haifengl.github.io/smile/
  22. Scaling up Microservice frameworks: • Micronaut Distributed data processing: •

    Apache Hadoop • Apache Spark • Apache Flink • Apache Samza • Apache Kafka • Apache Storm • Apache Apex • Apache Beam • Apache Ignite • Apache Nifi
  23. Scaling up: Apache Ignite Ignite™ is a memory-centric distributed database,

    caching, and processing platform for transactional, analytical, and streaming workloads delivering in-memory speeds at petabyte scale
  24. • A Concurrency & Parallelism Framework for Groovy and Java

    • Fork/join, parallel arrays, map/reduce, actors, agents, STM, CSP, dataflow, asynchronous functions Scaling Up: GPars Data Parallelism: Fork/Join Map/Reduce Fixed coordination (for collections) Actors Explicit coordination Safe Agents Delegated coordination Dataflow Implicit coordination def decryptor = actor { loop { react { String message -> reply message.reverse() } } } def console = actor { decryptor << 'lellarap si yvoorG' react { println 'Decrypted message: ' + it } } console.join() final def df = new Dataflows() task { df.z = df.x + df.y } task { df.x = 10 } task { df.y = 5 } assert df.z == 15 def nums = [1, 2, 3, 4] def result = nums.collectParallel{it * 2} assert result == [2, 4, 6, 8] def agent = new Agent([]) agent { it << 'Dave' } agent { it << 'Joe' } assert agent.val.size() == 2 class Account { private amount = newTxnInteger(0) void transfer(final int a) { atomic { amount.increment(a) } } int getCurrentAmount() { atomicWithInt { amount.get() } } } def future = {it * 2}.callAsync(3) assert 6 == future.get() 5 10 15 x y z +
  25. Scaling up: Apache Beam Apache Beam is an open source,

    unified model for defining both batch and streaming data-parallel processing pipelines Source: https://beam.apache.org/get-started/beam-overview/
  26. Computer science algorithms Decision problems Search problems Counting problems Optimization

    problems • Paradigms: brute force, divide & conquer, search & enumerate, randomized, complexity reduction, recursive, back tracking, graph • Optimization approaches: linear programming, dynamic programming, greedy, heuristics
  27. Data science algorithms Data Mining Statistics Machine Learning Optimization •

    Analytics: descriptive, predictive, prescriptive • Analysis: anomaly detection, classification, regression, clustering, association, optimization, dimension reduction • Data relationship: linear, non-linear • Assumptions: parametric, non-parametric • Strategy: supervised, unsupervised, reinforcement • Combining: ensemble, boosting
  28. Working with tables and plots List<Trace> traces(String filename, String lineColor,

    String markerColor) { def url = getClass().classLoader.getResource(filename) def table = new XlsxReader().read(builder(url).build()) table.addColumns(DateColumn.create('YearMonth', table.column('Date').collect { LocalDate.of(it.year, it.month, 15) })) def janFirst2017 = LocalDateTime.of(2017, JANUARY, 1, 0, 0) Function<Table, Selection> from2017 = r -> r.dateTimeColumn('Date').isAfter(janFirst2017) Function<Table, Selection> top3 = r -> r.intColumn('CandleID').isLessThanOrEqualTo(3) def byMonth = table.sortAscendingOn('Date') .where(and(from2017, top3)) .summarize('Rating', mean).by('YearMonth') def byDate = table.sortAscendingOn('Date') .where(and(from2017, top3)) .summarize('Rating', mean).by('Date') def averaged = ScatterTrace.builder(byMonth.dateColumn('YearMonth'), byMonth.nCol('Mean [Rating]')) .mode(ScatterTrace.Mode.LINE) .line(Line.builder().width(5).color(lineColor).shape(Line.Shape.SPLINE).smoothing(1.3).build()) .build() def scatter = ScatterTrace.builder(byDate.dateTimeColumn('Date'), byDate.nCol('Mean [Rating]')) .marker(Marker.builder().opacity(0.3).color(markerColor).build()) .build() [averaged, scatter] }
  29. Working with tables and plots def (sAverage, sScatter) = traces('Scented_all.xlsx',

    'seablue', 'lightskyblue') def (uAverage, uScatter) = traces('Unscented_all.xlsx', 'seagreen', 'lightgreen') Plot.show(new Figure(layout('scented'), sAverage, sScatter, line)) Plot.show(new Figure(layout('unscented'), uAverage, uScatter, line))
  30. Regression Overview Clustering: • Relationship between variables Types: • Linear

    OLSR • Logistic • Polynomial • Stepwise • Ridge • Lasso • ElasticNet Aspects: • Under/over fitting • Outliers Applications: • House price prediction • Market predictions • Ratings prediction • Weather forecast
  31. House price predictions – view data graphically import org.apache.commons.math3.random.EmpiricalDistribution import

    static groovyx.javafx.GroovyFX.start import static org.apache.commons.csv.CSVFormat.RFC4180 as CSV def file = 'kc_house_data.csv' as File def csv = CSV.withFirstRowAsHeader().parse(new FileReader(file)) def all = csv.collect { it.bedrooms.toInteger() } def dist = new EmpiricalDistribution(all.max()).tap{ load(all as double[]) } def bins = dist.binStats.withIndex().collectMany { v, i -> [i.toString(), v.n] } start { stage(title: 'Number of bedrooms histogram', show: true, width: 800, height: 600) { scene { barChart(title: 'Bedroom count', barGap: 0, categoryGap: 2) { series(name: 'Number of properties', data: bins) } } } }
  32. House price predictions – view stats import org.apache.commons.math3.stat.descriptive.SummaryStatistics import static

    org.apache.commons.csv.CSVFormat.RFC4180 as CSV def file = 'kc_house_data.csv' as File def csv = CSV.withFirstRowAsHeader().parse(new FileReader(file)) def all = csv.collect { it.bedrooms.toInteger() } def stats = new SummaryStatistics() all.each{ stats.addValue(it as double) } println stats.summary n: 21613 min: 0.0 max: 33.0 mean: 3.370841623097218 std dev: 0.930061831147451 variance: 0.8650150097573497 sum: 72854.0
  33. House price predictions – investigate outliers import org.apache.commons.csv.CSVFormat.RFC4180 as CSV

    def file = 'kc_house_data.csv' as File def csv = CSV.withFirstRowAsHeader().parse(new FileReader(file)) csv.findAll{ it.bedrooms.toInteger() > 10 }.each{ println it.toMap() as TreeMap } [bathrooms:3, bedrooms:11, condition:3, date:20140821T000000, floors:2, grade:7, id:1773100755, lat:47.556, long:-122.363, price:520000, sqft_above:2400, sqft_basement:600, sqft_living:3000, sqft_living15:1420, sqft_lot:4960, sqft_lot15:4960, view:0, waterfront:0, yr_built:1918, yr_renovated:1999, zipcode:98106] [bathrooms:1.75, bedrooms:33, condition:5, date:20140625T000000, floors:1, grade:7, id:2402100895, lat:47.6878, long:-122.331, price:640000, sqft_above:1040, sqft_basement:580, sqft_living:1620, sqft_living15:1330, sqft_lot:6000, sqft_lot15:4700, view:0, waterfront:0, yr_built:1947, yr_renovated:0, zipcode:98103]
  34. House price predictions – investigate outliers Some libraries optionally support

    CSV -> richly-typed domain classes import com.opencsv.bean.* import groovy.transform.ToString @ToString(includeNames = true) class House { @CsvBindByName Integer bedrooms @CsvBindByName String bathrooms @CsvBindByName(column = 'sqft_lot') Integer area_lot } def file = 'kc_house_data.csv' as File def builder = new CsvToBeanBuilder(new FileReader(file)) def rows = builder.withType(House).build().parse() rows.findAll{ it.bedrooms > 10 }.each{ println it } House(bedrooms:11, bathrooms:3, area_lot:4960) House(bedrooms:33, bathrooms:1.75, area_lot:6000)
  35. House prices – remove outlier and view bedrooms import org.apache.commons.math3.random.EmpiricalDistribution

    import static groovyx.javafx.GroovyFX.start import static org.apache.commons.csv.CSVFormat.RFC4180 as CSV def csv = CSV.withFirstRowAsHeader().parse(new FileReader('kc_house_data.csv')) def all = csv.collect { it.bedrooms.toInteger() }.findAll{ it < 30 } def dist = new EmpiricalDistribution(all.max()).tap{ load(all as double[]) } def bins = dist.binStats.withIndex().collectMany { v, i -> [i.toString(), v.n] } start { stage(title: 'Number of bedrooms histogram', show: true, width: 800, height: 600) { scene { barChart(title: 'Bedroom count', barGap: 0, categoryGap: 2) { series(name: 'Number of properties', data: bins) } } } }
  36. House price predictions – remove outlier and view price import

    org.apache.commons.math3.random.EmpiricalDistribution import static groovyx.javafx.GroovyFX.start import static org.apache.commons.csv.CSVFormat.RFC4180 as CSV def file = 'kc_house_data.csv' as File def csv = CSV.withFirstRowAsHeader().parse(new FileReader(file)) def all = csv.findAll { it.bedrooms.toInteger() < 30 }.collect { it.price.toDouble() } def info = new SummaryStatistics(); all.each(info::addValue) def head = "Price percentile (min=\$$info.min, mean=\$${info.mean as int}, max=\$$info.max)" def dist = new EmpiricalDistribution(100).tap{ load(all as double[]) } def bins = dist.binStats.withIndex().collectMany { v, i -> [i.toString(), v.n] } start { stage(title: 'Price histogram', show: true, width: 800, height: 600) { scene { barChart(title: head, barGap: 0, categoryGap: 0) { series(name: 'Number of properties', data: bins) } } } }
  37. House price predictions – explore with tablesaw import tech.tablesaw.api.* import

    tech.tablesaw.plotly.Plot import tech.tablesaw.plotly.api.* import static tech.tablesaw.aggregate.AggregateFunctions.* def rows = Table.read().csv('kc_house_data.csv') println rows.shape() println rows.structure() println rows.column("bedrooms").summary().print() println rows.where(rows.column("bedrooms").isGreaterThan(10)) def cleaned = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30)) println cleaned.shape() println cleaned.summarize("price", mean, min, max).by("bedrooms") Plot.show(ScatterPlot.create("Price x bathrooms x grade", cleaned, "bathrooms", "price", 'grade'))
  38. House price predictions – explore with tablesaw import tech.tablesaw.api.* import

    tech.tablesaw.plotly.Plot import tech.tablesaw.plotly.api.* import static tech.tablesaw.aggregate.AggregateFunctions.* def rows = Table.read().csv('kc_house_data.csv') println rows.shape() println rows.structure() println rows.column("bedrooms").summary().print() println rows.where(rows.column("bedrooms").isGreaterThan(10)) def cleaned = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30)) println cleaned.shape() println cleaned.summarize("price", mean, min, max).by("bedrooms") Plot.show(ScatterPlot.create("Price x bathrooms x grade", cleaned, "bathrooms", "price", 'grade')) 21613 rows X 21 cols
  39. House price predictions – explore with tablesaw import tech.tablesaw.api.* import

    tech.tablesaw.plotly.Plot import tech.tablesaw.plotly.api.* import static tech.tablesaw.aggregate.AggregateFunctions.* def rows = Table.read().csv('kc_house_data.csv') println rows.shape() println rows.structure() println rows.column("bedrooms").summary().print() println rows.where(rows.column("bedrooms").isGreaterThan(10)) def cleaned = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30)) println cleaned.shape() println cleaned.summarize("price", mean, min, max).by("bedrooms") Plot.show(ScatterPlot.create("Price x bathrooms x grade", cleaned, "bathrooms", "price", 'grade')) Structure of kc_house_data.csv Index | Column Name | Column Type | ------------------------------------------- 0 | id | LONG | 1 | date | STRING | 2 | price | DOUBLE | 3 | bedrooms | INTEGER | 4 | bathrooms | DOUBLE | 5 | sqft_living | INTEGER | 6 | sqft_lot | INTEGER | 7 | floors | DOUBLE | 8 | waterfront | INTEGER | 9 | view | INTEGER | ... | ... | ... | 11 | grade | INTEGER | 12 | sqft_above | INTEGER | 13 | sqft_basement | INTEGER | 14 | yr_built | INTEGER | 15 | yr_renovated | INTEGER | 16 | zipcode | INTEGER | 17 | lat | DOUBLE | 18 | long | DOUBLE | 19 | sqft_living15 | INTEGER | 20 | sqft_lot15 | INTEGER |
  40. House price predictions – explore with tablesaw import tech.tablesaw.api.* import

    tech.tablesaw.plotly.Plot import tech.tablesaw.plotly.api.* import static tech.tablesaw.aggregate.AggregateFunctions.* def rows = Table.read().csv('kc_house_data.csv') println rows.shape() println rows.structure() println rows.column("bedrooms").summary().print() println rows.where(rows.column("bedrooms").isGreaterThan(10)) def cleaned = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30)) println cleaned.shape() println cleaned.summarize("price", mean, min, max).by("bedrooms") Plot.show(ScatterPlot.create("Price x bathrooms x grade", cleaned, "bathrooms", "price", 'grade')) Column: bedrooms Measure | Value | ----------------------------------- n | 21613.0 | sum | 72854.0 | Mean | 3.370841623097218 | Min | 0.0 | Max | 33.0 | Range | 33.0 | Variance | 0.8650150097573497 | Std. Dev | 0.930061831147451 |
  41. House price predictions – explore with tablesaw import tech.tablesaw.api.* import

    tech.tablesaw.plotly.Plot import tech.tablesaw.plotly.api.* import static tech.tablesaw.aggregate.AggregateFunctions.* def rows = Table.read().csv('kc_house_data.csv') println rows.shape() println rows.structure() println rows.column("bedrooms").summary().print() println rows.where(rows.column("bedrooms").isGreaterThan(10)) def cleaned = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30)) println cleaned.shape() println cleaned.summarize("price", mean, min, max).by("bedrooms") Plot.show(ScatterPlot.create("Price x bathrooms x grade", cleaned, "bathrooms", "price", 'grade')) kc_house_data.csv id | date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | grade | sqft_above | sqft_basement | yr_built | yr_renovated | zipcode | lat | long | sqft_living15 | sqft_lot15 | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 1773100755 | 20140821 | 520000.0 | 11 | 3.0 | 3000 | 4960 | 2.0 | 0 | 0 | 3 | 7 | 2400 | 600 | 1918 | 1999 | 98106 | 47.556 | -122.363 | 1420 | 4960 | 2402100895 | 20140625 | 640000.0 | 33 | 1.75 | 1620 | 6000 | 1.0 | 0 | 0 | 5 | 7 | 1040 | 580 | 1947 | 0 | 98103 | 47.687 | -122.331 | 1330 | 4700 |
  42. House price predictions – explore with tablesaw import tech.tablesaw.api.* import

    tech.tablesaw.plotly.Plot import tech.tablesaw.plotly.api.* import static tech.tablesaw.aggregate.AggregateFunctions.* def rows = Table.read().csv('kc_house_data.csv') println rows.shape() println rows.structure() println rows.column("bedrooms").summary().print() println rows.where(rows.column("bedrooms").isGreaterThan(10)) def cleaned = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30)) println cleaned.shape() println cleaned.summarize("price", mean, min, max).by("bedrooms") Plot.show(ScatterPlot.create("Price x bathrooms x grade", cleaned, "bathrooms", "price", 'grade')) 21612 rows X 21 cols
  43. House price predictions – explore with tablesaw import tech.tablesaw.api.* import

    tech.tablesaw.plotly.Plot import tech.tablesaw.plotly.api.* import static tech.tablesaw.aggregate.AggregateFunctions.* def rows = Table.read().csv('kc_house_data.csv') println rows.shape() println rows.structure() println rows.column("bedrooms").summary().print() println rows.where(rows.column("bedrooms").isGreaterThan(10)) def cleaned = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30)) println cleaned.shape() println cleaned.summarize("price", mean, min, max).by("bedrooms") Plot.show(ScatterPlot.create("Price x bathrooms x grade", cleaned, "bathrooms", "price", 'grade')) kc_house_data.csv summary bedrooms | Mean [price] | Min [price] | Max [price] | ------------------------------------------------------------------- 0 | 409503.8461538461 | 139950.0 | 1295650.0 | 1 | 317642.8844221105 | 75000.0 | 1247000.0 | 2 | 401372.68188405805 | 78000.0 | 3278000.0 | 3 | 466232.0784812695 | 82000.0 | 3800000.0 | 4 | 635419.5042138912 | 100000.0 | 4489000.0 | 5 | 786599.8288569644 | 133000.0 | 7062500.0 | 6 | 825520.6360294118 | 175000.0 | 7700000.0 | 7 | 951184.6578947367 | 280000.0 | 3200000.0 | 8 | 1105076.923076923 | 340000.0 | 3300000.0 | 9 | 893999.8333333334 | 450000.0 | 1400000.0 | 10 | 819333.3333333334 | 650000.0 | 1148000.0 | 11 | 520000.0 | 520000.0 | 520000.0 |
  44. House price predictions – explore with tablesaw import tech.tablesaw.api.* import

    tech.tablesaw.plotly.Plot import tech.tablesaw.plotly.api.* import static tech.tablesaw.aggregate.AggregateFunctions.* def rows = Table.read().csv('kc_house_data.csv') println rows.shape() println rows.structure() println rows.column("bedrooms").summary().print() println rows.where(rows.column("bedrooms").isGreaterThan(10)) def cleaned = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30)) println cleaned.shape() println cleaned.summarize("price", mean, min, max).by("bedrooms") Plot.show(ScatterPlot.create("Price x bathrooms x grade", cleaned, "bathrooms", "price", 'grade'))
  45. House price predictions – explore with tablesaw cleaned.addColumns( StringColumn.create("waterfrontDesc", cleaned.column("waterfront").collect{

    it ? 'waterfront' : 'interior' }), DoubleColumn.create("scaledGrade", cleaned.column("grade").collect{ it * 2 }), DoubleColumn.create("scaledPrice", cleaned.column("price").collect{ it / 100000 }) ) Plot.show(BubblePlot.create("Price vs living area and grade (bubble size)", cleaned, "sqft_living", "price", "scaledGrade", "waterfrontDesc")) Plot.show(Scatter3DPlot.create("Grade, living space, bathrooms and price (bubble size)", cleaned, "sqft_living", "bathrooms", "grade", "scaledPrice", "waterfrontDesc"))
  46. Regression with Least Squares Find line of “best fit”: •

    Find intercept and slope of line which minimizes the sum of the square of the residual errors Residual errors y x
  47. Regression with Least Squares Find line of “best fit”: •

    Find intercept and slope of line which minimizes the sum of the square of the residual errors • Line is represented by: y = m x + b • For points (x0 , y0 )..(xn , yn ) errors will be: err0 = y0 – (m x0 + b) .. errn = yn – (m xn + b) which are then squared • To minimise, we find where the derivative is zero • Solving the math gives: b = σ 𝑦 σ 𝑥2−σ 𝑥 σ 𝑥 𝑦 𝑛 σ 𝑥2 − σ 𝑥 2 m = 𝑛 σ 𝑥 𝑦 − σ 𝑥 σ 𝑦 𝑛 σ 𝑥2 − σ 𝑥 2 Residual errors y x (0, b) (x0 , y0 ) (xn , yn ) errn err0
  48. Regression with Gradient Descent Find line of “best fit”: •

    Find intercept and slope of line which minimizes the sum of the square of the residual errors • Line is represented by: y = m x + b and we might have a similar loss function • Instead of solving for the derivative of loss function, we take repeated steps in the opposite direction of the gradient, taking smaller steps as we approach zero • A learning rate influences step size • Amenable to streaming & can involve random or mini-batch subsets of data for efficiency Loss function (residual error) Parameter being optimized (could be multiple dimensions)
  49. House price predictions – linear regression import org.apache.commons.math3.stat.regression.SimpleRegression import static

    groovyx.javafx.GroovyFX.start import static org.apache.commons.csv.CSVFormat.RFC4180 as CSV def feature = 'bedrooms' def nonOutliers = feature == 'bedrooms' ? { it[0] < 30 } : { true } def csv = CSV.withFirstRecordAsHeader().parse(new FileReader('kc_house_data.csv')) def all = csv.collect { [it[feature].toDouble(), it.price.toDouble()] }.findAll(nonOutliers) def reg = new SimpleRegression().tap{ addData(all as double[][]) } def (min, max) = all.transpose().with{ [it[0].min(), it[0].max()] } def predicted = [[min, reg.predict(min)], [max, reg.predict(max)]] start { stage(title: 'Price vs Number of bedrooms', show: true, width: 800, height: 600) { scene { scatterChart { series(name: 'Actual', data: all) } lineChart { series(name: 'Predicted', data: predicted) } } } }
  50. House price predictions – evaluating regression @Canonical @FXBindable class Feature

    { String name double rr, rmseTrain, rmseTest, mean } def features = [new Feature('bedrooms'), new Feature('bathrooms'), new Feature('grade'), new Feature('sqft_living')] def idxs = 0..<features.size() def priceIdx = features.size() def file = getClass().classLoader.getResource('kc_house_data.csv').file def csv = CSV.withFirstRecordAsHeader().parse(new FileReader(file)) def all = csv.collect { row -> [*idxs.collect{ row[features[it].name].toDouble() }, row.price.toDouble()] }.findAll{ it[0] < 30 } def (train, test) = all.chop(all.size() * 0.8 as int, -1) def trainT = train.transpose() def regs = idxs.collect { idx -> new SimpleRegression().tap{ addData([trainT[idx], trainT[priceIdx]].transpose() as double[][]) } } def residuals = idxs.collect{ idx -> test.collect{ regs[idx].predict(it[idx]) - it[priceIdx] } }
  51. House price predictions – evaluating regression idxs.each { idx ->

    features[idx].rr = regs[idx].RSquare features[idx].rmseTrain = Math.sqrt(regs[idx].meanSquareError) features[idx].rmseTest = Math.sqrt(StatUtils.sumSq(residuals[idx] as double[]) / (test.size() - 1)) features[idx].mean = StatUtils.mean(residuals[idx] as double[]) } start { stage(title: "Error statistics for feature", visible: true) { scene(fill: groovyblue, width: 800, height:200) { stackPane(padding: 10) { tableView(items: features) { tableColumn(text: "Feature", property: 'name') tableColumn(text: "R²", property: 'rr') tableColumn(text: "RMSE (train)", property: 'rmseTrain') tableColumn(text: "RMSE (test)", property: 'rmseTest') tableColumn(text: "Residuals Mean", property: 'mean') } } } } }
  52. House price predictions – evaluating regression def maxError = [residuals[graphIdx].min(),

    residuals[graphIdx].max()].max{ Math.abs(it) } residuals[graphIdx] << maxError * -1 // make graph even around origin maxError = Math.abs(maxError) def step = maxError.toInteger() / 50 def dist = new EmpiricalDistribution(100).tap{ load(residuals[graphIdx] as double[]) } def ndist = new NormalDistribution(0, dist.sampleStats.standardDeviation) def bins = dist.binStats.indexed().collect { i, v -> [v.n ? v.mean.toInteger().toString(): (-maxError + i * step).toInteger().toString(), v.n] } def nbins = dist.binStats.indexed().collect { i, v -> def x = v.n ? v.mean.toInteger() : (-maxError + i * step).toInteger(); [x.toString(), ndist.probability(x, x + step)] } def scale = dist.binStats.max{ it.n }.n / nbins.max{ it[1] }[1] nbins = nbins.collect{ [it[0], it[1] * scale] } start { stage(title: "Error histogram for ${features[graphIdx].name}", show: true, width: 800, height: 600) { scene { barChart(title: 'Error percentile', barGap: 0, categoryGap: 0) { series(name: 'Error in prediction', data: bins) series(name: 'Normal distribution', data: nbins) } } } }
  53. House price predictions – multi linear regression def features =

    [ 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_living15', 'lat', 'sqft_above', 'grade', 'view', 'waterfront', 'floors' ] def idxs = 0..<features.size() def priceIdx = features.size() def csv = CSV.withFirstRecordAsHeader().parse(new FileReader('kc_house_data.csv')) def all = csv.collect { row -> [*idxs.collect{ row[features[it]].toDouble() }, row.price.toDouble()] }.findAll{ it[0] < 30 } def (train, test) = all.chop(all.size() * 0.8 as int, -1) def trainT = train.transpose() def price = trainT[priceIdx] def reg = new OLSMultipleLinearRegression().tap{ newSampleData(price as double[], trainT[0..<priceIdx].transpose() as double[][]) } def params = reg.estimateRegressionParameters() println params def predicted = test.collect { data -> params[0] + (1..<params.size()).collect{ data[it-1] * params[it] }.sum() } def residuals = test.indexed().collect { i, data -> predicted[i] - data[priceIdx] } def rr = reg.calculateRSquared() def rmseTrain = Math.sqrt(reg.calculateResidualSumOfSquares()) def rmseTest = Math.sqrt(StatUtils.sumSq(residuals as double[]) / (test.size() - 1)) def mean = StatUtils.mean(residuals as double[]) println "$rr $rmseTrain $rmseTest $mean" [-3.1635392225013282E7, -27484.961699434185, -5236.246584477492, 191.507648719199, 17.94379836055797, 657803.1115560957, 4.369533695386711, 74850.68999978402, 66459.03393303146, 604896.8569593901, -27203.796342244277] 0.6536012563023335 215379.66626496855 212609.34498021202 1886.6015077985683 R2: https://en.wikipedia.org/wiki/Coefficient_of_determination rmse: https://en.wikipedia.org/wiki/Root-mean-square_deviation
  54. House price predictions – multi linear regression def file =

    getClass().classLoader.getResource('kc_house_data.csv').file as File def loader = new CSVLoader(file: file) def model = new LinearRegression() def allInstances = loader.dataSet def priceIndex = 2 allInstances.classIndex = priceIndex // remove "id" and "date" columns def rm = new Remove(attributeIndices: '1,2', inputFormat: allInstances) def instances = Filter.useFilter(allInstances, rm) model.buildClassifier(instances) println model def actual = instances.collect{ it.value(0).toDouble() } def predicted = instances.collect{ model.classifyInstance(it) } def chart = new XYChartBuilder()… Linear Regression Model price = -35766.5414 * bedrooms + 41144.2785 * bathrooms + 82.8088 * sqft_living + 0.1286 * sqft_lot + 6689.5501 * floors + 582960.4584 * waterfront + 52870.9424 * view + 26385.6491 * condition + 95890.4452 * grade + 98.4193 * sqft_above + 67.2917 * sqft_basement + -2620.2232 * yr_built + 19.8126 * yr_renovated + -582.4199 * zipcode + 602748.2265 * lat + -214729.8283 * long + 21.6814 * sqft_living15 + -0.3826 * sqft_lot15 + 6690324.6024
  55. House price predictions – multi linear regression def file =

    getClass().classLoader.getResource('kc_house_data.csv').file as File def loader = new CSVLoader(file: file) def model = new LinearRegression() def allInstances = loader.dataSet def priceIndex = 2 allInstances.classIndex = priceIndex // remove "id" and "date" columns def rm = new Remove(attributeIndices: '1,2', inputFormat: allInstances) def instances = Filter.useFilter(allInstances, rm) model.buildClassifier(instances) println model def actual = instances.collect{ it.value(0).toDouble() } def predicted = instances.collect{ model.classifyInstance(it) } def chart = new XYChartBuilder()… Linear Regression Model price = -35766.5414 * bedrooms + 41144.2785 * bathrooms + 82.8088 * sqft_living + 0.1286 * sqft_lot + 6689.5501 * floors + 582960.4584 * waterfront + 52870.9424 * view + 26385.6491 * condition + 95890.4452 * grade + 98.4193 * sqft_above + 67.2917 * sqft_basement + -2620.2232 * yr_built + 19.8126 * yr_renovated + -582.4199 * zipcode + 602748.2265 * lat + -214729.8283 * long + 21.6814 * sqft_living15 + -0.3826 * sqft_lot15 + 6690324.6024
  56. House price predictions – multi linear regression def file =

    getClass().classLoader.getResource('kc_house_data.csv').file as File def loader = new CSVLoader(file: file) def model = new SGD() model.options = ['-F', '4', '-N'] as String[] // Huber loss, unscaled def allInstances = loader.dataSet def priceIndex = 2 allInstances.classIndex = priceIndex // remove "id", "date", 'zip', 'lat', 'long' columns def rm = new Remove(attributeIndices: '1,2,17,18,19', inputFormat: allInstances) def instances = Filter.useFilter(allInstances, rm) model.buildClassifier(instances) println model def actual = instances.collect{ it.value(0).toDouble() } def predicted = instances.collect{ model.classifyInstance(it) } def chart = new XYChartBuilder()… Loss function: Huber loss (robust regression) price = -9.6832 bedrooms + 0.4953 bathrooms + 112.6971 sqft_living + 0.8189 sqft_lot + 5.5 floors + 0.7041 waterfront + 11.3372 view + 7.8443 condition + 23.9548 grade + 37.5499 sqft_above + 75.1472 sqft_basement + -7.8827 yr_built + 63.7169 yr_renovated + 103.4893 sqft_living15 + -1.3449 sqft_lot15 + 0.2788
  57. House price predictions – multi linear regression def file =

    getClass().classLoader.getResource('kc_house_data.csv').file as File def loader = new CSVLoader(file: file) def model = new SGD() model.options = ['-F', '4', '-N'] as String[] // Huber loss, unscaled def allInstances = loader.dataSet def priceIndex = 2 allInstances.classIndex = priceIndex // remove "id", "date", 'zip', 'lat', 'long' columns def rm = new Remove(attributeIndices: '1,2,17,18,19', inputFormat: allInstances) def instances = Filter.useFilter(allInstances, rm) model.buildClassifier(instances) println model def actual = instances.collect{ it.value(0).toDouble() } def predicted = instances.collect{ model.classifyInstance(it) } def chart = new XYChartBuilder()… Loss function: Huber loss (robust regression) price = -9.6832 bedrooms + 0.4953 bathrooms + 112.6971 sqft_living + 0.8189 sqft_lot + 5.5 floors + 0.7041 waterfront + 11.3372 view + 7.8443 condition + 23.9548 grade + 37.5499 sqft_above + 75.1472 sqft_basement + -7.8827 yr_built + 63.7169 yr_renovated + 103.4893 sqft_living15 + -1.3449 sqft_lot15 + 0.2788
  58. House price predictions – scaling regression import … def rows

    = Table.read().csv('kc_house_data.csv') rows = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30)) String[] features = [ 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_living15', 'lat', 'sqft_above', 'grade', 'view', 'waterfront', 'floors' ] def data = rows.as().doubleMatrix(features) // configure to all run on local machine but could be a cluster (can be hidden in XML) def cfg = new IgniteConfiguration( peerClassLoadingEnabled: true, discoverySpi: new TcpDiscoverySpi( ipFinder: new TcpDiscoveryMulticastIpFinder( addresses: ['127.0.0.1:47500..47509'] ) ) ) static pretty(mdl, features) { def sign = { val -> val < 0 ? '- ' : '+ ' } def valIdx = { idx, val -> sprintf '%.2f*%s', val, features[idx+1] } def valIdxSign = { idx, val -> sign(val) + valIdx(idx, Math.abs(val)) } def valSign = { val -> sign(val) + sprintf('%.2f', Math.abs(val)) } def (w, i) = [mdl.weights, mdl.intercept] [valIdx(0, w.get(0)), *(1..<w.size()).collect{ valIdxSign(it, w.get(it)) }, valSign(i)].join(' ') } Read in data and select features of interest Adjust config for known local mode setup Utility method to pretty print model
  59. House price predictions – scaling regression … Ignition.start(cfg).withCloseable { ignite

    -> println ">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols" def dataCache = new SandboxMLCache(ignite).fillCacheWith(data) def trainer = new LSQRTrainer().withEnvironmentBuilder(defaultBuilder().withRNGSeed(0)) def vectorizer = new DoubleArrayVectorizer().labeled(Vectorizer.LabelCoordinate.FIRST) def split = new TrainTestDatasetSplitter().split(0.8) def mdl = trainer.fit(ignite, dataCache, split.trainFilter, vectorizer) def rr = Evaluator.evaluate(dataCache, split.testFilter, mdl, vectorizer, new RM().withMetric { it.r2() }) def rmse = Evaluator.evaluate(dataCache, split.testFilter, mdl, vectorizer, new RM()) dataCache.destroy() println ">>> Model: " + pretty(mdl, features) println ">>> R^2 : " + rr println ">>> RMSE : " + rmse } Candidates for distributed processing
  60. House price predictions – scaling regression … Ignition.start(cfg).withCloseable { ignite

    -> println ">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols" def dataCache = new SandboxMLCache(ignite).fillCacheWith(data) def trainer = new LSQRTrainer().withEnvironmentBuilder(defaultBuilder().withRNGSeed(0)) def vectorizer = new DoubleArrayVectorizer().labeled(Vectorizer.LabelCoordinate.FIRST) def split = new TrainTestDatasetSplitter().split(0.8) def mdl = trainer.fit(ignite, dataCache, split.trainFilter, vectorizer) def rr = Evaluator.evaluate(dataCache, split.testFilter, mdl, vectorizer, new RM().withMetric { it.r2() }) def rmse = Evaluator.evaluate(dataCache, split.testFilter, mdl, vectorizer, new RM()) dataCache.destroy() println ">>> Model: " + pretty(mdl, features) println ">>> R^2 : " + rr println ">>> RMSE : " + rmse } [15:02:09] Ignite node started OK (id=63b2d47d) >>> Ignite grid started for data: 21612 rows X 11 cols >>> Model: -33081.12*bedrooms - 16205.93*bathrooms + 225.40*sqft_living + 6.77*sqft_living15 - 8766.52*lat - 38.59*sqft_above + 96251.64*grade + 62303.58*view + 572116.79*waterfront - 11620.55*floors - 48504.18 >>> R^2 : 0.607023233473466 >>> RMSE : 228035.85816493785 [15:02:10] Ignite node stopped OK [uptime=00:00:01.465] See also: https://github.com/apache/ignite/blob/master/examples/src/main/java/org/apache/ignite/examples/ml/regression/linear/BostonHousePricesPredictionExample.java
  61. House prices – scaling regression import … def spark =

    builder().config('spark.master', 'local[8]').appName('HousePrices').orCreate def file = '/path/to/kc_house_data.csv' int k = 5 Dataset<Row> ds = spark.read().format('csv') .options('header': 'true', 'inferSchema': 'true').load(file) double[] splits = [80, 20] def (training, test) = ds.randomSplit(splits) String[] colNames = ds.columns().toList().minus(['id', 'date', 'price']) def assembler = new VectorAssembler(inputCols: colNames, outputCol: 'features') Dataset<Row> dataset = assembler.transform(training) def lr = new LinearRegression(labelCol: 'price', maxIter: 10) def model = lr.fit(dataset) println 'Coefficients:' println model.coefficients().values()[1..-1].collect{ sprintf '%.2f', it }.join(', ') def testSummary = model.evaluate(assembler.transform(test)) printf 'RMSE: %.2f%n', testSummary.rootMeanSquaredError printf 'r2: %.2f%n', testSummary.r2 spark.stop() Coefficients: 41979.78, 80853.89, 0.15, 5412.83, 564343.22, 53834.10, 24817.09, 93195.29, -80662.68, -80694.28, -2713.58, 19.02, -628.67, 594468.23, -228397.19, 21.23, -0.42 RMSE: 187242.12 r2: 0.70
  62. Clustering Overview Clustering: • Grouping similar items Algorithm families: •

    Hierarchical • Partitioning k-means, x-means • Density-based • Graph-based Aspects: • Disjoint vs overlapping • Preset cluster number • Dimensionality reduction PCA • Nominal feature support Applications: • Market segmentation • Recommendation engines • Search result grouping • Social network analysis • Medical imaging
  63. Clustering case study: Whiskey flavor profiles • 86 scotch whiskies

    • 12 flavor categories Pictures: https://prasant.net/clustering-scotch-whisky-grouping-distilleries-by-k-means-clustering-81f2ecde069c https://www.r-bloggers.com/where-the-whisky-flavor-profile-data-came-from/ https://www.centerspace.net/clustering-analysis-part-iv-non-negative-matrix-factorization/
  64. import … def table = Table.read().csv('whiskey.csv') table = table.removeColumns(0) def

    cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] def panel = new PlotPanel( *[0..<cols.size(), 0..<cols.size()].combinations().collect { first, second -> def color = new Color(72 + (first * 16), 72 + (second * 16), 200 - (first * 4) - (second * 4)) ScatterPlot.plot(table.as().doubleMatrix(cols[first], cols[second]), '#' as char, color) } ) new SwingBuilder().edt { frame(title: 'Frame', size: [800, 600], show: true, defaultCloseOperation: DISPOSE) { widget(panel) } } Whiskey: How to visualize? Feature combinations?
  65. import … def table = Table.read().csv('whiskey.csv') table = table.removeColumns(0) def

    cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] Color[] colors = [CYAN, PINK, MAGENTA, ORANGE, GREEN, BLUE, RED, YELLOW] def panel = new PlotPanel( *[cols, cols].combinations().collect { first, second -> Histogram3D.plot(table.as().doubleMatrix(first, second), 4, colors) } ) new SwingBuilder().edt { frame(title: 'Frame', size: [800, 600], show: true, defaultCloseOperation: DISPOSE) { widget(panel) } } Whiskey: How to visualize? Feature combinations?
  66. import … def table = Table.read().csv('whiskey.csv') table = table.removeColumns(0) def

    cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] Color[] colors = [CYAN, PINK, MAGENTA, ORANGE, GREEN, BLUE, RED, YELLOW] def panel = new PlotPanel( *[cols, cols].combinations().collect { first, second -> Histogram3D.plot(table.as().doubleMatrix(first, second), 4, colors) } ) new SwingBuilder().edt { frame(title: 'Frame', size: [800, 600], show: true, defaultCloseOperation: DISPOSE) { widget(panel) } } Whiskey: How to visualize? Feature combinations?
  67. import … def table = Table.read().csv('whiskey.csv') table = table.removeColumns(0) def

    cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] Color[] colors = [CYAN, PINK, MAGENTA, ORANGE, GREEN, BLUE, RED, YELLOW] def panel = new PlotPanel( *[cols, cols].combinations().collect { first, second -> Histogram3D.plot(table.as().doubleMatrix(first, second), 4, colors) } ) SwingUtil.show(size: [1200, 900], panel) Whiskey: How to visualize? Feature combinations?
  68. Clustering with KMeans Step 1: • Guess k cluster centroids

    Step 2: • Assign points to closest centroid
  69. Clustering with KMeans Step 1: • Guess k cluster centroids

    Step 2: • Assign points to closest centroid
  70. Clustering with KMeans Step 1: • Guess k cluster centroids

    Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points
  71. Clustering with KMeans Step 1: • Guess k cluster centroids

    Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points
  72. Clustering with KMeans Step 1: • Guess k cluster centroids

    Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points
  73. Clustering with KMeans Step 1: • Guess k cluster centroids

    Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points
  74. Clustering with KMeans Step 1: • Guess k cluster centroids

    Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points Repeat steps 2 and 3 until stable or some limit reached
  75. Clustering with KMeans Step 1: • Guess k cluster centroids

    Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points Repeat steps 2 and 3 until stable or some limit reached
  76. Clustering with KMeans Step 1: • Guess k cluster centroids

    Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points Repeat steps 2 and 3 until stable or some limit reached
  77. Clustering with KMeans Step 1: • Guess k cluster centroids

    Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points Repeat steps 2 and 3 until stable or some limit reached
  78. import … def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco",

    "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] def numClusters = 5 def loader = new CSVLoader(file: 'whiskey.csv') def clusterer = new SimpleKMeans(numClusters: numClusters, preserveInstancesOrder: true) def instances = loader.dataSet instances.deleteAttributeAt(0) // remove RowID clusterer.buildClusterer(instances) println ' ' + cols.join(', ') def dataset = new DefaultCategoryDataset() clusterer.clusterCentroids.eachWithIndex{ Instance ctrd, num -> print "Cluster ${num+1}: " println ((1..cols.size()).collect{ sprintf '%.3f', ctrd.value(it) }.join(', ')) (1..cols.size()).each { idx -> dataset.addValue(ctrd.value(idx), "Cluster ${num+1}", cols[idx-1]) } } def clusters = (0..<numClusters).collectEntries{ [it, []] } clusterer.assignments.eachWithIndex { cnum, idx -> clusters[cnum] << instances.get(idx).stringValue(0) } clusters.each { k, v -> println "Cluster ${k+1}:" println v.join(', ') } def plot = new SpiderWebPlot(dataset: dataset) def chart = new JFreeChart('Whiskey clusters', plot) SwingUtil.show(new ChartPanel(chart)) Whiskey – clustering with radar plot and weka Read CSV and create clusterer Display whiskies assigned to each cluster Graph cluster centroids
  79. import … def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco",

    "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] def numClusters = 5 def loader = new CSVLoader(file: 'whiskey.csv') def clusterer = new SimpleKMeans(numClusters: numClusters, preserveInstancesOrder: true) def instances = loader.dataSet instances.deleteAttributeAt(0) // remove RowID clusterer.buildClusterer(instances) println ' ' + cols.join(', ') def dataset = new DefaultCategoryDataset() clusterer.clusterCentroids.eachWithIndex{ Instance ctrd, num -> print "Cluster ${num+1}: " println ((1..cols.size()).collect{ sprintf '%.3f', ctrd.value(it) }.join(', ')) (1..cols.size()).each { idx -> dataset.addValue(ctrd.value(idx), "Cluster ${num+1}", cols[idx-1]) } } def clusters = (0..<numClusters).collectEntries{ [it, []] } clusterer.assignments.eachWithIndex { cnum, idx -> clusters[cnum] << instances.get(idx).stringValue(0) } clusters.each { k, v -> println "Cluster ${k+1}:" println v.join(', ') } def plot = new SpiderWebPlot(dataset: dataset) def chart = new JFreeChart('Whiskey clusters', plot) SwingUtil.show(new ChartPanel(chart)) Whiskey – clustering with radar plot and weka Body, Sweetness, Smoky, Medicinal, Tobacco, Honey, Spicy, Winey, Nutty, Malty, Fruity, Floral Cluster 1: 3.800, 1.600, 3.600, 3.600, 0.600, 0.200, 1.600, 0.600, 1.000, 1.400, 1.200, 0.000 Cluster 2: 2.773, 2.409, 1.545, 0.045, 0.000, 1.818, 1.591, 2.000, 2.091, 2.136, 2.136, 1.591 Cluster 3: 1.773, 2.455, 1.318, 0.636, 0.000, 0.636, 1.000, 0.409, 1.636, 1.364, 1.591, 1.591 Cluster 4: 1.500, 2.233, 1.267, 0.267, 0.000, 1.533, 1.400, 0.700, 1.000, 1.900, 1.900, 2.133 Cluster 5: 2.000, 2.143, 1.857, 0.857, 1.000, 0.857, 1.714, 1.000, 1.286, 2.000, 1.429, 1.714 Cluster 1: Ardbeg, Clynelish, Lagavulin, Laphroig, Talisker Cluster 2: Aberfeldy, Aberlour, Ardmore, Auchroisk, Balmenach, BenNevis, Benrinnes, Benromach, BlairAthol, Dailuaine, Dalmore, Edradour, Glendronach, Glendullan, Glenfarclas, Glenrothes, Glenturret, Longmorn, Macallan, Mortlach, RoyalLochnagar, Strathisla Cluster 3: ArranIsleOf, Aultmore, Balblair, Cardhu, Craigganmore, Dufftown, GlenGrant, GlenKeith, GlenScotia, GlenSpey, Glenfiddich, Glenmorangie, Isle of Jura, Mannochmore, Miltonduff, Oban, Speyside, Springbank, Strathmill, Tamnavulin, Teaninich, Tomore Cluster 4: AnCnoc, Auchentoshan, Belvenie, Benriach, Bladnoch, Bowmore, Bruichladdich, Bunnahabhain, Dalwhinnie, Deanston, GlenElgin, GlenGarioch, GlenMoray, GlenOrd, Glenallachie, Glengoyne, Glenkinchie, Glenlivet, Glenlossie, Highland Park, Inchgower, Knochando, Linkwood, Loch Lomond, Scapa, Speyburn, Tamdhu, Tobermory, Tomatin, Tomintoul Cluster 5: Caol Ila, Craigallechie, GlenDeveronMacduff, OldFettercairn, OldPulteney, RoyalBrackla, Tullibardine
  80. import … def rows = CSV.withFirstRecordAsHeader().parse(new FileReader('whiskey.csv')) def cols =

    ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] def clusterer = new KMeansPlusPlusClusterer(5) def data = rows.collect{ row -> new DoublePoint(cols.collect{ col -> row[col] } as int[]) } def centroids = clusterer.cluster(data) println cols.join(', ') + ', Medoid' def dataset = new DefaultCategoryDataset() centroids.eachWithIndex{ ctrd, num -> def cpt = ctrd.center.point def closest = ctrd.points.min{ pt -> sumSq((0..<cpt.size()).collect{ cpt[it] - pt.point[it] } as double[]) } def medoid = rows.find{ row -> cols.collect{ row[it] as double } == closest.point }?.Distillery println cpt.collect{ sprintf '%.3f', it }.join(', ') + ", $medoid" cpt.eachWithIndex { val, idx -> dataset.addValue(val, "Cluster ${num+1}", cols[idx]) } } def plot = new SpiderWebPlot(dataset: dataset) def chart = new JFreeChart('Whiskey clusters', plot) SwingUtil.show(new ChartPanel(chart)) Whiskey – clustering with radar plot and medoids Same again with Apache Commons Math but also calculate medoids
  81. import … def rows = CSV.withFirstRecordAsHeader().parse(new FileReader('whiskey.csv')) def cols =

    ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] def clusterer = new KMeansPlusPlusClusterer(5) def data = rows.collect{ row -> new DoublePoint(cols.collect{ col -> row[col] } as int[]) } def centroids = clusterer.cluster(data) println cols.join(', ') + ', Medoid' def dataset = new DefaultCategoryDataset() centroids.eachWithIndex{ ctrd, num -> def cpt = ctrd.center.point def closest = ctrd.points.min{ pt -> sumSq((0..<cpt.size()).collect{ cpt[it] - pt.point[it] } as double[]) } def medoid = rows.find{ row -> cols.collect{ row[it] as double } == closest.point }?.Distillery println cpt.collect{ sprintf '%.3f', it }.join(', ') + ", $medoid" cpt.eachWithIndex { val, idx -> dataset.addValue(val, "Cluster ${num+1}", cols[idx]) } } def plot = new SpiderWebPlot(dataset: dataset) def chart = new JFreeChart('Whiskey clusters', plot) SwingUtil.show(new ChartPanel(chart)) Whiskey – clustering with radar plot and medoids Libraries: Apache Commons Math and JFreeChart Body, Sweetness, Smoky, Medicinal, Tobacco, Honey, Spicy, Winey, Nutty, Malty, Fruity, Floral, Medoid 2.000, 2.533, 1.267, 0.267, 0.200, 1.067, 1.667, 0.933, 0.267, 1.733, 1.800, 1.733, GlenOrd 2.789, 2.474, 1.474, 0.053, 0.000, 1.895, 1.632, 2.211, 2.105, 2.105, 2.211, 1.737, Aberfeldy 2.909, 1.545, 2.909, 2.727, 0.455, 0.455, 1.455, 0.545, 1.545, 1.455, 1.182, 0.545, Clynelish 1.333, 2.333, 0.944, 0.111, 0.000, 1.000, 0.444, 0.444, 1.500, 1.944, 1.778, 1.778, Aultmore 1.696, 2.304, 1.565, 0.435, 0.087, 1.391, 1.696, 0.609, 1.652, 1.652, 1.783, 2.130, Benromach
  82. Whiskey – clustering with PCA plot import … def rows

    = Table.read().csv('whiskey.csv') def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] def data = table.as().doubleMatrix(*cols) def pca = new PCA(data) pca.projection = 2 def projected = pca.project(data) def clusterer = new KMeans(data, 5) table = table.addColumns( DoubleColumn.create("PCA1", (0..<data.size()).collect{ projected[it][0] }), DoubleColumn.create("PCA2", (0..<data.size()).collect{ projected[it][1] }), StringColumn.create("Cluster", clusterer.clusterLabel.collect{ it.toString() }) ) def title = "Clusters x Principal Components" Plot.show(ScatterPlot.create(title, table, "PCA1", "PCA2", "Cluster"))
  83. Whiskey – clustering with 2-4D PCA plot import … def

    rows = Table.read().csv('whiskey.csv') def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] def data = rows.as().doubleMatrix(*cols) def pca = new PCA(data) def dims = 4 // can be 2, 3 or 4 pca.projection = dims def projected = pca.project(data) def adj = [1, 1, 1, 5] def clusterer = new KMeans(data, 5) def labels = clusterer.clusterLabel.collect{ "Cluster " + (it+1) } rows = rows.addColumns( *(0..<dims).collect { idx -> DoubleColumn.create("PCA${idx+1}", (0..<data.size()).collect{ adj[idx] * (projected[it][idx] + adj[idx]) }) }, StringColumn.create("Cluster", labels) ) def title = "Clusters x Principal Components" def type = dims == 2 ? ScatterPlot : Scatter3DPlot Plot.show(type.create(title, rows, *(1..dims).collect { "PCA$it" }, "Cluster"))
  84. Whiskey – clustering and visualizing centroids … def data =

    table.as().doubleMatrix(*cols) def pca = new PCA(data) pca.projection = 3 def projected = pca.project(data) def clusterer = new KMeans(data, 5) def labels = clusterer.clusterLabel.collect { "Cluster " + (it + 1) } table = table.addColumns( *(0..<3).collect { idx -> DoubleColumn.create("PCA${idx+1}", (0..<data.size()).collect{ projected[it][idx] })}, StringColumn.create("Cluster", labels), DoubleColumn.create("Centroid", [10] * labels.size()) ) def centroids = pca.project(clusterer.centroids()) def toAdd = table.emptyCopy(1) (0..<centroids.size()).each { idx -> toAdd[0].setString("Cluster", "Cluster " + (idx+1)) (1..3).each { toAdd[0].setDouble("PCA" + it, centroids[idx][it-1]) } toAdd[0].setDouble("Centroid", 50) table.append(toAdd) } def title = "Clusters x Principal Components w/ centroids" Plot.show(Scatter3DPlot.create(title, table, *(1..3).collect { "PCA$it" }, "Centroid", "Cluster"))
  85. Whiskey – Screeplot import … def rows = Table.read().csv('whiskey.csv') def

    cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] def data = table.as().doubleMatrix(*cols) def pca = new PCA(data) pca.projection = 2 def plots = [PlotCanvas.screeplot(pca)] def projected = pca.project(data) table = table.addColumns( *(1..2).collect { idx -> DoubleColumn.create("PCA$idx", (0..<data.size()).collect { projected[it][idx - 1] }) } ) def colors = [RED, BLUE, GREEN, ORANGE, MAGENTA, GRAY] def symbols = ['*', 'Q', '#', 'Q', '*', '#'] (2..6).each { k -> def clusterer = new KMeans(data, k) double[][] components = table.as().doubleMatrix('PCA1', 'PCA2') plots << ScatterPlot.plot(components, clusterer.clusterLabel, symbols[0..<k] as char[], colors[0..<k] as Color[]) } SwingUtil.show(size: [1200, 900], new PlotPanel(*plots))
  86. import … def rows = Table.read().csv('whiskey.csv') def cols = ["Body",

    "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] def data = table.as().doubleMatrix(*cols) def pca = new PCA(data) pca.projection = 2 def plots = [PlotCanvas.screeplot(pca)] def projected = pca.project(data) table = table.addColumns( *(1..2).collect { idx -> DoubleColumn.create("PCA$idx", (0..<data.size()).collect { projected[it][idx - 1] }) } ) def colors = [RED, BLUE, GREEN, ORANGE, MAGENTA, GRAY] def symbols = ['*', 'Q', '#', 'Q', '*', '#'] (2..6).each { k -> def clusterer = new KMeans(data, k) double[][] components = table.as().doubleMatrix('PCA1', 'PCA2') plots << ScatterPlot.plot(components, clusterer.clusterLabel, symbols[0..<k] as char[], colors[0..<k] as Color[]) } SwingUtil.show(size: [1200, 900], new PlotPanel(*plots)) Whiskey – Screeplot
  87. Whiskey – XMeans clustering with 2-4D PCA plot import …

    def rows = Table.read().csv('whiskey.csv') def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] def data = rows.as().doubleMatrix(*cols) def pca = new PCA(data) def dims = 4 // can be 2, 3 or 4 pca.projection = dims def projected = pca.project(data) def adj = [1, 1, 1, 5] def kmax = 10 def clusterer = new XMeans(data, kmax) def labels = clusterer.clusterLabel.collect { "Cluster " + (it + 1) } rows = rows.addColumns( *(0..<dims).collect { idx -> DoubleColumn.create("PCA${idx+1}", (0..<data.size()).collect{ adj[idx] * (projected[it][idx] + adj[idx]) }) }, StringColumn.create("Cluster", labels) ) def title = "Clusters x Principal Components" def type = dims == 2 ? ScatterPlot : Scatter3DPlot Plot.show(type.create(title, rows, *(1..dims).collect { "PCA$it" }, "Cluster"))
  88. Whiskey – SOM with Heatmap import … def file =

    new File(getClass().classLoader.getResource('whiskey.csv').file) def table = Read.csv(file.toPath(), CSV.withFirstRecordAsHeader()) String[] cols = ['Body', 'Sweetness', 'Smoky', 'Medicinal', 'Tobacco', 'Honey', 'Spicy', 'Winey', 'Nutty', 'Malty', 'Fruity', 'Floral'] def data = table.select(cols).toArray() def distilleries = table.column('Distillery').toStringArray() //def (ncols, nrows) = [3, 3] //def (ncols, nrows) = [7, 4] def (ncols, nrows) = [7, 6] //def (ncols, nrows) = [8, 6] def lattice = SOM.lattice(nrows, ncols, data) int epochs = 100 def model = new SOM(lattice, TimeFunction.constant(0.1), Neighborhood.Gaussian(1, data.length * epochs / 8.0)) for (int i = 0; i < epochs; i++) { for (int j : MathEx.permutate(data.length)) { model.update(data[j]) } } …
  89. Whiskey – SOM with Heatmap … def groups = data.toList().indices.groupBy

    { idx -> group(model, data[idx]) } def names = groups.collectEntries { k, v -> [k, distilleries[v].join(', ')] } private group(SOM model, double[] row) { double[][][] neurons = model.neurons() [0..<neurons.size(), 0..<neurons[0].size()].combinations().min { pair -> def (i, j) = pair MathEx.distance(neurons[i][j], row) } } names.toSorted{ e1, e2 -> e1.key[0] <=> e2.key[0] ?: e1.key[1] <=> e2.key[1] }.each { k, v -> println "Cluster ${k[0]},${k[1]}: $v" } def tooltip = { i, j -> names[[i, j]] ?: '' } as Hexmap.Tooltip new Hexmap(model.umatrix(), Palette.jet(256), tooltip) .canvas().tap { setAxisLabels('', '') }.window()
  90. Whiskey – SOM with Heatmap Cluster 0,0: Auchentoshan, Glengoyne Cluster

    0,1: AnCnoc Cluster 0,2: Inchgower, Tomintoul Cluster 0,3: ArranIsleOf, Speyburn Cluster 0,4: GlenElgin, Linkwood, RoyalBrackla Cluster 0,5: Glenfarclas, Glenlivet, Glenturret Cluster 0,6: Aberlour, Strathisla Cluster 1,0: Bladnoch, Bunnahabhain, Glenfiddich Cluster 1,1: Cardhu, Glenallachie Cluster 1,2: Glenkinchie, Glenlossie, Loch Lomond Cluster 1,3: Benriach, Dalwhinnie Cluster 1,4: GlenOrd Cluster 1,5: Craigallechie, GlenMoray, Longmorn Cluster 1,6: Edradour, Knochando Cluster 2,0: GlenSpey, Miltonduff Cluster 2,1: Tamnavulin Cluster 2,2: Mannochmore Cluster 2,3: Aultmore, GlenGrant, Speyside, Tamdhu, Tobermory Cluster 2,4: Deanston, Scapa Cluster 2,5: Benromach Cluster 2,6: Belvenie, BenNevis, Benrinnes Cluster 3,0: GlenKeith, Tullibardine Cluster 3,1: Balblair, Glenmorangie, Strathmill Cluster 3,3: Tomore Cluster 3,4: Ardmore, OldFettercairn Cluster 3,5: Aberfeldy, BlairAthol Cluster 3,6: Glendullan, RoyalLochnagar Cluster 4,0: Craigganmore, Dufftown Cluster 4,1: OldPulteney Cluster 4,2: Oban Cluster 4,3: Bruichladdich, GlenScotia, Isle of Jura, Springbank Cluster 4,4: GlenDeveronMacduff, Tomatin Cluster 4,5: Auchroisk, Glenrothes Cluster 4,6: Balmenach, Glendronach, Macallan Cluster 5,0: GlenGarioch, Teaninich Cluster 5,1: Caol Ila, Clynelish, Talisker Cluster 5,2: Ardbeg, Lagavulin, Laphroig Cluster 5,4: Bowmore, Highland Park Cluster 5,5: Dalmore Cluster 5,6: Dailuaine, Mortlach
  91. Whiskey – SOM with Heatmap Cluster 0,0: Auchentoshan, Glengoyne Cluster

    0,1: AnCnoc Cluster 0,2: Inchgower, Tomintoul Cluster 0,3: ArranIsleOf, Speyburn Cluster 0,4: GlenElgin, Linkwood, RoyalBrackla Cluster 0,5: Glenfarclas, Glenlivet, Glenturret Cluster 0,6: Aberlour, Strathisla Cluster 1,0: Bladnoch, Bunnahabhain, Glenfiddich Cluster 1,1: Cardhu, Glenallachie Cluster 1,2: Glenkinchie, Glenlossie, Loch Lomond Cluster 1,3: Benriach, Dalwhinnie Cluster 1,4: GlenOrd Cluster 1,5: Craigallechie, GlenMoray, Longmorn Cluster 1,6: Edradour, Knochando Cluster 2,0: GlenSpey, Miltonduff Cluster 2,1: Tamnavulin Cluster 2,2: Mannochmore Cluster 2,3: Aultmore, GlenGrant, Speyside, Tamdhu, Tobermory Cluster 2,4: Deanston, Scapa Cluster 2,5: Benromach Cluster 2,6: Belvenie, BenNevis, Benrinnes Cluster 3,0: GlenKeith, Tullibardine Cluster 3,1: Balblair, Glenmorangie, Strathmill Cluster 3,3: Tomore Cluster 3,4: Ardmore, OldFettercairn Cluster 3,5: Aberfeldy, BlairAthol Cluster 3,6: Glendullan, RoyalLochnagar Cluster 4,0: Craigganmore, Dufftown Cluster 4,1: OldPulteney Cluster 4,2: Oban Cluster 4,3: Bruichladdich, GlenScotia, Isle of Jura, Springbank Cluster 4,4: GlenDeveronMacduff, Tomatin Cluster 4,5: Auchroisk, Glenrothes Cluster 4,6: Balmenach, Glendronach, Macallan Cluster 5,0: GlenGarioch, Teaninich Cluster 5,1: Caol Ila, Clynelish, Talisker Cluster 5,2: Ardbeg, Lagavulin, Laphroig Cluster 5,4: Bowmore, Highland Park Cluster 5,5: Dalmore Cluster 5,6: Dailuaine, Mortlach
  92. Whiskey – Hierarchical clustering with Dendrogram import … def file

    = new File(getClass().classLoader.getResource('whiskey.csv').file) def table = Read.csv(file.toPath(), CSV.withFirstRecordAsHeader()) String[] cols = ['Body', 'Sweetness', 'Smoky', 'Medicinal', 'Tobacco', 'Honey', 'Spicy', 'Winey', 'Nutty', 'Malty', 'Fruity', 'Floral'] def data = table.select(cols).toArray() def distilleries = table.column('Distillery').toStringArray() def ninetyDeg = 1.57 // radians def FOREST_GREEN = new Color(0X808000) def clusters = HierarchicalClustering.fit(CompleteLinkage.of(data)) //println clusters.tree //println clusters.height def partitions = clusters.partition(4) // little trick to work out cluster colors def colorMap = new LinkedHashSet(partitions.toList()).toList().reverse().indexed().collectEntries { k, v -> [v, Palette.COLORS[k]] } Font font = new Font("BitStream Vera Sans", Font.PLAIN, 12) …
  93. Whiskey – Hierarchical clustering with Dendrogram … def dendrogram =

    new Dendrogram(clusters.tree, clusters.height, FOREST_GREEN).canvas().tap { title = 'Whiskey Dendrogram' setAxisLabels('Distilleries', 'Similarity') def lb = lowerBounds setBound([lb[0] - 1, lb[1] - 20] as double[], upperBounds) distilleries.eachWithIndex { String label, int i -> add(new Label(label, [i, -1] as double[], 0, 0, ninetyDeg, font, colorMap[partitions[i]])) } }.panel() def pca = PCA.fit(data) pca.projection = 2 def projected = pca.project(data) char mark = '#' def scatter = ScatterPlot.of(projected, partitions, mark).canvas().tap { title = 'Clustered by dendrogram partitions' setAxisLabels('PCA1', 'PCA2') }.panel() new PlotGrid(dendrogram, scatter).window()
  94. Whiskey – Hierarchical clustering with Dendrogram … def dendrogram =

    new Dendrogram(clusters.tree, clusters.height, FOREST_GREEN).canvas().tap { title = 'Whiskey Dendrogram' setAxisLabels('Distilleries', 'Similarity') def lb = lowerBounds setBound([lb[0] - 1, lb[1] - 20] as double[], upperBounds) distilleries.eachWithIndex { String label, int i -> add(new Label(label, [i, -1] as double[], 0, 0, ninetyDeg, font, colorMap[partitions[i]])) } }.panel() def pca = PCA.fit(data) pca.projection = 2 def projected = pca.project(data) char mark = '#' def scatter = ScatterPlot.of(projected, partitions, mark).canvas().tap { title = 'Clustered by dendrogram partitions' setAxisLabels('PCA1', 'PCA2') }.panel() new PlotGrid(dendrogram, scatter).window()
  95. Whiskey flavors – scaling clustering import … def rows =

    Table.read().csv('whiskey.csv') String[] features = [ 'Distillery', 'Body', 'Sweetness', 'Smoky', 'Medicinal', 'Tobacco', 'Honey', 'Spicy', 'Winey', 'Nutty', 'Malty', 'Fruity', 'Floral' ] def data = rows.as().doubleMatrix(features) def cfg = new IgniteConfiguration(…) Ignition.start(cfg).withCloseable { ignite -> println ">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols" def trainer = new KMeansTrainer().withDistance(new EuclideanDistance()).withAmountOfClusters(5) def dataCache = new SandboxMLCache(ignite).fillCacheWith(data) def vectorizer = new DoubleArrayVectorizer().labeled(Vectorizer.LabelCoordinate.FIRST) def mdl = trainer.fit(ignite, dataCache, vectorizer) println ">>> KMeans centroids" println features[1..-1].join(', ‘) mdl.centers.each { c -> println c.all().collect{ sprintf '%.4f', it.get() }.join(', ') } dataCache.destroy() } [17:18:07] Ignite node started OK (id=ea92fc20) >>> Ignite grid started for data: 86 rows X 13 cols >>> KMeans centroids Body, Sweetness, Smoky, Medicinal, Tobacco, Honey, Spicy, Winey, Nutty, Malty, Fruity, Floral 2.9091, 1.5455, 2.9091, 2.7273, 0.4545, 0.4545, 1.4545, 0.5455, 1.5455, 1.4545, 1.1818, 0.5455 1.8095, 2.4762, 1.5714, 0.4286, 0.1429, 1.5238, 1.7143, 0.8571, 0.7143, 1.6667, 1.6190, 2.0476 2.5217, 2.3913, 1.4783, 0.0435, 0.0435, 1.6087, 1.4783, 1.9565, 2.0435, 2.0435, 1.9565, 1.3913 1.3704, 2.3704, 1.0000, 0.2593, 0.0370, 0.7778, 0.9259, 0.4074, 1.5185, 1.7407, 2.0000, 2.1111 3.2500, 2.2500, 1.5000, 0.0000, 0.0000, 3.0000, 2.0000, 1.0000, 1.5000, 2.5000, 2.2500, 2.0000 [17:18:07] Ignite node stopped OK [uptime=00:00:00.368]
  96. Whiskey flavors – scaling clustering import … def spark =

    builder().config('spark.master', 'local[8]').appName('Whiskey').orCreate def file = '/path/to/whiskey.csv' int k = 5 Dataset<Row> rows = spark.read().format('com.databricks.spark.csv') .options('header': 'true', 'inferSchema': 'true').load(file) String[] colNames = rows.columns().toList().minus(['RowID', 'Distillery']) def assembler = new VectorAssembler(inputCols: colNames, outputCol: 'features') Dataset<Row> dataset = assembler.transform(rows) def clusterer = new KMeans(k: k, seed: 1L) def model = clusterer.fit(dataset) println 'Cluster centers:' model.clusterCenters().each{ println it.values().collect{ sprintf '%.2f', it }.join(', ') } spark.stop() Cluster centers: 1.73, 2.35, 1.58, 0.81, 0.19, 1.15, 1.42, 0.81, 1.23, 1.77, 1.23, 1.31 2.00, 1.00, 3.00, 0.00, 0.00, 0.00, 3.00, 1.00, 0.00, 2.00, 2.00, 2.00 2.86, 2.38, 1.52, 0.05, 0.00, 1.95, 1.76, 2.05, 1.81, 2.05, 2.19, 1.71 1.53, 2.38, 1.06, 0.16, 0.03, 1.09, 1.00, 0.50, 1.53, 1.75, 2.13, 2.28 3.67, 1.50, 3.67, 3.33, 0.67, 0.17, 1.67, 0.50, 1.17, 1.33, 1.17, 0.17
  97. Whiskey flavors – scaling clustering class Point { double[] pts

    } class TaggedPoint extends Point { int cluster } @Canonical(includeSuperProperties=true) class TaggedPointCounter extends TaggedPoint { long count TaggedPointCounter plus(TaggedPointCounter other) { new TaggedPointCounter((0..<pts.size()).collect{ pts[it] + other.pts[it] } as double[], cluster, count + other.count) } TaggedPointCounter average() { new TaggedPointCounter(pts.collect{it/count } as double[], cluster, 0) } } int k = 5 Integer iterations = 20 def configuration = new Configuration() Was: Rheem
  98. Whiskey flavors – scaling clustering class SelectNearestCentroid implements FunctionDescriptor.ExtendedSerializableFunction<Point, TaggedPointCounter>

    { Iterable<TaggedPointCounter> centroids @Override void open(ExecutionContext executionCtx) { centroids = executionCtx.getBroadcast("centroids") } @Override TaggedPointCounter apply(Point point) { def minDistance = Double.POSITIVE_INFINITY def nearestCentroidId = -1 for (centroid in centroids) { def distance = sqrt((0..<point.pts.size()).collect{ point.pts[it] - centroid.pts[it] }.sum{ it ** 2 }) if (distance < minDistance) { minDistance = distance nearestCentroidId = centroid.cluster } } new TaggedPointCounter(point.pts, nearestCentroidId, 1) } } Was: Rheem
  99. Whiskey flavors – scaling clustering def url = WhiskeyRheem.classLoader.getResource('whiskey.csv').file def

    rheemContext = new RheemContext(configuration) .withPlugin(Java.basicPlugin()) .withPlugin(Spark.basicPlugin()) def planBuilder = new JavaPlanBuilder(rheemContext) .withJobName("KMeans ($url, k=$k, iterations=$iterations)") def points = new File(url).readLines()[1..-1].collect{ new Point(pts: it.split(",")[2..-1]*.toDouble()) } def pointsBuilder = planBuilder.loadCollection(points) def random = new Random() double[][] initPts = (1..k).collect{ (0..<points[0].pts.size()).collect{ random.nextGaussian() * 4 } } def initialCentroidsBuilder = planBuilder .loadCollection((0..<k).collect{new TaggedPointCounter(initPts[it], it, 0)}) .withName("Load random centroids") Was: Rheem
  100. Whiskey flavors – scaling clustering def url = WhiskeyRheem.classLoader.getResource('whiskey.csv').file def

    rheemContext = new RheemContext(configuration) .withPlugin(Java.basicPlugin()) .withPlugin(Spark.basicPlugin()) def planBuilder = new JavaPlanBuilder(rheemContext) .withJobName("KMeans ($url, k=$k, iterations=$iterations)") def points = new File(url).readLines()[1..-1].collect{ new Point(pts: it.split(",")[2..-1]*.toDouble()) } def pointsBuilder = planBuilder.loadCollection(points) def random = new Random() double[][] initPts = (1..k).collect{ (0..<points[0].pts.size()).collect{ random.nextGaussian() * 4 } } def initialCentroidsBuilder = planBuilder .loadCollection((0..<k).collect{new TaggedPointCounter(initPts[it], it, 0)}) .withName("Load random centroids") [main] INFO org.qcri.rheem.core.api.Job - StopWatch results: * Optimization - 0:00:00.297 * Prepare - 0:00:00.063 * Prune&Isolate - 0:00:00.011 * Transformations - 0:00:00.052 * Sanity - 0:00:00.000 * Cardinality&Load Estimation - 0:00:00.115 * Create OptimizationContext - 0:00:00.008 * Create CardinalityEstimationManager - 0:00:00.001 * Push Estimation - 0:00:00.106 * Create Initial Execution Plan - 0:00:00.119 * Enumerate - 0:00:00.092 * Concatenation - 0:00:00.058 * Channel Conversion - 0:00:00.050 * Prune - 0:00:00.017 * Pick Best Plan - 0:00:00.009 * Split Stages - 0:00:00.013 * Execution - 0:00:00.352 * Execution 0 - 0:00:00.352 * Execute - 0:00:00.350 * Post-processing - 0:00:00.047 * Log measurements - 0:00:00.047 * Release Resources - 0:00:00.000 Centroids: Cluster0: 2.500, 2.324, 1.588, 0.176, 0.059, 1.882, 1.647, 1.676, 1.824, 2.088, 1.912, 1.706 Cluster1: 1.488, 2.463, 1.122, 0.268, 0.073, 0.927, 1.146, 0.512, 1.146, 1.659, 1.878, 2.000 Cluster2: 2.909, 1.545, 2.909, 2.727, 0.455, 0.455, 1.455, 0.545, 1.545, 1.455, 1.182, 0.545 Was: Rheem
  101. Clustering lab – repeat some of the examples with beer

    https://rstudio-pubs-static.s3.amazonaws.com/560654_2b44eef198f24a0bba45e002ae86d237.html
  102. Classification Overview Clustering: • Predicting class of some data Algorithms:

    • Logistic Regression • Naïve Bayes • Stochastic Gradient Descent • K-Nearest Neighbours • Decision Tree • Random Forest • Support Vector Machine Aspects: • Over/underfitting • Ensemble • Confusion Applications: • Image recognition • Speech recognition • Medical daignosis
  103. Case Study: classification of Iris flowers Created by British statistician

    and biologist Ronald Fisher for his 1936 paper “The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis” 150 samples, 50 each of three species of Iris: • setosa • versicolor • virginica Four features measured for each sample: • sepal length • sepal width • petal length • petal width https://en.wikipedia.org/wiki/Iris_flower_data_set https://archive.ics.uci.edu/ml/datasets/Iris setosa versicolor virginica sepal petal
  104. Iris flower data – Weka Naïve Bayes def file =

    getClass().classLoader.getResource('iris_data.csv').file as File def species = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'] def loader = new CSVLoader(file: file) def model = new NaiveBayes() def allInstances = loader.dataSet allInstances.classIndex = 4 model.buildClassifier(allInstances) double[] actual = allInstances.collect{ it.value(4) } double[] predicted = allInstances.collect{ model.classifyInstance(it) } double[] petalW = allInstances.collect{ it.value(2) } double[] petalL = allInstances.collect{ it.value(3) } def indices = actual.indices
  105. Iris flower data – Weka Naïve Bayes def file =

    getClass().classLoader.getResource('iris_data.csv').file as File def species = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'] def loader = new CSVLoader(file: file) def model = new NaiveBayes() def allInstances = loader.dataSet allInstances.classIndex = 4 model.buildClassifier(allInstances) double[] actual = allInstances.collect{ it.value(4) } double[] predicted = allInstances.collect{ model.classifyInstance(it) } double[] petalW = allInstances.collect{ it.value(0) } double[] petalL = allInstances.collect{ it.value(1) } def indices = actual.indices def chart = new XYChartBuilder().width(900).height(450). title("Species").xAxisTitle("Petal length").yAxisTitle("Petal width").build() species.eachWithIndex{ String name, int i -> def groups = indices.findAll{ predicted[it] == i }.groupBy{ actual[it] == i } Collection found = groups[true] ?: [] Collection errors = groups[false] ?: [] println "$name: ${found.size()} correct, ${errors.size()} incorrect" chart.addSeries("$name correct", petalW[found], petalL[found]).with { XYSeriesRenderStyle = Scatter } if (errors) { chart.addSeries("$name incorrect", petalW[errors], petalL[errors]).with { XYSeriesRenderStyle = Scatter } } } new SwingWrapper(chart).displayChart()
  106. def chart = new XYChartBuilder().width(900).height(450). title("Species").xAxisTitle("Petal length").yAxisTitle("Petal width").build() species.eachWithIndex{ String

    name, int i -> def groups = indices.findAll{ predicted[it] == i }.groupBy{ actual[it] == i } Collection found = groups[true] ?: [] Collection errors = groups[false] ?: [] println "$name: ${found.size()} correct, ${errors.size()} incorrect" chart.addSeries("$name correct", petalW[found], petalL[found]).with { XYSeriesRenderStyle = Scatter } if (errors) { chart.addSeries("$name incorrect", petalW[errors], petalL[errors]).with { XYSeriesRenderStyle = Scatter } } } new SwingWrapper(chart).displayChart() Iris flower data – Weka Naïve Bayes def file = getClass().classLoader.getResource('iris_data.csv').file as File def species = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'] def loader = new CSVLoader(file: file) def model = new NaiveBayes() def allInstances = loader.dataSet allInstances.classIndex = 4 model.buildClassifier(allInstances) double[] actual = allInstances.collect{ it.value(4) } double[] predicted = allInstances.collect{ model.classifyInstance(it) } double[] petalW = allInstances.collect{ it.value(0) } double[] petalL = allInstances.collect{ it.value(1) } def indices = actual.indices Iris-setosa: 50 correct, 0 incorrect Iris-versicolor: 48 correct, 4 incorrect Iris-virginica: 46 correct, 2 incorrect
  107. Iris flower data – Weka Logistic Regression def file =

    getClass().classLoader.getResource('iris_data.csv').file as File def species = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'] def loader = new CSVLoader(file: file) def model = new SimpleLogistic() def allInstances = loader.dataSet allInstances.classIndex = 4 model.buildClassifier(allInstances) double[] actual = allInstances.collect{ it.value(4) } double[] predicted = allInstances.collect{ model.classifyInstance(it) } double[] petalW = allInstances.collect{ it.value(2) } double[] petalL = allInstances.collect{ it.value(3) } def indices = actual.indices
  108. Iris flower data – Weka Logistic Regression def file =

    getClass().classLoader.getResource('iris_data.csv').file as File def species = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'] def loader = new CSVLoader(file: file) def model = new SimpleLogistic() def allInstances = loader.dataSet allInstances.classIndex = 4 model.buildClassifier(allInstances) double[] actual = allInstances.collect{ it.value(4) } double[] predicted = allInstances.collect{ model.classifyInstance(it) } double[] petalW = allInstances.collect{ it.value(0) } double[] petalL = allInstances.collect{ it.value(1) } def indices = actual.indices def chart = new XYChartBuilder().width(900).height(450). title("Species").xAxisTitle("Petal length").yAxisTitle("Petal width").build() species.eachWithIndex{ String name, int i -> def groups = indices.findAll{ predicted[it] == i }.groupBy{ actual[it] == i } Collection found = groups[true] ?: [] Collection errors = groups[false] ?: [] println "$name: ${found.size()} correct, ${errors.size()} incorrect" chart.addSeries("$name correct", petalW[found], petalL[found]).with { XYSeriesRenderStyle = Scatter } if (errors) { chart.addSeries("$name incorrect", petalW[errors], petalL[errors]).with { XYSeriesRenderStyle = Scatter } } } new SwingWrapper(chart).displayChart() Iris-setosa: 50 correct, 0 incorrect Iris-versicolor: 49 correct, 1 incorrect Iris-virginica: 49 correct, 1 incorrect
  109. Iris flower data – Weka J48 Decision Tree def file

    = getClass().classLoader.getResource('iris_data.csv').file as File def species = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'] def loader = new CSVLoader(file: file) def model = new J48() def allInstances = loader.dataSet allInstances.classIndex = 4 model.buildClassifier(allInstances) double[] actual = allInstances.collect{ it.value(4) } double[] predicted = allInstances.collect{ model.classifyInstance(it) } double[] petalW = allInstances.collect{ it.value(0) } double[] petalL = allInstances.collect{ it.value(1) } def indices = actual.indices def chart = new XYChartBuilder().width(900).height(450). title("Species").xAxisTitle("Petal length").yAxisTitle("Petal width").build() species.eachWithIndex{ String name, int i -> def groups = indices.findAll{ predicted[it] == i }.groupBy{ actual[it] == i } Collection found = groups[true] ?: [] Collection errors = groups[false] ?: [] println "$name: ${found.size()} correct, ${errors.size()} incorrect" chart.addSeries("$name correct", petalW[found], petalL[found]).with { XYSeriesRenderStyle = Scatter } if (errors) { chart.addSeries("$name incorrect", petalW[errors], petalL[errors]).with { XYSeriesRenderStyle = Scatter } } } new SwingWrapper(chart).displayChart() Petal width <= 0.6: Iris-setosa (50.0) Petal width > 0.6 | Petal width <= 1.7 | | Petal length <= 4.9: Iris-versicolor (48.0/1.0) | | Petal length > 4.9 | | | Petal width <= 1.5: Iris-virginica (3.0) | | | Petal width > 1.5: Iris-versicolor (3.0/1.0) | Petal width > 1.7: Iris-virginica (46.0/1.0) Number of Leaves : 5 Size of the tree : 9 Iris-setosa: 50 correct, 0 incorrect Iris-versicolor: 49 correct, 2 incorrect Iris-virginica: 48 correct, 1 incorrect
  110. Iris flower data – wekaDeeplearning4j WekaPackageManager.loadPackages(true) def file = getClass().classLoader.getResource('iris_data.csv').file

    as File def loader = new CSVLoader(file: file) def data = loader.dataSet data.classIndex = 4 def options = Utils.splitOptions("-S 1 -numEpochs 10 -layer \"weka.dl4j.layers.OutputLayer -activation weka.dl4j.activations.ActivationSoftmax \ -lossFn weka.dl4j.lossfunctions.LossMCXENT\"") AbstractClassifier myClassifier = Utils.forName(AbstractClassifier, "weka.classifiers.functions.Dl4jMlpClassifier", options) // Stratify and split Random rand = new Random(0) Instances randData = new Instances(data) randData.randomize(rand) randData.stratify(3) Instances train = randData.trainCV(3, 0) Instances test = randData.testCV(3, 0) // Build the classifier on the training data myClassifier.buildClassifier(train) // Evaluate the model on test data Evaluation eval = new Evaluation(test) eval.evaluateModel(myClassifier, test) println eval.toSummaryString() println eval.toMatrixString() Image: https://www.neuraldesigner.com/learning/examples/iris-flowers-classification
  111. Iris flower data – wekaDeeplearning4j WekaPackageManager.loadPackages(true) def file = getClass().classLoader.getResource('iris_data.csv').file

    as File def loader = new CSVLoader(file: file) def data = loader.dataSet data.classIndex = 4 def options = Utils.splitOptions("-S 1 -numEpochs 10 -layer \"weka.dl4j.layers.OutputLayer -activation weka.dl4j.activations.ActivationSoftmax \ -lossFn weka.dl4j.lossfunctions.LossMCXENT\"") AbstractClassifier myClassifier = Utils.forName(AbstractClassifier, "weka.classifiers.functions.Dl4jMlpClassifier", options) // Stratify and split Random rand = new Random(0) Instances randData = new Instances(data) randData.randomize(rand) randData.stratify(3) Instances train = randData.trainCV(3, 0) Instances test = randData.testCV(3, 0) // Build the classifier on the training data myClassifier.buildClassifier(train) // Evaluate the model on test data Evaluation eval = new Evaluation(test) eval.evaluateModel(myClassifier, test) println eval.toSummaryString() println eval.toMatrixString() [main] INFO org.deeplearning4j.nn.graph.ComputationGraph - Starting ComputationGraph with WorkspaceModes set to [training: ENABLED; inference: ENABLED], cacheMode set to [NONE] Training Dl4jMlpClassifier...: [] ETA: 00:00:00[INFO ] 00:03:31.035 [main] weka.classifiers.functions.Dl4jMlpClassifier - Epoch [1/10] took 00:00:00.670 Training Dl4jMlpClassifier...: [====== ] ETA: 00:00:06[INFO ] 00:03:31.152 [main] weka.classifiers.functions.Dl4jMlpClassifier - Epoch [2/10] took 00:00:00.113 Training Dl4jMlpClassifier...: [============ ] ETA: 00:00:03[INFO ] 00:03:31.244 [main] weka.classifiers.functions.Dl4jMlpClassifier - Epoch [3/10] took 00:00:00.090 Training Dl4jMlpClassifier...: [================== ] ETA: 00:00:02[INFO ] 00:03:31.325 [main] weka.classifiers.functions.Dl4jMlpClassifier - Epoch [4/10] took 00:00:00.079 Training Dl4jMlpClassifier...: [======================== ] ETA: 00:00:01[INFO ] 00:03:31.470 [main] weka.dl4j.listener.EpochListener - Epoch [5/10] Train Set: Loss: 0.510342 [INFO ] 00:03:31.470 [main] weka.classifiers.functions.Dl4jMlpClassifier - Epoch [5/10] took 00:00:00.144 Training Dl4jMlpClassifier...: [============================== ] ETA: 00:00:01[INFO ] 00:03:31.546 [main] weka.classifiers.functions.Dl4jMlpClassifier - Epoch [6/10] took 00:00:00.073 Training Dl4jMlpClassifier...: [==================================== ] ETA: 00:00:00[INFO ] 00:03:31.611 [main] weka.classifiers.functions.Dl4jMlpClassifier - Epoch [7/10] took 00:00:00.063 Training Dl4jMlpClassifier...: [========================================== ] ETA: 00:00:00[INFO ] 00:03:31.714 [main] weka.classifiers.functions.Dl4jMlpClassifier - Epoch [8/10] took 00:00:00.101 Training Dl4jMlpClassifier...: [================================================ ] ETA: 00:00:00[INFO ] 00:03:31.790 [main] weka.classifiers.functions.Dl4jMlpClassifier - Epoch [9/10] took 00:00:00.074 Training Dl4jMlpClassifier...: [====================================================== ] ETA: 00:00:00[INFO ] 00:03:31.882 [main] weka.dl4j.listener.EpochListener - Epoch [10/10] Train Set: Loss: 0.286469 …
  112. Iris flower data – wekaDeeplearning4j WekaPackageManager.loadPackages(true) def file = getClass().classLoader.getResource('iris_data.csv').file

    as File def loader = new CSVLoader(file: file) def data = loader.dataSet data.classIndex = 4 def options = Utils.splitOptions("-S 1 -numEpochs 10 -layer \"weka.dl4j.layers.OutputLayer -activation weka.dl4j.activations.ActivationSoftmax \ -lossFn weka.dl4j.lossfunctions.LossMCXENT\"") AbstractClassifier myClassifier = Utils.forName(AbstractClassifier, "weka.classifiers.functions.Dl4jMlpClassifier", options) // Stratify and split Random rand = new Random(0) Instances randData = new Instances(data) randData.randomize(rand) randData.stratify(3) Instances train = randData.trainCV(3, 0) Instances test = randData.testCV(3, 0) // Build the classifier on the training data myClassifier.buildClassifier(train) // Evaluate the model on test data Evaluation eval = new Evaluation(test) eval.evaluateModel(myClassifier, test) println eval.toSummaryString() println eval.toMatrixString() … [INFO ] 00:03:31.883 [main] weka.classifiers.functions.Dl4jMlpClassifier - Epoch [10/10] took 00:00:00.091 Training Dl4jMlpClassifier...: [============================================================] ETA: 00:00:00 Done! Correctly Classified Instances 40 80 % Incorrectly Classified Instances 10 20 % Kappa statistic 0.701 Mean absolute error 0.2542 Root mean squared error 0.3188 Relative absolute error 57.2154 % Root relative squared error 67.6336 % Total Number of Instances 50 === Confusion Matrix === a b c <-- classified as 17 0 0 | a = Iris-setosa 0 9 8 | b = Iris-versicolor 0 2 14 | c = Iris-virginica BUILD SUCCESSFUL in 22s
  113. Classify language Apache OpenNLP parts: • sentence detector, tokenizer, name

    finder, document categorizer, part-of-speech tagger, chunker, parser, coreference resolution import opennlp.tools.langdetect.* def base = 'http://apache.forsale.plus/opennlp/models' def url = "$base/langdetect/1.8.3/langdetect-183.bin" def model = new LanguageDetectorModel(new URL(url)) def detector = new LanguageDetectorME(model) [ 'spa': 'Bienvenido a Madrid', 'fra': 'Bienvenue à Paris', 'dan': 'Velkommen til København' ].each { k, v -> assert detector.predictLanguage(v).lang == k } Supported language detection training models: • MaxEnt • Perceptron • Naïve Bayes
  114. Classify language – entities String[] sentences = [ "A commit

    by Daniel Sun on December 6, 2020 improved Groovy 4's language integrated query.", "A commit by Daniel on Sun., December 6, 2020 improved Groovy 4's language integrated query.", 'The Groovy in Action book by Dierk Koenig et. al. is a bargain at $50, or indeed any price.', 'The conference wrapped up yesterday at 5:30 p.m. in Copenhagen, Denmark.', 'I saw Ms. May Smith waving to June Jones.', 'The parcel was passed from May to June.' ] // use a helper to cache models def helper = new ResourceHelper('http://opennlp.sourceforge.net/models-1.5') def modelNames = ['person', 'money', 'date', 'time', 'location'] def finders = modelNames.collect{ new NameFinderME(new TokenNameFinderModel(helper.load("en-ner-$it"))) } def tokenizer = SimpleTokenizer.INSTANCE …
  115. Classify language – entities sentences.each { sentence -> String[] tokens

    = tokenizer.tokenize(sentence) Span[] tokenSpans = tokenizer.tokenizePos(sentence) def entityText = [:] def entityPos = [:] finders.indices.each {fi -> // could be made smarter by looking at probabilities and overlapping spans Span[] spans = finders[fi].find(tokens) spans.each{span -> def se = span.start..<span.end def pos = (tokenSpans[se.from].start)..<(tokenSpans[se.to].end) entityPos[span.start] = pos entityText[span.start] = "$span.type(${sentence[pos]})" } } entityPos.keySet().toList().reverseEach { def pos = entityPos[it] def (from, to) = [pos.from, pos.to + 1] sentence = sentence[0..<from] + entityText[it] + sentence[to..-1] } println sentence }
  116. Classify language – entities sentences.each { sentence -> String[] tokens

    = tokenizer.tokenize(sentence) Span[] tokenSpans = tokenizer.tokenizePos(sentence) def entityText = [:] def entityPos = [:] finders.indices.each {fi -> // could be made smarter by looking at probabilities and overlapping spans Span[] spans = finders[fi].find(tokens) spans.each{span -> def se = span.start..<span.end def pos = (tokenSpans[se.from].start)..<(tokenSpans[se.to].end) entityPos[span.start] = pos entityText[span.start] = "$span.type(${sentence[pos]})" } } entityPos.keySet().toList().reverseEach { def pos = entityPos[it] def (from, to) = [pos.from, pos.to + 1] sentence = sentence[0..<from] + entityText[it] + sentence[to..-1] } println sentence } A commit by person(Daniel Sun) on date(December 6, 2020) improved Groovy 4's language integrated query. A commit by person(Daniel) on Sun., date(December 6, 2020) improved Groovy 4's language integrated query. The Groovy in Action book by person(Dierk Koenig) et. al. is a bargain at money($50), or indeed any price. The conference wrapped up date(yesterday) at time(5:30 p.m.) in location(Copenhagen), location(Denmark). I saw Ms. person(May Smith) waving to person(June Jones). The parcel was passed from date(May to June).
  117. Classify language – parts of speech def sentences = [

    'Paul has two sisters, Maree and Christine.', 'His bark was much worse than his bite', 'Turn on the lights to the main bedroom', "Light 'em all up", 'Make it dark downstairs' ] def tokenizer = SimpleTokenizer.INSTANCE sentences.each { String[] tokens = tokenizer.tokenize(it) def model = new POSModel(helper.load('en-pos-maxent')) def posTagger = new POSTaggerME(model) String[] tags = posTagger.tag(tokens) println tokens.indices.collect{tags[it] == tokens[it] ? tags[it] : "${tags[it]}(${tokens[it]})" }.join(' ') } NNP(Paul) VBZ(has) CD(two) NNS(sisters) , NNP(Maree) CC(and) NNP(Christine) . PRP$(His) NN(bark) VBD(was) RB(much) JJR(worse) IN(than) PRP$(his) NN(bite) VB(Turn) IN(on) DT(the) NNS(lights) TO(to) DT(the) JJ(main) NN(bedroom) NN(Light) POS(') NN(em) DT(all) IN(up) VB(Make) PRP(it) JJ(dark) NN(downstairs)
  118. Classify language – sentence detection def helper = new ResourceHelper('http://opennlp.sourceforge.net/models-1.5')

    def model = new SentenceModel(helper.load('en-sent')) def detector = new SentenceDetectorME(model) def sentences = detector.sentDetect(text) assert text.count('.') == 28 assert sentences.size() == 4 println sentences.join('\n\n')
  119. Classify language – sentence detection def helper = new ResourceHelper('http://opennlp.sourceforge.net/models-1.5')

    def model = new SentenceModel(helper.load('en-sent')) def detector = new SentenceDetectorME(model) def sentences = detector.sentDetect(text) assert text.count('.') == 28 assert sentences.size() == 4 println sentences.join('\n\n') The most referenced scientific paper of all time is "Protein measurement with the Folin phenol reagent" by Lowry, O. H., Rosebrough, N. J., Farr, A. L. & Randall, R. J. and was published in the J. BioChem. in 1951. It describes a method for measuring the amount of protein (even as small as 0.2 γ, were γ is the specific weight) in solutions and has been cited over 300,000 times and can be found here: https://www.jbc.org/content/193/1/265.full.pdf. Dr. Lowry completed two doctoral degrees under an M.D.-Ph.D. program from the University of Chicago before moving to Harvard under A. Baird Hastings. He was also the H.O.D of Pharmacology at Washington University in St. Louis for 29 years. The most referenced scientific paper of all time is "Protein measurement with the Folin phenol reagent" by Lowry, O. H., Rosebrough, N. J., Farr, A. L. & Randall, R. J. and was published in the J. BioChem. in 1951. It describes a method for measuring the amount of protein (even as small as 0.2 ?, were ? is the specific weight) in solutions and has been cited over 300,000 times and can be found here: https://www.jbc.org/content/193/1/265.full.pdf. Dr. Lowry completed two doctoral degrees under an M.D.-Ph.D. program from the University of Chicago before moving to Harvard under A. Baird Hastings. He was also the H.O.D of Pharmacology at Washington University in St. Louis for 29 years.
  120. Classify language Apache NLPCraft (incubating) start(LightSwitchModel) def cli = new

    NCTestClientBuilder().newBuilder().build() cli.open("nlpcraft.lightswitch.ex.java") println cli.ask('Turn on the lights in the master bedroom') println cli.ask("Light 'em all up") println cli.ask('Make it dark downstairs') // expecting no match if (cli) { cli.close() } stop() Lights are [on] in [master bedroom]. Lights are [on] in [entire house]. No matching intent found.
  121. Classify language Apache NLPCraft (incubating) @NCIntentRef("ls") @NCIntentSample([ "Turn the lights

    off in the entire house.", "Switch on the illumination in the master bedroom closet.", "Get the lights on.", "Lights up in the kitchen.", "Please, put the light out in the upstairs bedroom.", "Set the lights on in the entire house.", … "Kill off all the lights now!", "No lights in the bedroom, please.", "Light up the garage, please!", "Kill the illumination now!" ]) NCResult onMatch( @NCIntentTerm("act") NCToken actTok, @NCIntentTerm("loc") List<NCToken> locToks) { String status = actTok.id == "ls:on" ? "on" : "off" String locations = locToks ? locToks.collect{ it.meta("nlpcraft:nlp:origtext") }.join(", ") : "entire house" // Add HomeKit, Arduino or other integration here. // By default - just return a descriptive action string. NCResult.text("Lights are [$status] in [${locations.toLowerCase()}].") } Lights are [on] in [master bedroom]. Lights are [on] in [entire house]. No matching intent found.
  122. Classification with Neural networks For >1 layers (deep) model caters

    for non- linear relationships See: https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6
  123. Classification with Neural networks Case Study: MNIST • Handwritten digits

    • 60,000 training examples • 10,000 test examples • Size-normalized (28x28 pixels) and centered • http://yann.lecun.com/exdb/mnist/
  124. MNIST with Apache Commons Math class MnistNeuralNet { private static

    int inodes = 784 private static int hnodes = 200 private static int onodes = 10 private static double learning_rate = 0.1 private static RealMatrix inputHidden private static RealMatrix hiddenOutput static initRandom() { inputHidden = initRandom(createRealMatrix(hnodes, inodes), inodes**-0.5 as double) hiddenOutput = initRandom(createRealMatrix(onodes, hnodes), hnodes**-0.5 as double) } static double[] normalize(Integer[] img) { img.collect { it / 255.0d * 0.999d + 0.001d } as double[] } static RealMatrix createTarget(int label) { createRealMatrix(onodes, 1).tap { (0..<onodes).each { setEntry(it, 0, it != label ? 0.001d : 0.999d) } } } static RealMatrix query(double[] inputArray) { RealMatrix inputs = createColumnRealMatrix(inputArray) RealMatrix hiddenInputs = inputHidden * inputs RealMatrix hiddenOutputs = scalarSigmoid(hiddenInputs) RealMatrix finalInputs = hiddenOutput * hiddenOutputs return scalarSigmoid(finalInputs) } … Inspired by: https://golb.hplar.ch/2018/12/simple-neural-network.html MnistNeuralNet class supports training and prediction 0 1 2 9 … … … 10 28 * 28 = 784 200
  125. MNIST with Apache Commons Math … static RealMatrix initRandom(RealMatrix matrix,

    double desiredStandardDeviation) { scalar(matrix, it -> new Random().nextGaussian() * desiredStandardDeviation) } static void train(RealMatrix inputs, RealMatrix targets) { // forward RealMatrix hiddenInputs = inputHidden * inputs RealMatrix hiddenOutputs = scalarSigmoid(hiddenInputs) RealMatrix finalInputs = hiddenOutput * hiddenOutputs RealMatrix finalOutputs = scalarSigmoid(finalInputs) // back RealMatrix outputErrors = targets.subtract(finalOutputs) RealMatrix t1 = multiplyElements(outputErrors, finalOutputs) RealMatrix t2 = multiplyElements(t1, scalar(finalOutputs, it -> 1.0d - it)) RealMatrix t3 = t2 * hiddenOutputs.transpose() hiddenOutput = hiddenOutput.add(scalar(t3, it -> learning_rate * it)) RealMatrix hiddenErrors = hiddenOutput.transpose() * outputErrors t1 = multiplyElements(hiddenErrors, hiddenOutputs) t2 = multiplyElements(t1, scalar(hiddenOutputs, it -> 1.0d - it)) t3 = t2 * inputs.transpose() inputHidden = inputHidden.add(scalar(t3, it -> learning_rate * it)) } static void save(String filename) { new FileOutputStream(filename).withObjectOutputStream { oos -> oos.writeObject(inputHidden.data) oos.writeObject(hiddenOutput.data) } } } Inspired by: https://golb.hplar.ch/2018/12/simple-neural-network.html MnistNeuralNet class supports training and prediction cont’d
  126. MNIST with Apache Commons Math static RealMatrix scalar(RealMatrix matrix, Function<Double,

    Double> function) { int numRows = matrix.rowDimension int numCols = matrix.columnDimension RealMatrix result = createRealMatrix(numRows, numCols) for (r in 0..<numRows) { for (c in 0..<numCols) { result.setEntry(r, c, function.apply(matrix.getEntry(r, c))) } } return result } static int[][] rotate(int[][] img, double angleInDegrees) { double angle = Math.toRadians(angleInDegrees) int[][] result = new int[img.length][] for (y in 0..<img.length) { result[y] = new int[img[y].length] Arrays.fill(result[y], 0) } double cosAngle = Math.cos(angle) double sinAngle = Math.sin(angle) double x0 = img[0].length / 2 - cosAngle * img[0].length / 2 - sinAngle * img.length / 2 double y0 = img.length / 2 - cosAngle * img.length / 2 + sinAngle * img[0].length / 2 for (y in 0..<img.length) { for (x in 0..<img[y].length) { int xRot = (int) (x * cosAngle + y * sinAngle + x0) int yRot = (int) (-x * sinAngle + y * cosAngle + y0) if (xRot >= 0 && yRot >= 0 && xRot <= 27 && yRot <= 27) { result[y][x] = img[yRot][xRot] } } } return result } … Inspired by: https://golb.hplar.ch/2018/12/simple-neural-network.html Utility methods
  127. MNIST with Apache Commons Math … static RealMatrix multiplyElements(RealMatrix matrixA,

    RealMatrix matrixB) { // elementWise multiplication has same compatibility requirements as addition checkAdditionCompatible(matrixA, matrixB) int numRows = matrixA.rowDimension int numCols = matrixA.columnDimension RealMatrix product = createRealMatrix(numRows, numCols) for (r in 0..<numRows) { for (c in 0..<numCols) { product.setEntry(r, c, matrixA.getEntry(r, c) * matrixB.getEntry(r, c)) } } return product } static int maxIndex(RealMatrix result) { double[][] data = result.data (0..<data.size()).max { data[it][0] } } static RealMatrix scalarSigmoid(RealMatrix matrix) { def sigmoid = [start: { a, b, c, d, e, f -> }, visit: { r, c, double v -> 1 / (1 + Math.exp(-v)) }, end : { -> 0d }] as RealMatrixChangingVisitor RealMatrix result = matrix.copy() result.walkInRowOrder(sigmoid) result } Inspired by: https://golb.hplar.ch/2018/12/simple-neural-network.html Utility methods cont’d
  128. MNIST with Apache Commons Math int epochs = 10 initRandom()

    int[] labels = MnistReader.getLabels(Paths.get("/path/to/train-labels-idx1-ubyte.gz")) List<int[][]> images = MnistReader.getImages(Paths.get("/path/to/train-images-idx3-ubyte.gz")) println 'Processing images' def scaledImages = images.collect { image -> normalize(image.flatten() as Integer[]) } as double[][] def scaledImagesRot10 = images.collect { image -> normalize(rotate(image, 10).flatten() as Integer[]) } as double[][] def scaledImagesRot350 = images.collect { image -> normalize(rotate(image, -10).flatten() as Integer[]) } as double[][] epochs.times { e -> println "Running epoch: ${e + 1}" for (i in 0..<labels.length) { RealMatrix targets = createTarget(labels[i]) train(createColumnRealMatrix(scaledImages[i]), targets) train(createColumnRealMatrix(scaledImagesRot10[i]), targets) train(createColumnRealMatrix(scaledImagesRot350[i]), targets) } } int[] testLabels = MnistReader.getLabels(Paths.get("/path/to/t10k-labels-idx1-ubyte.gz")) List<int[][]> testImages = MnistReader.getImages(Paths.get("/path/to/t10k-images-idx3-ubyte.gz")) int correct = 0 for (i in 0..<testLabels.length) { int correctLabel = testLabels[i] RealMatrix predict = query(scale(testImages.get(i).flatten() as Integer[])) int predictLabel = maxIndex(predict) if (predictLabel == correctLabel) correct++ } println "Accuracy: " + correct / testLabels.length save("../resources/weights") Processing images Running epoch: 1 Running epoch: 2 Running epoch: 3 Running epoch: 4 Running epoch: 5 Running epoch: 6 Running epoch: 7 Running epoch: 8 Running epoch: 9 Running epoch: 10 Accuracy: 0.9767 Inspired by: https://golb.hplar.ch/2018/12/simple-neural-network.html Train model Test model
  129. MNIST with DeepLearning4J int rngSeed = 123 int batchSize =

    125 int numEpochs = 15 int numInputs = 28 * 28 int hiddenLayerSize = 1000 int numOutputs = 10 def (trainSet, testSet) = [true, false].collect { new MnistDataSetIterator(batchSize, it, rngSeed) } def log = LoggerFactory.getLogger(getClass()) def conf = new NeuralNetConfiguration.Builder() .seed(rngSeed) //include the random seed for reproducibility // use stochastic gradient descent as an optimization algorithm .updater(new Nesterovs(0.006, 0.9)) .l2(1e-4) .list() // input layer: .layer(new DenseLayer.Builder() .nIn(numInputs) .nOut(hiddenLayerSize) .activation(Activation.RELU) .weightInit(WeightInit.XAVIER) .build()) // hidden layer: .layer(new OutputLayer.Builder(LossFunction.NEGATIVELOGLIKELIHOOD) .nIn(hiddenLayerSize) .nOut(numOutputs) .activation(Activation.SOFTMAX) .weightInit(WeightInit.XAVIER) .build()) .build() … … log.info("Creating model ...") def model = new MultiLayerNetwork(conf) model.init() log.info("Training model ...") model.listeners = new ScoreIterationListener(100) model.fit(trainSet, numEpochs) // model.save('mlp_model.dat' as File) log.info("Evaluating model...") def eval = model.evaluate(testSet) log.info(eval.stats()) ======== Evaluation Metrics ======== # of classes: 10 Accuracy: 0.9728 Precision: 0.9728 Recall: 0.9726 F1 Score: 0.9726 Based on: https://github.com/deeplearning4j/dl4j-examples/blob/master/dl4j- examples/src/main/java/org/deeplearning4j/examples/feedforward/mnist/MLPMnistSingleLayerExample.java 0 1 2 9 … … … 10 28 * 28 = 784 1000
  130. … MNIST with DeepLearning4J Based on: https://github.com/deeplearning4j/dl4j-examples/blob/master/dl4j- examples/src/main/java/org/deeplearning4j/examples/feedforward/mnist/MLPMnistSingleLayerExample.java 0 1

    2 9 … … … 10 28 * 28 = 784 500 int rngSeed = 123 int batchSize = 64 int numEpochs = 15 int numInputs = 28 * 28 int numOutputs = 10 double rate = 0.0015 def (trainSet, testSet) = [true, false].collect { new MnistDataSetIterator(batchSize, it, rngSeed) } def log = LoggerFactory.getLogger(getClass()) def conf = new NeuralNetConfiguration.Builder() .seed(rngSeed) //include the random seed for reproducibility // use stochastic gradient descent as an optimization algorithm .updater(new Nadam()) .l2(rate * 0.005) // regularized .list() // first input layer: .layer(new DenseLayer.Builder() .nIn(numInputs) .nOut(500) .build()) // second input layer: .layer(new DenseLayer.Builder() .nIn(500) .nOut(100) .build()) // hidden layer: .layer(new OutputLayer.Builder(LossFunction.NEGATIVELOGLIKELIHOOD) .nOut(numOutputs) .activation(Activation.SOFTMAX) .build()) .build() … log.info("Creating model ...") def model = new MultiLayerNetwork(conf) model.init() log.info("Training model ...") model.listeners = new ScoreIterationListener(100) model.fit(trainSet, numEpochs) log.info("Evaluating model...") def eval = model.evaluate(testSet) log.info(eval.stats()) ======== Evaluation Metrics ======== # of classes: 10 Accuracy: 0.9820 Precision: 0.9820 Recall: 0.9818 F1 Score: 0.9819 100
  131. MNIST GUI with GroovyFX import javafx.embed.swing.SwingFXUtils import javafx.scene.canvas.GraphicsContext import javafx.scene.image.Image

    import javafx.scene.image.PixelFormat import javafx.scene.paint.Color import javax.imageio.ImageIO static clear(GraphicsContext g, size) { g.fill = Color.WHITE; g.fillRect(0, 0, size, size); g.fill = Color.BLACK } static int[] imageToArray(Image img) { int w = img.width int h = img.height int[] buf = new int[h * w] img.pixelReader.getPixels(0, 0, w, h, PixelFormat.intArgbInstance, buf, 0, w) buf } static Image snapshot(canvas) { def baos = new ByteArrayOutputStream() ImageIO.write(SwingFXUtils.fromFXImage(canvas.snapshot(null, null), null), "png", baos) new Image(new ByteArrayInputStream(baos.toByteArray()), 28, 28, true, true) } static displayResult(List items, Integer predict) { items.indexed().collect{ idx, val -> def marker = (idx == predict && items[predict] > 0.5) ? ' **' : '' "$idx ${val ? sprintf('%.4f', val) + marker : '?'}" }.join('\n') } A few more utility methods
  132. MNIST GUI with GroovyFX class MnistInfer { private static RealMatrix

    inputHidden private static RealMatrix hiddenOutput static void load(Class klass) { def weights = klass.classLoader.getResourceAsStream('weights') try (ObjectInputStream ois = new ObjectInputStream(weights)) { inputHidden = MatrixUtils.createRealMatrix((double[][]) ois.readObject()) hiddenOutput = MatrixUtils.createRealMatrix((double[][]) ois.readObject()) } } static RealMatrix query(double[] inputArray) { RealMatrix inputs = createColumnRealMatrix(inputArray) RealMatrix hiddenInputs = inputHidden * inputs RealMatrix hiddenOutputs = scalarSigmoid(hiddenInputs) RealMatrix finalInputs = hiddenOutput * hiddenOutputs return scalarSigmoid(finalInputs) } static double[] normalize(int[] img) { double[] result = new double[img.length] for (int i = 0; i < img.length; i++) { int red = (img[i] >> 16) & 0xff int green = (img[i] >> 8) & 0xff int blue = img[i] & 0xff result[i] = 1 - ((red + green + blue) / 765.0 * 999 + 1) / 1000 } return result } } Predictor class
  133. MNIST GUI with GroovyFX … start { stage(title: 'MNIST', visible:

    true) { scene(id: "scene", fill: WHITE) { borderPane { top { canvas(id: "canvas", width: size, height: size) def g = canvas.graphicsContext2D clear(g, size) canvas.onMouseDragged { e -> g.fillOval e.x - pen, e.y - pen, pen * 2, pen * 2 } } center { hbox(alignment: 'Center') { button('Clear', onAction: { clear(canvas.graphicsContext2D, size) out.text = displayResult([null] * 10, null) }) button('Predict', onAction: { def result = Mnist.query(Mnist.scale(imageToArray(snapshot(canvas)))) def predictLabel = Mnist.maxIndex(result) out.text = displayResult(result.data.collect{ it[0] }, predictLabel) }) } } bottom { textArea(displayResult([null] * 10, null), id: 'out', editable: false, prefColumnCount: 16) } } } } }
  134. Apache MXNET Object detection import groovy.swing.SwingBuilder import javax.imageio.ImageIO import org.apache.mxnet.infer.javaapi.ObjectDetector

    import org.apache.mxnet.javaapi.* import static javax.swing.WindowConstants.DISPOSE_ON_CLOSE static void downloadUrl(String srcDirUrl, String destDir, String filename) { def destination = new File(destDir, filename) if (!destination.parentFile.exists()) { destination.parentFile.mkdirs() } if (!destination.exists()) { destination.bytes = new URL(srcDirUrl + filename).bytes } } static downloadModelImage() { String baseDir = "${System.getProperty('java.io.tmpdir')}/resnetssd/" def imageName = 'dog-ssd.jpg' def modelName = 'resnet50_ssd_model' String imgURL = "https://s3.amazonaws.com/model-server/inputs/" println "Downloading image to ${baseDir}..." downloadUrl(imgURL, baseDir, imageName) println "Downloading model files to ${baseDir}... (may take a while)" String modelURL = "https://s3.amazonaws.com/model-server/models/resnet50_ssd/" downloadUrl(modelURL, baseDir, "$modelName-symbol.json") downloadUrl(modelURL, baseDir, "$modelName-0000.params") downloadUrl(modelURL, baseDir, 'synset.txt') [baseDir + imageName, baseDir + modelName] } … Utility methods for downloading test image and pre-trained model Test image
  135. Apache MXNET Object detection … static detectObjects(String modelPath, String imagePath,

    inputShape) { def context = [Context.cpu()] def inputDescriptor = new DataDesc("data", inputShape, DType.Float32(), "NCHW") def objDet = new ObjectDetector(modelPath, [inputDescriptor], context, 0) return objDet.imageObjectDetect(ObjectDetector.loadImageFromFile(imagePath), 3) } def (imagePath, modelPath) = downloadModelImage() def image = ImageIO.read(imagePath as File) def (w, h) = image.with{ [it.width, it.height] } def count = 1 // batch size def channels = 3 // for RGB def inputShape = new Shape(count, channels, w * 0.8 as int, h) def results = detectObjects(modelPath, imagePath, inputShape).sum() def boxes = results.collect {[ xmin: w * it.XMin as int, ymin: h * it.YMin as int, xmax: w * it.XMax as int, ymax: h * it.YMax as int ]} def names = results.collect{ it.className + sprintf(' %.3f', it.probability) } (0..<names.size()).each{ println "${names[it]} ${boxes[it]}" } Image.drawBoundingBox(image, boxes, names) new SwingBuilder().edt { frame(title: "${results.size()} detected objects", size: [w, h], show: true, defaultCloseOperation: DISPOSE_ON_CLOSE) { label(icon: imageIcon(image: image)) } } car 0.996 [xmin:439, ymin:62, xmax:715, ymax:173] bicycle 0.986 [xmin:144, ymin:128, xmax:547, ymax:453] dog 0.919 [xmin:102, ymin:184, xmax:344, ymax:541]
  136. Scaling up Data Science If your big data framework doesn’t

    have the algorithm you need? Preprocessing Partition based dataset Linear regression K-Means clustering Genetic algorithms Multilayer perceptron Decision trees k-NN classification k-NN regression SVM binary classification SVM multi-class Classification Model cross validation Logistic regression Random forest Gradient boosting Model updating ANN (Approximate Nearest Neighbor) Deep learning Summary statistics Correlations Stratified sampling Hypothesis testing Random data generation Classification and regression: support Vector machines logistic regression Linear regression Decision trees Naive Bayes classification Alternating least squares (ALS) K-Means clustering Latent Dirichlet allocation (LDA) Singular value decomposition (SVD) Principal component analysis (PCA) Feature extraction Transformation functions Optimization algorithms Re-formulate your algorithm using map/filter/reduce
  137. House pricing revisited – Rejigging our algorithm Data Training Test

    Model Stats Split Fit Predict and Evaluate Data Chunk1 Model1 Stats Split Fit Predict and Evaluate … Chunk2 Model2 ChunkN ModelN Fit Model Aggregate … … ChunkN Stats1 Stats2 StatsN … Aggregate Chunk2 Chunk1
  138. House pricing revisited - GPars import org.apache.commons.math3.stat.StatUtils import smile.math.Math import

    smile.regression.OLS import tech.tablesaw.api.* import groovyx.gpars.GParsPool import static java.lang.Math.sqrt def features = [ 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_living15', 'lat', 'sqft_above', 'grade', 'view', 'waterfront', 'floors' ] def file = getClass().classLoader.getResource('kc_house_data.csv').file def table = Table.read().csv(file) table = table.dropWhere(table.column("bedrooms").isGreaterThan(30)) def data = table.as().doubleMatrix(*features) … • Read CSV in normal way • Split data into chunks • Fit each chunk and aggregate resulting models • Split data into chunks • Evaluate each chunk against model and aggregate resulting stats
  139. House pricing revisited - GPars … GParsPool.withPool { def trainChunkSize

    = 3000 def numTrainChunks = 8 def models = (0..<numTrainChunks).collectParallel { def list = data.toList().shuffled().take(trainChunkSize) new OLS(list.collect{ it[1..-1] } as double[][], list.collect{ it[0] } as double[]).with{ [it.intercept(), *it.coefficients()] } } def model = models.transpose()*.sum().collect{ it/numTrainChunks } println "Intercept: ${model[0]}" println "Coefficients: ${model[1..-1].join(', ')}" double[] coefficients = model[1..-1] double intercept = model[0] def stats = { chunk -> def predicted = chunk.collect { row -> intercept + dot(row[1..-1] as double[], coefficients) } def residuals = chunk.toList().indexed().collect { idx, row -> predicted[idx] - row[0] } def rmse = sqrt(sumSq(residuals as double[]) / chunk.size()) [rmse, residuals.average(), chunk.size()] } def evalChunkSize = 2000 def results = data.collate(evalChunkSize).collectParallel(stats) println 'RMSE: ' + sqrt(results.collect{ it[0] * it[0] * it[2] }.sum() / data.size()) println 'mean: ' + results.collect{ it[1] * it[2] }.sum() / data.size() } • Read CSV in normal way • Split data into chunks • Fit each chunk and aggregate resulting models • Split data into chunks • Evaluate each chunk against model and aggregate resulting stats
  140. House pricing revisited - GPars … GParsPool.withPool { def trainChunkSize

    = 3000 def numTrainChunks = 8 def models = (0..<numTrainChunks).collectParallel { def list = data.toList().shuffled().take(trainChunkSize) new OLS(list.collect{ it[1..-1] } as double[][], list.collect{ it[0] } as double[]).with{ [it.intercept(), *it.coefficients()] } } def model = models.transpose()*.sum().collect{ it/numTrainChunks } println "Intercept: ${model[0]}" println "Coefficients: ${model[1..-1].join(', ')}" double[] coefficients = model[1..-1] double intercept = model[0] def stats = { chunk -> def predicted = chunk.collect { row -> intercept + dot(row[1..-1] as double[], coefficients) } def residuals = chunk.toList().indexed().collect { idx, row -> predicted[idx] - row[0] } def rmse = sqrt(sumSq(residuals as double[]) / chunk.size()) [rmse, residuals.average(), chunk.size()] } def evalChunkSize = 2000 def results = data.collate(evalChunkSize).collectParallel(stats) println 'RMSE: ' + sqrt(results.collect{ it[0] * it[0] * it[2] }.sum() / data.size()) println 'mean: ' + results.collect{ it[1] * it[2] }.sum() / data.size() } Intercept: -3.16904445043026E7 Coefficients: -21131.141578491588, -5431.675525641348, 179.47932136012156, 11.273983523229091, 657752.8815669084, -10.658550821995767, 86876.08541623666, 61708.58759419169, 611717.9878928157, -21819.967894981055 RMSE: 215185.8177879586 mean: -2151.974648800721
  141. … GParsPool.withPool { def trainChunkSize = 3000 def numTrainChunks =

    8 def models = (0..<numTrainChunks).collectParallel { def list = data.toList().shuffled().take(trainChunkSize) new OLS(list.collect{ it[1..-1] } as double[][], list.collect{ it[0] } as double[]).with{ [it.intercept(), *it.coefficients()] } } def model = models.transpose()*.sum().collect{ it/numTrainChunks } println "Intercept: ${model[0]}" println "Coefficients: ${model[1..-1].join(', ')}" double[] coefficients = model[1..-1] double intercept = model[0] def stats = { chunk -> def predicted = chunk.collect { row -> intercept + dot(row[1..-1] as double[], coefficients) } def residuals = chunk.toList().indexed().collect { idx, row -> predicted[idx] - row[0] } def rmse = sqrt(sumSq(residuals as double[]) / chunk.size()) [rmse, residuals.average(), chunk.size()] } def evalChunkSize = 2000 def results = data.collate(evalChunkSize).collectParallel(stats) println 'RMSE: ' + sqrt(results.collect{ it[0] * it[0] * it[2] }.sum() / data.size()) println 'mean: ' + results.collect{ it[1] * it[2] }.sum() / data.size() } House pricing revisited - GPars Data Chunk1 Model1 Stats shuffle/take results … Chunk2 Model2 ChunkN ModelN OLS#intercept/OLS#coefficients Model … … ChunkN Stats1 Stats2 StatsN … Chunk2 Chunk1 results.collect transpose/sum model collate stats
  142. House pricing revisited – Apache Beam … class MeanDoubleArrayCols implements

    SerializableFunction<Iterable<double[]>, double[]> { @Override double[] apply(Iterable<double[]> input) { double[] sum = null def count = 0 for (double[] next : input) { if (sum == null) { sum = new double[next.size()] (0..<sum.size()).each{ sum[it] = 0.0d } } (0..<sum.size()).each{ sum[it] += next[it] } count++ } if (sum != null) (0..<sum.size()).each{ sum[it] /= count } return sum } } class AggregateModelStats implements SerializableFunction<Iterable<double[]>, double[]> { @Override double[] apply(Iterable<double[]> input) { double[] sum = null for (double[] next : input) { if (sum == null) { sum = new double[next.size()] (0..<sum.size()).each{ sum[it] = 0.0d } } def total = sum[2] + next[2] sum[0] = sqrt((sum[2] * sum[0] * sum[0] + next[2] * next[0] * next[0]) / total) sum[1] = (sum[2] * sum[1] + next[2] * next[1]) / total sum[2] = total } return sum } } … Aggregate models Aggregate stats
  143. House pricing revisited – Apache Beam … static buildPipeline(Pipeline p)

    { def features = [ 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_living15', 'lat', 'sqft_above', 'grade', 'view', 'waterfront', 'floors' ] def readCsvChunks = new DoFn<String, double[][]>() { @ProcessElement void processElement(@Element String path, OutputReceiver<double[][]> receiver) throws IOException { def chunkSize = 6000 def table = Table.read().csv(path) table = table.dropWhere(table.column("bedrooms").isGreaterThan(30)) def idxs = 0..<table.rowCount() for (nextChunkIdxs in idxs.shuffled().collate(chunkSize)) { def chunk = table.rows(*nextChunkIdxs) receiver.output(chunk.as().doubleMatrix(*features)) } } } def fitModel = new DoFn<double[][], double[]>() { @ProcessElement void processElement(@Element double[][] rows, OutputReceiver<double[]> receiver) throws IOException { double[] model = new OLS(rows.collect{ it[1..-1] } as double[][], rows.collect{ it[0] } as double[]).with{ [it.intercept(), *it.coefficients()] } receiver.output(model) } } def evalModel = { double[][] chunk, double[] model -> double intercept = model[0] double[] coefficients = model[1..-1] def predicted = chunk.collect { row -> intercept + dot(row[1..-1] as double[], coefficients) } def residuals = chunk.toList().indexed().collect { idx, row -> predicted[idx] - row[0] } def rmse = sqrt(sumSq(residuals as double[]) / chunk.size()) def mean = residuals.average() [rmse, mean, chunk.size()] as double[] } … Map filename to chunks Map chunk to model Map chunk to stats
  144. House pricing revisited – Apache Beam … def model2out =

    new DoFn<double[], String>() { @ProcessElement void processElement(@Element double[] ds, OutputReceiver<String> out) { out.output("intercept: ${ds[0]}, coeffs: ${ds[1..-1].join(', ')}".toString()) } } def stats2out = new DoFn<double[], String>() { @ProcessElement void processElement(@Element double[] ds, OutputReceiver<String> out) { out.output("rmse: ${ds[0]}, mean: ${ds[1]}, count: ${ds[2]}".toString()) } } var csvChunks = p .apply(Create.of('/path/to/kc_house_data.csv')) .apply('Create chunks', ParDo.of(readCsvChunks)) var model = csvChunks .apply('Fit chunks', ParDo.of(fitModel)) .apply(Combine.globally(new MeanDoubleArrayCols())) var modelView = model .apply(View.<double[]>asSingleton()) csvChunks .apply(ParDo.of(new EvaluateModel(modelView, evalModel)).withSideInputs(modelView)) .apply(Combine.globally(new AggregateModelStats())) .apply('Log stats', ParDo.of(stats2out)).apply(Log.ofElements()) model .apply('Log model', ParDo.of(model2out)).apply(Log.ofElements()) } def pipeline = Pipeline.create() buildPipeline(pipeline) pipeline.run().waitUntilFinish() INFO: intercept: -3.2504584735458568E7, coeffs: -27862.329468295735, -2269.8423856962077, 198.1365232723505, 2.9357512250793327, 675276.3494316386, -2.323389992226333, 83241.42291377622, 66606.19682072607, 607326.5819472861, -30354.844181689274 INFO: rmse: 214702.83588071115, mean: -1191.2433423574066, count: 21612.0 Map model to output Map stats to output Create and run pipeline Build pipeline
  145. … var csvChunks = p | Create.of('/path/to/kc_house_data.csv') | 'Create chunks'

    >> ParDo.of(readCsvChunks) var model = csvChunks | 'Fit chunks' >> ParDo.of(fitModel) | Combine.globally(new MeanDoubleArrayCols()) var modelView = model | View.<double[]>asSingleton() csvChunks | ParDo.of(new EvaluateModel(modelView, evalModel)).withSideInputs(modelView) | Combine.globally(new AggregateModelStats()) | 'Log stats' >> ParDo.of(stats2out) | Log.ofElements() model | 'Log model' >> ParDo.of(model2out) | Log.ofElements() } PCollection.metaClass.or = { List arg -> delegate.apply(*arg) } PCollection.metaClass.or = { PTransform arg -> delegate.apply(arg) } String.metaClass.rightShift = { PTransform arg -> [delegate, arg] } Pipeline.metaClass.or = { PTransform arg -> delegate.apply(arg) } def pipeline = Pipeline.create() buildPipeline(pipeline) pipeline.run().waitUntilFinish() House pricing revisited – Apache Beam INFO: intercept: -3.2504584735458568E7, coeffs: -27862.329468295735, -2269.8423856962077, 198.1365232723505, 2.9357512250793327, 675276.3494316386, -2.323389992226333, 83241.42291377622, 66606.19682072607, 607326.5819472861, -30354.844181689274 INFO: rmse: 214702.83588071115, mean: -1191.2433423574066, count: 21612.0 Create and run pipeline Some optional metaprogramming for Python-like coding style
  146. House pricing revisited – Apache Beam … var csvChunks =

    p | Create.of('/path/to/kc_house_data.csv') | 'Create chunks' >> ParDo.of(readCsvChunks) var model = csvChunks | 'Fit chunks' >> ParDo.of(fitModel) | Combine.globally(new MeanDoubleArrayCols()) var modelView = model | View.<double[]>asSingleton() csvChunks | ParDo.of(new EvaluateModel(modelView, evalModel)).withSideInputs(modelView) | Combine.globally(new AggregateModelStats()) | 'Log stats' >> ParDo.of(stats2out) | Log.ofElements() model | 'Log model' >> ParDo.of(model2out) | Log.ofElements() } PCollection.metaClass.or = { List arg -> delegate.apply(*arg) } PCollection.metaClass.or = { PTransform arg -> delegate.apply(arg) } String.metaClass.rightShift = { PTransform arg -> [delegate, arg] } Pipeline.metaClass.or = { PTransform arg -> delegate.apply(arg) } def pipeline = Pipeline.create() buildPipeline(pipeline) pipeline.run().waitUntilFinish() INFO: intercept: -3.2504584735458568E7, coeffs: -27862.329468295735, -2269.8423856962077, 198.1365232723505, 2.9357512250793327, 675276.3494316386, -2.323389992226333, 83241.42291377622, 66606.19682072607, 607326.5819472861, -30354.844181689274 INFO: rmse: 214702.83588071115, mean: -1191.2433423574066, count: 21612.0 Data Chunk1 Model1 Stats readCsvChunks csvChunks … Chunk2 Model2 ChunkN ModelN fitModel Model … … ChunkN Stats1 Stats2 StatsN … Chunk2 Chunk1 AggregateModelStats MeanDoubleArrayCols csvChunks model modelView sideInput evalModel
  147. Optimization Overview Optimization: • Finding (optimal) solutions Approaches: • Dynamic

    programming • Linear programming • Non-linear programming • Constraint programming • Integer programming Aspects: • Combinatorial optimization • Continuous optimization Applications: • Timetabling/scheduling • Vehicle routing • Packing problems • Trading systems • Logistics • Language detection CP Constraint Programming CIP Constraint Integer Programming MINLP Mixed Integer Non-linear Programming MIP Mixed Integer Linear Programming IP Integer Linear Programming SAT Satisfiability LP Linear Programming
  148. Constraint Programming Typical domains: • Boolean (SAT problem) • integer

    • rational • interval (scheduling problems) • linear (approaches to non- linear problems do exist) • finite sets • mixed (2 or more of above) Picture: http://osullivan.ucc.ie/AAAI_OSullivan.pdf
  149. Cryptarithmetic S E N D + M O R E

    = M O N E Y Replace the letters with decimal digits to make a valid arithmetic sum
  150. Cryptarithmetic: Brute force def solutions(): # letters = ('s', 'e',

    'n', 'd', 'm', 'o', 'r', 'y') all_solutions = list() for s in range(1, 10): for e in range(0, 10): for n in range(0, 10): for d in range(0, 10): for m in range(1, 10): for o in range(0, 10): for r in range(0, 10): for y in range(0, 10): if len({s, e, n, d, m, o, r, y}) == 8: send = 1000 * s + 100 * e + 10 * n + d more = 1000 * m + 100 * o + 10 * r + e money = 10000 * m + 1000 * o + 100 * n + 10 * e + y if send + more == money: all_solutions.append((send, more, money)) return all_solutions print(solutions()) S E N D + M O R E = M O N E Y [(9567, 1085, 10652)]
  151. Cryptarithmetic: Brute force def solutions() { // letters = ['s',

    'e', 'n', 'd', 'm', 'o', 'r', 'y'] def all_solutions = [] for (s in 1..<10) for (e in 0..9) for (n in 0..9) for (d in 0..9) for (m in 1..9) for (o in 0..9) for (r in 0..9) for (y in 0..9) if ([s, e, n, d, m, o, r, y].toSet().size() == 8) { def send = 1000 * s + 100 * e + 10 * n + d def more = 1000 * m + 100 * o + 10 * r + e def money = 10000 * m + 1000 * o + 100 * n + 10 * e + y if (send + more == money) all_solutions.add([send, more, money]) } return all_solutions } print(solutions()) S E N D + M O R E = M O N E Y [[9567, 1085, 10652]]
  152. Cryptarithmetic: Brute force from itertools import permutations def solution2(): letters

    = ('s', 'e', 'n', 'd', 'm', 'o', 'r', 'y') digits = range(10) for perm in permutations(digits, len(letters)): sol = dict(zip(letters, perm)) if sol['s'] == 0 or sol['m'] == 0: continue send = 1000 * sol['s'] + 100 * sol['e'] + 10 * sol['n'] + sol['d'] more = 1000 * sol['m'] + 100 * sol['o'] + 10 * sol['r'] + sol['e'] money = 10000 * sol['m'] + 1000 * sol['o'] + 100 * sol['n'] + 10 * sol['e'] + sol['y'] if send + more == money: return send, more, money print(solution2()) S E N D + M O R E = M O N E Y (9567, 1085, 10652)
  153. Cryptarithmetic: Brute force def solution2() { def digits = 0..9

    for (p in digits.permutations()) { if (p[-1] < p[-2]) continue def (s, e, n, d, m, o, r, y) = p if (s == 0 || m == 0) continue def send = 1000 * s + 100 * e + 10 * n + d def more = 1000 * m + 100 * o + 10 * r + e def money = 10000 * m + 1000 * o + 100 * n + 10 * e + y if (send + more == money) return [send, more, money] } } print(solution2()) S E N D + M O R E = M O N E Y [9567, 1085, 10652]
  154. from csp import Constraint, CSP from typing import Dict, List,

    Optional class SendMoreMoneyConstraint(Constraint[str, int]): def __init__(self, letters: List[str]) -> None: super().__init__(letters) self.letters: List[str] = letters def satisfied(self, assignment: Dict[str, int]) -> bool: # not a solution if duplicate values if len(set(assignment.values())) < len(assignment): return False # if all vars assigned, check if correct if len(assignment) == len(self.letters): s: int = assignment["S"] e: int = assignment["E"] n: int = assignment["N"] d: int = assignment["D"] m: int = assignment["M"] o: int = assignment["O"] r: int = assignment["R"] y: int = assignment["Y"] send: int = s * 1000 + e * 100 + n * 10 + d more: int = m * 1000 + o * 100 + r * 10 + e money: int = m * 10000 + o * 1000 + n * 100 + e * 10 + y return send + more == money return True … Cryptarithmetic: Constraint programming … letters = ["S", "E", "N", "D", "M", "O", "R", "Y"] possible_digits = {} for letter in letters: possible_digits[letter] = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] possible_digits["M"] = [1] # can't start with 0 constraint csp: CSP[str, int] = CSP(letters, possible_digits) csp.add_constraint(SendMoreMoneyConstraint(letters)) solution: Optional[Dict[str, int]] = csp.backtracking_search() if solution is None: print("No solution found!") else: print(solution) S E N D + M O R E = M O N E Y {'S': 9, 'E': 5, 'N': 6, 'D': 7, 'M': 1, 'O': 0, 'R': 8, 'Y': 2} Adapted from: https://freecontent.manning.com/constraint-satisfaction-problems-in-python/
  155. @Grab('org.choco-solver:choco-solver:4.10.0') import org.chocosolver.solver.Model import org.chocosolver.solver.variables.IntVar def model = new Model("SEND+MORE=MONEY")

    def S = model.intVar("S", 1, 9) def E = model.intVar("E", 0, 9) def N = model.intVar("N", 0, 9) def D = model.intVar("D", 0, 9) def M = model.intVar("M", 1, 9) def O = model.intVar("O", 0, 9) def R = model.intVar("R", 0, 9) def Y = model.intVar("Y", 0, 9) model.allDifferent(S, E, N, D, M, O, R, Y).post() … Cryptarithmetic: Constraint programming … IntVar[] ALL = [ S, E, N, D, M, O, R, E, M, O, N, E, Y] int[] COEFFS = [ 1000, 100, 10, 1, 1000, 100, 10, 1, -10000, -1000, -100, -10, -1] model.scalar(ALL, COEFFS, "=", 0).post() model.solver.findSolution() S E N D + M O R E = M O N E Y Solution: S=9, E=5, N=6, D=7, M=1, O=0, R=8, Y=2,
  156. Dietary restrictions Minimise cost of diet given: • Must be

    at least 300 calories • Not more than 10 grams of protein • Not less than 10 grams of carbohydrates • Not less than 8 grams of fat • At least 0.5 units of fish • No more than 1 unit of milk Bread Milk Cheese Potato Fish Yogurt Cost 2.0 3.5 8.0 1.5 11.0 1.0 Protein (g) 4.0 8.0 7.0 1.3 8.0 9.2 Fat (g) 1.0 5.0 9.0 0.1 7.0 1.0 Carbohydrates (g) 15.0 11.7 0.4 22.6 0.0 17.0 Calories 90 120 106 97 130 180
  157. Diet problem (ojalgo) def model = new ExpressionsBasedModel() def bread

    = model.addVariable("Bread").lower(0) def milk = model.addVariable("Milk").lower(0).upper(1) def cheese = model.addVariable("Cheese").lower(0) def potato = model.addVariable("Potato").lower(0) def fish = model.addVariable("Fish").lower(0.5) def yogurt = model.addVariable("Yogurt").lower(0) def cost = model.addExpression("Cost") cost.set(bread, 2.0).set(milk, 3.5).set(cheese, 8.0).set(potato, 1.5).set(fish, 11.0).set(yogurt, 1.0) def protein = model.addExpression("Protein").upper(10) protein.set(bread, 4.0).set(milk, 8.0).set(cheese, 7.0).set(potato, 1.3).set(fish, 8.0).set(yogurt, 9.2) def fat = model.addExpression("Fat").lower(8) fat.set(bread, 1.0).set(milk, 5.0).set(cheese, 9.0).set(potato, 0.1).set(fish, 7.0).set(yogurt, 1.0) def carbs = model.addExpression("Carbohydrates").lower(10) carbs.set(bread, 15.0).set(milk, 11.7).set(cheese, 0.4).set(potato, 22.6).set(fish, 0.0).set(yogurt, 17.0) def calories = model.addExpression("Calories").lower(300) calories.set(bread, 90).set(milk, 120).set(cheese, 106).set(potato, 97).set(fish, 130).set(yogurt, 180) def result = model.minimise() // for a variation, see: // https://www.ojalgo.org/2019/05/the-diet-problem/ … OPTIMAL 0.0 @ { 0.0, 0.0, 0.40813898143741034, 1.8538791051880057, 0.5916230366492151, 0.0 } ############################################ 0 <= Bread: 0 0 <= Milk: 0 <= 1 0 <= Cheese: 0.408139 0 <= Potato: 1.853879 0.5 <= Fish: 0.591623 0 <= Yogurt: 0 10 <= Carbohydrates: 42.060923 8 <= Fat: 8.0 Cost: 12.553784 Protein: 10.0 <= 10 300 <= Calories: 300.0 ############################################
  158. Diet problem (Choco) def model = new Model("Diet problem") def

    unbounded = 100000 // scale quantities by 10, coefficients by 10, products by 100 def bread = model.intVar("Bread", 0, unbounded) def milk = model.intVar("Milk", 0, 10) def cheese = model.intVar("Cheese", 0, unbounded) def potato = model.intVar("Potato", 0, unbounded) def fish = model.intVar("Fish", 5, unbounded) def yogurt = model.intVar("Yogurt", 0, unbounded) IntVar[] all = [bread, milk, cheese, potato, fish, yogurt] def cost = model.intVar("Cost", 0, unbounded) model.scalar(all, [20, 35, 80, 15, 110, 10] as int[], "=", cost).post() def protein = model.intVar("Protein", 0, 1000) model.scalar(all, [40, 80, 70, 13, 80, 92] as int[], "=", protein).post() def fat = model.intVar("Fat", 800, unbounded) model.scalar(all, [10, 50, 90, 1, 70, 10] as int[], "=", fat).post() def carbs = model.intVar("Carbohydrates", 1000, unbounded) model.scalar(all, [150, 117, 4, 226, 0, 170] as int[], "=", carbs).post() def calories = model.intVar("Calories", 30000, unbounded) model.scalar(all, [900, 1200, 1060, 970, 1300, 1800] as int[], "=", calories).post() model.setObjective(Model.MINIMIZE, cost) def found = model.solver.findSolution() if (found) { all.each { println "$it.name: ${it.value / 10}" } [carbs, fat, protein, calories, cost].each { println "$it.name: ${it.value / 100}" } } else { println "No solution" } Bread: 0 Milk: 0 Cheese: 0.5 Potato: 1.9 Fish: 0.5 Yogurt: 0 Carbohydrates: 43.14 Fat: 8.19 Protein: 9.97 Calories: 302.3 Cost: 12.35 Choco supports RealVar but doesn’t have as rich a set of possible constraints for such vars.
  159. Diet problem (Choco) def model = new Model("Diet problem") def

    unbounded = 1000.0d def precision = 0.00001d // scale quantities by 10, coefficients by 10, products by 100 def bread = model.realVar("Bread", 0.0, unbounded, precision) def milk = model.realVar("Milk", 0.0, 1.0, precision) def cheese = model.realVar("Cheese", 0.0, unbounded, precision) def potato = model.realVar("Potato", 0.0, unbounded, precision) def fish = model.realVar("Fish", 0.5, unbounded, precision) def yogurt = model.realVar("Yogurt", 0.0, unbounded, precision) RealVar[] all = [bread, milk, cheese, potato, fish, yogurt] def scalarIbex = { coeffs, var -> def (a, b, c, d, e, f) = coeffs model.realIbexGenericConstraint("$a*{0}+$b*{1}+$c*{2}+$d*{3}+$e*{4}+$f*{5}={6}", [*all, var] as RealVar[]).post(); } def cost = model.realVar("Cost", 0.0, unbounded, precision) scalarIbex([2.0, 3.5, 8.0, 1.5, 11.0, 1.0], cost) def protein = model.realVar("Protein", 0.0, 10.0, precision) scalarIbex([4.0, 8.0, 7.0, 1.3, 8.0, 9.2], protein) def fat = model.realVar("Fat", 8.0, unbounded, precision) scalarIbex([1.0, 5.0, 9.0, 0.1, 7.0, 1.0], fat) def carbs = model.realVar("Carbohydrates", 10.0, unbounded, precision) scalarIbex([15.0, 11.7, 0.4, 22.6, 0.0, 17.0], carbs) def calories = model.realVar("Calories", 300, unbounded, precision) scalarIbex([90, 120, 106, 97, 130, 180], calories) model.setObjective(Model.MINIMIZE, cost) def found = model.solver.findSolution() Bread: 0.025131 .. 0.025137 Milk: 0.000009 .. 0.000010 Cheese: 0.428571 .. 0.428571 Potato: 1.848118 .. 1.848124 Fish: 0.561836 .. 0.561836 Yogurt: 0.000007 .. 0.000010 Carbohydrates: 42.316203 .. 42.316211 Fat: 8.000000 .. 8.000005 Protein: 9.997920 .. 9.997926 Calories: 300.000000 .. 300.000008 Cost: 12.431241 .. 12.431245 Choco does have a plugin (via JNI) for the Ibex C++ constraint processing library which does handle real numbers. def pretty = { var -> def bounds = found.getRealBounds(var) printf "%s: %.6f .. %.6f%n", var.name, *bounds } if (found) { all.each { pretty(it) } [carbs, fat, protein, calories, cost].each { pretty(it) } } else { println "No solution" }
  160. Diet problem (Apache Commons Math) import org.apache.commons.math3.optim.linear.* import org.apache.commons.math3.optim.nonlinear.scalar.GoalType import

    static org.apache.commons.math3.optim.linear.Relationship.* def cost = new LinearObjectiveFunction([2.0, 3.5, 8.0, 1.5, 11.0, 1.0] as double[], 0) static scalar(coeffs, rel, val) { new LinearConstraint(coeffs as double[], rel, val) } def bread_min = scalar([1, 0, 0, 0, 0, 0], GEQ, 0) def milk_min = scalar([0, 1, 0, 0, 0, 0], GEQ, 0) def milk_max = scalar([0, 1, 0, 0, 0, 0], LEQ, 1) def cheese_min = scalar([0, 0, 1, 0, 0, 0], GEQ, 0) def potato_min = scalar([0, 0, 0, 1, 0, 0], GEQ, 0) def fish_min = scalar([0, 0, 0, 0, 1, 0], GEQ, 0.5) def yogurt_min = scalar([0, 0, 0, 0, 0, 1], GEQ, 0) def protein = scalar([4.0, 8.0, 7.0, 1.3, 8.0, 9.2], LEQ, 10) def fat = scalar([1.0, 5.0, 9.0, 0.1, 7.0, 1.0], GEQ, 8) def carbs = scalar([15.0, 11.7, 0.4, 22.6, 0.0, 17.0], GEQ, 10) def calories = scalar([90, 120, 106, 97, 130, 180], GEQ, 300) LinearConstraintSet constraints = [bread_min, milk_min, milk_max, fish_min, cheese_min, potato_min, yogurt_min, protein, fat, carbs, calories] def solution = new SimplexSolver().optimize(cost, constraints, GoalType.MINIMIZE) if (solution != null) { printf "Opt: %.2f%n", solution.value println solution.point.collect{ sprintf '%.2f', it }.join(', ') } Opt: 12.08 -0.00, 0.05, 0.45, 1.87, 0.50, 0.00
  161. Diet problem (Google OrTools) import … Loader.loadNativeLibraries() static addConstraint(MPSolver solver,

    List<MPVariable> vars, MPVariable comp, List<Double> coeffs) { solver.makeConstraint(0, 0).tap {constraint -> constraint.setCoefficient(comp, -1) vars.indices.each { i -> constraint.setCoefficient(vars[i], coeffs[i]) } } } MPSolver.createSolver("SCIP").with {solver -> def bread = makeNumVar(0, infinity(),'bread') def milk = makeNumVar(0, 1, 'milk') def cheese = makeNumVar(0, infinity(), 'cheese') def potato = makeNumVar(0, infinity(), 'potato') def fish = makeNumVar(0.5, infinity(), 'fish') def yogurt = makeNumVar(0, infinity(), 'yogurt') def food = [bread, milk, cheese, potato, fish, yogurt] …
  162. Diet problem (Google OrTools) … def cost = makeNumVar(0, infinity(),'Cost')

    addConstraint(solver, food, cost, [2.0, 3.5, 8.0, 1.5, 11.0, 1.0]) def protein = makeNumVar(0, 10,'Protein') addConstraint(solver, food, protein, [4.0, 8.0, 7.0, 1.3, 8.0, 9.2]) def fat = makeNumVar(8, infinity(),'Fat') addConstraint(solver, food, fat, [1.0, 5.0, 9.0, 0.1, 7.0, 1.0]) def carbs = makeNumVar(10, infinity(),'Carbs') addConstraint(solver, food, carbs, [15.0, 11.7, 0.4, 22.6, 0.0, 17.0]) def cals = makeNumVar(300, infinity(),'Calories') addConstraint(solver, food, cals, [90, 120, 106, 97, 130, 180]) def components = [protein, fat, carbs, cals] objective().with {objective -> objective.setCoefficient(cost, 1) objective.setMinimization() def result = solve() println result if (result == OPTIMAL) { println "Solution: " + objective.value() println "Foods: ${food.collect{ "${it.name()} ${it.solutionValue()}" }}" println "Components: ${components.collect{ "${it.name()} ${it.solutionValue()}" }}" println "Iterations: ${iterations()}, Wall time: ${wallTime()}ms" } else { System.err.println "The problem does not have an optimal solution!" } } }
  163. Diet problem (Google OrTools) … def cost = makeNumVar(0, infinity(),'Cost')

    addConstraint(solver, food, cost, [2.0, 3.5, 8.0, 1.5, 11.0, 1.0]) def protein = makeNumVar(0, 10,'Protein') addConstraint(solver, food, protein, [4.0, 8.0, 7.0, 1.3, 8.0, 9.2]) def fat = makeNumVar(8, infinity(),'Fat') addConstraint(solver, food, fat, [1.0, 5.0, 9.0, 0.1, 7.0, 1.0]) def carbs = makeNumVar(10, infinity(),'Carbs') addConstraint(solver, food, carbs, [15.0, 11.7, 0.4, 22.6, 0.0, 17.0]) def cals = makeNumVar(300, infinity(),'Calories') addConstraint(solver, food, cals, [90, 120, 106, 97, 130, 180]) def components = [protein, fat, carbs, cals] objective().with {objective -> objective.setCoefficient(cost, 1) objective.setMinimization() def result = solve() println result if (result == OPTIMAL) { println "Solution: " + objective.value() println "Foods: ${food.collect{ "${it.name()} ${it.solutionValue()}" }}" println "Components: ${components.collect{ "${it.name()} ${it.solutionValue()}" }}" println "Iterations: ${iterations()}, Wall time: ${wallTime()}ms" } else { System.err.println "The problem does not have an optimal solution!" } } } OPTIMAL Solution: 12.081337881108173 Foods: [bread 0.0, milk 0.05359877488514538, cheese 0.44949881665042474, potato 1.8651677572045107, fish 0.5, yogurt 0.0] Components: [Protein 10.0, Fat 8.0, Carbs 42.95969650563832, Calories 300.0] Iterations: 4, Wall time: 89ms
  164. 225 objectcomputing.com © 2021, Object Computing, Inc. (OCI). All rights

    reserved. Groovy Community Information • groovy.apache.org • groovy-lang.org • github.com/apache/groovy • groovycommunity.com • [email protected][email protected] • @ApacheGroovy • objectcomputing.com/training/catalog/groovy-programming/ © 2021 Object Computing, Inc. (OCI). All rights reserved.
  165. objectcomputing.com © 2021, Object Computing, Inc. (OCI). All rights reserved.

    226 CONNECT WITH US 1+ (314) 579-0066 @objectcomputing objectcomputing.com THANK YOU Find me on twitter @paulk_asert © 2021 Object Computing, Inc. (OCI). All rights reserved.
  166. References • https://www.analyticsvidhya.com/blog/2017/02/lintroductory-guide-on-linear-programming- explained-in-simple-english/ • Computational Mixed-Integer Programming https://www.birs.ca/cmo-workshops/2018/18w5208/files/GleixnerAmbros.pdf •

    Integer Linear Programming: Graphical Introduction https://www.youtube.com/watch?v=RhHhy-8sz-4 • Integer Linear Programming: Binary Constraints https://www.youtube.com/watch?v=-3my1TkyFiM https://www.youtube.com/watch?v=B3biWsBLeCw https://www.youtube.com/watch?v=MO8uQnIch6I • Linear Programming: Alternate solutions, Infeasibility, Unboundedness, & Redundancy https://www.youtube.com/watch?v=eMA0LWsRQQ