Groovy Data Science

Groovy Data Science

Groovy is a powerful multi-paradigm programming language for the JVM that offers a wealth of features that make it ideal for many data science and big data scenarios.

* Groovy has a dynamic nature like Python, which means that it is very powerful, easy to learn, and productive. The language gets out of the way and lets data scientists write their algorithms naturally.

* Groovy has a static nature like Java and Kotlin, which makes it fast when needed. Its close alignment with Java means that you can often just cut-and-paste the Java examples from various big data solutions and they'll work just fine in Groovy.

* Groovy has first-class functional support, meaning that it offers features and allows solutions similar to Scala. Functional and stream processing with immutable data structures can offer many advantages when working in parallel processing or clustered environments.

These slides review the key benefits of using Groovy to develop data science solutions, including integration with various JDK libraries commonly used in data science solutions including libraries for data manipulation, machine learning, plotting and various big data solutions for scaling up these algorithms.

Math/Data Science libraries covered include:
Weka, Smile, Tablesaw, Apache Commons Math, Jupyter/Beakerx notebooks, Deep Learning4J.

Libraries for scaling/concurrency include:
Apache Spark, Apache Ignite, Apache MXNet, GPars, Apache Beam.

4578d99b560b2a470e05288ef6766ac2?s=128

paulking

June 19, 2019
Tweet

Transcript

  1. objectcomputing.com © 2018, Object Computing, Inc. (OCI). All rights reserved.

    No part of these notes may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, rowing, or otherwise, without the prior, written permission of Object Computing, Inc. (OCI) Data Science with Groovy Presented by Dr Paul King
  2. 2 objectcomputing.com © 2019, Object Computing, Inc. (OCI). All rights

    reserved. Dr Paul King OCI Groovy Lead V.P. and PMC Chair Apache Groovy Author: https://www.manning.com/books/groovy-in-action-second-edition https://speakerdeck.com/paulk/groovy-data-science https://github.com/paulk-asert/groovy-data-science @paulk_asert
  3. objectcomputing.com © 2018, Object Computing, Inc. (OCI). All rights reserved.

    3 REIMAGINE TOGETHER Grails
  4. objectcomputing.com © 2018, Object Computing, Inc. (OCI). All rights reserved.

    4 REIMAGINE TOGETHER
  5. What is Groovy? It’s like a super version of Java:

    • Supports most Java syntax but allows simpler syntax for many constructs • Supports all Java libraries but provides many extensions and its own productivity libraries • Has both a static and dynamic nature • Extensible language and tooling
  6. Groovy has a low learning curve, particularly for Java developers

  7. Groovy by the numbers: Downloads Popular & growing 2016: 23M

    2017: 50M 2018: 103M currently: approx. 20M+ per month
  8. Groovy by the numbers: TIOBE Sep/Oct 2019 Groovy ranked 11th.

    Not perfect index but still great to be: • Closing in on 10% rating of Java • Fastest growing of top ranking languages • Almost double the rating of all the other alternative JVM languages combined
  9. Java-like or concise shortcuts import java.util.List; import java.util.ArrayList; class Main

    { private List keepShorterThan(List strings, int length) { List result = new ArrayList(); for (int i = 0; i < strings.size(); i++) { String s = (String) strings.get(i); if (s.length() < length) { result.add(s); } } return result; } public static void main(String[] args) { List names = new ArrayList(); names.add("Ted"); names.add("Fred"); names.add("Jed"); names.add("Ned"); System.out.println(names); Main m = new Main(); List shortNames = m.keepShorterThan(names, 4); System.out.println(shortNames.size()); for (int i = 0; i < shortNames.size(); i++) { String s = (String) shortNames.get(i); System.out.println(s); } } } names = ["Ted", "Fred", "Jed", "Ned"] println names shortNames = names.findAll{ it.size() < 4 } println shortNames.size() shortNames.each{ println it }
  10. Concise syntax including DSL friendly import java.util.List; import java.util.ArrayList; class

    Main { private List keepShorterThan(List strings, int length) { List result = new ArrayList(); for (int i = 0; i < strings.size(); i++) { String s = (String) strings.get(i); if (s.length() < length) { result.add(s); } } return result; } public static void main(String[] args) { List names = new ArrayList(); names.add("Ted"); names.add("Fred"); names.add("Jed"); names.add("Ned"); System.out.println(names); Main m = new Main(); List shortNames = m.keepShorterThan(names, 4); System.out.println(shortNames.size()); for (int i = 0; i < shortNames.size(); i++) { String s = (String) shortNames.get(i); System.out.println(s); } } } names = ["Ted", "Fred", "Jed", "Ned"] println names shortNames = names.findAll{ it.size() < 4 } println shortNames.size() shortNames.each{ println it } given the names "Ted", "Fred", "Jed" and "Ned" display all the names display the number of names having size less than 4 display the names having size less than 4
  11. None
  12. None
  13. Metaprogramming

  14. Matrix manipulation: Java vs Groovy • Same example • Same

    library Array2DRowRealMatrix{{15.1379501385,40.488531856},{21.4354570637,59.5951246537}} import org.apache.commons.math3.linear.*; public class MatrixMain { public static void main(String[] args) { double[][] matrixData = { {1d,2d,3d}, {2d,5d,3d}}; RealMatrix m = MatrixUtils.createRealMatrix(matrixData); double[][] matrixData2 = { {1d,2d}, {2d,5d}, {1d, 7d}}; RealMatrix n = new Array2DRowRealMatrix(matrixData2); RealMatrix o = m.multiply(n); // Invert p, using LU decomposition RealMatrix oInverse = new LUDecomposition(o).getSolver().getInverse(); RealMatrix p = oInverse.scalarAdd(1d).scalarMultiply(2d); RealMatrix q = o.add(p.power(2)); System.out.println(q); } } Thanks to operator overloading and extensible tooling
  15. groovyConsole showing @Grab, power asserts and natural language processing

  16. REPL for Groovy (groovysh)

  17. Command chains and extensible type system Text file? No it’s

    a statically typed Groovy program which won’t compile if the animal names aren’t valid. A constraint programming solver is used to find the solution. cranes have 2 legs tortoises have 4 legs there are 7 animals there are 20 legs display solution Cranes 4 Tortoises 3
  18. Command chains and extensible type system cranes have 2 legs

    tortoises have 4 legs millipedes have 1000 legs there are 8 animals there are 1020 legs display solution Cranes 4 Tortoises 3 Millipedes 1
  19. IDE support Intellij IDEA, Apache NetBeans, Eclipse, Groovy Web Console

  20. Groovy frameworks plus entire Java ecosystem

  21. Data Science Process Research Goals Obtain Data Data Preparation Data

    Exploration Visualization Data Modeling Data ingestion Data storage Data processing platforms Modeling algorithms Math libraries Graphics processing Integration Deployment
  22. Source: https://medium.com/activewizards-machine-learning-company/comparison-of-top-data-science-libraries-for-python-r-and-scala-infographic-574069949267 Python/Scala/R Libraries/Tools/Frameworks for Data Science

  23. Groovy/Java Libraries/Tools/Frameworks for Data Science Source: https://medium.com/activewizards-machine-learning-company/comparison-of-top-data-science-libraries-for-python-r-and-scala-infographic-574069949267 Machine Learning DeepLearning4J

    Apache Spark ML Apache MXNet Visualization GroovyFX JFreeChart Tablesaw Jupyter/beakerx Data Manipulation and Analysis Tablesaw Apache Commons CSV OpenCSV Mathematics and Engineering GPars Apache Commons Math Smile Weka Groovylab Apache Spark Apache Ignite Apache Beam
  24. Obtain data Identify sources • Internal/External (data providers) • Databases,

    files, events, IoT • Unstructured/structured • Text, XML, JSON, YAML, CSV/spreadsheets, audio, images, video, web services • Apache Tika • JExcel • Apache POI Clarify ownership Check quality def jsonSlurper = new JsonSlurper() def object = jsonSlurper.parseText('{ "myList": [4, 8, 15, 16, 23, assert object instanceof Map assert object.myList instanceof List assert object.myList == [4, 8, 15, 16, 23, 42] def response = new XmlSlurper().parseText(books) def authorResult = response.value.books.book[0].author assert authorResult.text() == 'Miguel de Cervantes' def qry = 'SELECT * FROM Author’ assert sql.rows(qry, 1, 3)*.firstname == ['Dierk', 'Paul', 'Guilla assert sql.rows(qry, 4, 3)*.firstname == ['Hamlet', 'Cedric', 'Eri assert sql.rows(qry, 7, 3)*.firstname == ['Jon']
  25. Data preparation Collections: • Java: lists, maps, sets • Groovy:

    literal notation, GDK methods, GPath • Libraries • Google Guava https://github.com/google/guava • Apache Common Collections https://commons.apache.org/collections/ DataFrames: • Joinery https://cardillo.github.io/joinery/ A data frame implementation in the spirit of Pandas or R data frames with show/plot • Tablesaw https://jtablesaw.github.io/tablesaw/ Java dataframe and visualization library • Apache Spark DataFrames https://spark.apache.org/docs/latest/sql-programming- guide.html Equivalent to a table in a relational database or a data frame in R/Python • Paleo https://github.com/netzwerg/paleo Immutable Java 8 data frames with typed columns (including primitives)
  26. Data exploration - DEX

  27. Data exploration - Weka More info: https://www.youtube.com/watch?v=7quZv6WCTQc

  28. Visualization • Open-Source plotting libraries: GRAL, JFreeChart, Xchart, JMathPlot, Jzy3d,

    JavaFX, GroovyFX • Types: Scatter plot, Line plot, Area plot, Pie plot, Donut plot, Horizontal bars, Vertical bars, Bubble plot, Radar plot, Box plot, Raster plot, Contour plot, Gauge plot, Step plot, Gantt plot, Histogram • Features: Multiple axes, Logarithmic axes, Line smoothing, Data filtering/aggregation
  29. Notebooks supporting Groovy • Jupyter/beakerx • Apache Zeppelin • GroovyLab

    • Seco
  30. Notebooks supporting Groovy • Jupyter/beakerx • Apache Zeppelin • GroovyLab

    • Seco
  31. Data preparation IO: • Java in-built • Groovy enhancements •

    Libraries • Apache Commons IO https://commons.apache.org/io/ • AOL Cyclops-React https://github.com/aol/cyclops-react Web scraping: • GEB Batch processing/ETL: • Spring Batch http://projects.spring.io/spring-batch/ • Apache Camel • Apache Gobblin
  32. Math libraries Math and stats libraries • Java: log, exp,

    other basic methods. Libraries: • Apache Commons Math http://commons.apache.org/math/ statistics, optimization, and linear algebra • Apache Mahout http://mahout.apache.org/ linear algebra, plus distributed linear algebra and machine learning • JBlas http://jblas.org/ optimized and very fast linear algebra package that uses the BLAS library Also, many of the machine learning libraries come with some math functionality, often linear algebra, stats, and optimization.
  33. Data storage Database: • Java’s JDBC • Groovy JDBC enhancements

    and Datasets • Other NOSQL • Apache Cassandra • MongoDB • Apache CouchDB • Neo4j import com.gmongo.GMongo def db = new GMongo().getDB('athletes') db.athletes.drop() db.athletes << [ first: 'Paul', last: 'Tergat', dob: '1969-06-17', runs: [[distance: 42195, time: 2 * 60 * 60 + 4 * 60 + 55, venue: 'Berlin', when: '2003-09-28’]] ] println "World rows following $marathon3.venue $marathon3.when:" def t = new Traversal() for (Path p in t.description().breadthFirst(). relationships(MyRelationshipTypes.supercedes). evaluator(Evaluators.fromDepth(1)). uniqueness(Uniqueness.NONE). traverse(marathon3)) { def newRow = p.endNode() println "$newRow.venue $newRow.when" } Graph g = new Neo4jGraph(graphDb) def pretty = { it.collect { "$it.venue $it.when" }.join(', ') } def results = [] g.V('venue', 'London').fill(results) println 'London world rows: ' + pretty(results) results = [] g.V('venue', 'London').in('supercedes').fill(results) println 'World rows after London: ' + pretty(results)
  34. Machine learning and data mining Machine learning and data mining

    libraries: • Weka http://www.cs.waikato.ac.nz/ml/weka/ • JavaML http://java-ml.sourceforge.net/ old and reliable ML library but not actively updated • Smile http://haifengl.github.io/smile/ ML library under active development • JSAT https://github.com/EdwardRaff/JSAT contains many machine learning algorithms • H2O http://www.h2o.ai/ distributed Java ML framework also supports Scala, R and Python • Apache Mahout http://mahout.apache.org/ distributed linear algebra framework • Apache Spark ml and mllib https://github.com/apache/spark/tree/master/examples/src/main/java/org/apache/spark/examples
  35. Neural networks and text processing Neural networks: • Encog http://www.heatonresearch.com/encog/

    • DeepLearning4j http://deeplearning4j.org/ natively supports Keras models Text processing • Java: StringTokenizer, the java.text package, regular expressions • Groovy: Regex enhancements, templates, String enhancements • Libraries: • Apache Lucene https://lucene.apache.org/ • Stanford CoreNLP http://stanfordnlp.github.io/CoreNLP/ • Apache OpenNLP https://opennlp.apache.org/ • LingPipe http://alias-i.com/lingpipe/ • GATE https://gate.ac.uk/ • MALLET http://mallet.cs.umass.edu/ • Smile http://haifengl.github.io/smile/
  36. Scaling up Microservice frameworks: • Micronaut Distributed data processing: •

    Apache Hadoop • Apache Spark • Apache Flink • Apache Samza • Apache Kafka • Apache Storm • Apache Apex • Apache Beam • Apache Ignite • Apache Nifi
  37. Scaling up: Apache Ignite Ignite™ is a memory-centric distributed database,

    caching, and processing platform for transactional, analytical, and streaming workloads delivering in-memory speeds at petabyte scale
  38. Scaling up: Apache Spark Source: https://spark.apache.org/ and https://renovacloud.com/en/an-introduction-to-and-evaluation-of-apache-spark-for-big-data-architectures/ Apache Spark™

    is a unified analytics engine for large-scale data processing
  39. Scaling up: GPars Website: http://gpars.org/ • A Concurrency & Parallelism

    Framework for Groovy and Java Data Parallelism: Fork/Join Map/Reduce Fixed coordination (for collections) Actors Explicit coordination Safe Agents Delegated coordination Dataflow Implicit coordination def decryptor = actor { loop { react { String message -> reply message.reverse() } } } def console = actor { decryptor << 'lellarap si yvoorG' react { println 'Decrypted message: ' + it } } console.join() final def df = new Dataflows() task { df.z = df.x + df.y } task { df.x = 10 } task { df.y = 5 } assert df.z == 15 GParsPool.withPool { def result = [1, 2, 3, 4].collectParallel{it * 2} assert result == [2, 4, 6, 8] } 5 10 15 x y z + def agent = new Agent([]) agent { it << 'Dave' } agent { it << 'Joe' } assert agent.val.size() == 2 class Account { private amount = newTxnInteger(0) void transfer(final int a) { atomic { amount.increment(a) } } int getCurrentAmount() { atomicWithInt { amount.get() } } }
  40. Scaling up: Apache Beam Apache Beam is an open source,

    unified model for defining both batch and streaming data-parallel processing pipelines Source: https://beam.apache.org/get-started/beam-overview/
  41. Computer science algorithms Decision problems Search problems Counting problems Optimization

    problems • Paradigms: brute force, divide & conquer, search & enumerate, randomized, complexity reduction, recursive, back tracking, graph • Optimization approaches: linear programming, dynamic programming, greedy, heuristics
  42. Data science algorithms Data Mining Statistics Machine Learning Optimization •

    Analytics: descriptive, predictive, prescriptive • Analysis: anomaly detection, classification, regression, clustering, association, optimization, dimension reduction • Data relationship: linear, non-linear • Assumptions: parametric, non-parametric • Strategy: supervised, unsupervised, reinforcement • Combining: ensemble, boosting
  43. Data Science Algorithms Source: Jason Brownlee, https://machinelearningmastery.com/master-machine-learning-algorithms/

  44. Data Science Algorithms Source: Jason Brownlee, https://machinelearningmastery.com/master-machine-learning-algorithms/

  45. Regression Overview Regression: • Relationship between variables Types: • Linear

    • Logistic • Polynomial • Stepwise • Ridge • Lasso • ElasticNet Aspects: • Under/over fitting • Outliers Applications: • House price prediction • Market predictions • Ratings prediction • Weather forecast
  46. Linear regression case study: House price predictions See also: https://nicolai92.github.io/posts/pentaho-weka-prediction

    Dataset: https://www.kaggle.com/harlfoxem/housesalesprediction/
  47. House price predictions – view data graphically import org.apache.commons.math3.random.EmpiricalDistribution import

    static groovyx.javafx.GroovyFX.start import static org.apache.commons.csv.CSVFormat.RFC4180 as CSV def file = 'kc_house_data.csv' as File def csv = CSV.withFirstRowAsHeader().parse(new FileReader(file)) def all = csv.collect { it.bedrooms.toInteger() } def dist = new EmpiricalDistribution(all.max()).tap{ load(all as double[]) } def bins = dist.binStats.withIndex().collectMany { v, i -> [i.toString(), v.n] } start { stage(title: 'Number of bedrooms histogram', show: true, width: 800, height: 600) { scene { barChart(title: 'Bedroom count', barGap: 0, categoryGap: 2) { series(name: 'Number of properties', data: bins) } } } }
  48. House price predictions – view data graphically

  49. House price predictions – view stats import org.apache.commons.math3.stat.descriptive.SummaryStatistics import static

    org.apache.commons.csv.CSVFormat.RFC4180 as CSV def file = 'kc_house_data.csv' as File def csv = CSV.withFirstRowAsHeader().parse(new FileReader(file)) def all = csv.collect { it.bedrooms.toInteger() } def stats = new SummaryStatistics() all.each{ stats.addValue(it as double) } println stats.summary n: 21613 min: 0.0 max: 33.0 mean: 3.370841623097218 std dev: 0.930061831147451 variance: 0.8650150097573497 sum: 72854.0
  50. House price predictions – investigate outliers import org.apache.commons.csv.CSVFormat.RFC4180 as CSV

    def file = 'kc_house_data.csv' as File def csv = CSV.withFirstRowAsHeader().parse(new FileReader(file)) csv.findAll{ it.bedrooms.toInteger() > 10 }.each{ println it.toMap() as TreeMap } [bathrooms:3, bedrooms:11, condition:3, date:20140821T000000, floors:2, grade:7, id:1773100755, lat:47.556, long:-122.363, price:520000, sqft_above:2400, sqft_basement:600, sqft_living:3000, sqft_living15:1420, sqft_lot:4960, sqft_lot15:4960, view:0, waterfront:0, yr_built:1918, yr_renovated:1999, zipcode:98106] [bathrooms:1.75, bedrooms:33, condition:5, date:20140625T000000, floors:1, grade:7, id:2402100895, lat:47.6878, long:-122.331, price:640000, sqft_above:1040, sqft_basement:580, sqft_living:1620, sqft_living15:1330, sqft_lot:6000, sqft_lot15:4700, view:0, waterfront:0, yr_built:1947, yr_renovated:0, zipcode:98103]
  51. House price predictions – investigate outliers Some libraries optionally support

    CSV -> richly-typed domain classes import com.opencsv.bean.* import groovy.transform.ToString @ToString(includeNames = true) class House { @CsvBindByName Integer bedrooms @CsvBindByName String bathrooms @CsvBindByName(column = 'sqft_lot') Integer area_lot } def file = 'kc_house_data.csv' as File def builder = new CsvToBeanBuilder(new FileReader(file)) def rows = builder.withType(House).build().parse() rows.findAll{ it.bedrooms > 10 }.each{ println it } House(bedrooms:11, bathrooms:3, area_lot:4960) House(bedrooms:33, bathrooms:1.75, area_lot:6000)
  52. House prices – remove outlier and view bedrooms import org.apache.commons.math3.random.EmpiricalDistribution

    import static groovyx.javafx.GroovyFX.start import static org.apache.commons.csv.CSVFormat.RFC4180 as CSV def csv = CSV.withFirstRowAsHeader().parse(new FileReader('kc_house_data.csv')) def all = csv.collect { it.bedrooms.toInteger() }.findAll{ it < 30 } def dist = new EmpiricalDistribution(all.max()).tap{ load(all as double[]) } def bins = dist.binStats.withIndex().collectMany { v, i -> [i.toString(), v.n] } start { stage(title: 'Number of bedrooms histogram', show: true, width: 800, height: 600) { scene { barChart(title: 'Bedroom count', barGap: 0, categoryGap: 2) { series(name: 'Number of properties', data: bins) } } } }
  53. House price predictions – remove outlier and view price import

    org.apache.commons.math3.random.EmpiricalDistribution import static groovyx.javafx.GroovyFX.start import static org.apache.commons.csv.CSVFormat.RFC4180 as CSV def file = 'kc_house_data.csv' as File def csv = CSV.withFirstRowAsHeader().parse(new FileReader(file)) def all = csv.findAll { it.bedrooms.toInteger() < 30 }.collect { it.price.toDouble() } def info = new SummaryStatistics(); all.each(info::addValue) def head = "Price percentile (min=\$$info.min, mean=\$${info.mean as int}, max=\$$info.max)" def dist = new EmpiricalDistribution(100).tap{ load(all as double[]) } def bins = dist.binStats.withIndex().collectMany { v, i -> [i.toString(), v.n] } start { stage(title: 'Price histogram', show: true, width: 800, height: 600) { scene { barChart(title: head, barGap: 0, categoryGap: 0) { series(name: 'Number of properties', data: bins) } } } }
  54. House price predictions – explore with tablesaw import tech.tablesaw.api.* import

    tech.tablesaw.plotly.Plot import tech.tablesaw.plotly.api.* import static tech.tablesaw.aggregate.AggregateFunctions.* def rows = Table.read().csv('kc_house_data.csv') println rows.shape() println rows.structure() println rows.column("bedrooms").summary().print() println rows.where(rows.column("bedrooms").isGreaterThan(10)) def cleaned = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30)) println cleaned.shape() println cleaned.summarize("price", mean, min, max).by("bedrooms") Plot.show(ScatterPlot.create("Price x bathrooms x grade", cleaned, "bathrooms", "price", 'grade'))
  55. House price predictions – explore with tablesaw import tech.tablesaw.api.* import

    tech.tablesaw.plotly.Plot import tech.tablesaw.plotly.api.* import static tech.tablesaw.aggregate.AggregateFunctions.* def rows = Table.read().csv('kc_house_data.csv') println rows.shape() println rows.structure() println rows.column("bedrooms").summary().print() println rows.where(rows.column("bedrooms").isGreaterThan(10)) def cleaned = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30)) println cleaned.shape() println cleaned.summarize("price", mean, min, max).by("bedrooms") Plot.show(ScatterPlot.create("Price x bathrooms x grade", cleaned, "bathrooms", "price", 'grade')) 21613 rows X 21 cols
  56. House price predictions – explore with tablesaw import tech.tablesaw.api.* import

    tech.tablesaw.plotly.Plot import tech.tablesaw.plotly.api.* import static tech.tablesaw.aggregate.AggregateFunctions.* def rows = Table.read().csv('kc_house_data.csv') println rows.shape() println rows.structure() println rows.column("bedrooms").summary().print() println rows.where(rows.column("bedrooms").isGreaterThan(10)) def cleaned = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30)) println cleaned.shape() println cleaned.summarize("price", mean, min, max).by("bedrooms") Plot.show(ScatterPlot.create("Price x bathrooms x grade", cleaned, "bathrooms", "price", 'grade')) Structure of kc_house_data.csv Index | Column Name | Column Type | ------------------------------------------- 0 | id | LONG | 1 | date | STRING | 2 | price | DOUBLE | 3 | bedrooms | INTEGER | 4 | bathrooms | DOUBLE | 5 | sqft_living | INTEGER | 6 | sqft_lot | INTEGER | 7 | floors | DOUBLE | 8 | waterfront | INTEGER | 9 | view | INTEGER | ... | ... | ... | 11 | grade | INTEGER | 12 | sqft_above | INTEGER | 13 | sqft_basement | INTEGER | 14 | yr_built | INTEGER | 15 | yr_renovated | INTEGER | 16 | zipcode | INTEGER | 17 | lat | DOUBLE | 18 | long | DOUBLE | 19 | sqft_living15 | INTEGER | 20 | sqft_lot15 | INTEGER |
  57. House price predictions – explore with tablesaw import tech.tablesaw.api.* import

    tech.tablesaw.plotly.Plot import tech.tablesaw.plotly.api.* import static tech.tablesaw.aggregate.AggregateFunctions.* def rows = Table.read().csv('kc_house_data.csv') println rows.shape() println rows.structure() println rows.column("bedrooms").summary().print() println rows.where(rows.column("bedrooms").isGreaterThan(10)) def cleaned = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30)) println cleaned.shape() println cleaned.summarize("price", mean, min, max).by("bedrooms") Plot.show(ScatterPlot.create("Price x bathrooms x grade", cleaned, "bathrooms", "price", 'grade')) Column: bedrooms Measure | Value | ----------------------------------- n | 21613.0 | sum | 72854.0 | Mean | 3.370841623097218 | Min | 0.0 | Max | 33.0 | Range | 33.0 | Variance | 0.8650150097573497 | Std. Dev | 0.930061831147451 |
  58. House price predictions – explore with tablesaw import tech.tablesaw.api.* import

    tech.tablesaw.plotly.Plot import tech.tablesaw.plotly.api.* import static tech.tablesaw.aggregate.AggregateFunctions.* def rows = Table.read().csv('kc_house_data.csv') println rows.shape() println rows.structure() println rows.column("bedrooms").summary().print() println rows.where(rows.column("bedrooms").isGreaterThan(10)) def cleaned = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30)) println cleaned.shape() println cleaned.summarize("price", mean, min, max).by("bedrooms") Plot.show(ScatterPlot.create("Price x bathrooms x grade", cleaned, "bathrooms", "price", 'grade')) kc_house_data.csv id | date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | grade | sqft_above | sqft_basement | yr_built | yr_renovated | zipcode | lat | long | sqft_living15 | sqft_lot15 | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 1773100755 | 20140821 | 520000.0 | 11 | 3.0 | 3000 | 4960 | 2.0 | 0 | 0 | 3 | 7 | 2400 | 600 | 1918 | 1999 | 98106 | 47.556 | -122.363 | 1420 | 4960 | 2402100895 | 20140625 | 640000.0 | 33 | 1.75 | 1620 | 6000 | 1.0 | 0 | 0 | 5 | 7 | 1040 | 580 | 1947 | 0 | 98103 | 47.687 | -122.331 | 1330 | 4700 |
  59. House price predictions – explore with tablesaw import tech.tablesaw.api.* import

    tech.tablesaw.plotly.Plot import tech.tablesaw.plotly.api.* import static tech.tablesaw.aggregate.AggregateFunctions.* def rows = Table.read().csv('kc_house_data.csv') println rows.shape() println rows.structure() println rows.column("bedrooms").summary().print() println rows.where(rows.column("bedrooms").isGreaterThan(10)) def cleaned = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30)) println cleaned.shape() println cleaned.summarize("price", mean, min, max).by("bedrooms") Plot.show(ScatterPlot.create("Price x bathrooms x grade", cleaned, "bathrooms", "price", 'grade')) 21612 rows X 21 cols
  60. House price predictions – explore with tablesaw import tech.tablesaw.api.* import

    tech.tablesaw.plotly.Plot import tech.tablesaw.plotly.api.* import static tech.tablesaw.aggregate.AggregateFunctions.* def rows = Table.read().csv('kc_house_data.csv') println rows.shape() println rows.structure() println rows.column("bedrooms").summary().print() println rows.where(rows.column("bedrooms").isGreaterThan(10)) def cleaned = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30)) println cleaned.shape() println cleaned.summarize("price", mean, min, max).by("bedrooms") Plot.show(ScatterPlot.create("Price x bathrooms x grade", cleaned, "bathrooms", "price", 'grade')) kc_house_data.csv summary bedrooms | Mean [price] | Min [price] | Max [price] | ------------------------------------------------------------------- 0 | 409503.8461538461 | 139950.0 | 1295650.0 | 1 | 317642.8844221105 | 75000.0 | 1247000.0 | 2 | 401372.68188405805 | 78000.0 | 3278000.0 | 3 | 466232.0784812695 | 82000.0 | 3800000.0 | 4 | 635419.5042138912 | 100000.0 | 4489000.0 | 5 | 786599.8288569644 | 133000.0 | 7062500.0 | 6 | 825520.6360294118 | 175000.0 | 7700000.0 | 7 | 951184.6578947367 | 280000.0 | 3200000.0 | 8 | 1105076.923076923 | 340000.0 | 3300000.0 | 9 | 893999.8333333334 | 450000.0 | 1400000.0 | 10 | 819333.3333333334 | 650000.0 | 1148000.0 | 11 | 520000.0 | 520000.0 | 520000.0 |
  61. House price predictions – explore with tablesaw import tech.tablesaw.api.* import

    tech.tablesaw.plotly.Plot import tech.tablesaw.plotly.api.* import static tech.tablesaw.aggregate.AggregateFunctions.* def rows = Table.read().csv('kc_house_data.csv') println rows.shape() println rows.structure() println rows.column("bedrooms").summary().print() println rows.where(rows.column("bedrooms").isGreaterThan(10)) def cleaned = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30)) println cleaned.shape() println cleaned.summarize("price", mean, min, max).by("bedrooms") Plot.show(ScatterPlot.create("Price x bathrooms x grade", cleaned, "bathrooms", "price", 'grade'))
  62. House price predictions – explore with tablesaw cleaned.addColumns( StringColumn.create("waterfrontDesc", cleaned.column("waterfront").collect{

    it ? 'waterfront' : 'interior' }), DoubleColumn.create("scaledGrade", cleaned.column("grade").collect{ it * 2 }), DoubleColumn.create("scaledPrice", cleaned.column("price").collect{ it / 100000 }) ) Plot.show(BubblePlot.create("Price vs living area and grade (bubble size)", cleaned, "sqft_living", "price", "scaledGrade", "waterfrontDesc")) Plot.show(Scatter3DPlot.create("Grade, living space, bathrooms and price (bubble size)", cleaned, "sqft_living", "bathrooms", "grade", "scaledPrice", "waterfrontDesc"))
  63. House price predictions – explore with jupyter/beakerx

  64. Regression with Least Squares Find line of “best fit”:

  65. Regression with Least Squares Find line of “best fit”: •

    Find intercept and slope of line which minimizes the sum of the square of the residual errors Residual errors
  66. House price predictions – linear regression import org.apache.commons.math3.stat.regression.SimpleRegression import static

    groovyx.javafx.GroovyFX.start import static org.apache.commons.csv.CSVFormat.RFC4180 as CSV def feature = 'bedrooms' def nonOutliers = feature == 'bedrooms' ? { it[0] < 30 } : { true } def csv = CSV.withFirstRecordAsHeader().parse(new FileReader('kc_house_data.csv')) def all = csv.collect { [it[feature].toDouble(), it.price.toDouble()] }.findAll(nonOutliers) def reg = new SimpleRegression().tap{ addData(all as double[][]) } def (min, max) = all.transpose().with{ [it[0].min(), it[0].max()] } def predicted = [[min, reg.predict(min)], [max, reg.predict(max)]] start { stage(title: 'Price vs Number of bedrooms', show: true, width: 800, height: 600) { scene { scatterChart { series(name: 'Actual', data: all) } lineChart { series(name: 'Predicted', data: predicted) } } } }
  67. House price predictions – linear regression • Can be repeated

    for other features
  68. House price predictions – linear regression

  69. House price predictions – evaluating regression @Canonical @FXBindable class Feature

    { String name double rr, rmseTrain, rmseTest, mean } def features = [new Feature('bedrooms'), new Feature('bathrooms'), new Feature('grade'), new Feature('sqft_living')] def idxs = 0..<features.size() def priceIdx = features.size() def file = getClass().classLoader.getResource('kc_house_data.csv').file def csv = CSV.withFirstRecordAsHeader().parse(new FileReader(file)) def all = csv.collect { row -> [*idxs.collect{ row[features[it].name].toDouble() }, row.price.toDouble()] }.findAll{ it[0] < 30 } def (train, test) = all.chop(all.size() * 0.8 as int, -1) def trainT = train.transpose() def regs = idxs.collect { idx -> new SimpleRegression().tap{ addData([trainT[idx], trainT[priceIdx]].transpose() as double[][]) } } def residuals = idxs.collect{ idx -> test.collect{ regs[idx].predict(it[idx]) - it[priceIdx] } }
  70. House price predictions – evaluating regression idxs.each { idx ->

    features[idx].rr = regs[idx].RSquare features[idx].rmseTrain = Math.sqrt(regs[idx].meanSquareError) features[idx].rmseTest = Math.sqrt(StatUtils.sumSq(residuals[idx] as double[]) / (test.size() - 1)) features[idx].mean = StatUtils.mean(residuals[idx] as double[]) } start { stage(title: "Error statistics for feature", visible: true) { scene(fill: groovyblue, width: 800, height:200) { stackPane(padding: 10) { tableView(items: features) { tableColumn(text: "Feature", property: 'name') tableColumn(text: "R²", property: 'rr') tableColumn(text: "RMSE (train)", property: 'rmseTrain') tableColumn(text: "RMSE (test)", property: 'rmseTest') tableColumn(text: "Residuals Mean", property: 'mean') } } } } }
  71. House price predictions – evaluating regression def maxError = [residuals[graphIdx].min(),

    residuals[graphIdx].max()].max{ Math.abs(it) } residuals[graphIdx] << maxError * -1 // make graph even around origin maxError = Math.abs(maxError) def step = maxError.toInteger() / 50 def dist = new EmpiricalDistribution(100).tap{ load(residuals[graphIdx] as double[]) } def ndist = new NormalDistribution(0, dist.sampleStats.standardDeviation) def bins = dist.binStats.indexed().collect { i, v -> [v.n ? v.mean.toInteger().toString(): (-maxError + i * step).toInteger().toString(), v.n] } def nbins = dist.binStats.indexed().collect { i, v -> def x = v.n ? v.mean.toInteger() : (-maxError + i * step).toInteger(); [x.toString(), ndist.probability(x, x + step)] } def scale = dist.binStats.max{ it.n }.n / nbins.max{ it[1] }[1] nbins = nbins.collect{ [it[0], it[1] * scale] } start { stage(title: "Error histogram for ${features[graphIdx].name}", show: true, width: 800, height: 600) { scene { barChart(title: 'Error percentile', barGap: 0, categoryGap: 0) { series(name: 'Error in prediction', data: bins) series(name: 'Normal distribution', data: nbins) } } } }
  72. House price predictions – evaluating regression

  73. House price predictions – multi linear regression def features =

    [ 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_living15', 'lat', 'sqft_above', 'grade', 'view', 'waterfront', 'floors' ] def idxs = 0..<features.size() def priceIdx = features.size() def csv = CSV.withFirstRecordAsHeader().parse(new FileReader('kc_house_data.csv')) def all = csv.collect { row -> [*idxs.collect{ row[features[it]].toDouble() }, row.price.toDouble()] }.findAll{ it[0] < 30 } def (train, test) = all.chop(all.size() * 0.8 as int, -1) def trainT = train.transpose() def price = trainT[priceIdx] def reg = new OLSMultipleLinearRegression().tap{ newSampleData(price as double[], trainT[0..<priceIdx].transpose() as double[][]) } def params = reg.estimateRegressionParameters() println params def predicted = test.collect { data -> params[0] + (1..<params.size()).collect{ data[it-1] * params[it] }.sum() } def residuals = test.indexed().collect { i, data -> predicted[i] - data[priceIdx] } def rr = reg.calculateRSquared() def rmseTrain = Math.sqrt(reg.calculateResidualSumOfSquares()) def rmseTest = Math.sqrt(StatUtils.sumSq(residuals as double[]) / (test.size() - 1)) def mean = StatUtils.mean(residuals as double[]) println "$rr $rmseTrain $rmseTest $mean" [-3.1635392225013282E7, -27484.961699434185, -5236.246584477492, 191.507648719199, 17.94379836055797, 657803.1115560957, 4.369533695386711, 74850.68999978402, 66459.03393303146, 604896.8569593901, -27203.796342244277] 0.6536012563023335 215379.66626496855 212609.34498021202 1886.6015077985683 R2: https://en.wikipedia.org/wiki/Coefficient_of_determination rmse: https://en.wikipedia.org/wiki/Root-mean-square_deviation
  74. House price predictions – multi linear regression

  75. House price predictions – explore with Weka

  76. House price predictions – explore with Weka

  77. House price predictions – analyze with Weka

  78. House price predictions – analyze with Weka

  79. House price predictions – analyze with Weka

  80. House price predictions – scaling regression import … def rows

    = Table.read().csv('kc_house_data.csv') rows = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30)) String[] features = [ 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_living15', 'lat', 'sqft_above', 'grade', 'view', 'waterfront', 'floors' ] def data = rows.as().doubleMatrix(features) // configure to all run on local machine but could be a cluster (can be hidden in XML) def cfg = new IgniteConfiguration( peerClassLoadingEnabled: true, discoverySpi: new TcpDiscoverySpi( ipFinder: new TcpDiscoveryMulticastIpFinder( addresses: ['127.0.0.1:47500..47509'] ) ) ) static pretty(mdl, features) { def sign = { val -> val < 0 ? '- ' : '+ ' } def valIdx = { idx, val -> sprintf '%.2f*%s', val, features[idx+1] } def valIdxSign = { idx, val -> sign(val) + valIdx(idx, Math.abs(val)) } def valSign = { val -> sign(val) + sprintf('%.2f', Math.abs(val)) } def (w, i) = [mdl.weights, mdl.intercept] [valIdx(0, w.get(0)), *(1..<w.size()).collect{ valIdxSign(it, w.get(it)) }, valSign(i)].join(' ') } Read in data and select features of interest Adjust config for known local mode setup Utility method to pretty print model
  81. House price predictions – scaling regression … Ignition.start(cfg).withCloseable { ignite

    -> println ">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols" def dataCache = new SandboxMLCache(ignite).fillCacheWith(data) def trainer = new LSQRTrainer().withEnvironmentBuilder(defaultBuilder().withRNGSeed(0)) def vectorizer = new DoubleArrayVectorizer().labeled(Vectorizer.LabelCoordinate.FIRST) def split = new TrainTestDatasetSplitter().split(0.8) def mdl = trainer.fit(ignite, dataCache, split.trainFilter, vectorizer) def rr = Evaluator.evaluate(dataCache, split.testFilter, mdl, vectorizer, new RM().withMetric { it.r2() }) def rmse = Evaluator.evaluate(dataCache, split.testFilter, mdl, vectorizer, new RM()) dataCache.destroy() println ">>> Model: " + pretty(mdl, features) println ">>> R^2 : " + rr println ">>> RMSE : " + rmse } Candidates for distributed processing
  82. House price predictions – scaling regression … Ignition.start(cfg).withCloseable { ignite

    -> println ">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols" def dataCache = new SandboxMLCache(ignite).fillCacheWith(data) def trainer = new LSQRTrainer().withEnvironmentBuilder(defaultBuilder().withRNGSeed(0)) def vectorizer = new DoubleArrayVectorizer().labeled(Vectorizer.LabelCoordinate.FIRST) def split = new TrainTestDatasetSplitter().split(0.8) def mdl = trainer.fit(ignite, dataCache, split.trainFilter, vectorizer) def rr = Evaluator.evaluate(dataCache, split.testFilter, mdl, vectorizer, new RM().withMetric { it.r2() }) def rmse = Evaluator.evaluate(dataCache, split.testFilter, mdl, vectorizer, new RM()) dataCache.destroy() println ">>> Model: " + pretty(mdl, features) println ">>> R^2 : " + rr println ">>> RMSE : " + rmse } [15:02:09] Ignite node started OK (id=63b2d47d) >>> Ignite grid started for data: 21612 rows X 11 cols >>> Model: -33081.12*bedrooms - 16205.93*bathrooms + 225.40*sqft_living + 6.77*sqft_living15 - 8766.52*lat - 38.59*sqft_above + 96251.64*grade + 62303.58*view + 572116.79*waterfront - 11620.55*floors - 48504.18 >>> R^2 : 0.607023233473466 >>> RMSE : 228035.85816493785 [15:02:10] Ignite node stopped OK [uptime=00:00:01.465] See also: https://github.com/apache/ignite/blob/master/examples/src/main/java/org/apache/ignite/examples/ml/regression/linear/BostonHousePricesPredictionExample.java
  83. House prices – scaling regression import … def spark =

    builder().config('spark.master', 'local[8]').appName('HousePrices').orCreate def file = '/path/to/kc_house_data.csv' int k = 5 Dataset<Row> ds = spark.read().format('csv') .options('header': 'true', 'inferSchema': 'true').load(file) double[] splits = [80, 20] def (training, test) = ds.randomSplit(splits) String[] colNames = ds.columns().toList().minus(['id', 'date', 'price']) def assembler = new VectorAssembler(inputCols: colNames, outputCol: 'features') Dataset<Row> dataset = assembler.transform(training) def lr = new LinearRegression(labelCol: 'price', maxIter: 10) def model = lr.fit(dataset) println 'Coefficients:' println model.coefficients().values()[1..-1].collect{ sprintf '%.2f', it }.join(', ') def testSummary = model.evaluate(assembler.transform(test)) printf 'RMSE: %.2f%n', testSummary.rootMeanSquaredError printf 'r2: %.2f%n', testSummary.r2 spark.stop() Coefficients: 41979.78, 80853.89, 0.15, 5412.83, 564343.22, 53834.10, 24817.09, 93195.29, -80662.68, -80694.28, -2713.58, 19.02, -628.67, 594468.23, -228397.19, 21.23, -0.42 RMSE: 187242.12 r2: 0.70
  84. Clustering Overview Clustering: • Grouping similar items Algorithm families: •

    Hierarchical • Partitioning k-means, x-means • Density-based • Graph-based Aspects: • Disjoint vs overlapping • Preset cluster number • Dimensionality reduction PCA • Nominal feature support Applications: • Market segmentation • Recommendation engines • Search result grouping • Social network analysis • Medical imaging
  85. Clustering https://commons.apache.org/proper/commons-math/userguide/ml.html

  86. Dimensionality reduction https://stats.stackexchange.com/questions/183236/what-is-the-relation-between-k-means-clustering-and-pca

  87. Clustering case study: Whiskey flavor profiles • 86 scotch whiskies

    • 12 flavor categories Pictures: https://prasant.net/clustering-scotch-whisky-grouping-distilleries-by-k-means-clustering-81f2ecde069c https://www.r-bloggers.com/where-the-whisky-flavor-profile-data-came-from/ https://www.centerspace.net/clustering-analysis-part-iv-non-negative-matrix-factorization/
  88. Whiskey – exploring with Dex

  89. import … def table = Table.read().csv('whiskey.csv') table = table.removeColumns(0) def

    cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] def panel = new PlotPanel( *[0..<cols.size(), 0..<cols.size()].combinations().collect { first, second -> def color = new Color(72 + (first * 16), 72 + (second * 16), 200 - (first * 4) - (second * 4)) ScatterPlot.plot(table.as().doubleMatrix(cols[first], cols[second]), '#' as char, color) } ) new SwingBuilder().edt { frame(title: 'Frame', size: [800, 600], show: true, defaultCloseOperation: DISPOSE) { widget(panel) } } Whiskey: How to visualize? Feature combinations?
  90. Hmm… not particularly enlightening Whiskey: How to visualize? Feature combinations?

  91. Another dimension? Whiskey: How to visualize? Feature combinations?

  92. import … def table = Table.read().csv('whiskey.csv') table = table.removeColumns(0) def

    cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] Color[] colors = [CYAN, PINK, MAGENTA, ORANGE, GREEN, BLUE, RED, YELLOW] def panel = new PlotPanel( *[cols, cols].combinations().collect { first, second -> Histogram3D.plot(table.as().doubleMatrix(first, second), 4, colors) } ) new SwingBuilder().edt { frame(title: 'Frame', size: [800, 600], show: true, defaultCloseOperation: DISPOSE) { widget(panel) } } Whiskey: How to visualize? Feature combinations?
  93. import … def table = Table.read().csv('whiskey.csv') table = table.removeColumns(0) def

    cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] Color[] colors = [CYAN, PINK, MAGENTA, ORANGE, GREEN, BLUE, RED, YELLOW] def panel = new PlotPanel( *[cols, cols].combinations().collect { first, second -> Histogram3D.plot(table.as().doubleMatrix(first, second), 4, colors) } ) new SwingBuilder().edt { frame(title: 'Frame', size: [800, 600], show: true, defaultCloseOperation: DISPOSE) { widget(panel) } } Whiskey: How to visualize? Feature combinations?
  94. import … def table = Table.read().csv('whiskey.csv') table = table.removeColumns(0) def

    cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] Color[] colors = [CYAN, PINK, MAGENTA, ORANGE, GREEN, BLUE, RED, YELLOW] def panel = new PlotPanel( *[cols, cols].combinations().collect { first, second -> Histogram3D.plot(table.as().doubleMatrix(first, second), 4, colors) } ) SwingUtil.show(size: [1200, 900], panel) Whiskey: How to visualize? Feature combinations?
  95. Pretty… but only marginally more useful Whiskey: How to visualize?

    Feature combinations?
  96. Clustering with KMeans Step 1: • Guess k cluster centroids

    at random
  97. Clustering with KMeans Step 1: • Guess k cluster centroids

  98. Clustering with KMeans Step 1: • Guess k cluster centroids

    Step 2: • Assign points to closest centroid
  99. Clustering with KMeans Step 1: • Guess k cluster centroids

    Step 2: • Assign points to closest centroid
  100. Clustering with KMeans Step 1: • Guess k cluster centroids

    Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points
  101. Clustering with KMeans Step 1: • Guess k cluster centroids

    Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points
  102. Clustering with KMeans Step 1: • Guess k cluster centroids

    Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points
  103. Clustering with KMeans Step 1: • Guess k cluster centroids

    Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points
  104. Clustering with KMeans Step 1: • Guess k cluster centroids

    Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points Repeat steps 2 and 3 until stable or some limit reached
  105. Clustering with KMeans Step 1: • Guess k cluster centroids

    Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points Repeat steps 2 and 3 until stable or some limit reached
  106. Clustering with KMeans Step 1: • Guess k cluster centroids

    Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points Repeat steps 2 and 3 until stable or some limit reached
  107. Clustering with KMeans Step 1: • Guess k cluster centroids

    Step 2: • Assign points to closest centroid Step 3: • Calculate new centroids based on selected points Repeat steps 2 and 3 until stable or some limit reached
  108. import … def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco",

    "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] def numClusters = 5 def loader = new CSVLoader(file: 'whiskey.csv') def clusterer = new SimpleKMeans(numClusters: numClusters, preserveInstancesOrder: true) def instances = loader.dataSet instances.deleteAttributeAt(0) // remove RowID clusterer.buildClusterer(instances) println ' ' + cols.join(', ') def dataset = new DefaultCategoryDataset() clusterer.clusterCentroids.eachWithIndex{ Instance ctrd, num -> print "Cluster ${num+1}: " println ((1..cols.size()).collect{ sprintf '%.3f', ctrd.value(it) }.join(', ')) (1..cols.size()).each { idx -> dataset.addValue(ctrd.value(idx), "Cluster ${num+1}", cols[idx-1]) } } def clusters = (0..<numClusters).collectEntries{ [it, []] } clusterer.assignments.eachWithIndex { cnum, idx -> clusters[cnum] << instances.get(idx).stringValue(0) } clusters.each { k, v -> println "Cluster ${k+1}:" println v.join(', ') } def plot = new SpiderWebPlot(dataset: dataset) def chart = new JFreeChart('Whiskey clusters', plot) SwingUtil.show(new ChartPanel(chart)) Whiskey – clustering with radar plot and weka Read CSV and create clusterer Display whiskies assigned to each cluster Graph cluster centroids
  109. import … def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco",

    "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] def numClusters = 5 def loader = new CSVLoader(file: 'whiskey.csv') def clusterer = new SimpleKMeans(numClusters: numClusters, preserveInstancesOrder: true) def instances = loader.dataSet instances.deleteAttributeAt(0) // remove RowID clusterer.buildClusterer(instances) println ' ' + cols.join(', ') def dataset = new DefaultCategoryDataset() clusterer.clusterCentroids.eachWithIndex{ Instance ctrd, num -> print "Cluster ${num+1}: " println ((1..cols.size()).collect{ sprintf '%.3f', ctrd.value(it) }.join(', ')) (1..cols.size()).each { idx -> dataset.addValue(ctrd.value(idx), "Cluster ${num+1}", cols[idx-1]) } } def clusters = (0..<numClusters).collectEntries{ [it, []] } clusterer.assignments.eachWithIndex { cnum, idx -> clusters[cnum] << instances.get(idx).stringValue(0) } clusters.each { k, v -> println "Cluster ${k+1}:" println v.join(', ') } def plot = new SpiderWebPlot(dataset: dataset) def chart = new JFreeChart('Whiskey clusters', plot) SwingUtil.show(new ChartPanel(chart)) Whiskey – clustering with radar plot and weka Body, Sweetness, Smoky, Medicinal, Tobacco, Honey, Spicy, Winey, Nutty, Malty, Fruity, Floral Cluster 1: 3.800, 1.600, 3.600, 3.600, 0.600, 0.200, 1.600, 0.600, 1.000, 1.400, 1.200, 0.000 Cluster 2: 2.773, 2.409, 1.545, 0.045, 0.000, 1.818, 1.591, 2.000, 2.091, 2.136, 2.136, 1.591 Cluster 3: 1.773, 2.455, 1.318, 0.636, 0.000, 0.636, 1.000, 0.409, 1.636, 1.364, 1.591, 1.591 Cluster 4: 1.500, 2.233, 1.267, 0.267, 0.000, 1.533, 1.400, 0.700, 1.000, 1.900, 1.900, 2.133 Cluster 5: 2.000, 2.143, 1.857, 0.857, 1.000, 0.857, 1.714, 1.000, 1.286, 2.000, 1.429, 1.714 Cluster 1: Ardbeg, Clynelish, Lagavulin, Laphroig, Talisker Cluster 2: Aberfeldy, Aberlour, Ardmore, Auchroisk, Balmenach, BenNevis, Benrinnes, Benromach, BlairAthol, Dailuaine, Dalmore, Edradour, Glendronach, Glendullan, Glenfarclas, Glenrothes, Glenturret, Longmorn, Macallan, Mortlach, RoyalLochnagar, Strathisla Cluster 3: ArranIsleOf, Aultmore, Balblair, Cardhu, Craigganmore, Dufftown, GlenGrant, GlenKeith, GlenScotia, GlenSpey, Glenfiddich, Glenmorangie, Isle of Jura, Mannochmore, Miltonduff, Oban, Speyside, Springbank, Strathmill, Tamnavulin, Teaninich, Tomore Cluster 4: AnCnoc, Auchentoshan, Belvenie, Benriach, Bladnoch, Bowmore, Bruichladdich, Bunnahabhain, Dalwhinnie, Deanston, GlenElgin, GlenGarioch, GlenMoray, GlenOrd, Glenallachie, Glengoyne, Glenkinchie, Glenlivet, Glenlossie, Highland Park, Inchgower, Knochando, Linkwood, Loch Lomond, Scapa, Speyburn, Tamdhu, Tobermory, Tomatin, Tomintoul Cluster 5: Caol Ila, Craigallechie, GlenDeveronMacduff, OldFettercairn, OldPulteney, RoyalBrackla, Tullibardine
  110. import … def rows = CSV.withFirstRecordAsHeader().parse(new FileReader('whiskey.csv')) def cols =

    ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] def clusterer = new KMeansPlusPlusClusterer(5) def data = rows.collect{ row -> new DoublePoint(cols.collect{ col -> row[col] } as int[]) } def centroids = clusterer.cluster(data) println cols.join(', ') + ', Medoid' def dataset = new DefaultCategoryDataset() centroids.eachWithIndex{ ctrd, num -> def cpt = ctrd.center.point def closest = ctrd.points.min{ pt -> sumSq((0..<cpt.size()).collect{ cpt[it] - pt.point[it] } as double[]) } def medoid = rows.find{ row -> cols.collect{ row[it] as double } == closest.point }?.Distillery println cpt.collect{ sprintf '%.3f', it }.join(', ') + ", $medoid" cpt.eachWithIndex { val, idx -> dataset.addValue(val, "Cluster ${num+1}", cols[idx]) } } def plot = new SpiderWebPlot(dataset: dataset) def chart = new JFreeChart('Whiskey clusters', plot) SwingUtil.show(new ChartPanel(chart)) Whiskey – clustering with radar plot and medoids Same again with Apache Commons Math but also calculate medoids
  111. import … def rows = CSV.withFirstRecordAsHeader().parse(new FileReader('whiskey.csv')) def cols =

    ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] def clusterer = new KMeansPlusPlusClusterer(5) def data = rows.collect{ row -> new DoublePoint(cols.collect{ col -> row[col] } as int[]) } def centroids = clusterer.cluster(data) println cols.join(', ') + ', Medoid' def dataset = new DefaultCategoryDataset() centroids.eachWithIndex{ ctrd, num -> def cpt = ctrd.center.point def closest = ctrd.points.min{ pt -> sumSq((0..<cpt.size()).collect{ cpt[it] - pt.point[it] } as double[]) } def medoid = rows.find{ row -> cols.collect{ row[it] as double } == closest.point }?.Distillery println cpt.collect{ sprintf '%.3f', it }.join(', ') + ", $medoid" cpt.eachWithIndex { val, idx -> dataset.addValue(val, "Cluster ${num+1}", cols[idx]) } } def plot = new SpiderWebPlot(dataset: dataset) def chart = new JFreeChart('Whiskey clusters', plot) SwingUtil.show(new ChartPanel(chart)) Whiskey – clustering with radar plot and medoids Libraries: Apache Commons Math and JFreeChart Body, Sweetness, Smoky, Medicinal, Tobacco, Honey, Spicy, Winey, Nutty, Malty, Fruity, Floral, Medoid 2.000, 2.533, 1.267, 0.267, 0.200, 1.067, 1.667, 0.933, 0.267, 1.733, 1.800, 1.733, GlenOrd 2.789, 2.474, 1.474, 0.053, 0.000, 1.895, 1.632, 2.211, 2.105, 2.105, 2.211, 1.737, Aberfeldy 2.909, 1.545, 2.909, 2.727, 0.455, 0.455, 1.455, 0.545, 1.545, 1.455, 1.182, 0.545, Clynelish 1.333, 2.333, 0.944, 0.111, 0.000, 1.000, 0.444, 0.444, 1.500, 1.944, 1.778, 1.778, Aultmore 1.696, 2.304, 1.565, 0.435, 0.087, 1.391, 1.696, 0.609, 1.652, 1.652, 1.783, 2.130, Benromach
  112. Whiskey – clustering with PCA plot import … def rows

    = Table.read().csv('whiskey.csv') def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] def data = table.as().doubleMatrix(*cols) def pca = new PCA(data) pca.projection = 2 def projected = pca.project(data) def clusterer = new KMeans(data, 5) table = table.addColumns( DoubleColumn.create("PCA1", (0..<data.size()).collect{ projected[it][0] }), DoubleColumn.create("PCA2", (0..<data.size()).collect{ projected[it][1] }), StringColumn.create("Cluster", clusterer.clusterLabel.collect{ it.toString() }) ) def title = "Clusters x Principal Components" Plot.show(ScatterPlot.create(title, table, "PCA1", "PCA2", "Cluster"))
  113. Whiskey – clustering with 2-4D PCA plot import … def

    rows = Table.read().csv('whiskey.csv') def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] def data = rows.as().doubleMatrix(*cols) def pca = new PCA(data) def dims = 4 // can be 2, 3 or 4 pca.projection = dims def projected = pca.project(data) def adj = [1, 1, 1, 5] def clusterer = new KMeans(data, 5) def labels = clusterer.clusterLabel.collect{ "Cluster " + (it+1) } rows = rows.addColumns( *(0..<dims).collect { idx -> DoubleColumn.create("PCA${idx+1}", (0..<data.size()).collect{ adj[idx] * (projected[it][idx] + adj[idx]) }) }, StringColumn.create("Cluster", labels) ) def title = "Clusters x Principal Components" def type = dims == 2 ? ScatterPlot : Scatter3DPlot Plot.show(type.create(title, rows, *(1..dims).collect { "PCA$it" }, "Cluster"))
  114. Whiskey – clustering and visualizing centroids … def data =

    table.as().doubleMatrix(*cols) def pca = new PCA(data) pca.projection = 3 def projected = pca.project(data) def clusterer = new KMeans(data, 5) def labels = clusterer.clusterLabel.collect { "Cluster " + (it + 1) } table = table.addColumns( *(0..<3).collect { idx -> DoubleColumn.create("PCA${idx+1}", (0..<data.size()).collect{ projected[it][idx] })}, StringColumn.create("Cluster", labels), DoubleColumn.create("Centroid", [10] * labels.size()) ) def centroids = pca.project(clusterer.centroids()) def toAdd = table.emptyCopy(1) (0..<centroids.size()).each { idx -> toAdd[0].setString("Cluster", "Cluster " + (idx+1)) (1..3).each { toAdd[0].setDouble("PCA" + it, centroids[idx][it-1]) } toAdd[0].setDouble("Centroid", 50) table.append(toAdd) } def title = "Clusters x Principal Components w/ centroids" Plot.show(Scatter3DPlot.create(title, table, *(1..3).collect { "PCA$it" }, "Centroid", "Cluster"))
  115. Whiskey – Screeplot import … def rows = Table.read().csv('whiskey.csv') def

    cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] def data = table.as().doubleMatrix(*cols) def pca = new PCA(data) pca.projection = 2 def plots = [PlotCanvas.screeplot(pca)] def projected = pca.project(data) table = table.addColumns( *(1..2).collect { idx -> DoubleColumn.create("PCA$idx", (0..<data.size()).collect { projected[it][idx - 1] }) } ) def colors = [RED, BLUE, GREEN, ORANGE, MAGENTA, GRAY] def symbols = ['*', 'Q', '#', 'Q', '*', '#'] (2..6).each { k -> def clusterer = new KMeans(data, k) double[][] components = table.as().doubleMatrix('PCA1', 'PCA2') plots << ScatterPlot.plot(components, clusterer.clusterLabel, symbols[0..<k] as char[], colors[0..<k] as Color[]) } SwingUtil.show(size: [1200, 900], new PlotPanel(*plots))
  116. import … def rows = Table.read().csv('whiskey.csv') def cols = ["Body",

    "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey", "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"] def data = table.as().doubleMatrix(*cols) def pca = new PCA(data) pca.projection = 2 def plots = [PlotCanvas.screeplot(pca)] def projected = pca.project(data) table = table.addColumns( *(1..2).collect { idx -> DoubleColumn.create("PCA$idx", (0..<data.size()).collect { projected[it][idx - 1] }) } ) def colors = [RED, BLUE, GREEN, ORANGE, MAGENTA, GRAY] def symbols = ['*', 'Q', '#', 'Q', '*', '#'] (2..6).each { k -> def clusterer = new KMeans(data, k) double[][] components = table.as().doubleMatrix('PCA1', 'PCA2') plots << ScatterPlot.plot(components, clusterer.clusterLabel, symbols[0..<k] as char[], colors[0..<k] as Color[]) } SwingUtil.show(size: [1200, 900], new PlotPanel(*plots)) Whiskey – Screeplot
  117. Whiskey flavors – scaling clustering import … def rows =

    Table.read().csv('whiskey.csv') String[] features = [ 'Distillery', 'Body', 'Sweetness', 'Smoky', 'Medicinal', 'Tobacco', 'Honey', 'Spicy', 'Winey', 'Nutty', 'Malty', 'Fruity', 'Floral' ] def data = rows.as().doubleMatrix(features) def cfg = new IgniteConfiguration(…) Ignition.start(cfg).withCloseable { ignite -> println ">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols" def trainer = new KMeansTrainer().withDistance(new EuclideanDistance()).withAmountOfClusters(5) def dataCache = new SandboxMLCache(ignite).fillCacheWith(data) def vectorizer = new DoubleArrayVectorizer().labeled(Vectorizer.LabelCoordinate.FIRST) def mdl = trainer.fit(ignite, dataCache, vectorizer) println ">>> KMeans centroids" println features[1..-1].join(', ‘) mdl.centers.each { c -> println c.all().collect{ sprintf '%.4f', it.get() }.join(', ') } dataCache.destroy() } [17:18:07] Ignite node started OK (id=ea92fc20) >>> Ignite grid started for data: 86 rows X 13 cols >>> KMeans centroids Body, Sweetness, Smoky, Medicinal, Tobacco, Honey, Spicy, Winey, Nutty, Malty, Fruity, Floral 2.9091, 1.5455, 2.9091, 2.7273, 0.4545, 0.4545, 1.4545, 0.5455, 1.5455, 1.4545, 1.1818, 0.5455 1.8095, 2.4762, 1.5714, 0.4286, 0.1429, 1.5238, 1.7143, 0.8571, 0.7143, 1.6667, 1.6190, 2.0476 2.5217, 2.3913, 1.4783, 0.0435, 0.0435, 1.6087, 1.4783, 1.9565, 2.0435, 2.0435, 1.9565, 1.3913 1.3704, 2.3704, 1.0000, 0.2593, 0.0370, 0.7778, 0.9259, 0.4074, 1.5185, 1.7407, 2.0000, 2.1111 3.2500, 2.2500, 1.5000, 0.0000, 0.0000, 3.0000, 2.0000, 1.0000, 1.5000, 2.5000, 2.2500, 2.0000 [17:18:07] Ignite node stopped OK [uptime=00:00:00.368]
  118. Whiskey flavors – scaling clustering import … def spark =

    builder().config('spark.master', 'local[8]').appName('Whiskey').orCreate def file = '/path/to/whiskey.csv' int k = 5 Dataset<Row> rows = spark.read().format('com.databricks.spark.csv') .options('header': 'true', 'inferSchema': 'true').load(file) String[] colNames = rows.columns().toList().minus(['RowID', 'Distillery']) def assembler = new VectorAssembler(inputCols: colNames, outputCol: 'features') Dataset<Row> dataset = assembler.transform(rows) def clusterer = new KMeans(k: k, seed: 1L) def model = clusterer.fit(dataset) println 'Cluster centers:' model.clusterCenters().each{ println it.values().collect{ sprintf '%.2f', it }.join(', ') } spark.stop() Cluster centers: 1.73, 2.35, 1.58, 0.81, 0.19, 1.15, 1.42, 0.81, 1.23, 1.77, 1.23, 1.31 2.00, 1.00, 3.00, 0.00, 0.00, 0.00, 3.00, 1.00, 0.00, 2.00, 2.00, 2.00 2.86, 2.38, 1.52, 0.05, 0.00, 1.95, 1.76, 2.05, 1.81, 2.05, 2.19, 1.71 1.53, 2.38, 1.06, 0.16, 0.03, 1.09, 1.00, 0.50, 1.53, 1.75, 2.13, 2.28 3.67, 1.50, 3.67, 3.33, 0.67, 0.17, 1.67, 0.50, 1.17, 1.33, 1.17, 0.17
  119. Optimization Overview Optimization: • Finding (optimal) solutions Approaches: • Dynamic

    programming • Linear programming • Non-linear programming • Constraint programming • Integer programming Aspects: • Combinatorial optimization • Continuous optimization Applications: • Timetabling/scheduling • Vehicle routing • Packing problems • Trading systems • Logistics • Language detection CP Constraint Programming CIP Constraint Integer Programming MINLP Mixed Integer Non-linear Programming MIP Mixed Integer Linear Programming IP Integer Linear Programming SAT Satisfiability LP Linear Programming
  120. Constraint Programming Picture: http://osullivan.ucc.ie/AAAI_OSullivan.pdf

  121. Constraint Programming Typical domains: • Boolean (SAT problem) • integer

    • rational • interval (scheduling problems) • linear (approaches to non- linear problems do exist) • finite sets • mixed (2 or more of above) Picture: http://osullivan.ucc.ie/AAAI_OSullivan.pdf
  122. Cryptarithmetic S E N D + M O R E

    = M O N E Y Replace the letters with decimal digits to make a valid arithmetic sum
  123. Cryptarithmetic: Brute force def solutions(): # letters = ('s', 'e',

    'n', 'd', 'm', 'o', 'r', 'y') all_solutions = list() for s in range(1, 10): for e in range(0, 10): for n in range(0, 10): for d in range(0, 10): for m in range(1, 10): for o in range(0, 10): for r in range(0, 10): for y in range(0, 10): if len({s, e, n, d, m, o, r, y}) == 8: send = 1000 * s + 100 * e + 10 * n + d more = 1000 * m + 100 * o + 10 * r + e money = 10000 * m + 1000 * o + 100 * n + 10 * e + y if send + more == money: all_solutions.append((send, more, money)) return all_solutions print(solutions()) S E N D + M O R E = M O N E Y [(9567, 1085, 10652)]
  124. Cryptarithmetic: Brute force def solutions() { // letters = ['s',

    'e', 'n', 'd', 'm', 'o', 'r', 'y'] def all_solutions = [] for (s in 1..<10) for (e in 0..9) for (n in 0..9) for (d in 0..9) for (m in 1..9) for (o in 0..9) for (r in 0..9) for (y in 0..9) if ([s, e, n, d, m, o, r, y].toSet().size() == 8) { def send = 1000 * s + 100 * e + 10 * n + d def more = 1000 * m + 100 * o + 10 * r + e def money = 10000 * m + 1000 * o + 100 * n + 10 * e + y if (send + more == money) all_solutions.add([send, more, money]) } return all_solutions } print(solutions()) S E N D + M O R E = M O N E Y [[9567, 1085, 10652]]
  125. Cryptarithmetic: Brute force from itertools import permutations def solution2(): letters

    = ('s', 'e', 'n', 'd', 'm', 'o', 'r', 'y') digits = range(10) for perm in permutations(digits, len(letters)): sol = dict(zip(letters, perm)) if sol['s'] == 0 or sol['m'] == 0: continue send = 1000 * sol['s'] + 100 * sol['e'] + 10 * sol['n'] + sol['d'] more = 1000 * sol['m'] + 100 * sol['o'] + 10 * sol['r'] + sol['e'] money = 10000 * sol['m'] + 1000 * sol['o'] + 100 * sol['n'] + 10 * sol['e'] + sol['y'] if send + more == money: return send, more, money print(solution2()) S E N D + M O R E = M O N E Y (9567, 1085, 10652)
  126. Cryptarithmetic: Brute force def solution2() { def digits = 0..9

    for (p in digits.permutations()) { if (p[-1] < p[-2]) continue def (s, e, n, d, m, o, r, y) = p if (s == 0 || m == 0) continue def send = 1000 * s + 100 * e + 10 * n + d def more = 1000 * m + 100 * o + 10 * r + e def money = 10000 * m + 1000 * o + 100 * n + 10 * e + y if (send + more == money) return [send, more, money] } } print(solution2()) S E N D + M O R E = M O N E Y [9567, 1085, 10652]
  127. from csp import Constraint, CSP from typing import Dict, List,

    Optional class SendMoreMoneyConstraint(Constraint[str, int]): def __init__(self, letters: List[str]) -> None: super().__init__(letters) self.letters: List[str] = letters def satisfied(self, assignment: Dict[str, int]) -> bool: # not a solution if duplicate values if len(set(assignment.values())) < len(assignment): return False # if all vars assigned, check if correct if len(assignment) == len(self.letters): s: int = assignment["S"] e: int = assignment["E"] n: int = assignment["N"] d: int = assignment["D"] m: int = assignment["M"] o: int = assignment["O"] r: int = assignment["R"] y: int = assignment["Y"] send: int = s * 1000 + e * 100 + n * 10 + d more: int = m * 1000 + o * 100 + r * 10 + e money: int = m * 10000 + o * 1000 + n * 100 + e * 10 + y return send + more == money return True … Cryptarithmetic: Constraint programming … letters = ["S", "E", "N", "D", "M", "O", "R", "Y"] possible_digits = {} for letter in letters: possible_digits[letter] = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] possible_digits["M"] = [1] # can't start with 0 constraint csp: CSP[str, int] = CSP(letters, possible_digits) csp.add_constraint(SendMoreMoneyConstraint(letters)) solution: Optional[Dict[str, int]] = csp.backtracking_search() if solution is None: print("No solution found!") else: print(solution) S E N D + M O R E = M O N E Y {'S': 9, 'E': 5, 'N': 6, 'D': 7, 'M': 1, 'O': 0, 'R': 8, 'Y': 2} Adapted from: https://freecontent.manning.com/constraint-satisfaction-problems-in-python/
  128. @Grab('org.choco-solver:choco-solver:4.10.0') import org.chocosolver.solver.Model import org.chocosolver.solver.variables.IntVar def model = new Model("SEND+MORE=MONEY")

    def S = model.intVar("S", 1, 9) def E = model.intVar("E", 0, 9) def N = model.intVar("N", 0, 9) def D = model.intVar("D", 0, 9) def M = model.intVar("M", 1, 9) def O = model.intVar("0", 0, 9) def R = model.intVar("R", 0, 9) def Y = model.intVar("Y", 0, 9) model.allDifferent(S, E, N, D, M, O, R, Y).post() … Cryptarithmetic: Constraint programming … IntVar[] ALL = [ S, E, N, D, M, O, R, E, M, O, N, E, Y] int[] COEFFS = [ 1000, 100, 10, 1, 1000, 100, 10, 1, -10000, -1000, -100, -10, -1] model.scalar(ALL, COEFFS, "=", 0).post() model.solver.findSolution() S E N D + M O R E = M O N E Y Solution: S=9, E=5, N=6, D=7, M=1, 0=0, R=8, Y=2,
  129. Dietary restrictions Minimise cost of diet given: • Must be

    at least 300 calories • Not more than 10 grams of protein • Not less than 10 grams of carbohydrates • Not less than 8 grams of fat • At least 0.5 units of fish • No more than 1 unit of milk Bread Milk Cheese Potato Fish Yogurt Cost 2.0 3.5 8.0 1.5 11.0 1.0 Protein (g) 4.0 8.0 7.0 1.3 8.0 9.2 Fat (g) 1.0 5.0 9.0 0.1 7.0 1.0 Carbohydrates (g) 15.0 11.7 0.4 22.6 0.0 17.0 Calories 90 120 106 97 130 180
  130. Dietary restrictions (Excel)

  131. Dietary restrictions (Excel)

  132. Dietary restrictions (Google sheets)

  133. Diet problem (ojalgo) def model = new ExpressionsBasedModel() def bread

    = model.addVariable("Bread").lower(0) def milk = model.addVariable("Milk").lower(0).upper(1) def cheese = model.addVariable("Cheese").lower(0) def potato = model.addVariable("Potato").lower(0) def fish = model.addVariable("Fish").lower(0.5) def yogurt = model.addVariable("Yogurt").lower(0) def cost = model.addExpression("Cost") cost.set(bread, 2.0).set(milk, 3.5).set(cheese, 8.0).set(potato, 1.5).set(fish, 11.0).set(yogurt, 1.0) def protein = model.addExpression("Protein").upper(10) protein.set(bread, 4.0).set(milk, 8.0).set(cheese, 7.0).set(potato, 1.3).set(fish, 8.0).set(yogurt, 9.2) def fat = model.addExpression("Fat").lower(8) fat.set(bread, 1.0).set(milk, 5.0).set(cheese, 9.0).set(potato, 0.1).set(fish, 7.0).set(yogurt, 1.0) def carbs = model.addExpression("Carbohydrates").lower(10) carbs.set(bread, 15.0).set(milk, 11.7).set(cheese, 0.4).set(potato, 22.6).set(fish, 0.0).set(yogurt, 17.0) def calories = model.addExpression("Calories").lower(300) calories.set(bread, 90).set(milk, 120).set(cheese, 106).set(potato, 97).set(fish, 130).set(yogurt, 180) def result = model.minimise() // for a variation, see: … OPTIMAL 0.0 @ { 0.0, 0.0, 0.40813898143741034, 1.8538791051880057, 0.5916230366492151, 0.0 } ############################################ 0 <= Bread: 0 0 <= Milk: 0 <= 1 0 <= Cheese: 0.408139 0 <= Potato: 1.853879 0.5 <= Fish: 0.591623 0 <= Yogurt: 0 10 <= Carbohydrates: 42.060923 8 <= Fat: 8.0 Cost: 12.553784 Protein: 10.0 <= 10 300 <= Calories: 300.0 ############################################
  134. Diet problem (Choco) def model = new Model("Diet problem") def

    unbounded = 100000 // scale quantities by 10, coefficients by 10, products by 100 def bread = model.intVar("Bread", 0, unbounded) def milk = model.intVar("Milk", 0, 10) def cheese = model.intVar("Cheese", 0, unbounded) def potato = model.intVar("Potato", 0, unbounded) def fish = model.intVar("Fish", 5, unbounded) def yogurt = model.intVar("Yogurt", 0, unbounded) IntVar[] all = [bread, milk, cheese, potato, fish, yogurt] def cost = model.intVar("Cost", 0, unbounded) model.scalar(all, [20, 35, 80, 15, 110, 10] as int[], "=", cost).post() def protein = model.intVar("Protein", 0, 1000) model.scalar(all, [40, 80, 70, 13, 80, 92] as int[], "=", protein).post() def fat = model.intVar("Fat", 800, unbounded) model.scalar(all, [10, 50, 90, 1, 70, 10] as int[], "=", fat).post() def carbs = model.intVar("Carbohydrates", 1000, unbounded) model.scalar(all, [150, 117, 4, 226, 0, 170] as int[], "=", carbs).post() def calories = model.intVar("Calories", 30000, unbounded) model.scalar(all, [900, 1200, 1060, 970, 1300, 1800] as int[], "=", calories).post() model.setObjective(Model.MINIMIZE, cost) def found = model.solver.findSolution() if (found) { all.each { println "$it.name: ${it.value / 10}" } [carbs, fat, protein, calories, cost].each { println "$it.name: ${it.value / 100}" } } else { println "No solution" Bread: 0 Milk: 0 Cheese: 0.5 Potato: 1.9 Fish: 0.5 Yogurt: 0 Carbohydrates: 43.14 Fat: 8.19 Protein: 9.97 Calories: 302.3 Cost: 12.35 Choco supports RealVar but doesn’t have as rich a set of possible constraints for such vars.
  135. Diet problem (Choco) def model = new Model("Diet problem") def

    unbounded = 1000.0d def precision = 0.00001d // scale quantities by 10, coefficients by 10, products by 100 def bread = model.realVar("Bread", 0.0, unbounded, precision) def milk = model.realVar("Milk", 0.0, 1.0, precision) def cheese = model.realVar("Cheese", 0.0, unbounded, precision) def potato = model.realVar("Potato", 0.0, unbounded, precision) def fish = model.realVar("Fish", 0.5, unbounded, precision) def yogurt = model.realVar("Yogurt", 0.0, unbounded, precision) RealVar[] all = [bread, milk, cheese, potato, fish, yogurt] def scalarIbex = { coeffs, var -> def (a, b, c, d, e, f) = coeffs model.realIbexGenericConstraint("$a*{0}+$b*{1}+$c*{2}+$d*{3}+$e*{4}+$f*{5}={6}", [*all, var] as RealVar[]).post(); } def cost = model.realVar("Cost", 0.0, unbounded, precision) scalarIbex([2.0, 3.5, 8.0, 1.5, 11.0, 1.0], cost) def protein = model.realVar("Protein", 0.0, 10.0, precision) scalarIbex([4.0, 8.0, 7.0, 1.3, 8.0, 9.2], protein) def fat = model.realVar("Fat", 8.0, unbounded, precision) scalarIbex([1.0, 5.0, 9.0, 0.1, 7.0, 1.0], fat) def carbs = model.realVar("Carbohydrates", 10.0, unbounded, precision) scalarIbex([15.0, 11.7, 0.4, 22.6, 0.0, 17.0], carbs) def calories = model.realVar("Calories", 300, unbounded, precision) scalarIbex([90, 120, 106, 97, 130, 180], calories) model.setObjective(Model.MINIMIZE, cost) def found = model.solver.findSolution() Bread: 0.025131 .. 0.025137 Milk: 0.000009 .. 0.000010 Cheese: 0.428571 .. 0.428571 Potato: 1.848118 .. 1.848124 Fish: 0.561836 .. 0.561836 Yogurt: 0.000007 .. 0.000010 Carbohydrates: 42.316203 .. 42.316211 Fat: 8.000000 .. 8.000005 Protein: 9.997920 .. 9.997926 Calories: 300.000000 .. 300.000008 Cost: 12.431241 .. 12.431245 Choco does have a plugin (via JNI) for the Ibex C++ constraint processing library which does handle real numbers. def pretty = { var -> def bounds = found.getRealBounds(var) printf "%s: %.6f .. %.6f%n", var.name, *bounds } if (found) { all.each { pretty(it) } [carbs, fat, protein, calories, cost].each { pretty(it) } } else { println "No solution" }
  136. Diet problem (Apache Commons Math) import org.apache.commons.math3.optim.linear.* import org.apache.commons.math3.optim.nonlinear.scalar.GoalType import

    static org.apache.commons.math3.optim.linear.Relationship.* def cost = new LinearObjectiveFunction([2.0, 3.5, 8.0, 1.5, 11.0, 1.0] as double[], 0) static scalar(coeffs, rel, val) { new LinearConstraint(coeffs as double[], rel, val) } def bread_min = scalar([1, 0, 0, 0, 0, 0], GEQ, 0) def milk_min = scalar([0, 1, 0, 0, 0, 0], GEQ, 0) def milk_max = scalar([0, 1, 0, 0, 0, 0], LEQ, 1) def cheese_min = scalar([0, 0, 1, 0, 0, 0], GEQ, 0) def potato_min = scalar([0, 0, 0, 1, 0, 0], GEQ, 0) def fish_min = scalar([0, 0, 0, 0, 1, 0], GEQ, 0.5) def yogurt_min = scalar([0, 0, 0, 0, 0, 1], GEQ, 0) def protein = scalar([4.0, 8.0, 7.0, 1.3, 8.0, 9.2], LEQ, 10) def fat = scalar([1.0, 5.0, 9.0, 0.1, 7.0, 1.0], GEQ, 8) def carbs = scalar([15.0, 11.7, 0.4, 22.6, 0.0, 17.0], GEQ, 10) def calories = scalar([90, 120, 106, 97, 130, 180], GEQ, 300) LinearConstraintSet constraints = [bread_min, milk_min, milk_max, fish_min, cheese_min, potato_min, yogurt_min, protein, fat, carbs, calories] def solution = new SimplexSolver().optimize(cost, constraints, GoalType.MINIMIZE) if (solution != null) { printf "Opt: %.2f%n", solution.value println solution.point.collect{ sprintf '%.2f', it }.join(', ') } Opt: 12.08 -0.00, 0.05, 0.45, 1.87, 0.50, 0.00
  137. Classification Overview Classification: • Predicting class of some data Algorithms:

    • Logistic Regression • Naïve Bayes • Stochastic Gradient Descent • K-Nearest Neighbors • Decision Tree • Random Forest • Support Vector Machine Aspects: • Over/underfitting • Ensemble • Confusion Applications: • Image recognition • Speech recognition • Medical diagnosis
  138. Classify language Apache OpenNLP parts: • sentence detector, tokenizer, name

    finder, document categorizer, part-of-speech tagger, chunker, parser, coreference resolution import opennlp.tools.langdetect.* def base = 'http://apache.forsale.plus/opennlp/models' def url = "$base/langdetect/1.8.3/langdetect-183.bin" def model = new LanguageDetectorModel(new URL(url)) def detector = new LanguageDetectorME(model) [ 'spa': 'Bienvenido a Madrid', 'fra': 'Bienvenue à Paris', 'dan': 'Velkommen til København' ].each { k, v -> assert detector.predictLanguage(v).lang == k } Supported language detection training models: • MaxEnt • Perceptron • Naïve Bayes
  139. Classification with Neural networks For >1 layers (deep) model caters

    for non- linear relationships See: https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6
  140. Classification with Neural networks Case Study: MNIST • Handwritten digits

    • 60,000 training examples • 10,000 test examples • Size-normalized (28x28 pixels) and centered • http://yann.lecun.com/exdb/mnist/
  141. MNIST with Apache Commons Math class MnistNeuralNet { private static

    int inodes = 784 private static int hnodes = 200 private static int onodes = 10 private static double learning_rate = 0.1 private static RealMatrix inputHidden private static RealMatrix hiddenOutput static initRandom() { inputHidden = initRandom(createRealMatrix(hnodes, inodes), inodes**-0.5 as double) hiddenOutput = initRandom(createRealMatrix(onodes, hnodes), hnodes**-0.5 as double) } static double[] normalize(Integer[] img) { img.collect { it / 255.0d * 0.999d + 0.001d } as double[] } static RealMatrix createTarget(int label) { createRealMatrix(onodes, 1).tap { (0..<onodes).each { setEntry(it, 0, it != label ? 0.001d : 0.999d) } } } static RealMatrix query(double[] inputArray) { RealMatrix inputs = createColumnRealMatrix(inputArray) RealMatrix hiddenInputs = inputHidden * inputs RealMatrix hiddenOutputs = scalarSigmoid(hiddenInputs) RealMatrix finalInputs = hiddenOutput * hiddenOutputs return scalarSigmoid(finalInputs) } … Inspired by: https://golb.hplar.ch/2018/12/simple-neural-network.html MnistNeuralNet class supports training and prediction 0 1 2 9 … … … 10 28 * 28 = 784 200
  142. MNIST with Apache Commons Math … static RealMatrix initRandom(RealMatrix matrix,

    double desiredStandardDeviation) { scalar(matrix, it -> new Random().nextGaussian() * desiredStandardDeviation) } static void train(RealMatrix inputs, RealMatrix targets) { // forward RealMatrix hiddenInputs = inputHidden * inputs RealMatrix hiddenOutputs = scalarSigmoid(hiddenInputs) RealMatrix finalInputs = hiddenOutput * hiddenOutputs RealMatrix finalOutputs = scalarSigmoid(finalInputs) // back RealMatrix outputErrors = targets.subtract(finalOutputs) RealMatrix t1 = multiplyElements(outputErrors, finalOutputs) RealMatrix t2 = multiplyElements(t1, scalar(finalOutputs, it -> 1.0d - it)) RealMatrix t3 = t2 * hiddenOutputs.transpose() hiddenOutput = hiddenOutput.add(scalar(t3, it -> learning_rate * it)) RealMatrix hiddenErrors = hiddenOutput.transpose() * outputErrors t1 = multiplyElements(hiddenErrors, hiddenOutputs) t2 = multiplyElements(t1, scalar(hiddenOutputs, it -> 1.0d - it)) t3 = t2 * inputs.transpose() inputHidden = inputHidden.add(scalar(t3, it -> learning_rate * it)) } static void save(String filename) { new FileOutputStream(filename).withObjectOutputStream { oos -> oos.writeObject(inputHidden.data) oos.writeObject(hiddenOutput.data) } } } Inspired by: https://golb.hplar.ch/2018/12/simple-neural-network.html MnistNeuralNet class supports training and prediction cont’d
  143. MNIST with Apache Commons Math static RealMatrix scalar(RealMatrix matrix, Function<Double,

    Double> function) { int numRows = matrix.rowDimension int numCols = matrix.columnDimension RealMatrix result = createRealMatrix(numRows, numCols) for (r in 0..<numRows) { for (c in 0..<numCols) { result.setEntry(r, c, function.apply(matrix.getEntry(r, c))) } } return result } static int[][] rotate(int[][] img, double angleInDegrees) { double angle = Math.toRadians(angleInDegrees) int[][] result = new int[img.length][] for (y in 0..<img.length) { result[y] = new int[img[y].length] Arrays.fill(result[y], 0) } double cosAngle = Math.cos(angle) double sinAngle = Math.sin(angle) double x0 = img[0].length / 2 - cosAngle * img[0].length / 2 - sinAngle * img.length / 2 double y0 = img.length / 2 - cosAngle * img.length / 2 + sinAngle * img[0].length / 2 for (y in 0..<img.length) { for (x in 0..<img[y].length) { int xRot = (int) (x * cosAngle + y * sinAngle + x0) int yRot = (int) (-x * sinAngle + y * cosAngle + y0) if (xRot >= 0 && yRot >= 0 && xRot <= 27 && yRot <= 27) { result[y][x] = img[yRot][xRot] } } } return result } … Inspired by: https://golb.hplar.ch/2018/12/simple-neural-network.html Utility methods
  144. MNIST with Apache Commons Math … static RealMatrix multiplyElements(RealMatrix matrixA,

    RealMatrix matrixB) { // elementWise multiplication has same compatibility requirements as addition checkAdditionCompatible(matrixA, matrixB) int numRows = matrixA.rowDimension int numCols = matrixA.columnDimension RealMatrix product = createRealMatrix(numRows, numCols) for (r in 0..<numRows) { for (c in 0..<numCols) { product.setEntry(r, c, matrixA.getEntry(r, c) * matrixB.getEntry(r, c)) } } return product } static int maxIndex(RealMatrix result) { double[][] data = result.data (0..<data.size()).max { data[it][0] } } static RealMatrix scalarSigmoid(RealMatrix matrix) { def sigmoid = [start: { a, b, c, d, e, f -> }, visit: { r, c, double v -> 1 / (1 + Math.exp(-v)) }, end : { -> 0d }] as RealMatrixChangingVisitor RealMatrix result = matrix.copy() result.walkInRowOrder(sigmoid) result } Inspired by: https://golb.hplar.ch/2018/12/simple-neural-network.html Utility methods cont’d
  145. MNIST with Apache Commons Math int epochs = 10 initRandom()

    int[] labels = MnistReader.getLabels(Paths.get("/path/to/train-labels-idx1-ubyte.gz")) List<int[][]> images = MnistReader.getImages(Paths.get("/path/to/train-images-idx3-ubyte.gz")) println 'Processing images' def scaledImages = images.collect { image -> normalize(image.flatten() as Integer[]) } as double[][] def scaledImagesRot10 = images.collect { image -> normalize(rotate(image, 10).flatten() as Integer[]) } as double[][] def scaledImagesRot350 = images.collect { image -> normalize(rotate(image, -10).flatten() as Integer[]) } as double[][] epochs.times { e -> println "Running epoch: ${e + 1}" for (i in 0..<labels.length) { RealMatrix targets = createTarget(labels[i]) train(createColumnRealMatrix(scaledImages[i]), targets) train(createColumnRealMatrix(scaledImagesRot10[i]), targets) train(createColumnRealMatrix(scaledImagesRot350[i]), targets) } } int[] testLabels = MnistReader.getLabels(Paths.get("/path/to/t10k-labels-idx1-ubyte.gz")) List<int[][]> testImages = MnistReader.getImages(Paths.get("/path/to/t10k-images-idx3-ubyte.gz")) int correct = 0 for (i in 0..<testLabels.length) { int correctLabel = testLabels[i] RealMatrix predict = query(scale(testImages.get(i).flatten() as Integer[])) int predictLabel = maxIndex(predict) if (predictLabel == correctLabel) correct++ } println "Accuracy: " + correct / testLabels.length Processing images Running epoch: 1 Running epoch: 2 Running epoch: 3 Running epoch: 4 Running epoch: 5 Running epoch: 6 Running epoch: 7 Running epoch: 8 Running epoch: 9 Running epoch: 10 Accuracy: 0.9767 Inspired by: https://golb.hplar.ch/2018/12/simple-neural-network.html Train model Test model
  146. MNIST with DeepLearning4J int rngSeed = 123 int batchSize =

    125 int numEpochs = 15 int numInputs = 28 * 28 int hiddenLayerSize = 1000 int numOutputs = 10 def (trainSet, testSet) = [true, false].collect { new MnistDataSetIterator(batchSize, it, rngSeed) } def log = LoggerFactory.getLogger(getClass()) def conf = new NeuralNetConfiguration.Builder() .seed(rngSeed) //include the random seed for reproducibility // use stochastic gradient descent as an optimization algorithm .updater(new Nesterovs(0.006, 0.9)) .l2(1e-4) .list() // input layer: .layer(new DenseLayer.Builder() .nIn(numInputs) .nOut(hiddenLayerSize) .activation(Activation.RELU) .weightInit(WeightInit.XAVIER) .build()) // hidden layer: .layer(new OutputLayer.Builder(LossFunction.NEGATIVELOGLIKELIHOOD) .nIn(hiddenLayerSize) .nOut(numOutputs) .activation(Activation.SOFTMAX) .weightInit(WeightInit.XAVIER) .build()) .build() … … log.info("Creating model ...") def model = new MultiLayerNetwork(conf) model.init() log.info("Training model ...") model.listeners = new ScoreIterationListener(100) model.fit(trainSet, numEpochs) // model.save('mlp_model.dat' as File) log.info("Evaluating model...") def eval = model.evaluate(testSet) log.info(eval.stats()) ======== Evaluation Metrics ======== # of classes: 10 Accuracy: 0.9728 Precision: 0.9728 Recall: 0.9726 F1 Score: 0.9726 Based on: https://github.com/deeplearning4j/dl4j-examples/blob/master/dl4j- examples/src/main/java/org/deeplearning4j/examples/feedforward/mnist/MLPMnistSingleLayerExample.java 0 1 2 9 … … … 10 28 * 28 = 784 1000
  147. … MNIST with DeepLearning4J Based on: https://github.com/deeplearning4j/dl4j-examples/blob/master/dl4j- examples/src/main/java/org/deeplearning4j/examples/feedforward/mnist/MLPMnistSingleLayerExample.java 0 1

    2 9 … … … 10 28 * 28 = 784 500 int rngSeed = 123 int batchSize = 64 int numEpochs = 15 int numInputs = 28 * 28 int numOutputs = 10 double rate = 0.0015 def (trainSet, testSet) = [true, false].collect { new MnistDataSetIterator(batchSize, it, rngSeed) } def log = LoggerFactory.getLogger(getClass()) def conf = new NeuralNetConfiguration.Builder() .seed(rngSeed) //include the random seed for reproducibility // use stochastic gradient descent as an optimization algorithm .updater(new Nadam()) .l2(rate * 0.005) // regularized .list() // first input layer: .layer(new DenseLayer.Builder() .nIn(numInputs) .nOut(500) .build()) // second input layer: .layer(new DenseLayer.Builder() .nIn(500) .nOut(100) .build()) // hidden layer: .layer(new OutputLayer.Builder(LossFunction.NEGATIVELOGLIKELIHOOD) .nOut(numOutputs) .activation(Activation.SOFTMAX) .build()) .build() … log.info("Creating model ...") def model = new MultiLayerNetwork(conf) model.init() log.info("Training model ...") model.listeners = new ScoreIterationListener(100) model.fit(trainSet, numEpochs) log.info("Evaluating model...") def eval = model.evaluate(testSet) log.info(eval.stats()) ======== Evaluation Metrics ======== # of classes: 10 Accuracy: 0.9820 Precision: 0.9820 Recall: 0.9818 F1 Score: 0.9819 100
  148. MNIST GUI with GroovyFX import javafx.embed.swing.SwingFXUtils import javafx.scene.canvas.GraphicsContext import javafx.scene.image.Image

    import javafx.scene.image.PixelFormat import javafx.scene.paint.Color import javax.imageio.ImageIO static clear(GraphicsContext g, size) { g.fill = Color.WHITE; g.fillRect(0, 0, size, size); g.fill = Color.BLACK } static int[] imageToArray(Image img) { int w = img.width int h = img.height int[] buf = new int[h * w] img.pixelReader.getPixels(0, 0, w, h, PixelFormat.intArgbInstance, buf, 0, w) buf } static Image snapshot(canvas) { def baos = new ByteArrayOutputStream() ImageIO.write(SwingFXUtils.fromFXImage(canvas.snapshot(null, null), null), "png", baos) new Image(new ByteArrayInputStream(baos.toByteArray()), 28, 28, true, true) } static displayResult(List items, Integer predict) { items.indexed().collect{ idx, val -> def marker = (idx == predict && items[predict] > 0.5) ? ' **' : '' "$idx ${val ? sprintf('%.4f', val) + marker : '?'}" }.join('\n') } A few more utility methods
  149. MNIST GUI with GroovyFX class MnistInfer { private static RealMatrix

    inputHidden private static RealMatrix hiddenOutput static void load(Class klass) { def weights = klass.classLoader.getResourceAsStream('weights') try (ObjectInputStream ois = new ObjectInputStream(weights)) { inputHidden = MatrixUtils.createRealMatrix((double[][]) ois.readObject()) hiddenOutput = MatrixUtils.createRealMatrix((double[][]) ois.readObject()) } } static RealMatrix query(double[] inputArray) { RealMatrix inputs = createColumnRealMatrix(inputArray) RealMatrix hiddenInputs = inputHidden * inputs RealMatrix hiddenOutputs = scalarSigmoid(hiddenInputs) RealMatrix finalInputs = hiddenOutput * hiddenOutputs return scalarSigmoid(finalInputs) } static double[] normalize(int[] img) { double[] result = new double[img.length] for (int i = 0; i < img.length; i++) { int red = (img[i] >> 16) & 0xff int green = (img[i] >> 8) & 0xff int blue = img[i] & 0xff result[i] = 1 - ((red + green + blue) / 765.0 * 999 + 1) / 1000 } return result } } Predictor class
  150. MNIST GUI with GroovyFX … start { stage(title: 'MNIST', visible:

    true) { scene(id: "scene", fill: WHITE) { borderPane { top { canvas(id: "canvas", width: size, height: size) def g = canvas.graphicsContext2D clear(g, size) canvas.onMouseDragged { e -> g.fillOval e.x - pen, e.y - pen, pen * 2, pen * 2 } } center { hbox(alignment: 'Center') { button('Clear', onAction: { clear(canvas.graphicsContext2D, size) out.text = displayResult([null] * 10, null) }) button('Predict', onAction: { def result = Mnist.query(Mnist.scale(imageToArray(snapshot(canvas)))) def predictLabel = Mnist.maxIndex(result) out.text = displayResult(result.data.collect{ it[0] }, predictLabel) }) } } bottom { textArea(displayResult([null] * 10, null), id: 'out', editable: false, prefColumnCount: 16) } } } } }
  151. Apache MXNET Source: https://mxnet.apache.org/versions/master/faq/why_mxnet.html

  152. Apache MXNET Object detection import groovy.swing.SwingBuilder import javax.imageio.ImageIO import org.apache.mxnet.infer.javaapi.ObjectDetector

    import org.apache.mxnet.javaapi.* import static javax.swing.WindowConstants.DISPOSE_ON_CLOSE static void downloadUrl(String srcDirUrl, String destDir, String filename) { def destination = new File(destDir, filename) if (!destination.parentFile.exists()) { destination.parentFile.mkdirs() } if (!destination.exists()) { destination.bytes = new URL(srcDirUrl + filename).bytes } } static downloadModelImage() { String baseDir = "${System.getProperty('java.io.tmpdir')}/resnetssd/" def imageName = 'dog-ssd.jpg' def modelName = 'resnet50_ssd_model' String imgURL = "https://s3.amazonaws.com/model-server/inputs/" println "Downloading image to ${baseDir}..." downloadUrl(imgURL, baseDir, imageName) println "Downloading model files to ${baseDir}... (may take a while)" String modelURL = "https://s3.amazonaws.com/model-server/models/resnet50_ssd/" downloadUrl(modelURL, baseDir, "$modelName-symbol.json") downloadUrl(modelURL, baseDir, "$modelName-0000.params") downloadUrl(modelURL, baseDir, 'synset.txt') [baseDir + imageName, baseDir + modelName] } … Utility methods for downloading test image and pre-trained model Test image
  153. Apache MXNET Object detection … static detectObjects(String modelPath, String imagePath,

    inputShape) { def context = [Context.cpu()] def inputDescriptor = new DataDesc("data", inputShape, DType.Float32(), "NCHW") def objDet = new ObjectDetector(modelPath, [inputDescriptor], context, 0) return objDet.imageObjectDetect(ObjectDetector.loadImageFromFile(imagePath), 3) } def (imagePath, modelPath) = downloadModelImage() def image = ImageIO.read(imagePath as File) def (w, h) = image.with{ [it.width, it.height] } def count = 1 // batch size def channels = 3 // for RGB def inputShape = new Shape(count, channels, w * 0.8 as int, h) def results = detectObjects(modelPath, imagePath, inputShape).sum() def boxes = results.collect {[ xmin: w * it.XMin as int, ymin: h * it.YMin as int, xmax: w * it.XMax as int, ymax: h * it.YMax as int ]} def names = results.collect{ it.className + sprintf(' %.3f', it.probability) } (0..<names.size()).each{ println "${names[it]} ${boxes[it]}" } Image.drawBoundingBox(image, boxes, names) new SwingBuilder().edt { frame(title: "${results.size()} detected objects", size: [w, h], show: true, defaultCloseOperation: DISPOSE_ON_CLOSE) { label(icon: imageIcon(image: image)) } } car 0.996 [xmin:439, ymin:62, xmax:715, ymax:173] bicycle 0.986 [xmin:144, ymin:128, xmax:547, ymax:453] dog 0.919 [xmin:102, ymin:184, xmax:344, ymax:541]
  154. Apache MXNET Object detection … static detectObjects(String modelPath, String imagePath,

    inputShape) { def context = [Context.cpu()] def inputDescriptor = new DataDesc("data", inputShape, DType.Float32(), "NCHW") def objDet = new ObjectDetector(modelPath, [inputDescriptor], context, 0) return objDet.imageObjectDetect( ObjectDetector.loadImageFromFile(imagePath), 6) } def (imagePath, modelPath) = downloadModelImage() def image = ImageIO.read(imagePath as File) def (w, h) = image.with{ [it.width, it.height] } def count = 1 // batch size def channels = 3 // for RGB def inputShape = new Shape(count, channels, 460, 480) def results = detectObjects(modelPath, imagePath, inputShape).sum() def boxes = results.collect {[ xmin: w * it.XMin as int, ymin: h * it.YMin as int, xmax: w * it.XMax as int, ymax: h * it.YMax as int ]} def names = results.collect{ it.className + sprintf(' %.3f', it.probability) } (0..<names.size()).each{ println "${names[it]} ${boxes[it]}" } Image.drawBoundingBox(image, boxes, names) new SwingBuilder().edt { frame(title: "${results.size()} detected objects", size: [w, h], show: true, defaultCloseOperation: DISPOSE_ON_CLOSE) { label(icon: imageIcon(image: image)) } } car 0.997 [xmin:475, ymin:80, xmax:683, ymax:159] bicycle 0.993 [xmin:149, ymin:128, xmax:540, ymax:400] dog 0.844 [xmin:117, ymin:184, xmax:293, ymax:472] bicycle 0.220 [xmin:133, ymin:145, xmax:242, ymax:214] bicycle 0.158 [xmin:122, ymin:158, xmax:361, ymax:435] motorbike 0.112 [xmin:58, ymin:82, xmax:116, ymax:121]
  155. Apache MXNET Object detection bicycle 0.987 [xmin:175, ymin:100, xmax:570, ymax:454]

    car 0.976 [xmin:435, ymin:72, xmax:697, ymax:177] person 0.786 [xmin:468, ymin:40, xmax:768, ymax:504] dog 0.703 [xmin:97, ymin:167, xmax:331, ymax:540] diningtable 0.537 [xmin:0, ymin:277, xmax:738, ymax:576] sofa 0.498 [xmin:0, ymin:304, xmax:722, ymax:576] horse 0.430 [xmin:0, ymin:276, xmax:488, ymax:576] diningtable 0.422 [xmin:430, ymin:125, xmax:768, ymax:576] horse 0.309 [xmin:95, ymin:0, xmax:768, ymax:459] pottedplant 0.289 [xmin:299, ymin:257, xmax:768, ymax:576] bird 0.286 [xmin:424, ymin:40, xmax:768, ymax:382] motorbike 0.272 [xmin:50, ymin:78, xmax:98, ymax:130]
  156. Scaling up Data Science If your big data framework doesn’t

    have the algorithm you need? Preprocessing Partition based dataset Linear regression K-Means clustering Genetic algorithms Multilayer perceptron Decision trees k-NN classification k-NN regression SVM binary classification SVM multi-class Classification Model cross validation Logistic regression Random forest Gradient boosting Model updating ANN (Approximate Nearest Neighbor) Deep learning Summary statistics Correlations Stratified sampling Hypothesis testing Random data generation Classification and regression: support Vector machines logistic regression Linear regression Decision trees Naive Bayes classification Alternating least squares (ALS) K-Means clustering Latent Dirichlet allocation (LDA) Singular value decomposition (SVD) Principal component analysis (PCA) Feature extraction Transformation functions Optimization algorithms Re-formulate your algorithm using map/filter/reduce
  157. House pricing revisited – Initial algorithm Data Training Test Model

    Stats Split Fit Predict and Evaluate
  158. House pricing revisited – Rejigging our algorithm Data Training Test

    Model Stats Split Fit Predict and Evaluate Data Chunk1 Model1 Stats Split Fit Predict and Evaluate … Chunk2 Model2 ChunkN ModelN Fit Model Aggregate … … ChunkN Stats1 Stats2 StatsN … Aggregate Chunk2 Chunk1
  159. House pricing revisited - GPars import org.apache.commons.math3.stat.StatUtils import smile.math.Math import

    smile.regression.OLS import tech.tablesaw.api.* import groovyx.gpars.GParsPool import static java.lang.Math.sqrt def features = [ 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_living15', 'lat', 'sqft_above', 'grade', 'view', 'waterfront', 'floors' ] def file = getClass().classLoader.getResource('kc_house_data.csv').file def table = Table.read().csv(file) table = table.dropWhere(table.column("bedrooms").isGreaterThan(30)) def data = table.as().doubleMatrix(*features) … • Read CSV in normal way • Split data into chunks • Fit each chunk and aggregate resulting models • Split data into chunks • Evaluate each chunk against model and aggregate resulting stats
  160. House pricing revisited - GPars … GParsPool.withPool { def trainChunkSize

    = 3000 def numTrainChunks = 8 def models = (0..<numTrainChunks).collectParallel { def list = data.toList() Collections.shuffle(list) list = list.take(trainChunkSize) new OLS(list.collect{ it[1..-1] } as double[][], list.collect{ it[0] } as double[]).with{ [it.intercept(), *it.coefficients()] } } def model = models.transpose()*.sum().collect{ it/numTrainChunks } println "Intercept: ${model[0]}" println "Coefficients: ${model[1..-1].join(', ')}" double[] coefficients = model[1..-1] double intercept = model[0] def stats = { chunk -> def predicted = chunk.collect { row -> intercept + Math.dot(row[1..-1] as double[], coefficients) } def residuals = chunk.toList().indexed().collect { idx, row -> predicted[idx] - row[0] } def rmse = sqrt(StatUtils.sumSq(residuals as double[]) / chunk.size()) def mean = Math.mean(residuals as double[]) [rmse, mean, chunk.size()] } def evalChunkSize = 2000 def results = data.collate(evalChunkSize).collectParallel(stats) println 'RMSE: ' + sqrt(results.collect{ it[0] * it[0] * it[2] }.sum() / data.size()) println 'mean: ' + results.collect{ it[1] * it[2] }.sum() / data.size() } • Read CSV in normal way • Split data into chunks • Fit each chunk and aggregate resulting models • Split data into chunks • Evaluate each chunk against model and aggregate resulting stats
  161. House pricing revisited - GPars … GParsPool.withPool { def trainChunkSize

    = 3000 def numTrainChunks = 8 def models = (0..<numTrainChunks).collectParallel { def list = data.toList() Collections.shuffle(list) list = list.take(trainChunkSize) new OLS(list.collect{ it[1..-1] } as double[][], list.collect{ it[0] } as double[]).with{ [it.intercept(), *it.coefficients()] } } def model = models.transpose()*.sum().collect{ it/numTrainChunks } println "Intercept: ${model[0]}" println "Coefficients: ${model[1..-1].join(', ')}" double[] coefficients = model[1..-1] double intercept = model[0] def stats = { chunk -> def predicted = chunk.collect { row -> intercept + Math.dot(row[1..-1] as double[], coefficients) } def residuals = chunk.toList().indexed().collect { idx, row -> predicted[idx] - row[0] } def rmse = sqrt(StatUtils.sumSq(residuals as double[]) / chunk.size()) def mean = Math.mean(residuals as double[]) [rmse, mean, chunk.size()] } def evalChunkSize = 2000 def results = data.collate(evalChunkSize).collectParallel(stats) println 'RMSE: ' + sqrt(results.collect{ it[0] * it[0] * it[2] }.sum() / data.size()) println 'mean: ' + results.collect{ it[1] * it[2] }.sum() / data.size() } Intercept: -3.16904445043026E7 Coefficients: -21131.141578491588, -5431.675525641348, 179.47932136012156, 11.273983523229091, 657752.8815669084, -10.658550821995767, 86876.08541623666, 61708.58759419169, 611717.9878928157, -21819.967894981055 RMSE: 215185.8177879586 mean: -2151.974648800721
  162. … GParsPool.withPool { def trainChunkSize = 3000 def numTrainChunks =

    8 def models = (0..<numTrainChunks).collectParallel { def list = data.toList() Collections.shuffle(list) list = list.take(trainChunkSize) new OLS(list.collect{ it[1..-1] } as double[][], list.collect{ it[0] } as double[]).with{ [it.intercept(), *it.coefficients()] } } def model = models.transpose()*.sum().collect{ it/numTrainChunks } println "Intercept: ${model[0]}" println "Coefficients: ${model[1..-1].join(', ')}" double[] coefficients = model[1..-1] double intercept = model[0] def stats = { chunk -> def predicted = chunk.collect { row -> intercept + Math.dot(row[1..-1] as double[], coefficients) } def residuals = chunk.toList().indexed().collect { idx, row -> predicted[idx] - row[0] } def rmse = sqrt(StatUtils.sumSq(residuals as double[]) / chunk.size()) def mean = Math.mean(residuals as double[]) [rmse, mean, chunk.size()] } def evalChunkSize = 2000 def results = data.collate(evalChunkSize).collectParallel(stats) println 'RMSE: ' + sqrt(results.collect{ it[0] * it[0] * it[2] }.sum() / data.size()) println 'mean: ' + results.collect{ it[1] * it[2] }.sum() / data.size() } House pricing revisited - GPars Data Chunk1 Model1 Stats shuffle/take results … Chunk2 Model2 ChunkN ModelN OLS#intercept/OLS#coefficients Model … … ChunkN Stats1 Stats2 StatsN … Chunk2 Chunk1 results.collect transpose/sum model collate stats
  163. House pricing revisited – Apache Beam … class MeanDoubleArrayCols implements

    SerializableFunction<Iterable<double[]>, double[]> { @Override double[] apply(Iterable<double[]> input) { double[] sum = null def count = 0 for (double[] next : input) { if (sum == null) { sum = new double[next.size()] (0..<sum.size()).each{ sum[it] = 0.0d } } (0..<sum.size()).each{ sum[it] += next[it] } count++ } if (sum != null) (0..<sum.size()).each{ sum[it] /= count } return sum } } class AggregateModelStats implements SerializableFunction<Iterable<double[]>, double[]> { @Override double[] apply(Iterable<double[]> input) { double[] sum = null for (double[] next : input) { if (sum == null) { sum = new double[next.size()] (0..<sum.size()).each{ sum[it] = 0.0d } } def total = sum[2] + next[2] sum[0] = Math.sqrt((sum[2] * sum[0] * sum[0] + next[2] * next[0] * next[0]) / total) sum[1] = (sum[2] * sum[1] + next[2] * next[1]) / total sum[2] = total } return sum } } … Aggregate models Aggregate stats
  164. House pricing revisited – Apache Beam … static buildPipeline(Pipeline p)

    { def features = [ 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_living15', 'lat', 'sqft_above', 'grade', 'view', 'waterfront', 'floors' ] def readCsvChunks = new DoFn<String, double[][]>() { @ProcessElement void processElement(@Element String path, OutputReceiver<double[][]> receiver) throws IOException { def chunkSize = 6000 def table = Table.read().csv(path) table = table.dropWhere(table.column("bedrooms").isGreaterThan(30)) def idxs = 0..<table.rowCount() for (nextChunkIdxs in idxs.shuffled().collate(chunkSize)) { def chunk = table.rows(*nextChunkIdxs) receiver.output(chunk.as().doubleMatrix(*features)) } sleep 2000 } } def fitModel = new DoFn<double[][], double[]>() { @ProcessElement void processElement(@Element double[][] rows, OutputReceiver<double[]> receiver) throws IOException { double[] model = new OLS(rows.collect{ it[1..-1] } as double[][], rows.collect{ it[0] } as double[]).with{ [it.intercept(), *it.coefficients()] } receiver.output(model) } } def evalModel = { double[][] chunk, double[] model -> double intercept = model[0] double[] coefficients = model[1..-1] def predicted = chunk.collect { row -> intercept + Math.dot(row[1..-1] as double[], coefficients) } def residuals = chunk.toList().indexed().collect { idx, row -> predicted[idx] - row[0] } def rmse = sqrt(StatUtils.sumSq(residuals as double[]) / chunk.size()) [rmse, residuals.average(), chunk.size()] as double[] } … Map filename to chunks Map chunk to model Map chunk to stats
  165. House pricing revisited – Apache Beam … def model2out =

    new DoFn<double[], String>() { @ProcessElement void processElement(@Element double[] ds, OutputReceiver<String> out) { out.output("intercept: ${ds[0]}, coeffs: ${ds[1..-1].join(', ')}".toString()) } } def stats2out = new DoFn<double[], String>() { @ProcessElement void processElement(@Element double[] ds, OutputReceiver<String> out) { out.output("rmse: ${ds[0]}, mean: ${ds[1]}, count: ${ds[2]}".toString()) } } var csvChunks = p .apply(Create.of('/path/to/kc_house_data.csv')) .apply('Create chunks', ParDo.of(readCsvChunks)) var model = csvChunks .apply('Fit chunks', ParDo.of(fitModel)) .apply(Combine.globally(new MeanDoubleArrayCols())) var modelView = model .apply(View.<double[]>asSingleton()) csvChunks .apply(ParDo.of(new EvaluateModel(modelView, evalModel)).withSideInputs(modelView)) .apply(Combine.globally(new AggregateModelStats())) .apply('Log stats', ParDo.of(stats2out)).apply(Log.ofElements()) model .apply('Log model', ParDo.of(model2out)).apply(Log.ofElements()) } def pipeline = Pipeline.create() buildPipeline(pipeline) pipeline.run().waitUntilFinish() INFO: intercept: -3.2504584735458568E7, coeffs: -27862.329468295735, -2269.8423856962077, 198.1365232723505, 2.9357512250793327, 675276.3494316386, -2.323389992226333, 83241.42291377622, 66606.19682072607, 607326.5819472861, -30354.844181689274 INFO: rmse: 214702.83588071115, mean: -1191.2433423574066, count: 21612.0 Map model to output Map stats to output Create and run pipeline Build pipeline
  166. … var csvChunks = p | Create.of('/path/to/kc_house_data.csv') | 'Create chunks'

    >> ParDo.of(readCsvChunks) var model = csvChunks | 'Fit chunks' >> ParDo.of(fitModel) | Combine.globally(new MeanDoubleArrayCols()) var modelView = model | View.<double[]>asSingleton() csvChunks | ParDo.of(new EvaluateModel(modelView, evalModel)).withSideInputs(modelView) | Combine.globally(new AggregateModelStats()) | 'Log stats' >> ParDo.of(stats2out) | Log.ofElements() model | 'Log model' >> ParDo.of(model2out) | Log.ofElements() } PCollection.metaClass.or = { List arg -> delegate.apply(*arg) } PCollection.metaClass.or = { PTransform arg -> delegate.apply(arg) } String.metaClass.rightShift = { PTransform arg -> [delegate, arg] } Pipeline.metaClass.or = { PTransform arg -> delegate.apply(arg) } def pipeline = Pipeline.create() buildPipeline(pipeline) pipeline.run().waitUntilFinish() House pricing revisited – Apache Beam INFO: intercept: -3.2504584735458568E7, coeffs: -27862.329468295735, -2269.8423856962077, 198.1365232723505, 2.9357512250793327, 675276.3494316386, -2.323389992226333, 83241.42291377622, 66606.19682072607, 607326.5819472861, -30354.844181689274 INFO: rmse: 214702.83588071115, mean: -1191.2433423574066, count: 21612.0 Create and run pipeline Some optional metaprogramming for Python-like coding style
  167. House pricing revisited – Apache Beam … var csvChunks =

    p | Create.of('/path/to/kc_house_data.csv') | 'Create chunks' >> ParDo.of(readCsvChunks) var model = csvChunks | 'Fit chunks' >> ParDo.of(fitModel) | Combine.globally(new MeanDoubleArrayCols()) var modelView = model | View.<double[]>asSingleton() csvChunks | ParDo.of(new EvaluateModel(modelView, evalModel)).withSideInputs(modelView) | Combine.globally(new AggregateModelStats()) | 'Log stats' >> ParDo.of(stats2out) | Log.ofElements() model | 'Log model' >> ParDo.of(model2out) | Log.ofElements() } PCollection.metaClass.or = { List arg -> delegate.apply(*arg) } PCollection.metaClass.or = { PTransform arg -> delegate.apply(arg) } String.metaClass.rightShift = { PTransform arg -> [delegate, arg] } Pipeline.metaClass.or = { PTransform arg -> delegate.apply(arg) } def pipeline = Pipeline.create() buildPipeline(pipeline) pipeline.run().waitUntilFinish() INFO: intercept: -3.2504584735458568E7, coeffs: -27862.329468295735, -2269.8423856962077, 198.1365232723505, 2.9357512250793327, 675276.3494316386, -2.323389992226333, 83241.42291377622, 66606.19682072607, 607326.5819472861, -30354.844181689274 INFO: rmse: 214702.83588071115, mean: -1191.2433423574066, count: 21612.0 Data Chunk1 Model1 Stats readCsvChunks csvChunks … Chunk2 Model2 ChunkN ModelN fitModel Model … … ChunkN Stats1 Stats2 StatsN … Chunk2 Chunk1 AggregateModelStats MeanDoubleArrayCols csvChunks model modelView sideInput evalModel
  168. 180 objectcomputing.com © 2019, Object Computing, Inc. (OCI). All rights

    reserved. Groovy Community Information • groovy.apache.org • groovy-lang.org • github.com/apache/groovy • groovycommunity.com • dev@groovy.apache.org • users@groovy.apache.org • @ApacheGroovy • objectcomputing.com/training/catalog/groovy-programming/
  169. objectcomputing.com © 2019, Object Computing, Inc. (OCI). All rights reserved.

    181 CONNECT WITH US 1+ (314) 579-0066 @objectcomputing objectcomputing.com THANK YOU Find me on twitter @paulk_asert
  170. References • https://www.analyticsvidhya.com/blog/2017/02/lintroductory-guide-on-linear-programming- explained-in-simple-english/ • Computational Mixed-Integer Programming https://www.birs.ca/cmo-workshops/2018/18w5208/files/GleixnerAmbros.pdf •

    Integer Linear Programming: Graphical Introduction https://www.youtube.com/watch?v=RhHhy-8sz-4 • Integer Linear Programming: Binary Constraints https://www.youtube.com/watch?v=-3my1TkyFiM https://www.youtube.com/watch?v=B3biWsBLeCw https://www.youtube.com/watch?v=MO8uQnIch6I • Linear Programming: Alternate solutions, Infeasibility, Unboundedness, & Redundancy https://www.youtube.com/watch?v=eMA0LWsRQQ