Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Groovy Data Science

Groovy Data Science

Groovy is a powerful multi-paradigm programming language for the JVM that offers a wealth of features that make it ideal for many data science and big data scenarios.

* Groovy has a dynamic nature like Python, which means that it is very powerful, easy to learn, and productive. The language gets out of the way and lets data scientists write their algorithms naturally.

* Groovy has a static nature like Java and Kotlin, which makes it fast when needed. Its close alignment with Java means that you can often just cut-and-paste the Java examples from various big data solutions and they'll work just fine in Groovy.

* Groovy has first-class functional support, meaning that it offers features and allows solutions similar to Scala. Functional and stream processing with immutable data structures can offer many advantages when working in parallel processing or clustered environments.

These slides review the key benefits of using Groovy to develop data science solutions, including integration with various JDK libraries commonly used in data science solutions including libraries for data manipulation, machine learning, plotting and various big data solutions for scaling up these algorithms.

Topics covered include: regression, clustering, classification, neural networks, language processing, image detection.

Math/Data Science libraries covered include:
Weka, Smile, Tablesaw, Apache Commons Math, Jupyter/Beakerx notebooks, Deep Learning4J.

Libraries for scaling/concurrency include:
GPars, Apache Spark, Apache Ignite, Apache MXNet, Apache Beam, Apache NLPCraft.

paulking

July 01, 2021
Tweet

More Decks by paulking

Other Decks in Programming

Transcript

  1. objectcomputing.com
    © 2021, Object Computing, Inc. (OCI). All rights reserved. No part of these notes may be reproduced, stored in a retrieval system,
    or transmitted, in any form or by any means, electronic, mechanical, photocopying, rowing, or otherwise, without the prior, written
    permission of Object Computing, Inc. (OCI)
    Groovy and Data Science
    Presented by
    Dr Paul King
    © 2021 Object Computing, Inc. (OCI). All rights reserved.

    View Slide

  2. Dr Paul King
    OCI Groovy Lead
    V.P. and PMC Chair Apache Groovy
    Author:
    https://www.manning.com/books/groovy-in-action-second-edition
    Slides:
    https://speakerdeck.com/paulk/groovy-data-science
    Examples repo:
    https://github.com/paulk-asert/groovy-data-science
    Twitter:
    @paulk_asert

    View Slide

  3. Apache Groovy Programming Language
    • Multi-faceted language
    • Imperative/OO & functional (+ other styles via libraries)
    • Dynamic & static
    • Aligned closely with Java language
    • Very extensible
    • 17+ years since inception
    • ~1B downloads (partial count)
    • ~500 contributors
    • 200+ releases
    • https://www.youtube.com/watch?v=eIGOG-
    F9ZTw&feature=youtu.be

    View Slide

  4. Friends of Apache Groovy
    Open Collective

    View Slide

  5. What is Groovy?
    It’s like a super version of Java:
    • Supports most Java syntax but allows
    simpler syntax for many constructs
    • Supports all Java libraries but
    provides many extensions and its
    own productivity libraries
    • Has both a static and dynamic nature
    • Extensible language and tooling

    View Slide

  6. Groovy has a low learning curve, particularly for Java developers

    View Slide

  7. Groovy is Java's friend
    Seamless integration
    • IDEs provide cross-language compile,
    navigation, and refactoring
    • Arbitrarily mix source language
    • Drop-in replace any class
    • Overloaded methods
    • Syntax alignment
    • Shared data types
    Java
    Groovy

    View Slide

  8. Java-like or concise shortcuts
    import java.util.List;
    import java.util.ArrayList;
    class Main {
    private List keepShorterThan(List strings, int length) {
    List result = new ArrayList();
    for (int i = 0; i < strings.size(); i++) {
    String s = (String) strings.get(i);
    if (s.length() < length) {
    result.add(s);
    }
    }
    return result;
    }
    public static void main(String[] args) {
    List names = new ArrayList();
    names.add("Ted"); names.add("Fred");
    names.add("Jed"); names.add("Ned");
    System.out.println(names);
    Main m = new Main();
    List shortNames = m.keepShorterThan(names, 4);
    System.out.println(shortNames.size());
    for (int i = 0; i < shortNames.size(); i++) {
    String s = (String) shortNames.get(i);
    System.out.println(s);
    }
    }
    }
    names = ["Ted", "Fred", "Jed", "Ned"]
    println names
    shortNames = names.findAll{ it.size() < 4 }
    println shortNames.size()
    shortNames.each{ println it }

    View Slide

  9. Concise syntax including DSL friendly
    import java.util.List;
    import java.util.ArrayList;
    class Main {
    private List keepShorterThan(List strings, int length) {
    List result = new ArrayList();
    for (int i = 0; i < strings.size(); i++) {
    String s = (String) strings.get(i);
    if (s.length() < length) {
    result.add(s);
    }
    }
    return result;
    }
    public static void main(String[] args) {
    List names = new ArrayList();
    names.add("Ted"); names.add("Fred");
    names.add("Jed"); names.add("Ned");
    System.out.println(names);
    Main m = new Main();
    List shortNames = m.keepShorterThan(names, 4);
    System.out.println(shortNames.size());
    for (int i = 0; i < shortNames.size(); i++) {
    String s = (String) shortNames.get(i);
    System.out.println(s);
    }
    }
    }
    names = ["Ted", "Fred", "Jed", "Ned"]
    println names
    shortNames = names.findAll{ it.size() < 4 }
    println shortNames.size()
    shortNames.each{ println it }
    given the names "Ted", "Fred", "Jed" and "Ned"
    display all the names
    display the number of names having size less than 4
    display the names having size less than 4

    View Slide

  10. Metaprogramming

    View Slide

  11. Matrix manipulation: Java vs Groovy
    • Same example
    • Same library
    Array2DRowRealMatrix{{15.1379501385,40.488531856},{21.4354570637,59.5951246537}}
    import org.apache.commons.math3.linear.*;
    public class MatrixMain {
    public static void main(String[] args) {
    double[][] matrixData = { {1d,2d,3d}, {2d,5d,3d}};
    RealMatrix m = MatrixUtils.createRealMatrix(matrixData);
    double[][] matrixData2 = { {1d,2d}, {2d,5d}, {1d, 7d}};
    RealMatrix n = new Array2DRowRealMatrix(matrixData2);
    RealMatrix o = m.multiply(n);
    // Invert p, using LU decomposition
    RealMatrix oInverse = new LUDecomposition(o).getSolver().getInverse();
    RealMatrix p = oInverse.scalarAdd(1d).scalarMultiply(2d);
    RealMatrix q = o.add(p.power(2));
    System.out.println(q);
    }
    }
    Thanks to operator
    overloading and
    extensible tooling

    View Slide

  12. groovyConsole
    showing @Grab, power asserts and natural language processing

    View Slide

  13. REPL for Groovy (groovysh)

    View Slide

  14. Command chains and extensible type system
    Text file? No, it’s a statically typed Groovy program which
    won’t compile if the animal names aren’t valid.
    A constraint programming solver is used to find the solution.
    cranes have 2 legs
    tortoises have 4 legs
    there are 7 animals
    there are 20 legs
    display solution
    Cranes 4
    Tortoises 3

    View Slide

  15. Command chains and extensible type system
    cranes have 2 legs
    tortoises have 4 legs
    millipedes have 1000 legs
    there are 8 animals
    there are 1020 legs
    display solution
    Cranes 4
    Tortoises 3
    Millipedes 1

    View Slide

  16. IDE support
    Intellij IDEA, Apache NetBeans, Eclipse, Groovy Web Console

    View Slide

  17. Build tools
    Gradle, Apache Maven, Apache Ant
    plugins {
    id 'groovy'
    }
    repositories {
    jcenter()
    }
    dependencies {
    implementation 'org.codehaus.groovy:groovy-all:2.5.7'
    testImplementation 'org.spockframework:spock-core:1.2-groovy-2.5'
    }
    project {
    modelVersion '4.0.0'
    groupId 'org.exampledriven'
    artifactId 'maven-polyglot-groovy-example-parent'
    version '1.0-SNAPSHOT'
    packaging 'pom'
    modules {
    module 'maven-polyglot-submodule'
    module 'maven-polyglot-spring'
    }
    }
    ant.sequential {
    echo("inside sequential")
    def myDir = "target/AntTest/"
    mkdir(dir: myDir)
    copy(todir: myDir) {
    fileset(dir: "src/test") {
    include(name: "**/*.groovy")
    }
    }
    echo("done")
    }

    View Slide

  18. Groovy frameworks plus entire Java ecosystem

    View Slide

  19. Data Science Process
    Research
    Goals
    Obtain Data
    Data
    Preparation
    Data
    Exploration
    Visualization
    Data
    Modeling
    Data
    ingestion
    Data
    storage
    Data processing
    platforms
    Modeling
    algorithms
    Math
    libraries
    Graphics
    processing
    Integration
    Deployment

    View Slide

  20. Source: https://medium.com/activewizards-machine-learning-company/comparison-of-top-data-science-libraries-for-python-r-and-scala-infographic-574069949267
    Python/Scala/R Libraries/Tools/Frameworks for Data Science

    View Slide

  21. Groovy/Java Libraries/Tools/Frameworks for Data Science
    Source: https://medium.com/activewizards-machine-learning-company/comparison-of-top-data-science-libraries-for-python-r-and-scala-infographic-574069949267
    Machine Learning
    DeepLearning4J
    Apache Spark ML
    Apache MXNet
    Visualization
    GroovyFX
    JFreeChart
    Tablesaw
    Mathematics and Engineering
    GPars
    Apache Commons Math
    Smile
    Weka
    Groovylab
    Apache Spark
    Apache Ignite
    Apache Beam
    Data Manipulation and Analysis
    Tablesaw
    Apache Commons CSV
    OpenCSV

    View Slide

  22. Obtain data
    Identify sources
    • Internal/External (data providers)
    • Databases, files, events, IoT
    • Unstructured/structured
    • Text, XML, JSON, YAML,
    CSV/spreadsheets, audio,
    images, video, web services
    • Apache Tika
    • JExcel
    • Apache POI
    Clarify ownership
    Check quality
    def jsonSlurper = new JsonSlurper()
    def object = jsonSlurper.parseText('{ "myList": [4, 8, 15, 16, 23, 42] }’)
    assert object instanceof Map
    assert object.myList instanceof List
    assert object.myList == [4, 8, 15, 16, 23, 42]
    def response = new XmlSlurper().parseText(books)
    def authorResult = response.value.books.book[0].author
    assert authorResult.text() == 'Miguel de Cervantes'
    def qry = 'SELECT * FROM Author’
    assert sql.rows(qry, 1, 3)*.firstname == ['Dierk', 'Paul', 'Guillaume’]
    assert sql.rows(qry, 4, 3)*.firstname == ['Hamlet', 'Cedric', 'Erik’]
    assert sql.rows(qry, 7, 3)*.firstname == ['Jon']

    View Slide

  23. Data preparation
    Collections:
    • Java: lists, maps, sets
    • Groovy: literal notation, GDK methods, GPath
    • Libraries
    • Google Guava https://github.com/google/guava
    • Apache Common Collections https://commons.apache.org/collections/
    DataFrames:
    • Joinery https://cardillo.github.io/joinery/
    A data frame implementation in the spirit of Pandas or R data frames with show/plot
    • Tablesaw https://jtablesaw.github.io/tablesaw/
    Java dataframe and visualization library
    • Apache Spark DataFrames https://spark.apache.org/docs/latest/sql-programming-
    guide.html
    Equivalent to a table in a relational database or a data frame in R/Python
    • Paleo https://github.com/netzwerg/paleo
    Immutable Java 8 data frames with typed columns (including primitives)

    View Slide

  24. Data exploration - DEX

    View Slide

  25. Data exploration - Weka
    More info: https://www.youtube.com/watch?v=7quZv6WCTQc

    View Slide

  26. Visualization
    • Open-Source plotting libraries:
    GRAL, JFreeChart, Xchart, JMathPlot, Jzy3d, JavaFX, GroovyFX
    • Types: Scatter plot, Line plot, Area plot, Pie plot, Donut plot, Horizontal bars, Vertical
    bars, Bubble plot, Radar plot, Box plot, Raster plot, Contour plot, Gauge plot, Step
    plot, Gantt plot, Histogram
    • Features: Multiple axes, Logarithmic axes, Line smoothing, Data filtering/aggregation

    View Slide

  27. Notebooks supporting Groovy
    • Jupyter/beakerx
    • Apache Zeppelin
    • GroovyLab
    • Seco

    View Slide

  28. Notebooks supporting Groovy
    • Jupyter/beakerx
    • Apache Zeppelin
    • GroovyLab
    • Seco

    View Slide

  29. Data preparation
    IO:
    • Java in-built
    • Groovy enhancements
    • Libraries
    • Apache Commons IO https://commons.apache.org/io/
    • AOL Cyclops-React https://github.com/aol/cyclops-react
    Web scraping:
    • GEB
    Batch processing/ETL:
    • Spring Batch http://projects.spring.io/spring-batch/
    • Apache Camel
    • Apache Gobblin

    View Slide

  30. Math libraries
    Math and stats libraries
    • Java: log, exp, other basic methods.
    Libraries:
    • Apache Commons Math http://commons.apache.org/math/
    statistics, optimization, and linear algebra
    • Apache Mahout http://mahout.apache.org/
    linear algebra, plus distributed linear algebra and machine learning
    • JBlas http://jblas.org/ optimized and very fast linear algebra package
    that uses the BLAS library
    Also, many of the machine learning libraries come with some math
    functionality, often linear algebra, stats, and optimization.

    View Slide

  31. Data storage
    Database:
    • Java’s JDBC
    • Groovy JDBC enhancements
    and Datasets
    • Other NOSQL
    • Apache Cassandra
    • MongoDB
    • Apache CouchDB
    • Neo4j
    import com.gmongo.GMongo
    def db = new GMongo().getDB('athletes')
    db.athletes.drop()
    db.athletes << [
    first: 'Paul',
    last: 'Tergat', dob: '1969-06-17',
    runs: [[distance: 42195, time: 2 * 60 * 60 + 4 * 60 + 55,
    venue: 'Berlin', when: '2003-09-28’]]
    ]
    println "World rows following $marathon3.venue $marathon3.when:"
    def t = new Traversal()
    for (Path p in t.description().breadthFirst().
    relationships(MyRelationshipTypes.supercedes).
    evaluator(Evaluators.fromDepth(1)).
    uniqueness(Uniqueness.NONE).
    traverse(marathon3)) {
    def newRow = p.endNode()
    println "$newRow.venue $newRow.when"
    }
    Graph g = new Neo4jGraph(graphDb)
    def pretty = { it.collect { "$it.venue $it.when" }.join(', ') }
    def results = []
    g.V('venue', 'London').fill(results)
    println 'London world rows: ' + pretty(results)
    results = []
    g.V('venue', 'London').in('supercedes').fill(results)
    println 'World rows after London: ' + pretty(results)

    View Slide

  32. Machine learning and data mining
    Machine learning and data mining libraries:
    • Weka http://www.cs.waikato.ac.nz/ml/weka/
    • JavaML http://java-ml.sourceforge.net/
    old and reliable ML library but not actively updated
    • Smile http://haifengl.github.io/smile/
    ML library under active development
    • JSAT https://github.com/EdwardRaff/JSAT
    contains many machine learning algorithms
    • H2O http://www.h2o.ai/
    distributed Java ML framework also supports Scala, R and Python
    • Apache Mahout http://mahout.apache.org/
    distributed linear algebra framework
    • Apache Spark ml and mllib
    https://github.com/apache/spark/tree/master/examples/src/main/java/org/apache/spark/examples

    View Slide

  33. Neural networks and text processing
    Neural networks:
    • Encog http://www.heatonresearch.com/encog/
    • DeepLearning4j http://deeplearning4j.org/
    natively supports Keras models
    Text processing
    • Java: StringTokenizer, the java.text package, regular expressions
    • Groovy: Regex enhancements, templates, String enhancements
    • Libraries:
    • Apache Lucene https://lucene.apache.org/
    • Stanford CoreNLP http://stanfordnlp.github.io/CoreNLP/
    • Apache OpenNLP https://opennlp.apache.org/
    • LingPipe http://alias-i.com/lingpipe/
    • GATE https://gate.ac.uk/
    • MALLET http://mallet.cs.umass.edu/
    • Smile http://haifengl.github.io/smile/

    View Slide

  34. Scaling up
    Microservice frameworks:
    • Micronaut
    Distributed data processing:
    • Apache Hadoop
    • Apache Spark
    • Apache Flink
    • Apache Samza
    • Apache Kafka
    • Apache Storm
    • Apache Apex
    • Apache Beam
    • Apache Ignite
    • Apache Nifi

    View Slide

  35. Scaling up: Apache Ignite
    Ignite™ is a memory-centric distributed database, caching, and
    processing platform for transactional, analytical, and streaming
    workloads delivering in-memory speeds at petabyte scale

    View Slide

  36. Scaling up: Apache Spark
    Source: https://spark.apache.org/ and https://renovacloud.com/en/an-introduction-to-and-evaluation-of-apache-spark-for-big-data-architectures/
    Apache Spark™ is a unified analytics engine for
    large-scale data processing

    View Slide

  37. • A Concurrency & Parallelism Framework for
    Groovy and Java
    • Fork/join, parallel arrays, map/reduce, actors, agents,
    STM, CSP, dataflow, asynchronous functions
    Scaling Up: GPars
    Data Parallelism:
    Fork/Join
    Map/Reduce
    Fixed coordination
    (for collections)
    Actors Explicit coordination
    Safe Agents Delegated coordination
    Dataflow Implicit coordination
    def decryptor = actor {
    loop {
    react { String message ->
    reply message.reverse()
    }
    }
    }
    def console = actor {
    decryptor << 'lellarap si yvoorG'
    react {
    println 'Decrypted message: ' + it
    }
    }
    console.join()
    final def df = new Dataflows()
    task { df.z = df.x + df.y }
    task { df.x = 10 }
    task { df.y = 5 }
    assert df.z == 15
    def nums = [1, 2, 3, 4]
    def result = nums.collectParallel{it * 2}
    assert result == [2, 4, 6, 8]
    def agent = new Agent([])
    agent { it << 'Dave' }
    agent { it << 'Joe' }
    assert agent.val.size() == 2
    class Account {
    private amount = newTxnInteger(0)
    void transfer(final int a) {
    atomic { amount.increment(a) }
    }
    int getCurrentAmount() {
    atomicWithInt { amount.get() }
    }
    }
    def future = {it * 2}.callAsync(3)
    assert 6 == future.get()
    5
    10
    15
    x y
    z
    +

    View Slide

  38. Scaling up: Apache Beam
    Apache Beam is an open source, unified model for defining
    both batch and streaming data-parallel processing pipelines
    Source: https://beam.apache.org/get-started/beam-overview/

    View Slide

  39. Computer science algorithms
    Decision
    problems
    Search
    problems
    Counting
    problems
    Optimization
    problems
    • Paradigms: brute force, divide & conquer, search & enumerate,
    randomized, complexity reduction, recursive, back tracking, graph
    • Optimization approaches: linear programming,
    dynamic programming, greedy, heuristics

    View Slide

  40. Data science algorithms
    Data
    Mining
    Statistics
    Machine
    Learning
    Optimization
    • Analytics: descriptive, predictive, prescriptive
    • Analysis: anomaly detection, classification, regression, clustering,
    association, optimization, dimension reduction
    • Data relationship: linear, non-linear
    • Assumptions: parametric, non-parametric
    • Strategy: supervised, unsupervised, reinforcement
    • Combining: ensemble, boosting

    View Slide

  41. Data Science Algorithms
    Source: Jason Brownlee, https://machinelearningmastery.com/master-machine-learning-algorithms/

    View Slide

  42. Data Science Algorithms
    Source: Jason Brownlee, https://machinelearningmastery.com/master-machine-learning-algorithms/

    View Slide

  43. Candles and COVID-19

    View Slide

  44. Candles and COVID-19

    View Slide

  45. Working with tables and plots
    List traces(String filename, String lineColor, String markerColor) {
    def url = getClass().classLoader.getResource(filename)
    def table = new XlsxReader().read(builder(url).build())
    table.addColumns(DateColumn.create('YearMonth', table.column('Date').collect { LocalDate.of(it.year, it.month, 15) }))
    def janFirst2017 = LocalDateTime.of(2017, JANUARY, 1, 0, 0)
    Function from2017 = r -> r.dateTimeColumn('Date').isAfter(janFirst2017)
    Function top3 = r -> r.intColumn('CandleID').isLessThanOrEqualTo(3)
    def byMonth = table.sortAscendingOn('Date')
    .where(and(from2017, top3))
    .summarize('Rating', mean).by('YearMonth')
    def byDate = table.sortAscendingOn('Date')
    .where(and(from2017, top3))
    .summarize('Rating', mean).by('Date')
    def averaged = ScatterTrace.builder(byMonth.dateColumn('YearMonth'), byMonth.nCol('Mean [Rating]'))
    .mode(ScatterTrace.Mode.LINE)
    .line(Line.builder().width(5).color(lineColor).shape(Line.Shape.SPLINE).smoothing(1.3).build())
    .build()
    def scatter = ScatterTrace.builder(byDate.dateTimeColumn('Date'), byDate.nCol('Mean [Rating]'))
    .marker(Marker.builder().opacity(0.3).color(markerColor).build())
    .build()
    [averaged, scatter]
    }

    View Slide

  46. Working with tables and plots
    def (sAverage, sScatter) = traces('Scented_all.xlsx', 'seablue', 'lightskyblue')
    def (uAverage, uScatter) = traces('Unscented_all.xlsx', 'seagreen', 'lightgreen')
    Plot.show(new Figure(layout('scented'), sAverage, sScatter, line))
    Plot.show(new Figure(layout('unscented'), uAverage, uScatter, line))

    View Slide

  47. Working with tables and plots

    View Slide

  48. Regression Overview Clustering:
    • Relationship between variables
    Types:
    • Linear OLSR
    • Logistic
    • Polynomial
    • Stepwise
    • Ridge
    • Lasso
    • ElasticNet
    Aspects:
    • Under/over fitting
    • Outliers
    Applications:
    • House price prediction
    • Market predictions
    • Ratings prediction
    • Weather forecast

    View Slide

  49. Linear regression case study: House price predictions
    See also: https://nicolai92.github.io/posts/pentaho-weka-prediction
    Dataset: https://www.kaggle.com/harlfoxem/housesalesprediction/

    View Slide

  50. House price predictions – view data graphically
    import org.apache.commons.math3.random.EmpiricalDistribution
    import static groovyx.javafx.GroovyFX.start
    import static org.apache.commons.csv.CSVFormat.RFC4180 as CSV
    def file = 'kc_house_data.csv' as File
    def csv = CSV.withFirstRowAsHeader().parse(new FileReader(file))
    def all = csv.collect { it.bedrooms.toInteger() }
    def dist = new EmpiricalDistribution(all.max()).tap{ load(all as double[]) }
    def bins = dist.binStats.withIndex().collectMany { v, i -> [i.toString(), v.n] }
    start {
    stage(title: 'Number of bedrooms histogram', show: true, width: 800, height: 600) {
    scene {
    barChart(title: 'Bedroom count', barGap: 0, categoryGap: 2) {
    series(name: 'Number of properties', data: bins)
    }
    }
    }
    }

    View Slide

  51. House price predictions – view data graphically

    View Slide

  52. House price predictions – view stats
    import org.apache.commons.math3.stat.descriptive.SummaryStatistics
    import static org.apache.commons.csv.CSVFormat.RFC4180 as CSV
    def file = 'kc_house_data.csv' as File
    def csv = CSV.withFirstRowAsHeader().parse(new FileReader(file))
    def all = csv.collect { it.bedrooms.toInteger() }
    def stats = new SummaryStatistics()
    all.each{ stats.addValue(it as double) }
    println stats.summary
    n: 21613
    min: 0.0
    max: 33.0
    mean: 3.370841623097218
    std dev: 0.930061831147451
    variance: 0.8650150097573497
    sum: 72854.0

    View Slide

  53. House price predictions – investigate outliers
    import org.apache.commons.csv.CSVFormat.RFC4180 as CSV
    def file = 'kc_house_data.csv' as File
    def csv = CSV.withFirstRowAsHeader().parse(new FileReader(file))
    csv.findAll{ it.bedrooms.toInteger() > 10 }.each{ println it.toMap() as TreeMap }
    [bathrooms:3, bedrooms:11, condition:3, date:20140821T000000, floors:2, grade:7, id:1773100755, lat:47.556,
    long:-122.363, price:520000, sqft_above:2400, sqft_basement:600, sqft_living:3000, sqft_living15:1420,
    sqft_lot:4960, sqft_lot15:4960, view:0, waterfront:0, yr_built:1918, yr_renovated:1999, zipcode:98106]
    [bathrooms:1.75, bedrooms:33, condition:5, date:20140625T000000, floors:1, grade:7, id:2402100895, lat:47.6878,
    long:-122.331, price:640000, sqft_above:1040, sqft_basement:580, sqft_living:1620, sqft_living15:1330,
    sqft_lot:6000, sqft_lot15:4700, view:0, waterfront:0, yr_built:1947, yr_renovated:0, zipcode:98103]

    View Slide

  54. House price predictions – investigate outliers
    Some libraries optionally support CSV -> richly-typed domain classes
    import com.opencsv.bean.*
    import groovy.transform.ToString
    @ToString(includeNames = true)
    class House {
    @CsvBindByName
    Integer bedrooms
    @CsvBindByName
    String bathrooms
    @CsvBindByName(column = 'sqft_lot')
    Integer area_lot
    }
    def file = 'kc_house_data.csv' as File
    def builder = new CsvToBeanBuilder(new FileReader(file))
    def rows = builder.withType(House).build().parse()
    rows.findAll{ it.bedrooms > 10 }.each{ println it }
    House(bedrooms:11, bathrooms:3, area_lot:4960)
    House(bedrooms:33, bathrooms:1.75, area_lot:6000)

    View Slide

  55. House prices – remove outlier and view bedrooms
    import org.apache.commons.math3.random.EmpiricalDistribution
    import static groovyx.javafx.GroovyFX.start
    import static org.apache.commons.csv.CSVFormat.RFC4180 as CSV
    def csv = CSV.withFirstRowAsHeader().parse(new FileReader('kc_house_data.csv'))
    def all = csv.collect { it.bedrooms.toInteger() }.findAll{ it < 30 }
    def dist = new EmpiricalDistribution(all.max()).tap{ load(all as double[]) }
    def bins = dist.binStats.withIndex().collectMany { v, i -> [i.toString(), v.n] }
    start {
    stage(title: 'Number of bedrooms histogram', show: true, width: 800, height: 600) {
    scene {
    barChart(title: 'Bedroom count', barGap: 0, categoryGap: 2) {
    series(name: 'Number of properties', data: bins)
    }
    }
    }
    }

    View Slide

  56. House price predictions – remove outlier and view price
    import org.apache.commons.math3.random.EmpiricalDistribution
    import static groovyx.javafx.GroovyFX.start
    import static org.apache.commons.csv.CSVFormat.RFC4180 as CSV
    def file = 'kc_house_data.csv' as File
    def csv = CSV.withFirstRowAsHeader().parse(new FileReader(file))
    def all = csv.findAll { it.bedrooms.toInteger() < 30 }.collect { it.price.toDouble() }
    def info = new SummaryStatistics(); all.each(info::addValue)
    def head = "Price percentile (min=\$$info.min, mean=\$${info.mean as int}, max=\$$info.max)"
    def dist = new EmpiricalDistribution(100).tap{ load(all as double[]) }
    def bins = dist.binStats.withIndex().collectMany { v, i -> [i.toString(), v.n] }
    start {
    stage(title: 'Price histogram', show: true, width: 800, height: 600) {
    scene {
    barChart(title: head, barGap: 0, categoryGap: 0) {
    series(name: 'Number of properties', data: bins)
    }
    }
    }
    }

    View Slide

  57. House price predictions – explore with tablesaw

    View Slide

  58. House price predictions – explore with tablesaw
    import tech.tablesaw.api.*
    import tech.tablesaw.plotly.Plot
    import tech.tablesaw.plotly.api.*
    import static tech.tablesaw.aggregate.AggregateFunctions.*
    def rows = Table.read().csv('kc_house_data.csv')
    println rows.shape()
    println rows.structure()
    println rows.column("bedrooms").summary().print()
    println rows.where(rows.column("bedrooms").isGreaterThan(10))
    def cleaned = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30))
    println cleaned.shape()
    println cleaned.summarize("price", mean, min, max).by("bedrooms")
    Plot.show(ScatterPlot.create("Price x bathrooms x grade", cleaned, "bathrooms", "price", 'grade'))

    View Slide

  59. House price predictions – explore with tablesaw
    import tech.tablesaw.api.*
    import tech.tablesaw.plotly.Plot
    import tech.tablesaw.plotly.api.*
    import static tech.tablesaw.aggregate.AggregateFunctions.*
    def rows = Table.read().csv('kc_house_data.csv')
    println rows.shape()
    println rows.structure()
    println rows.column("bedrooms").summary().print()
    println rows.where(rows.column("bedrooms").isGreaterThan(10))
    def cleaned = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30))
    println cleaned.shape()
    println cleaned.summarize("price", mean, min, max).by("bedrooms")
    Plot.show(ScatterPlot.create("Price x bathrooms x grade", cleaned, "bathrooms", "price", 'grade'))
    21613 rows X 21 cols

    View Slide

  60. House price predictions – explore with tablesaw
    import tech.tablesaw.api.*
    import tech.tablesaw.plotly.Plot
    import tech.tablesaw.plotly.api.*
    import static tech.tablesaw.aggregate.AggregateFunctions.*
    def rows = Table.read().csv('kc_house_data.csv')
    println rows.shape()
    println rows.structure()
    println rows.column("bedrooms").summary().print()
    println rows.where(rows.column("bedrooms").isGreaterThan(10))
    def cleaned = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30))
    println cleaned.shape()
    println cleaned.summarize("price", mean, min, max).by("bedrooms")
    Plot.show(ScatterPlot.create("Price x bathrooms x grade", cleaned, "bathrooms", "price", 'grade'))
    Structure of kc_house_data.csv
    Index | Column Name | Column Type |
    -------------------------------------------
    0 | id | LONG |
    1 | date | STRING |
    2 | price | DOUBLE |
    3 | bedrooms | INTEGER |
    4 | bathrooms | DOUBLE |
    5 | sqft_living | INTEGER |
    6 | sqft_lot | INTEGER |
    7 | floors | DOUBLE |
    8 | waterfront | INTEGER |
    9 | view | INTEGER |
    ... | ... | ... |
    11 | grade | INTEGER |
    12 | sqft_above | INTEGER |
    13 | sqft_basement | INTEGER |
    14 | yr_built | INTEGER |
    15 | yr_renovated | INTEGER |
    16 | zipcode | INTEGER |
    17 | lat | DOUBLE |
    18 | long | DOUBLE |
    19 | sqft_living15 | INTEGER |
    20 | sqft_lot15 | INTEGER |

    View Slide

  61. House price predictions – explore with tablesaw
    import tech.tablesaw.api.*
    import tech.tablesaw.plotly.Plot
    import tech.tablesaw.plotly.api.*
    import static tech.tablesaw.aggregate.AggregateFunctions.*
    def rows = Table.read().csv('kc_house_data.csv')
    println rows.shape()
    println rows.structure()
    println rows.column("bedrooms").summary().print()
    println rows.where(rows.column("bedrooms").isGreaterThan(10))
    def cleaned = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30))
    println cleaned.shape()
    println cleaned.summarize("price", mean, min, max).by("bedrooms")
    Plot.show(ScatterPlot.create("Price x bathrooms x grade", cleaned, "bathrooms", "price", 'grade'))
    Column: bedrooms
    Measure | Value |
    -----------------------------------
    n | 21613.0 |
    sum | 72854.0 |
    Mean | 3.370841623097218 |
    Min | 0.0 |
    Max | 33.0 |
    Range | 33.0 |
    Variance | 0.8650150097573497 |
    Std. Dev | 0.930061831147451 |

    View Slide

  62. House price predictions – explore with tablesaw
    import tech.tablesaw.api.*
    import tech.tablesaw.plotly.Plot
    import tech.tablesaw.plotly.api.*
    import static tech.tablesaw.aggregate.AggregateFunctions.*
    def rows = Table.read().csv('kc_house_data.csv')
    println rows.shape()
    println rows.structure()
    println rows.column("bedrooms").summary().print()
    println rows.where(rows.column("bedrooms").isGreaterThan(10))
    def cleaned = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30))
    println cleaned.shape()
    println cleaned.summarize("price", mean, min, max).by("bedrooms")
    Plot.show(ScatterPlot.create("Price x bathrooms x grade", cleaned, "bathrooms", "price", 'grade'))
    kc_house_data.csv
    id | date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | grade | sqft_above | sqft_basement | yr_built | yr_renovated | zipcode | lat | long | sqft_living15 | sqft_lot15 |
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    1773100755 | 20140821 | 520000.0 | 11 | 3.0 | 3000 | 4960 | 2.0 | 0 | 0 | 3 | 7 | 2400 | 600 | 1918 | 1999 | 98106 | 47.556 | -122.363 | 1420 | 4960 |
    2402100895 | 20140625 | 640000.0 | 33 | 1.75 | 1620 | 6000 | 1.0 | 0 | 0 | 5 | 7 | 1040 | 580 | 1947 | 0 | 98103 | 47.687 | -122.331 | 1330 | 4700 |

    View Slide

  63. House price predictions – explore with tablesaw
    import tech.tablesaw.api.*
    import tech.tablesaw.plotly.Plot
    import tech.tablesaw.plotly.api.*
    import static tech.tablesaw.aggregate.AggregateFunctions.*
    def rows = Table.read().csv('kc_house_data.csv')
    println rows.shape()
    println rows.structure()
    println rows.column("bedrooms").summary().print()
    println rows.where(rows.column("bedrooms").isGreaterThan(10))
    def cleaned = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30))
    println cleaned.shape()
    println cleaned.summarize("price", mean, min, max).by("bedrooms")
    Plot.show(ScatterPlot.create("Price x bathrooms x grade", cleaned, "bathrooms", "price", 'grade'))
    21612 rows X 21 cols

    View Slide

  64. House price predictions – explore with tablesaw
    import tech.tablesaw.api.*
    import tech.tablesaw.plotly.Plot
    import tech.tablesaw.plotly.api.*
    import static tech.tablesaw.aggregate.AggregateFunctions.*
    def rows = Table.read().csv('kc_house_data.csv')
    println rows.shape()
    println rows.structure()
    println rows.column("bedrooms").summary().print()
    println rows.where(rows.column("bedrooms").isGreaterThan(10))
    def cleaned = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30))
    println cleaned.shape()
    println cleaned.summarize("price", mean, min, max).by("bedrooms")
    Plot.show(ScatterPlot.create("Price x bathrooms x grade", cleaned, "bathrooms", "price", 'grade'))
    kc_house_data.csv summary
    bedrooms | Mean [price] | Min [price] | Max [price] |
    -------------------------------------------------------------------
    0 | 409503.8461538461 | 139950.0 | 1295650.0 |
    1 | 317642.8844221105 | 75000.0 | 1247000.0 |
    2 | 401372.68188405805 | 78000.0 | 3278000.0 |
    3 | 466232.0784812695 | 82000.0 | 3800000.0 |
    4 | 635419.5042138912 | 100000.0 | 4489000.0 |
    5 | 786599.8288569644 | 133000.0 | 7062500.0 |
    6 | 825520.6360294118 | 175000.0 | 7700000.0 |
    7 | 951184.6578947367 | 280000.0 | 3200000.0 |
    8 | 1105076.923076923 | 340000.0 | 3300000.0 |
    9 | 893999.8333333334 | 450000.0 | 1400000.0 |
    10 | 819333.3333333334 | 650000.0 | 1148000.0 |
    11 | 520000.0 | 520000.0 | 520000.0 |

    View Slide

  65. House price predictions – explore with tablesaw
    import tech.tablesaw.api.*
    import tech.tablesaw.plotly.Plot
    import tech.tablesaw.plotly.api.*
    import static tech.tablesaw.aggregate.AggregateFunctions.*
    def rows = Table.read().csv('kc_house_data.csv')
    println rows.shape()
    println rows.structure()
    println rows.column("bedrooms").summary().print()
    println rows.where(rows.column("bedrooms").isGreaterThan(10))
    def cleaned = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30))
    println cleaned.shape()
    println cleaned.summarize("price", mean, min, max).by("bedrooms")
    Plot.show(ScatterPlot.create("Price x bathrooms x grade", cleaned, "bathrooms", "price", 'grade'))

    View Slide

  66. House price predictions – explore with tablesaw
    cleaned.addColumns(
    StringColumn.create("waterfrontDesc", cleaned.column("waterfront").collect{ it ? 'waterfront' : 'interior' }),
    DoubleColumn.create("scaledGrade", cleaned.column("grade").collect{ it * 2 }),
    DoubleColumn.create("scaledPrice", cleaned.column("price").collect{ it / 100000 })
    )
    Plot.show(BubblePlot.create("Price vs living area and grade (bubble size)",
    cleaned, "sqft_living", "price", "scaledGrade", "waterfrontDesc"))
    Plot.show(Scatter3DPlot.create("Grade, living space, bathrooms and price (bubble size)",
    cleaned, "sqft_living", "bathrooms", "grade", "scaledPrice", "waterfrontDesc"))

    View Slide

  67. House price predictions – explore with jupyter/beakerx

    View Slide

  68. Regression with Least Squares Find line of “best fit”:

    View Slide

  69. Regression with Least Squares Find line of “best fit”:
    • Find intercept and slope of
    line which minimizes the
    sum of the square of the
    residual errors
    Residual errors
    y
    x

    View Slide

  70. Regression with Least Squares Find line of “best fit”:
    • Find intercept and slope of
    line which minimizes the
    sum of the square of the
    residual errors
    • Line is represented by:
    y = m x + b
    • For points (x0
    , y0
    )..(xn
    , yn
    )
    errors will be:
    err0
    = y0
    – (m x0
    + b)
    ..
    errn
    = yn
    – (m xn
    + b)
    which are then squared
    • To minimise, we find where
    the derivative is zero
    • Solving the math gives:
    b = σ 𝑦 σ 𝑥2−σ 𝑥 σ 𝑥 𝑦
    𝑛 σ 𝑥2 − σ 𝑥 2
    m =
    𝑛 σ 𝑥 𝑦 − σ 𝑥 σ 𝑦
    𝑛 σ 𝑥2 − σ 𝑥 2
    Residual errors
    y
    x
    (0, b)
    (x0
    , y0
    )
    (xn
    , yn
    )
    errn
    err0

    View Slide

  71. Regression with Gradient Descent Find line of “best fit”:
    • Find intercept and slope of
    line which minimizes the
    sum of the square of the
    residual errors
    • Line is represented by:
    y = m x + b
    and we might have a
    similar loss function
    • Instead of solving for the
    derivative of loss function,
    we take repeated steps in
    the opposite direction of
    the gradient, taking smaller
    steps as we approach zero
    • A learning rate influences
    step size
    • Amenable to streaming &
    can involve random or
    mini-batch subsets of
    data for efficiency
    Loss function (residual error)
    Parameter being optimized (could be multiple dimensions)

    View Slide

  72. House price predictions – linear regression
    import org.apache.commons.math3.stat.regression.SimpleRegression
    import static groovyx.javafx.GroovyFX.start
    import static org.apache.commons.csv.CSVFormat.RFC4180 as CSV
    def feature = 'bedrooms'
    def nonOutliers = feature == 'bedrooms' ? { it[0] < 30 } : { true }
    def csv = CSV.withFirstRecordAsHeader().parse(new FileReader('kc_house_data.csv'))
    def all = csv.collect { [it[feature].toDouble(), it.price.toDouble()] }.findAll(nonOutliers)
    def reg = new SimpleRegression().tap{ addData(all as double[][]) }
    def (min, max) = all.transpose().with{ [it[0].min(), it[0].max()] }
    def predicted = [[min, reg.predict(min)], [max, reg.predict(max)]]
    start {
    stage(title: 'Price vs Number of bedrooms', show: true, width: 800, height: 600) {
    scene {
    scatterChart {
    series(name: 'Actual', data: all)
    }
    lineChart {
    series(name: 'Predicted', data: predicted)
    }
    }
    }
    }

    View Slide

  73. House price predictions – linear regression
    • Can be repeated for other features

    View Slide

  74. House price predictions – linear regression

    View Slide

  75. House price predictions – evaluating regression
    @Canonical
    @FXBindable
    class Feature {
    String name
    double rr, rmseTrain, rmseTest, mean
    }
    def features = [new Feature('bedrooms'), new Feature('bathrooms'), new Feature('grade'), new Feature('sqft_living')]
    def idxs = 0..def priceIdx = features.size()
    def file = getClass().classLoader.getResource('kc_house_data.csv').file
    def csv = CSV.withFirstRecordAsHeader().parse(new FileReader(file))
    def all = csv.collect { row ->
    [*idxs.collect{ row[features[it].name].toDouble() }, row.price.toDouble()]
    }.findAll{ it[0] < 30 }
    def (train, test) = all.chop(all.size() * 0.8 as int, -1)
    def trainT = train.transpose()
    def regs = idxs.collect { idx ->
    new SimpleRegression().tap{ addData([trainT[idx], trainT[priceIdx]].transpose() as double[][]) }
    }
    def residuals = idxs.collect{ idx -> test.collect{ regs[idx].predict(it[idx]) - it[priceIdx] } }

    View Slide

  76. House price predictions – evaluating regression
    idxs.each { idx ->
    features[idx].rr = regs[idx].RSquare
    features[idx].rmseTrain = Math.sqrt(regs[idx].meanSquareError)
    features[idx].rmseTest = Math.sqrt(StatUtils.sumSq(residuals[idx] as double[]) / (test.size() - 1))
    features[idx].mean = StatUtils.mean(residuals[idx] as double[])
    }
    start {
    stage(title: "Error statistics for feature", visible: true) {
    scene(fill: groovyblue, width: 800, height:200) {
    stackPane(padding: 10) {
    tableView(items: features) {
    tableColumn(text: "Feature", property: 'name')
    tableColumn(text: "R²", property: 'rr')
    tableColumn(text: "RMSE (train)", property: 'rmseTrain')
    tableColumn(text: "RMSE (test)", property: 'rmseTest')
    tableColumn(text: "Residuals Mean", property: 'mean')
    }
    }
    }
    }
    }

    View Slide

  77. House price predictions – evaluating regression
    def maxError = [residuals[graphIdx].min(), residuals[graphIdx].max()].max{ Math.abs(it) }
    residuals[graphIdx] << maxError * -1 // make graph even around origin
    maxError = Math.abs(maxError)
    def step = maxError.toInteger() / 50
    def dist = new EmpiricalDistribution(100).tap{ load(residuals[graphIdx] as double[]) }
    def ndist = new NormalDistribution(0, dist.sampleStats.standardDeviation)
    def bins = dist.binStats.indexed().collect { i, v ->
    [v.n ? v.mean.toInteger().toString(): (-maxError + i * step).toInteger().toString(), v.n]
    }
    def nbins = dist.binStats.indexed().collect { i, v ->
    def x = v.n ? v.mean.toInteger() : (-maxError + i * step).toInteger(); [x.toString(), ndist.probability(x, x + step)]
    }
    def scale = dist.binStats.max{ it.n }.n / nbins.max{ it[1] }[1]
    nbins = nbins.collect{ [it[0], it[1] * scale] }
    start {
    stage(title: "Error histogram for ${features[graphIdx].name}", show: true, width: 800, height: 600) {
    scene {
    barChart(title: 'Error percentile', barGap: 0, categoryGap: 0) {
    series(name: 'Error in prediction', data: bins)
    series(name: 'Normal distribution', data: nbins)
    }
    }
    }
    }

    View Slide

  78. House price predictions – evaluating regression

    View Slide

  79. House price predictions – multi linear regression
    def features = [
    'bedrooms', 'bathrooms', 'sqft_living', 'sqft_living15', 'lat',
    'sqft_above', 'grade', 'view', 'waterfront', 'floors'
    ]
    def idxs = 0..def priceIdx = features.size()
    def csv = CSV.withFirstRecordAsHeader().parse(new FileReader('kc_house_data.csv'))
    def all = csv.collect { row -> [*idxs.collect{ row[features[it]].toDouble() }, row.price.toDouble()] }.findAll{ it[0] < 30 }
    def (train, test) = all.chop(all.size() * 0.8 as int, -1)
    def trainT = train.transpose()
    def price = trainT[priceIdx]
    def reg = new OLSMultipleLinearRegression().tap{ newSampleData(price as double[], trainT[0..def params = reg.estimateRegressionParameters()
    println params
    def predicted = test.collect { data -> params[0] + (1..def residuals = test.indexed().collect { i, data -> predicted[i] - data[priceIdx] }
    def rr = reg.calculateRSquared()
    def rmseTrain = Math.sqrt(reg.calculateResidualSumOfSquares())
    def rmseTest = Math.sqrt(StatUtils.sumSq(residuals as double[]) / (test.size() - 1))
    def mean = StatUtils.mean(residuals as double[])
    println "$rr $rmseTrain $rmseTest $mean"
    [-3.1635392225013282E7, -27484.961699434185, -5236.246584477492, 191.507648719199, 17.94379836055797, 657803.1115560957,
    4.369533695386711, 74850.68999978402, 66459.03393303146, 604896.8569593901, -27203.796342244277]
    0.6536012563023335 215379.66626496855 212609.34498021202 1886.6015077985683
    R2: https://en.wikipedia.org/wiki/Coefficient_of_determination
    rmse: https://en.wikipedia.org/wiki/Root-mean-square_deviation

    View Slide

  80. House price predictions – multi linear regression

    View Slide

  81. House price predictions – multi linear regression
    def file = getClass().classLoader.getResource('kc_house_data.csv').file as File
    def loader = new CSVLoader(file: file)
    def model = new LinearRegression()
    def allInstances = loader.dataSet
    def priceIndex = 2
    allInstances.classIndex = priceIndex
    // remove "id" and "date" columns
    def rm = new Remove(attributeIndices: '1,2', inputFormat: allInstances)
    def instances = Filter.useFilter(allInstances, rm)
    model.buildClassifier(instances)
    println model
    def actual = instances.collect{ it.value(0).toDouble() }
    def predicted = instances.collect{ model.classifyInstance(it) }
    def chart = new XYChartBuilder()…
    Linear Regression Model
    price =
    -35766.5414 * bedrooms +
    41144.2785 * bathrooms +
    82.8088 * sqft_living +
    0.1286 * sqft_lot +
    6689.5501 * floors +
    582960.4584 * waterfront +
    52870.9424 * view +
    26385.6491 * condition +
    95890.4452 * grade +
    98.4193 * sqft_above +
    67.2917 * sqft_basement +
    -2620.2232 * yr_built +
    19.8126 * yr_renovated +
    -582.4199 * zipcode +
    602748.2265 * lat +
    -214729.8283 * long +
    21.6814 * sqft_living15 +
    -0.3826 * sqft_lot15 +
    6690324.6024

    View Slide

  82. House price predictions – multi linear regression
    def file = getClass().classLoader.getResource('kc_house_data.csv').file as File
    def loader = new CSVLoader(file: file)
    def model = new LinearRegression()
    def allInstances = loader.dataSet
    def priceIndex = 2
    allInstances.classIndex = priceIndex
    // remove "id" and "date" columns
    def rm = new Remove(attributeIndices: '1,2', inputFormat: allInstances)
    def instances = Filter.useFilter(allInstances, rm)
    model.buildClassifier(instances)
    println model
    def actual = instances.collect{ it.value(0).toDouble() }
    def predicted = instances.collect{ model.classifyInstance(it) }
    def chart = new XYChartBuilder()…
    Linear Regression Model
    price =
    -35766.5414 * bedrooms +
    41144.2785 * bathrooms +
    82.8088 * sqft_living +
    0.1286 * sqft_lot +
    6689.5501 * floors +
    582960.4584 * waterfront +
    52870.9424 * view +
    26385.6491 * condition +
    95890.4452 * grade +
    98.4193 * sqft_above +
    67.2917 * sqft_basement +
    -2620.2232 * yr_built +
    19.8126 * yr_renovated +
    -582.4199 * zipcode +
    602748.2265 * lat +
    -214729.8283 * long +
    21.6814 * sqft_living15 +
    -0.3826 * sqft_lot15 +
    6690324.6024

    View Slide

  83. House price predictions – multi linear regression
    def file = getClass().classLoader.getResource('kc_house_data.csv').file as File
    def loader = new CSVLoader(file: file)
    def model = new SGD()
    model.options = ['-F', '4', '-N'] as String[] // Huber loss, unscaled
    def allInstances = loader.dataSet
    def priceIndex = 2
    allInstances.classIndex = priceIndex
    // remove "id", "date", 'zip', 'lat', 'long' columns
    def rm = new Remove(attributeIndices: '1,2,17,18,19', inputFormat:
    allInstances)
    def instances = Filter.useFilter(allInstances, rm)
    model.buildClassifier(instances)
    println model
    def actual = instances.collect{ it.value(0).toDouble() }
    def predicted = instances.collect{ model.classifyInstance(it) }
    def chart = new XYChartBuilder()…
    Loss function: Huber loss
    (robust regression)
    price =
    -9.6832 bedrooms
    + 0.4953 bathrooms
    + 112.6971 sqft_living
    + 0.8189 sqft_lot
    + 5.5 floors
    + 0.7041 waterfront
    + 11.3372 view
    + 7.8443 condition
    + 23.9548 grade
    + 37.5499 sqft_above
    + 75.1472 sqft_basement
    + -7.8827 yr_built
    + 63.7169 yr_renovated
    + 103.4893 sqft_living15
    + -1.3449 sqft_lot15
    + 0.2788

    View Slide

  84. House price predictions – multi linear regression
    def file = getClass().classLoader.getResource('kc_house_data.csv').file as File
    def loader = new CSVLoader(file: file)
    def model = new SGD()
    model.options = ['-F', '4', '-N'] as String[] // Huber loss, unscaled
    def allInstances = loader.dataSet
    def priceIndex = 2
    allInstances.classIndex = priceIndex
    // remove "id", "date", 'zip', 'lat', 'long' columns
    def rm = new Remove(attributeIndices: '1,2,17,18,19', inputFormat:
    allInstances)
    def instances = Filter.useFilter(allInstances, rm)
    model.buildClassifier(instances)
    println model
    def actual = instances.collect{ it.value(0).toDouble() }
    def predicted = instances.collect{ model.classifyInstance(it) }
    def chart = new XYChartBuilder()…
    Loss function: Huber loss
    (robust regression)
    price =
    -9.6832 bedrooms
    + 0.4953 bathrooms
    + 112.6971 sqft_living
    + 0.8189 sqft_lot
    + 5.5 floors
    + 0.7041 waterfront
    + 11.3372 view
    + 7.8443 condition
    + 23.9548 grade
    + 37.5499 sqft_above
    + 75.1472 sqft_basement
    + -7.8827 yr_built
    + 63.7169 yr_renovated
    + 103.4893 sqft_living15
    + -1.3449 sqft_lot15
    + 0.2788

    View Slide

  85. House price predictions – explore with Weka

    View Slide

  86. House price predictions – explore with Weka

    View Slide

  87. House price predictions – analyze with Weka

    View Slide

  88. House price predictions – analyze with Weka

    View Slide

  89. House price predictions – analyze with Weka

    View Slide

  90. House price predictions – scaling regression
    import …
    def rows = Table.read().csv('kc_house_data.csv')
    rows = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30))
    String[] features = [
    'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_living15',
    'lat', 'sqft_above', 'grade', 'view', 'waterfront', 'floors'
    ]
    def data = rows.as().doubleMatrix(features)
    // configure to all run on local machine but could be a cluster (can be hidden in XML)
    def cfg = new IgniteConfiguration(
    peerClassLoadingEnabled: true,
    discoverySpi: new TcpDiscoverySpi(
    ipFinder: new TcpDiscoveryMulticastIpFinder(
    addresses: ['127.0.0.1:47500..47509']
    )
    )
    )
    static pretty(mdl, features) {
    def sign = { val -> val < 0 ? '- ' : '+ ' }
    def valIdx = { idx, val -> sprintf '%.2f*%s', val, features[idx+1] }
    def valIdxSign = { idx, val -> sign(val) + valIdx(idx, Math.abs(val)) }
    def valSign = { val -> sign(val) + sprintf('%.2f', Math.abs(val)) }
    def (w, i) = [mdl.weights, mdl.intercept]
    [valIdx(0, w.get(0)), *(1..}
    Read in data and select
    features of interest
    Adjust config for known
    local mode setup
    Utility method to pretty
    print model

    View Slide

  91. House price predictions – scaling regression

    Ignition.start(cfg).withCloseable { ignite ->
    println ">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols"
    def dataCache = new SandboxMLCache(ignite).fillCacheWith(data)
    def trainer = new LSQRTrainer().withEnvironmentBuilder(defaultBuilder().withRNGSeed(0))
    def vectorizer = new DoubleArrayVectorizer().labeled(Vectorizer.LabelCoordinate.FIRST)
    def split = new TrainTestDatasetSplitter().split(0.8)
    def mdl = trainer.fit(ignite, dataCache, split.trainFilter, vectorizer)
    def rr = Evaluator.evaluate(dataCache, split.testFilter, mdl, vectorizer, new RM().withMetric { it.r2() })
    def rmse = Evaluator.evaluate(dataCache, split.testFilter, mdl, vectorizer, new RM())
    dataCache.destroy()
    println ">>> Model: " + pretty(mdl, features)
    println ">>> R^2 : " + rr
    println ">>> RMSE : " + rmse
    }
    Candidates for
    distributed processing

    View Slide

  92. House price predictions – scaling regression

    Ignition.start(cfg).withCloseable { ignite ->
    println ">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols"
    def dataCache = new SandboxMLCache(ignite).fillCacheWith(data)
    def trainer = new LSQRTrainer().withEnvironmentBuilder(defaultBuilder().withRNGSeed(0))
    def vectorizer = new DoubleArrayVectorizer().labeled(Vectorizer.LabelCoordinate.FIRST)
    def split = new TrainTestDatasetSplitter().split(0.8)
    def mdl = trainer.fit(ignite, dataCache, split.trainFilter, vectorizer)
    def rr = Evaluator.evaluate(dataCache, split.testFilter, mdl, vectorizer, new RM().withMetric { it.r2() })
    def rmse = Evaluator.evaluate(dataCache, split.testFilter, mdl, vectorizer, new RM())
    dataCache.destroy()
    println ">>> Model: " + pretty(mdl, features)
    println ">>> R^2 : " + rr
    println ">>> RMSE : " + rmse
    }
    [15:02:09] Ignite node started OK (id=63b2d47d)
    >>> Ignite grid started for data: 21612 rows X 11 cols
    >>> Model: -33081.12*bedrooms - 16205.93*bathrooms + 225.40*sqft_living + 6.77*sqft_living15 - 8766.52*lat
    - 38.59*sqft_above + 96251.64*grade + 62303.58*view + 572116.79*waterfront - 11620.55*floors - 48504.18
    >>> R^2 : 0.607023233473466
    >>> RMSE : 228035.85816493785
    [15:02:10] Ignite node stopped OK [uptime=00:00:01.465]
    See also: https://github.com/apache/ignite/blob/master/examples/src/main/java/org/apache/ignite/examples/ml/regression/linear/BostonHousePricesPredictionExample.java

    View Slide

  93. House prices – scaling regression
    import …
    def spark = builder().config('spark.master', 'local[8]').appName('HousePrices').orCreate
    def file = '/path/to/kc_house_data.csv'
    int k = 5
    Dataset ds = spark.read().format('csv')
    .options('header': 'true', 'inferSchema': 'true').load(file)
    double[] splits = [80, 20]
    def (training, test) = ds.randomSplit(splits)
    String[] colNames = ds.columns().toList().minus(['id', 'date', 'price'])
    def assembler = new VectorAssembler(inputCols: colNames, outputCol: 'features')
    Dataset dataset = assembler.transform(training)
    def lr = new LinearRegression(labelCol: 'price', maxIter: 10)
    def model = lr.fit(dataset)
    println 'Coefficients:'
    println model.coefficients().values()[1..-1].collect{ sprintf '%.2f', it }.join(', ')
    def testSummary = model.evaluate(assembler.transform(test))
    printf 'RMSE: %.2f%n', testSummary.rootMeanSquaredError
    printf 'r2: %.2f%n', testSummary.r2
    spark.stop()
    Coefficients:
    41979.78, 80853.89, 0.15, 5412.83, 564343.22, 53834.10, 24817.09, 93195.29, -80662.68,
    -80694.28, -2713.58, 19.02, -628.67, 594468.23, -228397.19, 21.23, -0.42
    RMSE: 187242.12
    r2: 0.70

    View Slide

  94. Clustering Overview Clustering:
    • Grouping similar items
    Algorithm families:
    • Hierarchical
    • Partitioning k-means, x-means
    • Density-based
    • Graph-based
    Aspects:
    • Disjoint vs overlapping
    • Preset cluster number
    • Dimensionality reduction PCA
    • Nominal feature support
    Applications:
    • Market segmentation
    • Recommendation engines
    • Search result grouping
    • Social network analysis
    • Medical imaging

    View Slide

  95. Clustering
    https://commons.apache.org/proper/commons-math/userguide/ml.html

    View Slide

  96. Clustering case study: Whiskey flavor profiles
    • 86 scotch
    whiskies
    • 12 flavor
    categories
    Pictures: https://prasant.net/clustering-scotch-whisky-grouping-distilleries-by-k-means-clustering-81f2ecde069c
    https://www.r-bloggers.com/where-the-whisky-flavor-profile-data-came-from/
    https://www.centerspace.net/clustering-analysis-part-iv-non-negative-matrix-factorization/

    View Slide

  97. Whiskey – exploring with Dex

    View Slide

  98. import …
    def table = Table.read().csv('whiskey.csv')
    table = table.removeColumns(0)
    def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey",
    "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"]
    def panel = new PlotPanel(
    *[0..
    def color = new Color(72 + (first * 16), 72 + (second * 16), 200 - (first * 4) - (second * 4))
    ScatterPlot.plot(table.as().doubleMatrix(cols[first], cols[second]), '#' as char, color)
    }
    )
    new SwingBuilder().edt {
    frame(title: 'Frame', size: [800, 600], show: true, defaultCloseOperation: DISPOSE) {
    widget(panel)
    }
    }
    Whiskey: How to visualize? Feature combinations?

    View Slide

  99. Hmm… not
    particularly
    enlightening
    Whiskey: How to visualize? Feature combinations?

    View Slide

  100. Another dimension?
    Whiskey: How to visualize? Feature combinations?

    View Slide

  101. import …
    def table = Table.read().csv('whiskey.csv')
    table = table.removeColumns(0)
    def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey",
    "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"]
    Color[] colors = [CYAN, PINK, MAGENTA, ORANGE, GREEN, BLUE, RED, YELLOW]
    def panel = new PlotPanel(
    *[cols, cols].combinations().collect { first, second ->
    Histogram3D.plot(table.as().doubleMatrix(first, second), 4, colors)
    }
    )
    new SwingBuilder().edt {
    frame(title: 'Frame', size: [800, 600], show: true, defaultCloseOperation: DISPOSE) {
    widget(panel)
    }
    }
    Whiskey: How to visualize? Feature combinations?

    View Slide

  102. import …
    def table = Table.read().csv('whiskey.csv')
    table = table.removeColumns(0)
    def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey",
    "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"]
    Color[] colors = [CYAN, PINK, MAGENTA, ORANGE, GREEN, BLUE, RED, YELLOW]
    def panel = new PlotPanel(
    *[cols, cols].combinations().collect { first, second ->
    Histogram3D.plot(table.as().doubleMatrix(first, second), 4, colors)
    }
    )
    new SwingBuilder().edt {
    frame(title: 'Frame', size: [800, 600], show: true, defaultCloseOperation: DISPOSE) {
    widget(panel)
    }
    }
    Whiskey: How to visualize? Feature combinations?

    View Slide

  103. import …
    def table = Table.read().csv('whiskey.csv')
    table = table.removeColumns(0)
    def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey",
    "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"]
    Color[] colors = [CYAN, PINK, MAGENTA, ORANGE, GREEN, BLUE, RED, YELLOW]
    def panel = new PlotPanel(
    *[cols, cols].combinations().collect { first, second ->
    Histogram3D.plot(table.as().doubleMatrix(first, second), 4, colors)
    }
    )
    SwingUtil.show(size: [1200, 900], panel)
    Whiskey: How to visualize? Feature combinations?

    View Slide

  104. Pretty… but
    only marginally
    more useful
    Whiskey: How to visualize? Feature combinations?

    View Slide

  105. Clustering with KMeans Step 1:
    • Guess k cluster centroids
    at random

    View Slide

  106. Clustering with KMeans Step 1:
    • Guess k cluster centroids

    View Slide

  107. Clustering with KMeans Step 1:
    • Guess k cluster centroids
    Step 2:
    • Assign points to closest
    centroid

    View Slide

  108. Clustering with KMeans Step 1:
    • Guess k cluster centroids
    Step 2:
    • Assign points to closest
    centroid

    View Slide

  109. Clustering with KMeans Step 1:
    • Guess k cluster centroids
    Step 2:
    • Assign points to closest
    centroid
    Step 3:
    • Calculate new centroids
    based on selected points

    View Slide

  110. Clustering with KMeans Step 1:
    • Guess k cluster centroids
    Step 2:
    • Assign points to closest
    centroid
    Step 3:
    • Calculate new centroids
    based on selected points

    View Slide

  111. Clustering with KMeans Step 1:
    • Guess k cluster centroids
    Step 2:
    • Assign points to closest
    centroid
    Step 3:
    • Calculate new centroids
    based on selected points

    View Slide

  112. Clustering with KMeans Step 1:
    • Guess k cluster centroids
    Step 2:
    • Assign points to closest
    centroid
    Step 3:
    • Calculate new centroids
    based on selected points

    View Slide

  113. Clustering with KMeans Step 1:
    • Guess k cluster centroids
    Step 2:
    • Assign points to closest
    centroid
    Step 3:
    • Calculate new centroids
    based on selected points
    Repeat steps 2 and 3 until
    stable or some limit reached

    View Slide

  114. Clustering with KMeans Step 1:
    • Guess k cluster centroids
    Step 2:
    • Assign points to closest
    centroid
    Step 3:
    • Calculate new centroids
    based on selected points
    Repeat steps 2 and 3 until
    stable or some limit reached

    View Slide

  115. Clustering with KMeans Step 1:
    • Guess k cluster centroids
    Step 2:
    • Assign points to closest
    centroid
    Step 3:
    • Calculate new centroids
    based on selected points
    Repeat steps 2 and 3 until
    stable or some limit reached

    View Slide

  116. Clustering with KMeans Step 1:
    • Guess k cluster centroids
    Step 2:
    • Assign points to closest
    centroid
    Step 3:
    • Calculate new centroids
    based on selected points
    Repeat steps 2 and 3 until
    stable or some limit reached

    View Slide

  117. import …
    def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey",
    "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"]
    def numClusters = 5
    def loader = new CSVLoader(file: 'whiskey.csv')
    def clusterer = new SimpleKMeans(numClusters: numClusters, preserveInstancesOrder: true)
    def instances = loader.dataSet
    instances.deleteAttributeAt(0) // remove RowID
    clusterer.buildClusterer(instances)
    println ' ' + cols.join(', ')
    def dataset = new DefaultCategoryDataset()
    clusterer.clusterCentroids.eachWithIndex{ Instance ctrd, num ->
    print "Cluster ${num+1}: "
    println ((1..cols.size()).collect{ sprintf '%.3f', ctrd.value(it) }.join(', '))
    (1..cols.size()).each { idx ->
    dataset.addValue(ctrd.value(idx), "Cluster ${num+1}", cols[idx-1]) }
    }
    def clusters = (0..clusterer.assignments.eachWithIndex { cnum, idx ->
    clusters[cnum] << instances.get(idx).stringValue(0) }
    clusters.each { k, v ->
    println "Cluster ${k+1}:"
    println v.join(', ')
    }
    def plot = new SpiderWebPlot(dataset: dataset)
    def chart = new JFreeChart('Whiskey clusters', plot)
    SwingUtil.show(new ChartPanel(chart))
    Whiskey – clustering with radar plot and weka
    Read CSV and
    create clusterer
    Display whiskies
    assigned to each cluster
    Graph cluster centroids

    View Slide

  118. import …
    def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey",
    "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"]
    def numClusters = 5
    def loader = new CSVLoader(file: 'whiskey.csv')
    def clusterer = new SimpleKMeans(numClusters: numClusters, preserveInstancesOrder: true)
    def instances = loader.dataSet
    instances.deleteAttributeAt(0) // remove RowID
    clusterer.buildClusterer(instances)
    println ' ' + cols.join(', ')
    def dataset = new DefaultCategoryDataset()
    clusterer.clusterCentroids.eachWithIndex{ Instance ctrd, num ->
    print "Cluster ${num+1}: "
    println ((1..cols.size()).collect{ sprintf '%.3f', ctrd.value(it) }.join(', '))
    (1..cols.size()).each { idx ->
    dataset.addValue(ctrd.value(idx), "Cluster ${num+1}", cols[idx-1]) }
    }
    def clusters = (0..clusterer.assignments.eachWithIndex { cnum, idx ->
    clusters[cnum] << instances.get(idx).stringValue(0) }
    clusters.each { k, v ->
    println "Cluster ${k+1}:"
    println v.join(', ')
    }
    def plot = new SpiderWebPlot(dataset: dataset)
    def chart = new JFreeChart('Whiskey clusters', plot)
    SwingUtil.show(new ChartPanel(chart))
    Whiskey – clustering with radar plot and weka
    Body, Sweetness, Smoky, Medicinal, Tobacco, Honey, Spicy, Winey, Nutty, Malty, Fruity, Floral
    Cluster 1: 3.800, 1.600, 3.600, 3.600, 0.600, 0.200, 1.600, 0.600, 1.000, 1.400, 1.200, 0.000
    Cluster 2: 2.773, 2.409, 1.545, 0.045, 0.000, 1.818, 1.591, 2.000, 2.091, 2.136, 2.136, 1.591
    Cluster 3: 1.773, 2.455, 1.318, 0.636, 0.000, 0.636, 1.000, 0.409, 1.636, 1.364, 1.591, 1.591
    Cluster 4: 1.500, 2.233, 1.267, 0.267, 0.000, 1.533, 1.400, 0.700, 1.000, 1.900, 1.900, 2.133
    Cluster 5: 2.000, 2.143, 1.857, 0.857, 1.000, 0.857, 1.714, 1.000, 1.286, 2.000, 1.429, 1.714
    Cluster 1:
    Ardbeg, Clynelish, Lagavulin, Laphroig, Talisker
    Cluster 2:
    Aberfeldy, Aberlour, Ardmore, Auchroisk, Balmenach, BenNevis, Benrinnes, Benromach,
    BlairAthol, Dailuaine, Dalmore, Edradour, Glendronach, Glendullan, Glenfarclas,
    Glenrothes, Glenturret, Longmorn, Macallan, Mortlach, RoyalLochnagar, Strathisla
    Cluster 3:
    ArranIsleOf, Aultmore, Balblair, Cardhu, Craigganmore, Dufftown, GlenGrant, GlenKeith,
    GlenScotia, GlenSpey, Glenfiddich, Glenmorangie, Isle of Jura, Mannochmore, Miltonduff,
    Oban, Speyside, Springbank, Strathmill, Tamnavulin, Teaninich, Tomore
    Cluster 4:
    AnCnoc, Auchentoshan, Belvenie, Benriach, Bladnoch, Bowmore, Bruichladdich, Bunnahabhain,
    Dalwhinnie, Deanston, GlenElgin, GlenGarioch, GlenMoray, GlenOrd, Glenallachie, Glengoyne,
    Glenkinchie, Glenlivet, Glenlossie, Highland Park, Inchgower, Knochando, Linkwood,
    Loch Lomond, Scapa, Speyburn, Tamdhu, Tobermory, Tomatin, Tomintoul
    Cluster 5:
    Caol Ila, Craigallechie, GlenDeveronMacduff, OldFettercairn,
    OldPulteney, RoyalBrackla, Tullibardine

    View Slide

  119. import …
    def rows = CSV.withFirstRecordAsHeader().parse(new FileReader('whiskey.csv'))
    def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey",
    "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"]
    def clusterer = new KMeansPlusPlusClusterer(5)
    def data = rows.collect{ row -> new DoublePoint(cols.collect{ col -> row[col] } as int[]) }
    def centroids = clusterer.cluster(data)
    println cols.join(', ') + ', Medoid'
    def dataset = new DefaultCategoryDataset()
    centroids.eachWithIndex{ ctrd, num ->
    def cpt = ctrd.center.point
    def closest = ctrd.points.min{ pt ->
    sumSq((0..}
    def medoid = rows.find{ row -> cols.collect{ row[it] as double } == closest.point }?.Distillery
    println cpt.collect{ sprintf '%.3f', it }.join(', ') + ", $medoid"
    cpt.eachWithIndex { val, idx -> dataset.addValue(val, "Cluster ${num+1}", cols[idx]) }
    }
    def plot = new SpiderWebPlot(dataset: dataset)
    def chart = new JFreeChart('Whiskey clusters', plot)
    SwingUtil.show(new ChartPanel(chart))
    Whiskey – clustering with radar plot and medoids
    Same again with Apache
    Commons Math but also
    calculate medoids

    View Slide

  120. import …
    def rows = CSV.withFirstRecordAsHeader().parse(new FileReader('whiskey.csv'))
    def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey",
    "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"]
    def clusterer = new KMeansPlusPlusClusterer(5)
    def data = rows.collect{ row -> new DoublePoint(cols.collect{ col -> row[col] } as int[]) }
    def centroids = clusterer.cluster(data)
    println cols.join(', ') + ', Medoid'
    def dataset = new DefaultCategoryDataset()
    centroids.eachWithIndex{ ctrd, num ->
    def cpt = ctrd.center.point
    def closest = ctrd.points.min{ pt ->
    sumSq((0..}
    def medoid = rows.find{ row -> cols.collect{ row[it] as double } == closest.point }?.Distillery
    println cpt.collect{ sprintf '%.3f', it }.join(', ') + ", $medoid"
    cpt.eachWithIndex { val, idx -> dataset.addValue(val, "Cluster ${num+1}", cols[idx]) }
    }
    def plot = new SpiderWebPlot(dataset: dataset)
    def chart = new JFreeChart('Whiskey clusters', plot)
    SwingUtil.show(new ChartPanel(chart))
    Whiskey – clustering with radar plot and medoids
    Libraries: Apache Commons Math and
    JFreeChart
    Body, Sweetness, Smoky, Medicinal, Tobacco, Honey, Spicy, Winey, Nutty, Malty, Fruity, Floral, Medoid
    2.000, 2.533, 1.267, 0.267, 0.200, 1.067, 1.667, 0.933, 0.267, 1.733, 1.800, 1.733, GlenOrd
    2.789, 2.474, 1.474, 0.053, 0.000, 1.895, 1.632, 2.211, 2.105, 2.105, 2.211, 1.737, Aberfeldy
    2.909, 1.545, 2.909, 2.727, 0.455, 0.455, 1.455, 0.545, 1.545, 1.455, 1.182, 0.545, Clynelish
    1.333, 2.333, 0.944, 0.111, 0.000, 1.000, 0.444, 0.444, 1.500, 1.944, 1.778, 1.778, Aultmore
    1.696, 2.304, 1.565, 0.435, 0.087, 1.391, 1.696, 0.609, 1.652, 1.652, 1.783, 2.130, Benromach

    View Slide

  121. Dimensionality reduction

    View Slide

  122. Whiskey – clustering with PCA plot
    import …
    def rows = Table.read().csv('whiskey.csv')
    def cols = ["Body", "Sweetness", "Smoky", "Medicinal",
    "Tobacco", "Honey", "Spicy", "Winey",
    "Nutty", "Malty", "Fruity", "Floral"]
    def data = table.as().doubleMatrix(*cols)
    def pca = new PCA(data)
    pca.projection = 2
    def projected = pca.project(data)
    def clusterer = new KMeans(data, 5)
    table = table.addColumns(
    DoubleColumn.create("PCA1", (0..DoubleColumn.create("PCA2", (0..StringColumn.create("Cluster", clusterer.clusterLabel.collect{ it.toString() })
    )
    def title = "Clusters x Principal Components"
    Plot.show(ScatterPlot.create(title, table, "PCA1", "PCA2", "Cluster"))

    View Slide

  123. Whiskey – clustering with 2-4D PCA plot
    import …
    def rows = Table.read().csv('whiskey.csv')
    def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey",
    "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"]
    def data = rows.as().doubleMatrix(*cols)
    def pca = new PCA(data)
    def dims = 4 // can be 2, 3 or 4
    pca.projection = dims
    def projected = pca.project(data)
    def adj = [1, 1, 1, 5]
    def clusterer = new KMeans(data, 5)
    def labels = clusterer.clusterLabel.collect{ "Cluster " + (it+1) }
    rows = rows.addColumns(
    *(0..
    DoubleColumn.create("PCA${idx+1}", (0..adj[idx] * (projected[it][idx] + adj[idx])
    })
    },
    StringColumn.create("Cluster", labels)
    )
    def title = "Clusters x Principal Components"
    def type = dims == 2 ? ScatterPlot : Scatter3DPlot
    Plot.show(type.create(title, rows, *(1..dims).collect { "PCA$it" }, "Cluster"))

    View Slide

  124. Whiskey – clustering and visualizing centroids

    def data = table.as().doubleMatrix(*cols)
    def pca = new PCA(data)
    pca.projection = 3
    def projected = pca.project(data)
    def clusterer = new KMeans(data, 5)
    def labels = clusterer.clusterLabel.collect { "Cluster " + (it + 1) }
    table = table.addColumns(
    *(0..<3).collect { idx ->
    DoubleColumn.create("PCA${idx+1}", (0..projected[it][idx]
    })},
    StringColumn.create("Cluster", labels),
    DoubleColumn.create("Centroid", [10] * labels.size())
    )
    def centroids = pca.project(clusterer.centroids())
    def toAdd = table.emptyCopy(1)
    (0..
    toAdd[0].setString("Cluster", "Cluster " + (idx+1))
    (1..3).each { toAdd[0].setDouble("PCA" + it, centroids[idx][it-1]) }
    toAdd[0].setDouble("Centroid", 50)
    table.append(toAdd)
    }
    def title = "Clusters x Principal Components w/ centroids"
    Plot.show(Scatter3DPlot.create(title, table, *(1..3).collect { "PCA$it" }, "Centroid", "Cluster"))

    View Slide

  125. Whiskey – Screeplot
    import …
    def rows = Table.read().csv('whiskey.csv')
    def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey",
    "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"]
    def data = table.as().doubleMatrix(*cols)
    def pca = new PCA(data)
    pca.projection = 2
    def plots = [PlotCanvas.screeplot(pca)]
    def projected = pca.project(data)
    table = table.addColumns(
    *(1..2).collect { idx ->
    DoubleColumn.create("PCA$idx", (0..}
    )
    def colors = [RED, BLUE, GREEN, ORANGE, MAGENTA, GRAY]
    def symbols = ['*', 'Q', '#', 'Q', '*', '#']
    (2..6).each { k ->
    def clusterer = new KMeans(data, k)
    double[][] components = table.as().doubleMatrix('PCA1', 'PCA2')
    plots << ScatterPlot.plot(components, clusterer.clusterLabel,
    symbols[0..}
    SwingUtil.show(size: [1200, 900], new PlotPanel(*plots))

    View Slide

  126. import …
    def rows = Table.read().csv('whiskey.csv')
    def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey",
    "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"]
    def data = table.as().doubleMatrix(*cols)
    def pca = new PCA(data)
    pca.projection = 2
    def plots = [PlotCanvas.screeplot(pca)]
    def projected = pca.project(data)
    table = table.addColumns(
    *(1..2).collect { idx ->
    DoubleColumn.create("PCA$idx", (0..}
    )
    def colors = [RED, BLUE, GREEN, ORANGE, MAGENTA, GRAY]
    def symbols = ['*', 'Q', '#', 'Q', '*', '#']
    (2..6).each { k ->
    def clusterer = new KMeans(data, k)
    double[][] components = table.as().doubleMatrix('PCA1', 'PCA2')
    plots << ScatterPlot.plot(components, clusterer.clusterLabel,
    symbols[0..}
    SwingUtil.show(size: [1200, 900], new PlotPanel(*plots))
    Whiskey – Screeplot

    View Slide

  127. Whiskey – XMeans clustering with 2-4D PCA plot
    import …
    def rows = Table.read().csv('whiskey.csv')
    def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey",
    "Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"]
    def data = rows.as().doubleMatrix(*cols)
    def pca = new PCA(data)
    def dims = 4 // can be 2, 3 or 4
    pca.projection = dims
    def projected = pca.project(data)
    def adj = [1, 1, 1, 5]
    def kmax = 10
    def clusterer = new XMeans(data, kmax)
    def labels = clusterer.clusterLabel.collect { "Cluster " + (it + 1) }
    rows = rows.addColumns(
    *(0..
    DoubleColumn.create("PCA${idx+1}", (0..adj[idx] * (projected[it][idx] + adj[idx])
    })
    },
    StringColumn.create("Cluster", labels)
    )
    def title = "Clusters x Principal Components"
    def type = dims == 2 ? ScatterPlot : Scatter3DPlot
    Plot.show(type.create(title, rows, *(1..dims).collect { "PCA$it" }, "Cluster"))

    View Slide

  128. Whiskey – SOM with Heatmap
    import …
    def file = new File(getClass().classLoader.getResource('whiskey.csv').file)
    def table = Read.csv(file.toPath(), CSV.withFirstRecordAsHeader())
    String[] cols = ['Body', 'Sweetness', 'Smoky', 'Medicinal', 'Tobacco', 'Honey',
    'Spicy', 'Winey', 'Nutty', 'Malty', 'Fruity', 'Floral']
    def data = table.select(cols).toArray()
    def distilleries = table.column('Distillery').toStringArray()
    //def (ncols, nrows) = [3, 3]
    //def (ncols, nrows) = [7, 4]
    def (ncols, nrows) = [7, 6]
    //def (ncols, nrows) = [8, 6]
    def lattice = SOM.lattice(nrows, ncols, data)
    int epochs = 100
    def model = new SOM(lattice, TimeFunction.constant(0.1),
    Neighborhood.Gaussian(1, data.length * epochs / 8.0))
    for (int i = 0; i < epochs; i++) {
    for (int j : MathEx.permutate(data.length)) {
    model.update(data[j])
    }
    }

    View Slide

  129. Whiskey – SOM with Heatmap

    def groups = data.toList().indices.groupBy { idx -> group(model, data[idx]) }
    def names = groups.collectEntries { k, v -> [k, distilleries[v].join(', ')] }
    private group(SOM model, double[] row) {
    double[][][] neurons = model.neurons()
    [0..
    def (i, j) = pair
    MathEx.distance(neurons[i][j], row)
    }
    }
    names.toSorted{ e1, e2 -> e1.key[0] <=> e2.key[0] ?: e1.key[1] <=> e2.key[1] }.each { k, v ->
    println "Cluster ${k[0]},${k[1]}: $v"
    }
    def tooltip = { i, j -> names[[i, j]] ?: '' } as Hexmap.Tooltip
    new Hexmap(model.umatrix(), Palette.jet(256), tooltip)
    .canvas().tap { setAxisLabels('', '') }.window()

    View Slide

  130. Whiskey – SOM with Heatmap
    Cluster 0,0: Auchentoshan, Glengoyne
    Cluster 0,1: AnCnoc
    Cluster 0,2: Inchgower, Tomintoul
    Cluster 0,3: ArranIsleOf, Speyburn
    Cluster 0,4: GlenElgin, Linkwood, RoyalBrackla
    Cluster 0,5: Glenfarclas, Glenlivet, Glenturret
    Cluster 0,6: Aberlour, Strathisla
    Cluster 1,0: Bladnoch, Bunnahabhain, Glenfiddich
    Cluster 1,1: Cardhu, Glenallachie
    Cluster 1,2: Glenkinchie, Glenlossie, Loch Lomond
    Cluster 1,3: Benriach, Dalwhinnie
    Cluster 1,4: GlenOrd
    Cluster 1,5: Craigallechie, GlenMoray, Longmorn
    Cluster 1,6: Edradour, Knochando
    Cluster 2,0: GlenSpey, Miltonduff
    Cluster 2,1: Tamnavulin
    Cluster 2,2: Mannochmore
    Cluster 2,3: Aultmore, GlenGrant, Speyside, Tamdhu, Tobermory
    Cluster 2,4: Deanston, Scapa
    Cluster 2,5: Benromach
    Cluster 2,6: Belvenie, BenNevis, Benrinnes
    Cluster 3,0: GlenKeith, Tullibardine
    Cluster 3,1: Balblair, Glenmorangie, Strathmill
    Cluster 3,3: Tomore
    Cluster 3,4: Ardmore, OldFettercairn
    Cluster 3,5: Aberfeldy, BlairAthol
    Cluster 3,6: Glendullan, RoyalLochnagar
    Cluster 4,0: Craigganmore, Dufftown
    Cluster 4,1: OldPulteney
    Cluster 4,2: Oban
    Cluster 4,3: Bruichladdich, GlenScotia, Isle of Jura, Springbank
    Cluster 4,4: GlenDeveronMacduff, Tomatin
    Cluster 4,5: Auchroisk, Glenrothes
    Cluster 4,6: Balmenach, Glendronach, Macallan
    Cluster 5,0: GlenGarioch, Teaninich
    Cluster 5,1: Caol Ila, Clynelish, Talisker
    Cluster 5,2: Ardbeg, Lagavulin, Laphroig
    Cluster 5,4: Bowmore, Highland Park
    Cluster 5,5: Dalmore
    Cluster 5,6: Dailuaine, Mortlach

    View Slide

  131. Whiskey – SOM with Heatmap
    Cluster 0,0: Auchentoshan, Glengoyne
    Cluster 0,1: AnCnoc
    Cluster 0,2: Inchgower, Tomintoul
    Cluster 0,3: ArranIsleOf, Speyburn
    Cluster 0,4: GlenElgin, Linkwood, RoyalBrackla
    Cluster 0,5: Glenfarclas, Glenlivet, Glenturret
    Cluster 0,6: Aberlour, Strathisla
    Cluster 1,0: Bladnoch, Bunnahabhain, Glenfiddich
    Cluster 1,1: Cardhu, Glenallachie
    Cluster 1,2: Glenkinchie, Glenlossie, Loch Lomond
    Cluster 1,3: Benriach, Dalwhinnie
    Cluster 1,4: GlenOrd
    Cluster 1,5: Craigallechie, GlenMoray, Longmorn
    Cluster 1,6: Edradour, Knochando
    Cluster 2,0: GlenSpey, Miltonduff
    Cluster 2,1: Tamnavulin
    Cluster 2,2: Mannochmore
    Cluster 2,3: Aultmore, GlenGrant, Speyside, Tamdhu, Tobermory
    Cluster 2,4: Deanston, Scapa
    Cluster 2,5: Benromach
    Cluster 2,6: Belvenie, BenNevis, Benrinnes
    Cluster 3,0: GlenKeith, Tullibardine
    Cluster 3,1: Balblair, Glenmorangie, Strathmill
    Cluster 3,3: Tomore
    Cluster 3,4: Ardmore, OldFettercairn
    Cluster 3,5: Aberfeldy, BlairAthol
    Cluster 3,6: Glendullan, RoyalLochnagar
    Cluster 4,0: Craigganmore, Dufftown
    Cluster 4,1: OldPulteney
    Cluster 4,2: Oban
    Cluster 4,3: Bruichladdich, GlenScotia, Isle of Jura, Springbank
    Cluster 4,4: GlenDeveronMacduff, Tomatin
    Cluster 4,5: Auchroisk, Glenrothes
    Cluster 4,6: Balmenach, Glendronach, Macallan
    Cluster 5,0: GlenGarioch, Teaninich
    Cluster 5,1: Caol Ila, Clynelish, Talisker
    Cluster 5,2: Ardbeg, Lagavulin, Laphroig
    Cluster 5,4: Bowmore, Highland Park
    Cluster 5,5: Dalmore
    Cluster 5,6: Dailuaine, Mortlach

    View Slide

  132. Whiskey – Hierarchical clustering with Dendrogram
    import …
    def file = new File(getClass().classLoader.getResource('whiskey.csv').file)
    def table = Read.csv(file.toPath(), CSV.withFirstRecordAsHeader())
    String[] cols = ['Body', 'Sweetness', 'Smoky', 'Medicinal', 'Tobacco', 'Honey',
    'Spicy', 'Winey', 'Nutty', 'Malty', 'Fruity', 'Floral']
    def data = table.select(cols).toArray()
    def distilleries = table.column('Distillery').toStringArray()
    def ninetyDeg = 1.57 // radians
    def FOREST_GREEN = new Color(0X808000)
    def clusters = HierarchicalClustering.fit(CompleteLinkage.of(data))
    //println clusters.tree
    //println clusters.height
    def partitions = clusters.partition(4)
    // little trick to work out cluster colors
    def colorMap = new LinkedHashSet(partitions.toList()).toList().reverse().indexed().collectEntries { k, v ->
    [v, Palette.COLORS[k]]
    }
    Font font = new Font("BitStream Vera Sans", Font.PLAIN, 12)

    View Slide

  133. Whiskey – Hierarchical clustering with Dendrogram

    def dendrogram = new Dendrogram(clusters.tree, clusters.height, FOREST_GREEN).canvas().tap {
    title = 'Whiskey Dendrogram'
    setAxisLabels('Distilleries', 'Similarity')
    def lb = lowerBounds
    setBound([lb[0] - 1, lb[1] - 20] as double[], upperBounds)
    distilleries.eachWithIndex { String label, int i ->
    add(new Label(label, [i, -1] as double[], 0, 0, ninetyDeg, font, colorMap[partitions[i]]))
    }
    }.panel()
    def pca = PCA.fit(data)
    pca.projection = 2
    def projected = pca.project(data)
    char mark = '#'
    def scatter = ScatterPlot.of(projected, partitions, mark).canvas().tap {
    title = 'Clustered by dendrogram partitions'
    setAxisLabels('PCA1', 'PCA2')
    }.panel()
    new PlotGrid(dendrogram, scatter).window()

    View Slide

  134. Whiskey – Hierarchical clustering with Dendrogram

    def dendrogram = new Dendrogram(clusters.tree, clusters.height, FOREST_GREEN).canvas().tap {
    title = 'Whiskey Dendrogram'
    setAxisLabels('Distilleries', 'Similarity')
    def lb = lowerBounds
    setBound([lb[0] - 1, lb[1] - 20] as double[], upperBounds)
    distilleries.eachWithIndex { String label, int i ->
    add(new Label(label, [i, -1] as double[], 0, 0, ninetyDeg, font, colorMap[partitions[i]]))
    }
    }.panel()
    def pca = PCA.fit(data)
    pca.projection = 2
    def projected = pca.project(data)
    char mark = '#'
    def scatter = ScatterPlot.of(projected, partitions, mark).canvas().tap {
    title = 'Clustered by dendrogram partitions'
    setAxisLabels('PCA1', 'PCA2')
    }.panel()
    new PlotGrid(dendrogram, scatter).window()

    View Slide

  135. Whiskey – Exploring Weka clustering algorithms

    View Slide

  136. Whiskey flavors – scaling clustering
    import …
    def rows = Table.read().csv('whiskey.csv')
    String[] features = [
    'Distillery', 'Body', 'Sweetness',
    'Smoky', 'Medicinal', 'Tobacco',
    'Honey', 'Spicy', 'Winey', 'Nutty',
    'Malty', 'Fruity', 'Floral'
    ]
    def data = rows.as().doubleMatrix(features)
    def cfg = new IgniteConfiguration(…)
    Ignition.start(cfg).withCloseable { ignite ->
    println ">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols"
    def trainer = new KMeansTrainer().withDistance(new EuclideanDistance()).withAmountOfClusters(5)
    def dataCache = new SandboxMLCache(ignite).fillCacheWith(data)
    def vectorizer = new DoubleArrayVectorizer().labeled(Vectorizer.LabelCoordinate.FIRST)
    def mdl = trainer.fit(ignite, dataCache, vectorizer)
    println ">>> KMeans centroids"
    println features[1..-1].join(', ‘)
    mdl.centers.each { c -> println c.all().collect{ sprintf '%.4f', it.get() }.join(', ') }
    dataCache.destroy()
    }
    [17:18:07] Ignite node started OK (id=ea92fc20)
    >>> Ignite grid started for data: 86 rows X 13 cols
    >>> KMeans centroids
    Body, Sweetness, Smoky, Medicinal, Tobacco, Honey, Spicy, Winey, Nutty, Malty, Fruity, Floral
    2.9091, 1.5455, 2.9091, 2.7273, 0.4545, 0.4545, 1.4545, 0.5455, 1.5455, 1.4545, 1.1818, 0.5455
    1.8095, 2.4762, 1.5714, 0.4286, 0.1429, 1.5238, 1.7143, 0.8571, 0.7143, 1.6667, 1.6190, 2.0476
    2.5217, 2.3913, 1.4783, 0.0435, 0.0435, 1.6087, 1.4783, 1.9565, 2.0435, 2.0435, 1.9565, 1.3913
    1.3704, 2.3704, 1.0000, 0.2593, 0.0370, 0.7778, 0.9259, 0.4074, 1.5185, 1.7407, 2.0000, 2.1111
    3.2500, 2.2500, 1.5000, 0.0000, 0.0000, 3.0000, 2.0000, 1.0000, 1.5000, 2.5000, 2.2500, 2.0000
    [17:18:07] Ignite node stopped OK [uptime=00:00:00.368]

    View Slide

  137. Whiskey flavors – scaling clustering
    import …
    def spark = builder().config('spark.master', 'local[8]').appName('Whiskey').orCreate
    def file = '/path/to/whiskey.csv'
    int k = 5
    Dataset rows = spark.read().format('com.databricks.spark.csv')
    .options('header': 'true', 'inferSchema': 'true').load(file)
    String[] colNames = rows.columns().toList().minus(['RowID', 'Distillery'])
    def assembler = new VectorAssembler(inputCols: colNames, outputCol: 'features')
    Dataset dataset = assembler.transform(rows)
    def clusterer = new KMeans(k: k, seed: 1L)
    def model = clusterer.fit(dataset)
    println 'Cluster centers:'
    model.clusterCenters().each{ println it.values().collect{ sprintf '%.2f', it }.join(', ') }
    spark.stop()
    Cluster centers:
    1.73, 2.35, 1.58, 0.81, 0.19, 1.15, 1.42, 0.81, 1.23, 1.77, 1.23, 1.31
    2.00, 1.00, 3.00, 0.00, 0.00, 0.00, 3.00, 1.00, 0.00, 2.00, 2.00, 2.00
    2.86, 2.38, 1.52, 0.05, 0.00, 1.95, 1.76, 2.05, 1.81, 2.05, 2.19, 1.71
    1.53, 2.38, 1.06, 0.16, 0.03, 1.09, 1.00, 0.50, 1.53, 1.75, 2.13, 2.28
    3.67, 1.50, 3.67, 3.33, 0.67, 0.17, 1.67, 0.50, 1.17, 1.33, 1.17, 0.17

    View Slide

  138. Whiskey flavors – scaling clustering
    class Point { double[] pts }
    class TaggedPoint extends Point { int cluster }
    @Canonical(includeSuperProperties=true)
    class TaggedPointCounter extends TaggedPoint {
    long count
    TaggedPointCounter plus(TaggedPointCounter other) {
    new TaggedPointCounter((0..}
    TaggedPointCounter average() {
    new TaggedPointCounter(pts.collect{it/count } as double[], cluster, 0)
    }
    }
    int k = 5
    Integer iterations = 20
    def configuration = new Configuration()
    Was: Rheem

    View Slide

  139. Whiskey flavors – scaling clustering
    class SelectNearestCentroid implements FunctionDescriptor.ExtendedSerializableFunction {
    Iterable centroids
    @Override
    void open(ExecutionContext executionCtx) {
    centroids = executionCtx.getBroadcast("centroids")
    }
    @Override
    TaggedPointCounter apply(Point point) {
    def minDistance = Double.POSITIVE_INFINITY
    def nearestCentroidId = -1
    for (centroid in centroids) {
    def distance = sqrt((0..if (distance < minDistance) {
    minDistance = distance
    nearestCentroidId = centroid.cluster
    }
    }
    new TaggedPointCounter(point.pts, nearestCentroidId, 1)
    }
    }
    Was: Rheem

    View Slide

  140. Whiskey flavors – scaling clustering
    def url = WhiskeyRheem.classLoader.getResource('whiskey.csv').file
    def rheemContext = new RheemContext(configuration)
    .withPlugin(Java.basicPlugin())
    .withPlugin(Spark.basicPlugin())
    def planBuilder = new JavaPlanBuilder(rheemContext)
    .withJobName("KMeans ($url, k=$k, iterations=$iterations)")
    def points = new File(url).readLines()[1..-1].collect{ new Point(pts: it.split(",")[2..-1]*.toDouble()) }
    def pointsBuilder = planBuilder.loadCollection(points)
    def random = new Random()
    double[][] initPts = (1..k).collect{ (0..def initialCentroidsBuilder = planBuilder
    .loadCollection((0...withName("Load random centroids")
    Was: Rheem

    View Slide

  141. Whiskey flavors – scaling clustering
    def url = WhiskeyRheem.classLoader.getResource('whiskey.csv').file
    def rheemContext = new RheemContext(configuration)
    .withPlugin(Java.basicPlugin())
    .withPlugin(Spark.basicPlugin())
    def planBuilder = new JavaPlanBuilder(rheemContext)
    .withJobName("KMeans ($url, k=$k, iterations=$iterations)")
    def points = new File(url).readLines()[1..-1].collect{ new Point(pts: it.split(",")[2..-1]*.toDouble()) }
    def pointsBuilder = planBuilder.loadCollection(points)
    def random = new Random()
    double[][] initPts = (1..k).collect{ (0..def initialCentroidsBuilder = planBuilder
    .loadCollection((0...withName("Load random centroids")
    [main] INFO org.qcri.rheem.core.api.Job - StopWatch results:
    * Optimization - 0:00:00.297
    * Prepare - 0:00:00.063
    * Prune&Isolate - 0:00:00.011
    * Transformations - 0:00:00.052
    * Sanity - 0:00:00.000
    * Cardinality&Load Estimation - 0:00:00.115
    * Create OptimizationContext - 0:00:00.008
    * Create CardinalityEstimationManager - 0:00:00.001
    * Push Estimation - 0:00:00.106
    * Create Initial Execution Plan - 0:00:00.119
    * Enumerate - 0:00:00.092
    * Concatenation - 0:00:00.058
    * Channel Conversion - 0:00:00.050
    * Prune - 0:00:00.017
    * Pick Best Plan - 0:00:00.009
    * Split Stages - 0:00:00.013
    * Execution - 0:00:00.352
    * Execution 0 - 0:00:00.352
    * Execute - 0:00:00.350
    * Post-processing - 0:00:00.047
    * Log measurements - 0:00:00.047
    * Release Resources - 0:00:00.000
    Centroids:
    Cluster0: 2.500, 2.324, 1.588, 0.176, 0.059, 1.882, 1.647, 1.676, 1.824, 2.088, 1.912, 1.706
    Cluster1: 1.488, 2.463, 1.122, 0.268, 0.073, 0.927, 1.146, 0.512, 1.146, 1.659, 1.878, 2.000
    Cluster2: 2.909, 1.545, 2.909, 2.727, 0.455, 0.455, 1.455, 0.545, 1.545, 1.455, 1.182, 0.545
    Was: Rheem

    View Slide

  142. Clustering lab – repeat some of the examples with beer
    https://rstudio-pubs-static.s3.amazonaws.com/560654_2b44eef198f24a0bba45e002ae86d237.html

    View Slide

  143. Clustering lab – repeat some of the examples with beer

    View Slide

  144. Classification Overview Clustering:
    • Predicting class of some data
    Algorithms:
    • Logistic Regression
    • Naïve Bayes
    • Stochastic Gradient Descent
    • K-Nearest Neighbours
    • Decision Tree
    • Random Forest
    • Support Vector Machine
    Aspects:
    • Over/underfitting
    • Ensemble
    • Confusion
    Applications:
    • Image recognition
    • Speech recognition
    • Medical daignosis

    View Slide

  145. Case Study: classification of Iris flowers
    Created by British statistician and biologist Ronald
    Fisher for his 1936 paper “The use of multiple
    measurements in taxonomic problems as an
    example of linear discriminant analysis”
    150 samples, 50 each of three species of Iris:
    • setosa
    • versicolor
    • virginica
    Four features measured for each sample:
    • sepal length
    • sepal width
    • petal length
    • petal width
    https://en.wikipedia.org/wiki/Iris_flower_data_set
    https://archive.ics.uci.edu/ml/datasets/Iris
    setosa
    versicolor
    virginica
    sepal
    petal

    View Slide

  146. Iris flower data – explore with Weka

    View Slide

  147. Iris flower data – explore with Weka (cont’d)

    View Slide

  148. Iris flower data – explore with Weka (cont’d)

    View Slide

  149. Iris flower data – explore with Weka (cont’d)

    View Slide

  150. Iris flower data – explore with Weka (cont’d)

    View Slide

  151. Iris flower data – explore with Weka (cont’d)

    View Slide

  152. Iris flower data – Weka Naïve Bayes
    def file = getClass().classLoader.getResource('iris_data.csv').file as File
    def species = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
    def loader = new CSVLoader(file: file)
    def model = new NaiveBayes()
    def allInstances = loader.dataSet
    allInstances.classIndex = 4
    model.buildClassifier(allInstances)
    double[] actual = allInstances.collect{ it.value(4) }
    double[] predicted = allInstances.collect{ model.classifyInstance(it) }
    double[] petalW = allInstances.collect{ it.value(2) }
    double[] petalL = allInstances.collect{ it.value(3) }
    def indices = actual.indices

    View Slide

  153. Iris flower data – Weka Naïve Bayes
    def file = getClass().classLoader.getResource('iris_data.csv').file as File
    def species = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
    def loader = new CSVLoader(file: file)
    def model = new NaiveBayes()
    def allInstances = loader.dataSet
    allInstances.classIndex = 4
    model.buildClassifier(allInstances)
    double[] actual = allInstances.collect{ it.value(4) }
    double[] predicted = allInstances.collect{ model.classifyInstance(it) }
    double[] petalW = allInstances.collect{ it.value(0) }
    double[] petalL = allInstances.collect{ it.value(1) }
    def indices = actual.indices
    def chart = new XYChartBuilder().width(900).height(450).
    title("Species").xAxisTitle("Petal length").yAxisTitle("Petal width").build()
    species.eachWithIndex{ String name, int i ->
    def groups = indices.findAll{ predicted[it] == i }.groupBy{ actual[it] == i }
    Collection found = groups[true] ?: []
    Collection errors = groups[false] ?: []
    println "$name: ${found.size()} correct, ${errors.size()} incorrect"
    chart.addSeries("$name correct", petalW[found], petalL[found]).with {
    XYSeriesRenderStyle = Scatter
    }
    if (errors) {
    chart.addSeries("$name incorrect", petalW[errors], petalL[errors]).with {
    XYSeriesRenderStyle = Scatter
    }
    }
    }
    new SwingWrapper(chart).displayChart()

    View Slide

  154. def chart = new XYChartBuilder().width(900).height(450).
    title("Species").xAxisTitle("Petal length").yAxisTitle("Petal width").build()
    species.eachWithIndex{ String name, int i ->
    def groups = indices.findAll{ predicted[it] == i }.groupBy{ actual[it] == i }
    Collection found = groups[true] ?: []
    Collection errors = groups[false] ?: []
    println "$name: ${found.size()} correct, ${errors.size()} incorrect"
    chart.addSeries("$name correct", petalW[found], petalL[found]).with {
    XYSeriesRenderStyle = Scatter
    }
    if (errors) {
    chart.addSeries("$name incorrect", petalW[errors], petalL[errors]).with {
    XYSeriesRenderStyle = Scatter
    }
    }
    }
    new SwingWrapper(chart).displayChart()
    Iris flower data – Weka Naïve Bayes
    def file = getClass().classLoader.getResource('iris_data.csv').file as File
    def species = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
    def loader = new CSVLoader(file: file)
    def model = new NaiveBayes()
    def allInstances = loader.dataSet
    allInstances.classIndex = 4
    model.buildClassifier(allInstances)
    double[] actual = allInstances.collect{ it.value(4) }
    double[] predicted = allInstances.collect{ model.classifyInstance(it) }
    double[] petalW = allInstances.collect{ it.value(0) }
    double[] petalL = allInstances.collect{ it.value(1) }
    def indices = actual.indices
    Iris-setosa: 50 correct, 0 incorrect
    Iris-versicolor: 48 correct, 4 incorrect
    Iris-virginica: 46 correct, 2 incorrect

    View Slide

  155. Iris flower data – Weka Logistic Regression
    def file = getClass().classLoader.getResource('iris_data.csv').file as File
    def species = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
    def loader = new CSVLoader(file: file)
    def model = new SimpleLogistic()
    def allInstances = loader.dataSet
    allInstances.classIndex = 4
    model.buildClassifier(allInstances)
    double[] actual = allInstances.collect{ it.value(4) }
    double[] predicted = allInstances.collect{ model.classifyInstance(it) }
    double[] petalW = allInstances.collect{ it.value(2) }
    double[] petalL = allInstances.collect{ it.value(3) }
    def indices = actual.indices

    View Slide

  156. Iris flower data – Weka Logistic Regression
    def file = getClass().classLoader.getResource('iris_data.csv').file as File
    def species = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
    def loader = new CSVLoader(file: file)
    def model = new SimpleLogistic()
    def allInstances = loader.dataSet
    allInstances.classIndex = 4
    model.buildClassifier(allInstances)
    double[] actual = allInstances.collect{ it.value(4) }
    double[] predicted = allInstances.collect{ model.classifyInstance(it) }
    double[] petalW = allInstances.collect{ it.value(0) }
    double[] petalL = allInstances.collect{ it.value(1) }
    def indices = actual.indices
    def chart = new XYChartBuilder().width(900).height(450).
    title("Species").xAxisTitle("Petal length").yAxisTitle("Petal width").build()
    species.eachWithIndex{ String name, int i ->
    def groups = indices.findAll{ predicted[it] == i }.groupBy{ actual[it] == i }
    Collection found = groups[true] ?: []
    Collection errors = groups[false] ?: []
    println "$name: ${found.size()} correct, ${errors.size()} incorrect"
    chart.addSeries("$name correct", petalW[found], petalL[found]).with {
    XYSeriesRenderStyle = Scatter
    }
    if (errors) {
    chart.addSeries("$name incorrect", petalW[errors], petalL[errors]).with {
    XYSeriesRenderStyle = Scatter
    }
    }
    }
    new SwingWrapper(chart).displayChart()
    Iris-setosa: 50 correct, 0 incorrect
    Iris-versicolor: 49 correct, 1 incorrect
    Iris-virginica: 49 correct, 1 incorrect

    View Slide

  157. Iris flower data – Weka J48 Decision Tree
    def file = getClass().classLoader.getResource('iris_data.csv').file as File
    def species = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
    def loader = new CSVLoader(file: file)
    def model = new J48()
    def allInstances = loader.dataSet
    allInstances.classIndex = 4
    model.buildClassifier(allInstances)
    double[] actual = allInstances.collect{ it.value(4) }
    double[] predicted = allInstances.collect{ model.classifyInstance(it) }
    double[] petalW = allInstances.collect{ it.value(0) }
    double[] petalL = allInstances.collect{ it.value(1) }
    def indices = actual.indices
    def chart = new XYChartBuilder().width(900).height(450).
    title("Species").xAxisTitle("Petal length").yAxisTitle("Petal width").build()
    species.eachWithIndex{ String name, int i ->
    def groups = indices.findAll{ predicted[it] == i }.groupBy{ actual[it] == i }
    Collection found = groups[true] ?: []
    Collection errors = groups[false] ?: []
    println "$name: ${found.size()} correct, ${errors.size()} incorrect"
    chart.addSeries("$name correct", petalW[found], petalL[found]).with {
    XYSeriesRenderStyle = Scatter
    }
    if (errors) {
    chart.addSeries("$name incorrect", petalW[errors], petalL[errors]).with {
    XYSeriesRenderStyle = Scatter
    }
    }
    }
    new SwingWrapper(chart).displayChart()
    Petal width <= 0.6: Iris-setosa (50.0)
    Petal width > 0.6
    | Petal width <= 1.7
    | | Petal length <= 4.9: Iris-versicolor (48.0/1.0)
    | | Petal length > 4.9
    | | | Petal width <= 1.5: Iris-virginica (3.0)
    | | | Petal width > 1.5: Iris-versicolor (3.0/1.0)
    | Petal width > 1.7: Iris-virginica (46.0/1.0)
    Number of Leaves : 5
    Size of the tree : 9
    Iris-setosa: 50 correct, 0 incorrect
    Iris-versicolor: 49 correct, 2 incorrect
    Iris-virginica: 48 correct, 1 incorrect

    View Slide

  158. Iris flower data – wekaDeeplearning4j
    WekaPackageManager.loadPackages(true)
    def file = getClass().classLoader.getResource('iris_data.csv').file as File
    def loader = new CSVLoader(file: file)
    def data = loader.dataSet
    data.classIndex = 4
    def options = Utils.splitOptions("-S 1 -numEpochs 10 -layer \"weka.dl4j.layers.OutputLayer -activation weka.dl4j.activations.ActivationSoftmax \
    -lossFn weka.dl4j.lossfunctions.LossMCXENT\"")
    AbstractClassifier myClassifier = Utils.forName(AbstractClassifier, "weka.classifiers.functions.Dl4jMlpClassifier", options)
    // Stratify and split
    Random rand = new Random(0)
    Instances randData = new Instances(data)
    randData.randomize(rand)
    randData.stratify(3)
    Instances train = randData.trainCV(3, 0)
    Instances test = randData.testCV(3, 0)
    // Build the classifier on the training data
    myClassifier.buildClassifier(train)
    // Evaluate the model on test data
    Evaluation eval = new Evaluation(test)
    eval.evaluateModel(myClassifier, test)
    println eval.toSummaryString()
    println eval.toMatrixString() Image: https://www.neuraldesigner.com/learning/examples/iris-flowers-classification

    View Slide

  159. Iris flower data – wekaDeeplearning4j
    WekaPackageManager.loadPackages(true)
    def file = getClass().classLoader.getResource('iris_data.csv').file as File
    def loader = new CSVLoader(file: file)
    def data = loader.dataSet
    data.classIndex = 4
    def options = Utils.splitOptions("-S 1 -numEpochs 10 -layer \"weka.dl4j.layers.OutputLayer -activation weka.dl4j.activations.ActivationSoftmax \
    -lossFn weka.dl4j.lossfunctions.LossMCXENT\"")
    AbstractClassifier myClassifier = Utils.forName(AbstractClassifier, "weka.classifiers.functions.Dl4jMlpClassifier", options)
    // Stratify and split
    Random rand = new Random(0)
    Instances randData = new Instances(data)
    randData.randomize(rand)
    randData.stratify(3)
    Instances train = randData.trainCV(3, 0)
    Instances test = randData.testCV(3, 0)
    // Build the classifier on the training data
    myClassifier.buildClassifier(train)
    // Evaluate the model on test data
    Evaluation eval = new Evaluation(test)
    eval.evaluateModel(myClassifier, test)
    println eval.toSummaryString()
    println eval.toMatrixString()
    [main] INFO org.deeplearning4j.nn.graph.ComputationGraph - Starting ComputationGraph
    with WorkspaceModes set to [training: ENABLED; inference: ENABLED], cacheMode set to [NONE]
    Training Dl4jMlpClassifier...: [] ETA: 00:00:00[INFO ] 00:03:31.035
    [main] weka.classifiers.functions.Dl4jMlpClassifier - Epoch [1/10] took 00:00:00.670
    Training Dl4jMlpClassifier...: [====== ] ETA: 00:00:06[INFO ] 00:03:31.152
    [main] weka.classifiers.functions.Dl4jMlpClassifier - Epoch [2/10] took 00:00:00.113
    Training Dl4jMlpClassifier...: [============ ] ETA: 00:00:03[INFO ] 00:03:31.244
    [main] weka.classifiers.functions.Dl4jMlpClassifier - Epoch [3/10] took 00:00:00.090
    Training Dl4jMlpClassifier...: [================== ] ETA: 00:00:02[INFO ] 00:03:31.325
    [main] weka.classifiers.functions.Dl4jMlpClassifier - Epoch [4/10] took 00:00:00.079
    Training Dl4jMlpClassifier...: [======================== ] ETA: 00:00:01[INFO ] 00:03:31.470
    [main] weka.dl4j.listener.EpochListener - Epoch [5/10]
    Train Set:
    Loss: 0.510342
    [INFO ] 00:03:31.470 [main] weka.classifiers.functions.Dl4jMlpClassifier - Epoch [5/10] took 00:00:00.144
    Training Dl4jMlpClassifier...: [============================== ] ETA: 00:00:01[INFO ] 00:03:31.546
    [main] weka.classifiers.functions.Dl4jMlpClassifier - Epoch [6/10] took 00:00:00.073
    Training Dl4jMlpClassifier...: [==================================== ] ETA: 00:00:00[INFO ] 00:03:31.611
    [main] weka.classifiers.functions.Dl4jMlpClassifier - Epoch [7/10] took 00:00:00.063
    Training Dl4jMlpClassifier...: [========================================== ] ETA: 00:00:00[INFO ] 00:03:31.714
    [main] weka.classifiers.functions.Dl4jMlpClassifier - Epoch [8/10] took 00:00:00.101
    Training Dl4jMlpClassifier...: [================================================ ] ETA: 00:00:00[INFO ] 00:03:31.790
    [main] weka.classifiers.functions.Dl4jMlpClassifier - Epoch [9/10] took 00:00:00.074
    Training Dl4jMlpClassifier...: [====================================================== ] ETA: 00:00:00[INFO ] 00:03:31.882
    [main] weka.dl4j.listener.EpochListener - Epoch [10/10]
    Train Set:
    Loss: 0.286469

    View Slide

  160. Iris flower data – wekaDeeplearning4j
    WekaPackageManager.loadPackages(true)
    def file = getClass().classLoader.getResource('iris_data.csv').file as File
    def loader = new CSVLoader(file: file)
    def data = loader.dataSet
    data.classIndex = 4
    def options = Utils.splitOptions("-S 1 -numEpochs 10 -layer \"weka.dl4j.layers.OutputLayer -activation weka.dl4j.activations.ActivationSoftmax \
    -lossFn weka.dl4j.lossfunctions.LossMCXENT\"")
    AbstractClassifier myClassifier = Utils.forName(AbstractClassifier, "weka.classifiers.functions.Dl4jMlpClassifier", options)
    // Stratify and split
    Random rand = new Random(0)
    Instances randData = new Instances(data)
    randData.randomize(rand)
    randData.stratify(3)
    Instances train = randData.trainCV(3, 0)
    Instances test = randData.testCV(3, 0)
    // Build the classifier on the training data
    myClassifier.buildClassifier(train)
    // Evaluate the model on test data
    Evaluation eval = new Evaluation(test)
    eval.evaluateModel(myClassifier, test)
    println eval.toSummaryString()
    println eval.toMatrixString()

    [INFO ] 00:03:31.883 [main] weka.classifiers.functions.Dl4jMlpClassifier - Epoch [10/10] took 00:00:00.091
    Training Dl4jMlpClassifier...: [============================================================] ETA: 00:00:00
    Done!
    Correctly Classified Instances 40 80 %
    Incorrectly Classified Instances 10 20 %
    Kappa statistic 0.701
    Mean absolute error 0.2542
    Root mean squared error 0.3188
    Relative absolute error 57.2154 %
    Root relative squared error 67.6336 %
    Total Number of Instances 50
    === Confusion Matrix ===
    a b c <-- classified as
    17 0 0 | a = Iris-setosa
    0 9 8 | b = Iris-versicolor
    0 2 14 | c = Iris-virginica
    BUILD SUCCESSFUL in 22s

    View Slide

  161. Classify language
    Apache OpenNLP parts:
    • sentence detector, tokenizer, name finder, document
    categorizer, part-of-speech tagger, chunker, parser,
    coreference resolution
    import opennlp.tools.langdetect.*
    def base = 'http://apache.forsale.plus/opennlp/models'
    def url = "$base/langdetect/1.8.3/langdetect-183.bin"
    def model = new LanguageDetectorModel(new URL(url))
    def detector = new LanguageDetectorME(model)
    [ 'spa': 'Bienvenido a Madrid',
    'fra': 'Bienvenue à Paris',
    'dan': 'Velkommen til København'
    ].each { k, v ->
    assert detector.predictLanguage(v).lang == k
    }
    Supported language
    detection training
    models:
    • MaxEnt
    • Perceptron
    • Naïve Bayes

    View Slide

  162. Classify language – entities
    String[] sentences = [
    "A commit by Daniel Sun on December 6, 2020 improved Groovy 4's language integrated query.",
    "A commit by Daniel on Sun., December 6, 2020 improved Groovy 4's language integrated query.",
    'The Groovy in Action book by Dierk Koenig et. al. is a bargain at $50, or indeed any price.',
    'The conference wrapped up yesterday at 5:30 p.m. in Copenhagen, Denmark.',
    'I saw Ms. May Smith waving to June Jones.',
    'The parcel was passed from May to June.'
    ]
    // use a helper to cache models
    def helper = new ResourceHelper('http://opennlp.sourceforge.net/models-1.5')
    def modelNames = ['person', 'money', 'date', 'time', 'location']
    def finders = modelNames.collect{ new NameFinderME(new TokenNameFinderModel(helper.load("en-ner-$it"))) }
    def tokenizer = SimpleTokenizer.INSTANCE

    View Slide

  163. Classify language – entities
    sentences.each { sentence ->
    String[] tokens = tokenizer.tokenize(sentence)
    Span[] tokenSpans = tokenizer.tokenizePos(sentence)
    def entityText = [:]
    def entityPos = [:]
    finders.indices.each {fi ->
    // could be made smarter by looking at probabilities and overlapping spans
    Span[] spans = finders[fi].find(tokens)
    spans.each{span ->
    def se = span.start..def pos = (tokenSpans[se.from].start)..<(tokenSpans[se.to].end)
    entityPos[span.start] = pos
    entityText[span.start] = "$span.type(${sentence[pos]})"
    }
    }
    entityPos.keySet().toList().reverseEach {
    def pos = entityPos[it]
    def (from, to) = [pos.from, pos.to + 1]
    sentence = sentence[0..}
    println sentence
    }

    View Slide

  164. Classify language – entities
    sentences.each { sentence ->
    String[] tokens = tokenizer.tokenize(sentence)
    Span[] tokenSpans = tokenizer.tokenizePos(sentence)
    def entityText = [:]
    def entityPos = [:]
    finders.indices.each {fi ->
    // could be made smarter by looking at probabilities and overlapping spans
    Span[] spans = finders[fi].find(tokens)
    spans.each{span ->
    def se = span.start..def pos = (tokenSpans[se.from].start)..<(tokenSpans[se.to].end)
    entityPos[span.start] = pos
    entityText[span.start] = "$span.type(${sentence[pos]})"
    }
    }
    entityPos.keySet().toList().reverseEach {
    def pos = entityPos[it]
    def (from, to) = [pos.from, pos.to + 1]
    sentence = sentence[0..}
    println sentence
    }
    A commit by person(Daniel Sun) on date(December 6, 2020) improved Groovy 4's language integrated query.
    A commit by person(Daniel) on Sun., date(December 6, 2020) improved Groovy 4's language integrated query.
    The Groovy in Action book by person(Dierk Koenig) et. al. is a bargain at money($50), or indeed any price.
    The conference wrapped up date(yesterday) at time(5:30 p.m.) in location(Copenhagen), location(Denmark).
    I saw Ms. person(May Smith) waving to person(June Jones).
    The parcel was passed from date(May to June).

    View Slide

  165. Classify language – parts of speech
    def sentences = [
    'Paul has two sisters, Maree and Christine.',
    'His bark was much worse than his bite',
    'Turn on the lights to the main bedroom',
    "Light 'em all up",
    'Make it dark downstairs'
    ]
    def tokenizer = SimpleTokenizer.INSTANCE
    sentences.each {
    String[] tokens = tokenizer.tokenize(it)
    def model = new POSModel(helper.load('en-pos-maxent'))
    def posTagger = new POSTaggerME(model)
    String[] tags = posTagger.tag(tokens)
    println tokens.indices.collect{tags[it] == tokens[it] ? tags[it] : "${tags[it]}(${tokens[it]})" }.join(' ')
    }
    NNP(Paul) VBZ(has) CD(two) NNS(sisters) , NNP(Maree) CC(and) NNP(Christine) .
    PRP$(His) NN(bark) VBD(was) RB(much) JJR(worse) IN(than) PRP$(his) NN(bite)
    VB(Turn) IN(on) DT(the) NNS(lights) TO(to) DT(the) JJ(main) NN(bedroom)
    NN(Light) POS(') NN(em) DT(all) IN(up)
    VB(Make) PRP(it) JJ(dark) NN(downstairs)

    View Slide

  166. Classify language – sentence detection
    def helper = new ResourceHelper('http://opennlp.sourceforge.net/models-1.5')
    def model = new SentenceModel(helper.load('en-sent'))
    def detector = new SentenceDetectorME(model)
    def sentences = detector.sentDetect(text)
    assert text.count('.') == 28
    assert sentences.size() == 4
    println sentences.join('\n\n')

    View Slide

  167. Classify language – sentence detection
    def helper = new ResourceHelper('http://opennlp.sourceforge.net/models-1.5')
    def model = new SentenceModel(helper.load('en-sent'))
    def detector = new SentenceDetectorME(model)
    def sentences = detector.sentDetect(text)
    assert text.count('.') == 28
    assert sentences.size() == 4
    println sentences.join('\n\n')
    The most referenced scientific paper of all time is "Protein measurement with the
    Folin phenol reagent" by Lowry, O. H., Rosebrough, N. J., Farr, A. L. & Randall,
    R. J. and was published in the J. BioChem. in 1951. It describes a method for
    measuring the amount of protein (even as small as 0.2 γ, were γ is the specific
    weight) in solutions and has been cited over 300,000 times and can be found here:
    https://www.jbc.org/content/193/1/265.full.pdf. Dr. Lowry completed
    two doctoral degrees under an M.D.-Ph.D. program from the University of Chicago
    before moving to Harvard under A. Baird Hastings. He was also the H.O.D of
    Pharmacology at Washington University in St. Louis for 29 years.
    The most referenced scientific paper of all time is "Protein measurement with the
    Folin phenol reagent" by Lowry, O. H., Rosebrough, N. J., Farr, A. L. & Randall,
    R. J. and was published in the J. BioChem. in 1951.
    It describes a method for
    measuring the amount of protein (even as small as 0.2 ?, were ? is the specific
    weight) in solutions and has been cited over 300,000 times and can be found here:
    https://www.jbc.org/content/193/1/265.full.pdf.
    Dr. Lowry completed
    two doctoral degrees under an M.D.-Ph.D. program from the University of Chicago
    before moving to Harvard under A. Baird Hastings.
    He was also the H.O.D of
    Pharmacology at Washington University in St. Louis for 29 years.

    View Slide

  168. Classify language
    Apache NLPCraft (incubating)
    start(LightSwitchModel)
    def cli = new NCTestClientBuilder().newBuilder().build()
    cli.open("nlpcraft.lightswitch.ex.java")
    println cli.ask('Turn on the lights in the master bedroom')
    println cli.ask("Light 'em all up")
    println cli.ask('Make it dark downstairs') // expecting no match
    if (cli) {
    cli.close()
    }
    stop()
    Lights are [on] in [master bedroom].
    Lights are [on] in [entire house].
    No matching intent found.

    View Slide

  169. Classify language
    Apache NLPCraft (incubating)
    @NCIntentRef("ls")
    @NCIntentSample([
    "Turn the lights off in the entire house.",
    "Switch on the illumination in the master bedroom closet.",
    "Get the lights on.",
    "Lights up in the kitchen.",
    "Please, put the light out in the upstairs bedroom.",
    "Set the lights on in the entire house.",

    "Kill off all the lights now!",
    "No lights in the bedroom, please.",
    "Light up the garage, please!",
    "Kill the illumination now!"
    ])
    NCResult onMatch(
    @NCIntentTerm("act") NCToken actTok,
    @NCIntentTerm("loc") List locToks) {
    String status = actTok.id == "ls:on" ? "on" : "off"
    String locations = locToks ? locToks.collect{ it.meta("nlpcraft:nlp:origtext") }.join(", ") : "entire house"
    // Add HomeKit, Arduino or other integration here.
    // By default - just return a descriptive action string.
    NCResult.text("Lights are [$status] in [${locations.toLowerCase()}].")
    }
    Lights are [on] in [master bedroom].
    Lights are [on] in [entire house].
    No matching intent found.

    View Slide

  170. Classification with Neural networks
    For >1 layers (deep)
    model caters for non-
    linear relationships
    See: https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6

    View Slide

  171. Classification with Neural networks Case Study: MNIST
    • Handwritten digits
    • 60,000 training examples
    • 10,000 test examples
    • Size-normalized (28x28 pixels) and centered
    • http://yann.lecun.com/exdb/mnist/

    View Slide

  172. MNIST with Apache Commons Math
    class MnistNeuralNet {
    private static int inodes = 784
    private static int hnodes = 200
    private static int onodes = 10
    private static double learning_rate = 0.1
    private static RealMatrix inputHidden
    private static RealMatrix hiddenOutput
    static initRandom() {
    inputHidden = initRandom(createRealMatrix(hnodes, inodes), inodes**-0.5 as double)
    hiddenOutput = initRandom(createRealMatrix(onodes, hnodes), hnodes**-0.5 as double)
    }
    static double[] normalize(Integer[] img) {
    img.collect { it / 255.0d * 0.999d + 0.001d } as double[]
    }
    static RealMatrix createTarget(int label) {
    createRealMatrix(onodes, 1).tap {
    (0..}
    }
    static RealMatrix query(double[] inputArray) {
    RealMatrix inputs = createColumnRealMatrix(inputArray)
    RealMatrix hiddenInputs = inputHidden * inputs
    RealMatrix hiddenOutputs = scalarSigmoid(hiddenInputs)
    RealMatrix finalInputs = hiddenOutput * hiddenOutputs
    return scalarSigmoid(finalInputs)
    }

    Inspired by: https://golb.hplar.ch/2018/12/simple-neural-network.html
    MnistNeuralNet
    class supports training
    and prediction
    0
    1
    2
    9



    10
    28 * 28
    = 784
    200

    View Slide

  173. MNIST with Apache Commons Math

    static RealMatrix initRandom(RealMatrix matrix, double desiredStandardDeviation) {
    scalar(matrix, it -> new Random().nextGaussian() * desiredStandardDeviation)
    }
    static void train(RealMatrix inputs, RealMatrix targets) {
    // forward
    RealMatrix hiddenInputs = inputHidden * inputs
    RealMatrix hiddenOutputs = scalarSigmoid(hiddenInputs)
    RealMatrix finalInputs = hiddenOutput * hiddenOutputs
    RealMatrix finalOutputs = scalarSigmoid(finalInputs)
    // back
    RealMatrix outputErrors = targets.subtract(finalOutputs)
    RealMatrix t1 = multiplyElements(outputErrors, finalOutputs)
    RealMatrix t2 = multiplyElements(t1, scalar(finalOutputs, it -> 1.0d - it))
    RealMatrix t3 = t2 * hiddenOutputs.transpose()
    hiddenOutput = hiddenOutput.add(scalar(t3, it -> learning_rate * it))
    RealMatrix hiddenErrors = hiddenOutput.transpose() * outputErrors
    t1 = multiplyElements(hiddenErrors, hiddenOutputs)
    t2 = multiplyElements(t1, scalar(hiddenOutputs, it -> 1.0d - it))
    t3 = t2 * inputs.transpose()
    inputHidden = inputHidden.add(scalar(t3, it -> learning_rate * it))
    }
    static void save(String filename) {
    new FileOutputStream(filename).withObjectOutputStream { oos ->
    oos.writeObject(inputHidden.data)
    oos.writeObject(hiddenOutput.data)
    }
    }
    }
    Inspired by: https://golb.hplar.ch/2018/12/simple-neural-network.html
    MnistNeuralNet
    class supports training
    and prediction cont’d

    View Slide

  174. MNIST with Apache Commons Math
    static RealMatrix scalar(RealMatrix matrix, Function function) {
    int numRows = matrix.rowDimension
    int numCols = matrix.columnDimension
    RealMatrix result = createRealMatrix(numRows, numCols)
    for (r in 0..for (c in 0..result.setEntry(r, c, function.apply(matrix.getEntry(r, c)))
    }
    }
    return result
    }
    static int[][] rotate(int[][] img, double angleInDegrees) {
    double angle = Math.toRadians(angleInDegrees)
    int[][] result = new int[img.length][]
    for (y in 0..result[y] = new int[img[y].length]
    Arrays.fill(result[y], 0)
    }
    double cosAngle = Math.cos(angle)
    double sinAngle = Math.sin(angle)
    double x0 = img[0].length / 2 - cosAngle * img[0].length / 2 - sinAngle * img.length / 2
    double y0 = img.length / 2 - cosAngle * img.length / 2 + sinAngle * img[0].length / 2
    for (y in 0..for (x in 0..int xRot = (int) (x * cosAngle + y * sinAngle + x0)
    int yRot = (int) (-x * sinAngle + y * cosAngle + y0)
    if (xRot >= 0 && yRot >= 0 && xRot <= 27 && yRot <= 27) {
    result[y][x] = img[yRot][xRot]
    }
    }
    }
    return result
    } …
    Inspired by: https://golb.hplar.ch/2018/12/simple-neural-network.html
    Utility methods

    View Slide

  175. MNIST with Apache Commons Math

    static RealMatrix multiplyElements(RealMatrix matrixA, RealMatrix matrixB) {
    // elementWise multiplication has same compatibility requirements as addition
    checkAdditionCompatible(matrixA, matrixB)
    int numRows = matrixA.rowDimension
    int numCols = matrixA.columnDimension
    RealMatrix product = createRealMatrix(numRows, numCols)
    for (r in 0..for (c in 0..product.setEntry(r, c, matrixA.getEntry(r, c) * matrixB.getEntry(r, c))
    }
    }
    return product
    }
    static int maxIndex(RealMatrix result) {
    double[][] data = result.data
    (0..}
    static RealMatrix scalarSigmoid(RealMatrix matrix) {
    def sigmoid = [start: { a, b, c, d, e, f -> },
    visit: { r, c, double v -> 1 / (1 + Math.exp(-v)) },
    end : { -> 0d }] as RealMatrixChangingVisitor
    RealMatrix result = matrix.copy()
    result.walkInRowOrder(sigmoid)
    result
    }
    Inspired by: https://golb.hplar.ch/2018/12/simple-neural-network.html
    Utility methods
    cont’d

    View Slide

  176. MNIST with Apache Commons Math
    int epochs = 10
    initRandom()
    int[] labels = MnistReader.getLabels(Paths.get("/path/to/train-labels-idx1-ubyte.gz"))
    List images = MnistReader.getImages(Paths.get("/path/to/train-images-idx3-ubyte.gz"))
    println 'Processing images'
    def scaledImages = images.collect { image -> normalize(image.flatten() as Integer[]) } as double[][]
    def scaledImagesRot10 = images.collect { image -> normalize(rotate(image, 10).flatten() as Integer[]) } as double[][]
    def scaledImagesRot350 = images.collect { image -> normalize(rotate(image, -10).flatten() as Integer[]) } as double[][]
    epochs.times { e ->
    println "Running epoch: ${e + 1}"
    for (i in 0..RealMatrix targets = createTarget(labels[i])
    train(createColumnRealMatrix(scaledImages[i]), targets)
    train(createColumnRealMatrix(scaledImagesRot10[i]), targets)
    train(createColumnRealMatrix(scaledImagesRot350[i]), targets)
    }
    }
    int[] testLabels = MnistReader.getLabels(Paths.get("/path/to/t10k-labels-idx1-ubyte.gz"))
    List testImages = MnistReader.getImages(Paths.get("/path/to/t10k-images-idx3-ubyte.gz"))
    int correct = 0
    for (i in 0..int correctLabel = testLabels[i]
    RealMatrix predict = query(scale(testImages.get(i).flatten() as Integer[]))
    int predictLabel = maxIndex(predict)
    if (predictLabel == correctLabel) correct++
    }
    println "Accuracy: " + correct / testLabels.length
    save("../resources/weights")
    Processing images
    Running epoch: 1
    Running epoch: 2
    Running epoch: 3
    Running epoch: 4
    Running epoch: 5
    Running epoch: 6
    Running epoch: 7
    Running epoch: 8
    Running epoch: 9
    Running epoch: 10
    Accuracy: 0.9767
    Inspired by: https://golb.hplar.ch/2018/12/simple-neural-network.html
    Train
    model
    Test
    model

    View Slide

  177. MNIST with DeepLearning4J
    int rngSeed = 123
    int batchSize = 125
    int numEpochs = 15
    int numInputs = 28 * 28
    int hiddenLayerSize = 1000
    int numOutputs = 10
    def (trainSet, testSet) = [true, false].collect {
    new MnistDataSetIterator(batchSize, it, rngSeed) }
    def log = LoggerFactory.getLogger(getClass())
    def conf = new NeuralNetConfiguration.Builder()
    .seed(rngSeed) //include the random seed for reproducibility
    // use stochastic gradient descent as an optimization algorithm
    .updater(new Nesterovs(0.006, 0.9))
    .l2(1e-4)
    .list()
    // input layer:
    .layer(new DenseLayer.Builder()
    .nIn(numInputs)
    .nOut(hiddenLayerSize)
    .activation(Activation.RELU)
    .weightInit(WeightInit.XAVIER)
    .build())
    // hidden layer:
    .layer(new OutputLayer.Builder(LossFunction.NEGATIVELOGLIKELIHOOD)
    .nIn(hiddenLayerSize)
    .nOut(numOutputs)
    .activation(Activation.SOFTMAX)
    .weightInit(WeightInit.XAVIER)
    .build())
    .build()


    log.info("Creating model ...")
    def model = new MultiLayerNetwork(conf)
    model.init()
    log.info("Training model ...")
    model.listeners = new ScoreIterationListener(100)
    model.fit(trainSet, numEpochs)
    // model.save('mlp_model.dat' as File)
    log.info("Evaluating model...")
    def eval = model.evaluate(testSet)
    log.info(eval.stats())
    ======== Evaluation Metrics ========
    # of classes: 10
    Accuracy: 0.9728
    Precision: 0.9728
    Recall: 0.9726
    F1 Score: 0.9726
    Based on: https://github.com/deeplearning4j/dl4j-examples/blob/master/dl4j-
    examples/src/main/java/org/deeplearning4j/examples/feedforward/mnist/MLPMnistSingleLayerExample.java
    0
    1
    2
    9



    10
    28 * 28
    = 784 1000

    View Slide


  178. MNIST with DeepLearning4J
    Based on: https://github.com/deeplearning4j/dl4j-examples/blob/master/dl4j-
    examples/src/main/java/org/deeplearning4j/examples/feedforward/mnist/MLPMnistSingleLayerExample.java
    0
    1
    2
    9



    10
    28 * 28
    = 784 500
    int rngSeed = 123
    int batchSize = 64
    int numEpochs = 15
    int numInputs = 28 * 28
    int numOutputs = 10
    double rate = 0.0015
    def (trainSet, testSet) = [true, false].collect {
    new MnistDataSetIterator(batchSize, it, rngSeed) }
    def log = LoggerFactory.getLogger(getClass())
    def conf = new NeuralNetConfiguration.Builder()
    .seed(rngSeed) //include the random seed for reproducibility
    // use stochastic gradient descent as an optimization algorithm
    .updater(new Nadam())
    .l2(rate * 0.005) // regularized
    .list()
    // first input layer:
    .layer(new DenseLayer.Builder()
    .nIn(numInputs)
    .nOut(500)
    .build())
    // second input layer:
    .layer(new DenseLayer.Builder()
    .nIn(500)
    .nOut(100)
    .build())
    // hidden layer:
    .layer(new OutputLayer.Builder(LossFunction.NEGATIVELOGLIKELIHOOD)
    .nOut(numOutputs)
    .activation(Activation.SOFTMAX)
    .build())
    .build()

    log.info("Creating model ...")
    def model = new MultiLayerNetwork(conf)
    model.init()
    log.info("Training model ...")
    model.listeners = new ScoreIterationListener(100)
    model.fit(trainSet, numEpochs)
    log.info("Evaluating model...")
    def eval = model.evaluate(testSet)
    log.info(eval.stats())
    ======== Evaluation Metrics ========
    # of classes: 10
    Accuracy: 0.9820
    Precision: 0.9820
    Recall: 0.9818
    F1 Score: 0.9819
    100

    View Slide

  179. MNIST GUI with GroovyFX
    import javafx.embed.swing.SwingFXUtils
    import javafx.scene.canvas.GraphicsContext
    import javafx.scene.image.Image
    import javafx.scene.image.PixelFormat
    import javafx.scene.paint.Color
    import javax.imageio.ImageIO
    static clear(GraphicsContext g, size) {
    g.fill = Color.WHITE; g.fillRect(0, 0, size, size); g.fill = Color.BLACK
    }
    static int[] imageToArray(Image img) {
    int w = img.width
    int h = img.height
    int[] buf = new int[h * w]
    img.pixelReader.getPixels(0, 0, w, h, PixelFormat.intArgbInstance, buf, 0, w)
    buf
    }
    static Image snapshot(canvas) {
    def baos = new ByteArrayOutputStream()
    ImageIO.write(SwingFXUtils.fromFXImage(canvas.snapshot(null, null), null), "png", baos)
    new Image(new ByteArrayInputStream(baos.toByteArray()), 28, 28, true, true)
    }
    static displayResult(List items, Integer predict) {
    items.indexed().collect{ idx, val ->
    def marker = (idx == predict && items[predict] > 0.5) ? ' **' : ''
    "$idx ${val ? sprintf('%.4f', val) + marker : '?'}"
    }.join('\n')
    }
    A few more
    utility methods

    View Slide

  180. MNIST GUI with GroovyFX
    class MnistInfer {
    private static RealMatrix inputHidden
    private static RealMatrix hiddenOutput
    static void load(Class klass) {
    def weights = klass.classLoader.getResourceAsStream('weights')
    try (ObjectInputStream ois = new ObjectInputStream(weights)) {
    inputHidden = MatrixUtils.createRealMatrix((double[][]) ois.readObject())
    hiddenOutput = MatrixUtils.createRealMatrix((double[][]) ois.readObject())
    }
    }
    static RealMatrix query(double[] inputArray) {
    RealMatrix inputs = createColumnRealMatrix(inputArray)
    RealMatrix hiddenInputs = inputHidden * inputs
    RealMatrix hiddenOutputs = scalarSigmoid(hiddenInputs)
    RealMatrix finalInputs = hiddenOutput * hiddenOutputs
    return scalarSigmoid(finalInputs)
    }
    static double[] normalize(int[] img) {
    double[] result = new double[img.length]
    for (int i = 0; i < img.length; i++) {
    int red = (img[i] >> 16) & 0xff
    int green = (img[i] >> 8) & 0xff
    int blue = img[i] & 0xff
    result[i] = 1 - ((red + green + blue) / 765.0 * 999 + 1) / 1000
    }
    return result
    }
    }
    Predictor class

    View Slide

  181. MNIST GUI with GroovyFX

    start {
    stage(title: 'MNIST', visible: true) {
    scene(id: "scene", fill: WHITE) {
    borderPane {
    top {
    canvas(id: "canvas", width: size, height: size)
    def g = canvas.graphicsContext2D
    clear(g, size)
    canvas.onMouseDragged { e -> g.fillOval e.x - pen, e.y - pen, pen * 2, pen * 2 }
    }
    center {
    hbox(alignment: 'Center') {
    button('Clear', onAction: {
    clear(canvas.graphicsContext2D, size)
    out.text = displayResult([null] * 10, null)
    })
    button('Predict', onAction: {
    def result = Mnist.query(Mnist.scale(imageToArray(snapshot(canvas))))
    def predictLabel = Mnist.maxIndex(result)
    out.text = displayResult(result.data.collect{ it[0] }, predictLabel)
    })
    }
    }
    bottom {
    textArea(displayResult([null] * 10, null), id: 'out', editable: false, prefColumnCount: 16)
    }
    }
    }
    }
    }

    View Slide

  182. Apache MXNET
    Source: https://mxnet.apache.org/versions/master/faq/why_mxnet.html

    View Slide

  183. Apache MXNET Object detection
    import groovy.swing.SwingBuilder
    import javax.imageio.ImageIO
    import org.apache.mxnet.infer.javaapi.ObjectDetector
    import org.apache.mxnet.javaapi.*
    import static javax.swing.WindowConstants.DISPOSE_ON_CLOSE
    static void downloadUrl(String srcDirUrl, String destDir, String filename) {
    def destination = new File(destDir, filename)
    if (!destination.parentFile.exists()) {
    destination.parentFile.mkdirs()
    }
    if (!destination.exists()) {
    destination.bytes = new URL(srcDirUrl + filename).bytes
    }
    }
    static downloadModelImage() {
    String baseDir = "${System.getProperty('java.io.tmpdir')}/resnetssd/"
    def imageName = 'dog-ssd.jpg'
    def modelName = 'resnet50_ssd_model'
    String imgURL = "https://s3.amazonaws.com/model-server/inputs/"
    println "Downloading image to ${baseDir}..."
    downloadUrl(imgURL, baseDir, imageName)
    println "Downloading model files to ${baseDir}... (may take a while)"
    String modelURL = "https://s3.amazonaws.com/model-server/models/resnet50_ssd/"
    downloadUrl(modelURL, baseDir, "$modelName-symbol.json")
    downloadUrl(modelURL, baseDir, "$modelName-0000.params")
    downloadUrl(modelURL, baseDir, 'synset.txt')
    [baseDir + imageName, baseDir + modelName]
    }

    Utility methods for
    downloading test image
    and pre-trained model
    Test image

    View Slide

  184. Apache MXNET Object detection

    static detectObjects(String modelPath, String imagePath, inputShape) {
    def context = [Context.cpu()]
    def inputDescriptor = new DataDesc("data", inputShape, DType.Float32(), "NCHW")
    def objDet = new ObjectDetector(modelPath, [inputDescriptor], context, 0)
    return objDet.imageObjectDetect(ObjectDetector.loadImageFromFile(imagePath), 3)
    }
    def (imagePath, modelPath) = downloadModelImage()
    def image = ImageIO.read(imagePath as File)
    def (w, h) = image.with{ [it.width, it.height] }
    def count = 1 // batch size
    def channels = 3 // for RGB
    def inputShape = new Shape(count, channels, w * 0.8 as int, h)
    def results = detectObjects(modelPath, imagePath, inputShape).sum()
    def boxes = results.collect {[
    xmin: w * it.XMin as int, ymin: h * it.YMin as int,
    xmax: w * it.XMax as int, ymax: h * it.YMax as int
    ]}
    def names = results.collect{ it.className + sprintf(' %.3f', it.probability) }
    (0..Image.drawBoundingBox(image, boxes, names)
    new SwingBuilder().edt {
    frame(title: "${results.size()} detected objects", size: [w, h],
    show: true, defaultCloseOperation: DISPOSE_ON_CLOSE) {
    label(icon: imageIcon(image: image))
    }
    } car 0.996 [xmin:439, ymin:62, xmax:715, ymax:173]
    bicycle 0.986 [xmin:144, ymin:128, xmax:547, ymax:453]
    dog 0.919 [xmin:102, ymin:184, xmax:344, ymax:541]

    View Slide

  185. Bonus Material

    View Slide

  186. Scaling up Data Science
    If your big data framework doesn’t have the algorithm you need?
    Preprocessing
    Partition based dataset
    Linear regression
    K-Means clustering
    Genetic algorithms
    Multilayer perceptron
    Decision trees
    k-NN classification
    k-NN regression
    SVM binary classification
    SVM multi-class Classification
    Model cross validation
    Logistic regression
    Random forest
    Gradient boosting
    Model updating
    ANN (Approximate Nearest Neighbor)
    Deep learning
    Summary statistics
    Correlations
    Stratified sampling
    Hypothesis testing
    Random data generation
    Classification and regression: support
    Vector machines logistic regression
    Linear regression
    Decision trees
    Naive Bayes classification
    Alternating least squares (ALS)
    K-Means clustering
    Latent Dirichlet allocation (LDA)
    Singular value decomposition (SVD)
    Principal component analysis (PCA)
    Feature extraction
    Transformation functions
    Optimization algorithms
    Re-formulate your
    algorithm using
    map/filter/reduce

    View Slide

  187. House pricing revisited – Initial algorithm
    Data
    Training Test
    Model
    Stats
    Split
    Fit
    Predict and
    Evaluate

    View Slide

  188. House pricing revisited – Rejigging our algorithm
    Data
    Training Test
    Model
    Stats
    Split
    Fit
    Predict and
    Evaluate
    Data
    Chunk1
    Model1
    Stats
    Split
    Fit
    Predict and
    Evaluate

    Chunk2
    Model2
    ChunkN
    ModelN
    Fit
    Model
    Aggregate

    … ChunkN
    Stats1 Stats2 StatsN

    Aggregate
    Chunk2
    Chunk1

    View Slide

  189. House pricing revisited - GPars
    import org.apache.commons.math3.stat.StatUtils
    import smile.math.Math
    import smile.regression.OLS
    import tech.tablesaw.api.*
    import groovyx.gpars.GParsPool
    import static java.lang.Math.sqrt
    def features = [
    'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_living15', 'lat',
    'sqft_above', 'grade', 'view', 'waterfront', 'floors'
    ]
    def file = getClass().classLoader.getResource('kc_house_data.csv').file
    def table = Table.read().csv(file)
    table = table.dropWhere(table.column("bedrooms").isGreaterThan(30))
    def data = table.as().doubleMatrix(*features)

    • Read CSV in normal way
    • Split data into chunks
    • Fit each chunk and aggregate resulting models
    • Split data into chunks
    • Evaluate each chunk against model and aggregate resulting stats

    View Slide

  190. House pricing revisited - GPars

    GParsPool.withPool {
    def trainChunkSize = 3000
    def numTrainChunks = 8
    def models = (0..def list = data.toList().shuffled().take(trainChunkSize)
    new OLS(list.collect{ it[1..-1] } as double[][],
    list.collect{ it[0] } as double[]).with{
    [it.intercept(), *it.coefficients()]
    }
    }
    def model = models.transpose()*.sum().collect{ it/numTrainChunks }
    println "Intercept: ${model[0]}"
    println "Coefficients: ${model[1..-1].join(', ')}"
    double[] coefficients = model[1..-1]
    double intercept = model[0]
    def stats = { chunk ->
    def predicted = chunk.collect { row -> intercept + dot(row[1..-1] as double[], coefficients) }
    def residuals = chunk.toList().indexed().collect { idx, row -> predicted[idx] - row[0] }
    def rmse = sqrt(sumSq(residuals as double[]) / chunk.size())
    [rmse, residuals.average(), chunk.size()]
    }
    def evalChunkSize = 2000
    def results = data.collate(evalChunkSize).collectParallel(stats)
    println 'RMSE: ' + sqrt(results.collect{ it[0] * it[0] * it[2] }.sum() / data.size())
    println 'mean: ' + results.collect{ it[1] * it[2] }.sum() / data.size()
    }
    • Read CSV in normal way
    • Split data into chunks
    • Fit each chunk and aggregate resulting models
    • Split data into chunks
    • Evaluate each chunk against model and aggregate resulting stats

    View Slide

  191. House pricing revisited - GPars

    GParsPool.withPool {
    def trainChunkSize = 3000
    def numTrainChunks = 8
    def models = (0..def list = data.toList().shuffled().take(trainChunkSize)
    new OLS(list.collect{ it[1..-1] } as double[][],
    list.collect{ it[0] } as double[]).with{
    [it.intercept(), *it.coefficients()]
    }
    }
    def model = models.transpose()*.sum().collect{ it/numTrainChunks }
    println "Intercept: ${model[0]}"
    println "Coefficients: ${model[1..-1].join(', ')}"
    double[] coefficients = model[1..-1]
    double intercept = model[0]
    def stats = { chunk ->
    def predicted = chunk.collect { row -> intercept + dot(row[1..-1] as double[], coefficients) }
    def residuals = chunk.toList().indexed().collect { idx, row -> predicted[idx] - row[0] }
    def rmse = sqrt(sumSq(residuals as double[]) / chunk.size())
    [rmse, residuals.average(), chunk.size()]
    }
    def evalChunkSize = 2000
    def results = data.collate(evalChunkSize).collectParallel(stats)
    println 'RMSE: ' + sqrt(results.collect{ it[0] * it[0] * it[2] }.sum() / data.size())
    println 'mean: ' + results.collect{ it[1] * it[2] }.sum() / data.size()
    }
    Intercept: -3.16904445043026E7
    Coefficients: -21131.141578491588, -5431.675525641348,
    179.47932136012156, 11.273983523229091, 657752.8815669084,
    -10.658550821995767, 86876.08541623666, 61708.58759419169,
    611717.9878928157, -21819.967894981055
    RMSE: 215185.8177879586
    mean: -2151.974648800721

    View Slide


  192. GParsPool.withPool {
    def trainChunkSize = 3000
    def numTrainChunks = 8
    def models = (0..def list = data.toList().shuffled().take(trainChunkSize)
    new OLS(list.collect{ it[1..-1] } as double[][],
    list.collect{ it[0] } as double[]).with{
    [it.intercept(), *it.coefficients()]
    }
    }
    def model = models.transpose()*.sum().collect{ it/numTrainChunks }
    println "Intercept: ${model[0]}"
    println "Coefficients: ${model[1..-1].join(', ')}"
    double[] coefficients = model[1..-1]
    double intercept = model[0]
    def stats = { chunk ->
    def predicted = chunk.collect { row -> intercept + dot(row[1..-1] as double[], coefficients) }
    def residuals = chunk.toList().indexed().collect { idx, row -> predicted[idx] - row[0] }
    def rmse = sqrt(sumSq(residuals as double[]) / chunk.size())
    [rmse, residuals.average(), chunk.size()]
    }
    def evalChunkSize = 2000
    def results = data.collate(evalChunkSize).collectParallel(stats)
    println 'RMSE: ' + sqrt(results.collect{ it[0] * it[0] * it[2] }.sum() / data.size())
    println 'mean: ' + results.collect{ it[1] * it[2] }.sum() / data.size()
    }
    House pricing revisited - GPars Data
    Chunk1
    Model1
    Stats
    shuffle/take
    results

    Chunk2
    Model2
    ChunkN
    ModelN
    OLS#intercept/OLS#coefficients
    Model

    … ChunkN
    Stats1 Stats2 StatsN

    Chunk2
    Chunk1
    results.collect
    transpose/sum
    model
    collate
    stats

    View Slide

  193. House pricing revisited – Apache Beam

    class MeanDoubleArrayCols implements SerializableFunction, double[]> {
    @Override
    double[] apply(Iterable input) {
    double[] sum = null
    def count = 0
    for (double[] next : input) {
    if (sum == null) {
    sum = new double[next.size()]
    (0..}
    (0..count++
    }
    if (sum != null) (0..return sum
    }
    }
    class AggregateModelStats implements SerializableFunction, double[]> {
    @Override
    double[] apply(Iterable input) {
    double[] sum = null
    for (double[] next : input) {
    if (sum == null) {
    sum = new double[next.size()]
    (0..}
    def total = sum[2] + next[2]
    sum[0] = sqrt((sum[2] * sum[0] * sum[0] + next[2] * next[0] * next[0]) / total)
    sum[1] = (sum[2] * sum[1] + next[2] * next[1]) / total
    sum[2] = total
    }
    return sum
    }
    }

    Aggregate models
    Aggregate stats

    View Slide

  194. House pricing revisited – Apache Beam

    static buildPipeline(Pipeline p) {
    def features = [
    'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_living15', 'lat',
    'sqft_above', 'grade', 'view', 'waterfront', 'floors'
    ]
    def readCsvChunks = new DoFn() {
    @ProcessElement
    void processElement(@Element String path, OutputReceiver receiver) throws IOException {
    def chunkSize = 6000
    def table = Table.read().csv(path)
    table = table.dropWhere(table.column("bedrooms").isGreaterThan(30))
    def idxs = 0..for (nextChunkIdxs in idxs.shuffled().collate(chunkSize)) {
    def chunk = table.rows(*nextChunkIdxs)
    receiver.output(chunk.as().doubleMatrix(*features))
    }
    }
    }
    def fitModel = new DoFn() {
    @ProcessElement
    void processElement(@Element double[][] rows, OutputReceiver receiver) throws IOException {
    double[] model = new OLS(rows.collect{ it[1..-1] } as double[][], rows.collect{ it[0] } as double[]).with{ [it.intercept(), *it.coefficients()] }
    receiver.output(model)
    }
    }
    def evalModel = { double[][] chunk, double[] model ->
    double intercept = model[0]
    double[] coefficients = model[1..-1]
    def predicted = chunk.collect { row -> intercept + dot(row[1..-1] as double[], coefficients) }
    def residuals = chunk.toList().indexed().collect { idx, row -> predicted[idx] - row[0] }
    def rmse = sqrt(sumSq(residuals as double[]) / chunk.size())
    def mean = residuals.average()
    [rmse, mean, chunk.size()] as double[]
    }

    Map filename to chunks
    Map chunk to model
    Map chunk to stats

    View Slide

  195. House pricing revisited – Apache Beam

    def model2out = new DoFn() {
    @ProcessElement
    void processElement(@Element double[] ds, OutputReceiver out) {
    out.output("intercept: ${ds[0]}, coeffs: ${ds[1..-1].join(', ')}".toString())
    }
    }
    def stats2out = new DoFn() {
    @ProcessElement
    void processElement(@Element double[] ds, OutputReceiver out) {
    out.output("rmse: ${ds[0]}, mean: ${ds[1]}, count: ${ds[2]}".toString())
    }
    }
    var csvChunks = p
    .apply(Create.of('/path/to/kc_house_data.csv'))
    .apply('Create chunks', ParDo.of(readCsvChunks))
    var model = csvChunks
    .apply('Fit chunks', ParDo.of(fitModel))
    .apply(Combine.globally(new MeanDoubleArrayCols()))
    var modelView = model
    .apply(View.asSingleton())
    csvChunks
    .apply(ParDo.of(new EvaluateModel(modelView, evalModel)).withSideInputs(modelView))
    .apply(Combine.globally(new AggregateModelStats()))
    .apply('Log stats', ParDo.of(stats2out)).apply(Log.ofElements())
    model
    .apply('Log model', ParDo.of(model2out)).apply(Log.ofElements())
    }
    def pipeline = Pipeline.create()
    buildPipeline(pipeline)
    pipeline.run().waitUntilFinish()
    INFO: intercept: -3.2504584735458568E7, coeffs: -27862.329468295735, -2269.8423856962077, 198.1365232723505, 2.9357512250793327,
    675276.3494316386, -2.323389992226333, 83241.42291377622, 66606.19682072607, 607326.5819472861, -30354.844181689274
    INFO: rmse: 214702.83588071115, mean: -1191.2433423574066, count: 21612.0
    Map model
    to output
    Map stats to
    output
    Create and run pipeline
    Build
    pipeline

    View Slide


  196. var csvChunks = p
    | Create.of('/path/to/kc_house_data.csv')
    | 'Create chunks' >> ParDo.of(readCsvChunks)
    var model = csvChunks
    | 'Fit chunks' >> ParDo.of(fitModel)
    | Combine.globally(new MeanDoubleArrayCols())
    var modelView = model
    | View.asSingleton()
    csvChunks
    | ParDo.of(new EvaluateModel(modelView, evalModel)).withSideInputs(modelView)
    | Combine.globally(new AggregateModelStats())
    | 'Log stats' >> ParDo.of(stats2out) | Log.ofElements()
    model
    | 'Log model' >> ParDo.of(model2out) | Log.ofElements()
    }
    PCollection.metaClass.or = { List arg -> delegate.apply(*arg) }
    PCollection.metaClass.or = { PTransform arg -> delegate.apply(arg) }
    String.metaClass.rightShift = { PTransform arg -> [delegate, arg] }
    Pipeline.metaClass.or = { PTransform arg -> delegate.apply(arg) }
    def pipeline = Pipeline.create()
    buildPipeline(pipeline)
    pipeline.run().waitUntilFinish()
    House pricing revisited – Apache Beam
    INFO: intercept: -3.2504584735458568E7, coeffs: -27862.329468295735, -2269.8423856962077, 198.1365232723505, 2.9357512250793327,
    675276.3494316386, -2.323389992226333, 83241.42291377622, 66606.19682072607, 607326.5819472861, -30354.844181689274
    INFO: rmse: 214702.83588071115, mean: -1191.2433423574066, count: 21612.0
    Create and run pipeline
    Some optional
    metaprogramming for
    Python-like coding style

    View Slide

  197. House pricing revisited – Apache Beam

    var csvChunks = p
    | Create.of('/path/to/kc_house_data.csv')
    | 'Create chunks' >> ParDo.of(readCsvChunks)
    var model = csvChunks
    | 'Fit chunks' >> ParDo.of(fitModel)
    | Combine.globally(new MeanDoubleArrayCols())
    var modelView = model
    | View.asSingleton()
    csvChunks
    | ParDo.of(new EvaluateModel(modelView, evalModel)).withSideInputs(modelView)
    | Combine.globally(new AggregateModelStats())
    | 'Log stats' >> ParDo.of(stats2out) | Log.ofElements()
    model
    | 'Log model' >> ParDo.of(model2out) | Log.ofElements()
    }
    PCollection.metaClass.or = { List arg -> delegate.apply(*arg) }
    PCollection.metaClass.or = { PTransform arg -> delegate.apply(arg) }
    String.metaClass.rightShift = { PTransform arg -> [delegate, arg] }
    Pipeline.metaClass.or = { PTransform arg -> delegate.apply(arg) }
    def pipeline = Pipeline.create()
    buildPipeline(pipeline)
    pipeline.run().waitUntilFinish()
    INFO: intercept: -3.2504584735458568E7, coeffs: -27862.329468295735, -2269.8423856962077, 198.1365232723505, 2.9357512250793327,
    675276.3494316386, -2.323389992226333, 83241.42291377622, 66606.19682072607, 607326.5819472861, -30354.844181689274
    INFO: rmse: 214702.83588071115, mean: -1191.2433423574066, count: 21612.0
    Data
    Chunk1
    Model1
    Stats
    readCsvChunks
    csvChunks

    Chunk2
    Model2
    ChunkN
    ModelN
    fitModel
    Model

    … ChunkN
    Stats1 Stats2 StatsN

    Chunk2
    Chunk1
    AggregateModelStats
    MeanDoubleArrayCols
    csvChunks
    model
    modelView sideInput
    evalModel

    View Slide

  198. Optimization Overview Optimization:
    • Finding (optimal) solutions
    Approaches:
    • Dynamic programming
    • Linear programming
    • Non-linear programming
    • Constraint programming
    • Integer programming
    Aspects:
    • Combinatorial optimization
    • Continuous optimization
    Applications:
    • Timetabling/scheduling
    • Vehicle routing
    • Packing problems
    • Trading systems
    • Logistics
    • Language detection
    CP
    Constraint
    Programming
    CIP
    Constraint
    Integer
    Programming
    MINLP
    Mixed Integer
    Non-linear
    Programming
    MIP
    Mixed Integer Linear
    Programming
    IP
    Integer Linear
    Programming
    SAT
    Satisfiability
    LP
    Linear
    Programming

    View Slide

  199. Constraint Programming
    Picture: http://osullivan.ucc.ie/AAAI_OSullivan.pdf

    View Slide

  200. Constraint Programming
    Typical domains:
    • Boolean
    (SAT problem)
    • integer
    • rational
    • interval
    (scheduling problems)
    • linear
    (approaches to non-
    linear problems do exist)
    • finite sets
    • mixed
    (2 or more of above)
    Picture: http://osullivan.ucc.ie/AAAI_OSullivan.pdf

    View Slide

  201. Cryptarithmetic
    S E N D
    + M O R E
    = M O N E Y
    Replace the letters with
    decimal digits to make a
    valid arithmetic sum

    View Slide

  202. Cryptarithmetic: Brute force
    def solutions():
    # letters = ('s', 'e', 'n', 'd', 'm', 'o', 'r', 'y')
    all_solutions = list()
    for s in range(1, 10):
    for e in range(0, 10):
    for n in range(0, 10):
    for d in range(0, 10):
    for m in range(1, 10):
    for o in range(0, 10):
    for r in range(0, 10):
    for y in range(0, 10):
    if len({s, e, n, d, m, o, r, y}) == 8:
    send = 1000 * s + 100 * e + 10 * n + d
    more = 1000 * m + 100 * o + 10 * r + e
    money = 10000 * m + 1000 * o + 100 * n + 10 * e + y
    if send + more == money:
    all_solutions.append((send, more, money))
    return all_solutions
    print(solutions())
    S E N D
    + M O R E
    = M O N E Y
    [(9567, 1085, 10652)]

    View Slide

  203. Cryptarithmetic: Brute force
    def solutions() {
    // letters = ['s', 'e', 'n', 'd', 'm', 'o', 'r', 'y']
    def all_solutions = []
    for (s in 1..<10)
    for (e in 0..9)
    for (n in 0..9)
    for (d in 0..9)
    for (m in 1..9)
    for (o in 0..9)
    for (r in 0..9)
    for (y in 0..9)
    if ([s, e, n, d, m, o, r, y].toSet().size() == 8) {
    def send = 1000 * s + 100 * e + 10 * n + d
    def more = 1000 * m + 100 * o + 10 * r + e
    def money = 10000 * m + 1000 * o + 100 * n + 10 * e + y
    if (send + more == money)
    all_solutions.add([send, more, money])
    }
    return all_solutions
    }
    print(solutions())
    S E N D
    + M O R E
    = M O N E Y
    [[9567, 1085, 10652]]

    View Slide

  204. Cryptarithmetic: Brute force
    from itertools import permutations
    def solution2():
    letters = ('s', 'e', 'n', 'd', 'm', 'o', 'r', 'y')
    digits = range(10)
    for perm in permutations(digits, len(letters)):
    sol = dict(zip(letters, perm))
    if sol['s'] == 0 or sol['m'] == 0:
    continue
    send = 1000 * sol['s'] + 100 * sol['e'] + 10 * sol['n'] + sol['d']
    more = 1000 * sol['m'] + 100 * sol['o'] + 10 * sol['r'] + sol['e']
    money = 10000 * sol['m'] + 1000 * sol['o'] + 100 * sol['n'] + 10 * sol['e'] + sol['y']
    if send + more == money:
    return send, more, money
    print(solution2())
    S E N D
    + M O R E
    = M O N E Y
    (9567, 1085, 10652)

    View Slide

  205. Cryptarithmetic: Brute force
    def solution2() {
    def digits = 0..9
    for (p in digits.permutations()) {
    if (p[-1] < p[-2]) continue
    def (s, e, n, d, m, o, r, y) = p
    if (s == 0 || m == 0)
    continue
    def send = 1000 * s + 100 * e + 10 * n + d
    def more = 1000 * m + 100 * o + 10 * r + e
    def money = 10000 * m + 1000 * o + 100 * n + 10 * e + y
    if (send + more == money)
    return [send, more, money]
    }
    }
    print(solution2())
    S E N D
    + M O R E
    = M O N E Y
    [9567, 1085, 10652]

    View Slide

  206. from csp import Constraint, CSP
    from typing import Dict, List, Optional
    class SendMoreMoneyConstraint(Constraint[str, int]):
    def __init__(self, letters: List[str]) -> None:
    super().__init__(letters)
    self.letters: List[str] = letters
    def satisfied(self, assignment: Dict[str, int]) -> bool:
    # not a solution if duplicate values
    if len(set(assignment.values())) < len(assignment):
    return False
    # if all vars assigned, check if correct
    if len(assignment) == len(self.letters):
    s: int = assignment["S"]
    e: int = assignment["E"]
    n: int = assignment["N"]
    d: int = assignment["D"]
    m: int = assignment["M"]
    o: int = assignment["O"]
    r: int = assignment["R"]
    y: int = assignment["Y"]
    send: int = s * 1000 + e * 100 + n * 10 + d
    more: int = m * 1000 + o * 100 + r * 10 + e
    money: int = m * 10000 + o * 1000 + n * 100 + e * 10 + y
    return send + more == money
    return True

    Cryptarithmetic: Constraint programming

    letters = ["S", "E", "N", "D", "M", "O", "R", "Y"]
    possible_digits = {}
    for letter in letters:
    possible_digits[letter] = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
    possible_digits["M"] = [1] # can't start with 0 constraint
    csp: CSP[str, int] = CSP(letters, possible_digits)
    csp.add_constraint(SendMoreMoneyConstraint(letters))
    solution: Optional[Dict[str, int]] = csp.backtracking_search()
    if solution is None:
    print("No solution found!")
    else:
    print(solution)
    S E N D
    + M O R E
    = M O N E Y
    {'S': 9, 'E': 5, 'N': 6, 'D': 7,
    'M': 1, 'O': 0, 'R': 8, 'Y': 2}
    Adapted from: https://freecontent.manning.com/constraint-satisfaction-problems-in-python/

    View Slide

  207. @Grab('org.choco-solver:choco-solver:4.10.0')
    import org.chocosolver.solver.Model
    import org.chocosolver.solver.variables.IntVar
    def model = new Model("SEND+MORE=MONEY")
    def S = model.intVar("S", 1, 9)
    def E = model.intVar("E", 0, 9)
    def N = model.intVar("N", 0, 9)
    def D = model.intVar("D", 0, 9)
    def M = model.intVar("M", 1, 9)
    def O = model.intVar("O", 0, 9)
    def R = model.intVar("R", 0, 9)
    def Y = model.intVar("Y", 0, 9)
    model.allDifferent(S, E, N, D, M, O, R, Y).post()

    Cryptarithmetic: Constraint programming

    IntVar[] ALL = [
    S, E, N, D,
    M, O, R, E,
    M, O, N, E, Y]
    int[] COEFFS = [
    1000, 100, 10, 1,
    1000, 100, 10, 1,
    -10000, -1000, -100, -10, -1]
    model.scalar(ALL, COEFFS, "=", 0).post()
    model.solver.findSolution()
    S E N D
    + M O R E
    = M O N E Y
    Solution: S=9, E=5, N=6,
    D=7, M=1, O=0, R=8, Y=2,

    View Slide

  208. Dietary restrictions
    Minimise cost of diet given:
    • Must be at least 300 calories
    • Not more than 10 grams of protein
    • Not less than 10 grams of carbohydrates
    • Not less than 8 grams of fat
    • At least 0.5 units of fish
    • No more than 1 unit of milk
    Bread Milk Cheese Potato Fish Yogurt
    Cost 2.0 3.5 8.0 1.5 11.0 1.0
    Protein (g) 4.0 8.0 7.0 1.3 8.0 9.2
    Fat (g) 1.0 5.0 9.0 0.1 7.0 1.0
    Carbohydrates (g) 15.0 11.7 0.4 22.6 0.0 17.0
    Calories 90 120 106 97 130 180

    View Slide

  209. Dietary restrictions (Excel)

    View Slide

  210. Dietary restrictions (Excel)

    View Slide

  211. Dietary restrictions (Google sheets)

    View Slide

  212. Diet problem (ojalgo)
    def model = new ExpressionsBasedModel()
    def bread = model.addVariable("Bread").lower(0)
    def milk = model.addVariable("Milk").lower(0).upper(1)
    def cheese = model.addVariable("Cheese").lower(0)
    def potato = model.addVariable("Potato").lower(0)
    def fish = model.addVariable("Fish").lower(0.5)
    def yogurt = model.addVariable("Yogurt").lower(0)
    def cost = model.addExpression("Cost")
    cost.set(bread, 2.0).set(milk, 3.5).set(cheese, 8.0).set(potato, 1.5).set(fish,
    11.0).set(yogurt, 1.0)
    def protein = model.addExpression("Protein").upper(10)
    protein.set(bread, 4.0).set(milk, 8.0).set(cheese, 7.0).set(potato,
    1.3).set(fish, 8.0).set(yogurt, 9.2)
    def fat = model.addExpression("Fat").lower(8)
    fat.set(bread, 1.0).set(milk, 5.0).set(cheese, 9.0).set(potato, 0.1).set(fish,
    7.0).set(yogurt, 1.0)
    def carbs = model.addExpression("Carbohydrates").lower(10)
    carbs.set(bread, 15.0).set(milk, 11.7).set(cheese, 0.4).set(potato,
    22.6).set(fish, 0.0).set(yogurt, 17.0)
    def calories = model.addExpression("Calories").lower(300)
    calories.set(bread, 90).set(milk, 120).set(cheese, 106).set(potato, 97).set(fish,
    130).set(yogurt, 180)
    def result = model.minimise()
    // for a variation, see:
    // https://www.ojalgo.org/2019/05/the-diet-problem/

    OPTIMAL 0.0 @ { 0.0, 0.0, 0.40813898143741034,
    1.8538791051880057, 0.5916230366492151, 0.0 }
    ############################################
    0 <= Bread: 0
    0 <= Milk: 0 <= 1
    0 <= Cheese: 0.408139
    0 <= Potato: 1.853879
    0.5 <= Fish: 0.591623
    0 <= Yogurt: 0
    10 <= Carbohydrates: 42.060923
    8 <= Fat: 8.0
    Cost: 12.553784
    Protein: 10.0 <= 10
    300 <= Calories: 300.0
    ############################################

    View Slide

  213. Diet problem (Choco)
    def model = new Model("Diet problem")
    def unbounded = 100000
    // scale quantities by 10, coefficients by 10, products by 100
    def bread = model.intVar("Bread", 0, unbounded)
    def milk = model.intVar("Milk", 0, 10)
    def cheese = model.intVar("Cheese", 0, unbounded)
    def potato = model.intVar("Potato", 0, unbounded)
    def fish = model.intVar("Fish", 5, unbounded)
    def yogurt = model.intVar("Yogurt", 0, unbounded)
    IntVar[] all = [bread, milk, cheese, potato, fish, yogurt]
    def cost = model.intVar("Cost", 0, unbounded)
    model.scalar(all, [20, 35, 80, 15, 110, 10] as int[], "=", cost).post()
    def protein = model.intVar("Protein", 0, 1000)
    model.scalar(all, [40, 80, 70, 13, 80, 92] as int[], "=", protein).post()
    def fat = model.intVar("Fat", 800, unbounded)
    model.scalar(all, [10, 50, 90, 1, 70, 10] as int[], "=", fat).post()
    def carbs = model.intVar("Carbohydrates", 1000, unbounded)
    model.scalar(all, [150, 117, 4, 226, 0, 170] as int[], "=", carbs).post()
    def calories = model.intVar("Calories", 30000, unbounded)
    model.scalar(all, [900, 1200, 1060, 970, 1300, 1800] as int[], "=", calories).post()
    model.setObjective(Model.MINIMIZE, cost)
    def found = model.solver.findSolution()
    if (found) {
    all.each { println "$it.name: ${it.value / 10}" }
    [carbs, fat, protein, calories, cost].each { println "$it.name: ${it.value / 100}" }
    } else {
    println "No solution"
    }
    Bread: 0
    Milk: 0
    Cheese: 0.5
    Potato: 1.9
    Fish: 0.5
    Yogurt: 0
    Carbohydrates: 43.14
    Fat: 8.19
    Protein: 9.97
    Calories: 302.3
    Cost: 12.35
    Choco supports RealVar but doesn’t have as
    rich a set of possible constraints for such vars.

    View Slide

  214. Diet problem (Choco)
    def model = new Model("Diet problem")
    def unbounded = 1000.0d
    def precision = 0.00001d
    // scale quantities by 10, coefficients by 10, products by 100
    def bread = model.realVar("Bread", 0.0, unbounded, precision)
    def milk = model.realVar("Milk", 0.0, 1.0, precision)
    def cheese = model.realVar("Cheese", 0.0, unbounded, precision)
    def potato = model.realVar("Potato", 0.0, unbounded, precision)
    def fish = model.realVar("Fish", 0.5, unbounded, precision)
    def yogurt = model.realVar("Yogurt", 0.0, unbounded, precision)
    RealVar[] all = [bread, milk, cheese, potato, fish, yogurt]
    def scalarIbex = { coeffs, var ->
    def (a, b, c, d, e, f) = coeffs
    model.realIbexGenericConstraint("$a*{0}+$b*{1}+$c*{2}+$d*{3}+$e*{4}+$f*{5}={6}",
    [*all, var] as RealVar[]).post();
    }
    def cost = model.realVar("Cost", 0.0, unbounded, precision)
    scalarIbex([2.0, 3.5, 8.0, 1.5, 11.0, 1.0], cost)
    def protein = model.realVar("Protein", 0.0, 10.0, precision)
    scalarIbex([4.0, 8.0, 7.0, 1.3, 8.0, 9.2], protein)
    def fat = model.realVar("Fat", 8.0, unbounded, precision)
    scalarIbex([1.0, 5.0, 9.0, 0.1, 7.0, 1.0], fat)
    def carbs = model.realVar("Carbohydrates", 10.0, unbounded, precision)
    scalarIbex([15.0, 11.7, 0.4, 22.6, 0.0, 17.0], carbs)
    def calories = model.realVar("Calories", 300, unbounded, precision)
    scalarIbex([90, 120, 106, 97, 130, 180], calories)
    model.setObjective(Model.MINIMIZE, cost)
    def found = model.solver.findSolution()
    Bread: 0.025131 .. 0.025137
    Milk: 0.000009 .. 0.000010
    Cheese: 0.428571 .. 0.428571
    Potato: 1.848118 .. 1.848124
    Fish: 0.561836 .. 0.561836
    Yogurt: 0.000007 .. 0.000010
    Carbohydrates: 42.316203 .. 42.316211
    Fat: 8.000000 .. 8.000005
    Protein: 9.997920 .. 9.997926
    Calories: 300.000000 .. 300.000008
    Cost: 12.431241 .. 12.431245
    Choco does have a plugin (via JNI) for
    the Ibex C++ constraint processing library
    which does handle real numbers.
    def pretty = { var ->
    def bounds = found.getRealBounds(var)
    printf "%s: %.6f .. %.6f%n", var.name, *bounds
    }
    if (found) {
    all.each { pretty(it) }
    [carbs, fat, protein, calories, cost].each { pretty(it) }
    } else {
    println "No solution"
    }

    View Slide

  215. Diet problem (Apache Commons Math)
    import org.apache.commons.math3.optim.linear.*
    import org.apache.commons.math3.optim.nonlinear.scalar.GoalType
    import static org.apache.commons.math3.optim.linear.Relationship.*
    def cost = new LinearObjectiveFunction([2.0, 3.5, 8.0, 1.5, 11.0, 1.0] as double[], 0)
    static scalar(coeffs, rel, val) { new LinearConstraint(coeffs as double[], rel, val) }
    def bread_min = scalar([1, 0, 0, 0, 0, 0], GEQ, 0)
    def milk_min = scalar([0, 1, 0, 0, 0, 0], GEQ, 0)
    def milk_max = scalar([0, 1, 0, 0, 0, 0], LEQ, 1)
    def cheese_min = scalar([0, 0, 1, 0, 0, 0], GEQ, 0)
    def potato_min = scalar([0, 0, 0, 1, 0, 0], GEQ, 0)
    def fish_min = scalar([0, 0, 0, 0, 1, 0], GEQ, 0.5)
    def yogurt_min = scalar([0, 0, 0, 0, 0, 1], GEQ, 0)
    def protein = scalar([4.0, 8.0, 7.0, 1.3, 8.0, 9.2], LEQ, 10)
    def fat = scalar([1.0, 5.0, 9.0, 0.1, 7.0, 1.0], GEQ, 8)
    def carbs = scalar([15.0, 11.7, 0.4, 22.6, 0.0, 17.0], GEQ, 10)
    def calories = scalar([90, 120, 106, 97, 130, 180], GEQ, 300)
    LinearConstraintSet constraints = [bread_min, milk_min, milk_max, fish_min, cheese_min,
    potato_min, yogurt_min, protein, fat, carbs, calories]
    def solution = new SimplexSolver().optimize(cost, constraints, GoalType.MINIMIZE)
    if (solution != null) {
    printf "Opt: %.2f%n", solution.value
    println solution.point.collect{ sprintf '%.2f', it }.join(', ')
    }
    Opt: 12.08
    -0.00, 0.05, 0.45, 1.87, 0.50, 0.00

    View Slide

  216. Diet problem (Google OrTools)
    import …
    Loader.loadNativeLibraries()
    static addConstraint(MPSolver solver, List vars, MPVariable comp, List coeffs) {
    solver.makeConstraint(0, 0).tap {constraint ->
    constraint.setCoefficient(comp, -1)
    vars.indices.each { i ->
    constraint.setCoefficient(vars[i], coeffs[i])
    }
    }
    }
    MPSolver.createSolver("SCIP").with {solver ->
    def bread = makeNumVar(0, infinity(),'bread')
    def milk = makeNumVar(0, 1, 'milk')
    def cheese = makeNumVar(0, infinity(), 'cheese')
    def potato = makeNumVar(0, infinity(), 'potato')
    def fish = makeNumVar(0.5, infinity(), 'fish')
    def yogurt = makeNumVar(0, infinity(), 'yogurt')
    def food = [bread, milk, cheese, potato, fish, yogurt]

    View Slide

  217. Diet problem (Google OrTools)

    def cost = makeNumVar(0, infinity(),'Cost')
    addConstraint(solver, food, cost, [2.0, 3.5, 8.0, 1.5, 11.0, 1.0])
    def protein = makeNumVar(0, 10,'Protein')
    addConstraint(solver, food, protein, [4.0, 8.0, 7.0, 1.3, 8.0, 9.2])
    def fat = makeNumVar(8, infinity(),'Fat')
    addConstraint(solver, food, fat, [1.0, 5.0, 9.0, 0.1, 7.0, 1.0])
    def carbs = makeNumVar(10, infinity(),'Carbs')
    addConstraint(solver, food, carbs, [15.0, 11.7, 0.4, 22.6, 0.0, 17.0])
    def cals = makeNumVar(300, infinity(),'Calories')
    addConstraint(solver, food, cals, [90, 120, 106, 97, 130, 180])
    def components = [protein, fat, carbs, cals]
    objective().with {objective ->
    objective.setCoefficient(cost, 1)
    objective.setMinimization()
    def result = solve()
    println result
    if (result == OPTIMAL) {
    println "Solution: " + objective.value()
    println "Foods: ${food.collect{ "${it.name()} ${it.solutionValue()}" }}"
    println "Components: ${components.collect{ "${it.name()} ${it.solutionValue()}" }}"
    println "Iterations: ${iterations()}, Wall time: ${wallTime()}ms"
    } else {
    System.err.println "The problem does not have an optimal solution!"
    }
    }
    }

    View Slide

  218. Diet problem (Google OrTools)

    def cost = makeNumVar(0, infinity(),'Cost')
    addConstraint(solver, food, cost, [2.0, 3.5, 8.0, 1.5, 11.0, 1.0])
    def protein = makeNumVar(0, 10,'Protein')
    addConstraint(solver, food, protein, [4.0, 8.0, 7.0, 1.3, 8.0, 9.2])
    def fat = makeNumVar(8, infinity(),'Fat')
    addConstraint(solver, food, fat, [1.0, 5.0, 9.0, 0.1, 7.0, 1.0])
    def carbs = makeNumVar(10, infinity(),'Carbs')
    addConstraint(solver, food, carbs, [15.0, 11.7, 0.4, 22.6, 0.0, 17.0])
    def cals = makeNumVar(300, infinity(),'Calories')
    addConstraint(solver, food, cals, [90, 120, 106, 97, 130, 180])
    def components = [protein, fat, carbs, cals]
    objective().with {objective ->
    objective.setCoefficient(cost, 1)
    objective.setMinimization()
    def result = solve()
    println result
    if (result == OPTIMAL) {
    println "Solution: " + objective.value()
    println "Foods: ${food.collect{ "${it.name()} ${it.solutionValue()}" }}"
    println "Components: ${components.collect{ "${it.name()} ${it.solutionValue()}" }}"
    println "Iterations: ${iterations()}, Wall time: ${wallTime()}ms"
    } else {
    System.err.println "The problem does not have an optimal solution!"
    }
    }
    }
    OPTIMAL
    Solution: 12.081337881108173
    Foods: [bread 0.0, milk 0.05359877488514538, cheese 0.44949881665042474, potato 1.8651677572045107, fish 0.5, yogurt 0.0]
    Components: [Protein 10.0, Fat 8.0, Carbs 42.95969650563832, Calories 300.0]
    Iterations: 4, Wall time: 89ms

    View Slide

  219. 225
    objectcomputing.com
    © 2021, Object Computing, Inc. (OCI). All rights reserved.
    Groovy Community Information
    • groovy.apache.org
    • groovy-lang.org
    • github.com/apache/groovy
    • groovycommunity.com
    [email protected]
    [email protected]
    • @ApacheGroovy
    • objectcomputing.com/training/catalog/groovy-programming/
    © 2021 Object Computing, Inc. (OCI). All rights reserved.

    View Slide

  220. objectcomputing.com
    © 2021, Object Computing, Inc. (OCI). All rights reserved. 226
    CONNECT WITH US
    1+ (314) 579-0066
    @objectcomputing
    objectcomputing.com
    THANK YOU
    Find me on twitter
    @paulk_asert
    © 2021 Object Computing, Inc. (OCI). All rights reserved.

    View Slide

  221. References
    • https://www.analyticsvidhya.com/blog/2017/02/lintroductory-guide-on-linear-programming-
    explained-in-simple-english/
    • Computational Mixed-Integer Programming
    https://www.birs.ca/cmo-workshops/2018/18w5208/files/GleixnerAmbros.pdf
    • Integer Linear Programming: Graphical Introduction
    https://www.youtube.com/watch?v=RhHhy-8sz-4
    • Integer Linear Programming: Binary Constraints
    https://www.youtube.com/watch?v=-3my1TkyFiM
    https://www.youtube.com/watch?v=B3biWsBLeCw
    https://www.youtube.com/watch?v=MO8uQnIch6I
    • Linear Programming: Alternate solutions, Infeasibility, Unboundedness, & Redundancy
    https://www.youtube.com/watch?v=eMA0LWsRQQ

    View Slide