Groovy is a powerful multi-paradigm programming language for the JVM that offers a wealth of features that make it ideal for many data science and big data scenarios.
* Groovy has a dynamic nature like Python, which means that it is very powerful, easy to learn, and productive. The language gets out of the way and lets data scientists write their algorithms naturally.
* Groovy has a static nature like Java and Kotlin, which makes it fast when needed. Its close alignment with Java means that you can often just cut-and-paste the Java examples from various big data solutions and they'll work just fine in Groovy.
* Groovy has first-class functional support, meaning that it offers features and allows solutions similar to Scala. Functional and stream processing with immutable data structures can offer many advantages when working in parallel processing or clustered environments.
These slides review the key benefits of using Groovy to develop data science solutions, including integration with various JDK libraries commonly used in data science solutions including libraries for data manipulation, machine learning, plotting and various big data solutions for scaling up these algorithms.
Topics covered include: regression, clustering, classification, neural networks, language processing, image detection.
Math/Data Science libraries covered include:
Weka, Smile, Tablesaw, Apache Commons Math, Jupyter/Beakerx notebooks, Deep Learning4J.
Libraries for scaling/concurrency include:
GPars, Apache Spark, Apache Ignite, Apache MXNet, Apache Beam, Apache NLPCraft.
objectcomputing.com
© 2021, Object Computing, Inc. (OCI). All rights reserved. No part of these notes may be reproduced, stored in a retrieval system,
or transmitted, in any form or by any means, electronic, mechanical, photocopying, rowing, or otherwise, without the prior, written
permission of Object Computing, Inc. (OCI)
Groovy and Data Science
Presented by
Dr Paul King
© 2021 Object Computing, Inc. (OCI). All rights reserved.
Dr Paul King
OCI Groovy Lead
V.P. and PMC Chair Apache Groovy
Author:
https://www.manning.com/books/groovy-in-action-second-edition
Slides:
https://speakerdeck.com/paulk/groovy-data-science
Examples repo:
https://github.com/paulk-asert/groovy-data-science
Twitter:
@paulk_asert
Apache Groovy Programming Language
• Multi-faceted language
• Imperative/OO & functional (+ other styles via libraries)
• Dynamic & static
• Aligned closely with Java language
• Very extensible
• 17+ years since inception
• ~1B downloads (partial count)
• ~500 contributors
• 200+ releases
• https://www.youtube.com/watch?v=eIGOG-
F9ZTw&feature=youtu.be
Friends of Apache Groovy
Open Collective
What is Groovy?
It’s like a super version of Java:
• Supports most Java syntax but allows
simpler syntax for many constructs
• Supports all Java libraries but
provides many extensions and its
own productivity libraries
• Has both a static and dynamic nature
• Extensible language and tooling
Groovy has a low learning curve, particularly for Java developers
Groovy is Java's friend
Seamless integration
• IDEs provide cross-language compile,
navigation, and refactoring
• Arbitrarily mix source language
• Drop-in replace any class
• Overloaded methods
• Syntax alignment
• Shared data types
Java
Groovy
Java-like or concise shortcuts
import java.util.List;
import java.util.ArrayList;
class Main {
private List keepShorterThan(List strings, int length) {
List result = new ArrayList();
for (int i = 0; i < strings.size(); i++) {
String s = (String) strings.get(i);
if (s.length() < length) {
result.add(s);
}
}
return result;
}
public static void main(String[] args) {
List names = new ArrayList();
names.add("Ted"); names.add("Fred");
names.add("Jed"); names.add("Ned");
System.out.println(names);
Main m = new Main();
List shortNames = m.keepShorterThan(names, 4);
System.out.println(shortNames.size());
for (int i = 0; i < shortNames.size(); i++) {
String s = (String) shortNames.get(i);
System.out.println(s);
}
}
}
names = ["Ted", "Fred", "Jed", "Ned"]
println names
shortNames = names.findAll{ it.size() < 4 }
println shortNames.size()
shortNames.each{ println it }
Concise syntax including DSL friendly
import java.util.List;
import java.util.ArrayList;
class Main {
private List keepShorterThan(List strings, int length) {
List result = new ArrayList();
for (int i = 0; i < strings.size(); i++) {
String s = (String) strings.get(i);
if (s.length() < length) {
result.add(s);
}
}
return result;
}
public static void main(String[] args) {
List names = new ArrayList();
names.add("Ted"); names.add("Fred");
names.add("Jed"); names.add("Ned");
System.out.println(names);
Main m = new Main();
List shortNames = m.keepShorterThan(names, 4);
System.out.println(shortNames.size());
for (int i = 0; i < shortNames.size(); i++) {
String s = (String) shortNames.get(i);
System.out.println(s);
}
}
}
names = ["Ted", "Fred", "Jed", "Ned"]
println names
shortNames = names.findAll{ it.size() < 4 }
println shortNames.size()
shortNames.each{ println it }
given the names "Ted", "Fred", "Jed" and "Ned"
display all the names
display the number of names having size less than 4
display the names having size less than 4
Metaprogramming
Matrix manipulation: Java vs Groovy
• Same example
• Same library
Array2DRowRealMatrix{{15.1379501385,40.488531856},{21.4354570637,59.5951246537}}
import org.apache.commons.math3.linear.*;
public class MatrixMain {
public static void main(String[] args) {
double[][] matrixData = { {1d,2d,3d}, {2d,5d,3d}};
RealMatrix m = MatrixUtils.createRealMatrix(matrixData);
double[][] matrixData2 = { {1d,2d}, {2d,5d}, {1d, 7d}};
RealMatrix n = new Array2DRowRealMatrix(matrixData2);
RealMatrix o = m.multiply(n);
// Invert p, using LU decomposition
RealMatrix oInverse = new LUDecomposition(o).getSolver().getInverse();
RealMatrix p = oInverse.scalarAdd(1d).scalarMultiply(2d);
RealMatrix q = o.add(p.power(2));
System.out.println(q);
}
}
Thanks to operator
overloading and
extensible tooling
groovyConsole
showing @Grab, power asserts and natural language processing
REPL for Groovy (groovysh)
Command chains and extensible type system
Text file? No, it’s a statically typed Groovy program which
won’t compile if the animal names aren’t valid.
A constraint programming solver is used to find the solution.
cranes have 2 legs
tortoises have 4 legs
there are 7 animals
there are 20 legs
display solution
Cranes 4
Tortoises 3
Command chains and extensible type system
cranes have 2 legs
tortoises have 4 legs
millipedes have 1000 legs
there are 8 animals
there are 1020 legs
display solution
Cranes 4
Tortoises 3
Millipedes 1
IDE support
Intellij IDEA, Apache NetBeans, Eclipse, Groovy Web Console
Build tools
Gradle, Apache Maven, Apache Ant
plugins {
id 'groovy'
}
repositories {
jcenter()
}
dependencies {
implementation 'org.codehaus.groovy:groovy-all:2.5.7'
testImplementation 'org.spockframework:spock-core:1.2-groovy-2.5'
}
project {
modelVersion '4.0.0'
groupId 'org.exampledriven'
artifactId 'maven-polyglot-groovy-example-parent'
version '1.0-SNAPSHOT'
packaging 'pom'
modules {
module 'maven-polyglot-submodule'
module 'maven-polyglot-spring'
}
}
ant.sequential {
echo("inside sequential")
def myDir = "target/AntTest/"
mkdir(dir: myDir)
copy(todir: myDir) {
fileset(dir: "src/test") {
include(name: "**/*.groovy")
}
}
echo("done")
}
Groovy frameworks plus entire Java ecosystem
Data Science Process
Research
Goals
Obtain Data
Data
Preparation
Data
Exploration
Visualization
Data
Modeling
Data
ingestion
Data
storage
Data processing
platforms
Modeling
algorithms
Math
libraries
Graphics
processing
Integration
Deployment
Source: https://medium.com/activewizards-machine-learning-company/comparison-of-top-data-science-libraries-for-python-r-and-scala-infographic-574069949267
Python/Scala/R Libraries/Tools/Frameworks for Data Science
Groovy/Java Libraries/Tools/Frameworks for Data Science
Source: https://medium.com/activewizards-machine-learning-company/comparison-of-top-data-science-libraries-for-python-r-and-scala-infographic-574069949267
Machine Learning
DeepLearning4J
Apache Spark ML
Apache MXNet
Visualization
GroovyFX
JFreeChart
Tablesaw
Mathematics and Engineering
GPars
Apache Commons Math
Smile
Weka
Groovylab
Apache Spark
Apache Ignite
Apache Beam
Data Manipulation and Analysis
Tablesaw
Apache Commons CSV
OpenCSV
Obtain data
Identify sources
• Internal/External (data providers)
• Databases, files, events, IoT
• Unstructured/structured
• Text, XML, JSON, YAML,
CSV/spreadsheets, audio,
images, video, web services
• Apache Tika
• JExcel
• Apache POI
Clarify ownership
Check quality
def jsonSlurper = new JsonSlurper()
def object = jsonSlurper.parseText('{ "myList": [4, 8, 15, 16, 23, 42] }’)
assert object instanceof Map
assert object.myList instanceof List
assert object.myList == [4, 8, 15, 16, 23, 42]
def response = new XmlSlurper().parseText(books)
def authorResult = response.value.books.book[0].author
assert authorResult.text() == 'Miguel de Cervantes'
def qry = 'SELECT * FROM Author’
assert sql.rows(qry, 1, 3)*.firstname == ['Dierk', 'Paul', 'Guillaume’]
assert sql.rows(qry, 4, 3)*.firstname == ['Hamlet', 'Cedric', 'Erik’]
assert sql.rows(qry, 7, 3)*.firstname == ['Jon']
Data preparation
Collections:
• Java: lists, maps, sets
• Groovy: literal notation, GDK methods, GPath
• Libraries
• Google Guava https://github.com/google/guava
• Apache Common Collections https://commons.apache.org/collections/
DataFrames:
• Joinery https://cardillo.github.io/joinery/
A data frame implementation in the spirit of Pandas or R data frames with show/plot
• Tablesaw https://jtablesaw.github.io/tablesaw/
Java dataframe and visualization library
• Apache Spark DataFrames https://spark.apache.org/docs/latest/sql-programming-
guide.html
Equivalent to a table in a relational database or a data frame in R/Python
• Paleo https://github.com/netzwerg/paleo
Immutable Java 8 data frames with typed columns (including primitives)
Data exploration - DEX
Data exploration - Weka
More info: https://www.youtube.com/watch?v=7quZv6WCTQc
Visualization
• Open-Source plotting libraries:
GRAL, JFreeChart, Xchart, JMathPlot, Jzy3d, JavaFX, GroovyFX
• Types: Scatter plot, Line plot, Area plot, Pie plot, Donut plot, Horizontal bars, Vertical
bars, Bubble plot, Radar plot, Box plot, Raster plot, Contour plot, Gauge plot, Step
plot, Gantt plot, Histogram
• Features: Multiple axes, Logarithmic axes, Line smoothing, Data filtering/aggregation
Notebooks supporting Groovy
• Jupyter/beakerx
• Apache Zeppelin
• GroovyLab
• Seco
Notebooks supporting Groovy
• Jupyter/beakerx
• Apache Zeppelin
• GroovyLab
• Seco
Data preparation
IO:
• Java in-built
• Groovy enhancements
• Libraries
• Apache Commons IO https://commons.apache.org/io/
• AOL Cyclops-React https://github.com/aol/cyclops-react
Web scraping:
• GEB
Batch processing/ETL:
• Spring Batch http://projects.spring.io/spring-batch/
• Apache Camel
• Apache Gobblin
Math libraries
Math and stats libraries
• Java: log, exp, other basic methods.
Libraries:
• Apache Commons Math http://commons.apache.org/math/
statistics, optimization, and linear algebra
• Apache Mahout http://mahout.apache.org/
linear algebra, plus distributed linear algebra and machine learning
• JBlas http://jblas.org/ optimized and very fast linear algebra package
that uses the BLAS library
Also, many of the machine learning libraries come with some math
functionality, often linear algebra, stats, and optimization.
Data storage
Database:
• Java’s JDBC
• Groovy JDBC enhancements
and Datasets
• Other NOSQL
• Apache Cassandra
• MongoDB
• Apache CouchDB
• Neo4j
import com.gmongo.GMongo
def db = new GMongo().getDB('athletes')
db.athletes.drop()
db.athletes << [
first: 'Paul',
last: 'Tergat', dob: '1969-06-17',
runs: [[distance: 42195, time: 2 * 60 * 60 + 4 * 60 + 55,
venue: 'Berlin', when: '2003-09-28’]]
]
println "World rows following $marathon3.venue $marathon3.when:"
def t = new Traversal()
for (Path p in t.description().breadthFirst().
relationships(MyRelationshipTypes.supercedes).
evaluator(Evaluators.fromDepth(1)).
uniqueness(Uniqueness.NONE).
traverse(marathon3)) {
def newRow = p.endNode()
println "$newRow.venue $newRow.when"
}
Graph g = new Neo4jGraph(graphDb)
def pretty = { it.collect { "$it.venue $it.when" }.join(', ') }
def results = []
g.V('venue', 'London').fill(results)
println 'London world rows: ' + pretty(results)
results = []
g.V('venue', 'London').in('supercedes').fill(results)
println 'World rows after London: ' + pretty(results)
Machine learning and data mining
Machine learning and data mining libraries:
• Weka http://www.cs.waikato.ac.nz/ml/weka/
• JavaML http://java-ml.sourceforge.net/
old and reliable ML library but not actively updated
• Smile http://haifengl.github.io/smile/
ML library under active development
• JSAT https://github.com/EdwardRaff/JSAT
contains many machine learning algorithms
• H2O http://www.h2o.ai/
distributed Java ML framework also supports Scala, R and Python
• Apache Mahout http://mahout.apache.org/
distributed linear algebra framework
• Apache Spark ml and mllib
https://github.com/apache/spark/tree/master/examples/src/main/java/org/apache/spark/examples
Neural networks and text processing
Neural networks:
• Encog http://www.heatonresearch.com/encog/
• DeepLearning4j http://deeplearning4j.org/
natively supports Keras models
Text processing
• Java: StringTokenizer, the java.text package, regular expressions
• Groovy: Regex enhancements, templates, String enhancements
• Libraries:
• Apache Lucene https://lucene.apache.org/
• Stanford CoreNLP http://stanfordnlp.github.io/CoreNLP/
• Apache OpenNLP https://opennlp.apache.org/
• LingPipe http://alias-i.com/lingpipe/
• GATE https://gate.ac.uk/
• MALLET http://mallet.cs.umass.edu/
• Smile http://haifengl.github.io/smile/
Scaling up
Microservice frameworks:
• Micronaut
Distributed data processing:
• Apache Hadoop
• Apache Spark
• Apache Flink
• Apache Samza
• Apache Kafka
• Apache Storm
• Apache Apex
• Apache Beam
• Apache Ignite
• Apache Nifi
Scaling up: Apache Ignite
Ignite™ is a memory-centric distributed database, caching, and
processing platform for transactional, analytical, and streaming
workloads delivering in-memory speeds at petabyte scale
Scaling up: Apache Spark
Source: https://spark.apache.org/ and https://renovacloud.com/en/an-introduction-to-and-evaluation-of-apache-spark-for-big-data-architectures/
Apache Spark™ is a unified analytics engine for
large-scale data processing
• A Concurrency & Parallelism Framework for
Groovy and Java
• Fork/join, parallel arrays, map/reduce, actors, agents,
STM, CSP, dataflow, asynchronous functions
Scaling Up: GPars
Data Parallelism:
Fork/Join
Map/Reduce
Fixed coordination
(for collections)
Actors Explicit coordination
Safe Agents Delegated coordination
Dataflow Implicit coordination
def decryptor = actor {
loop {
react { String message ->
reply message.reverse()
}
}
}
def console = actor {
decryptor << 'lellarap si yvoorG'
react {
println 'Decrypted message: ' + it
}
}
console.join()
final def df = new Dataflows()
task { df.z = df.x + df.y }
task { df.x = 10 }
task { df.y = 5 }
assert df.z == 15
def nums = [1, 2, 3, 4]
def result = nums.collectParallel{it * 2}
assert result == [2, 4, 6, 8]
def agent = new Agent([])
agent { it << 'Dave' }
agent { it << 'Joe' }
assert agent.val.size() == 2
class Account {
private amount = newTxnInteger(0)
void transfer(final int a) {
atomic { amount.increment(a) }
}
int getCurrentAmount() {
atomicWithInt { amount.get() }
}
}
def future = {it * 2}.callAsync(3)
assert 6 == future.get()
5
10
15
x y
z
+
Scaling up: Apache Beam
Apache Beam is an open source, unified model for defining
both batch and streaming data-parallel processing pipelines
Source: https://beam.apache.org/get-started/beam-overview/
Computer science algorithms
Decision
problems
Search
problems
Counting
problems
Optimization
problems
• Paradigms: brute force, divide & conquer, search & enumerate,
randomized, complexity reduction, recursive, back tracking, graph
• Optimization approaches: linear programming,
dynamic programming, greedy, heuristics
Data science algorithms
Data
Mining
Statistics
Machine
Learning
Optimization
• Analytics: descriptive, predictive, prescriptive
• Analysis: anomaly detection, classification, regression, clustering,
association, optimization, dimension reduction
• Data relationship: linear, non-linear
• Assumptions: parametric, non-parametric
• Strategy: supervised, unsupervised, reinforcement
• Combining: ensemble, boosting
Data Science Algorithms
Source: Jason Brownlee, https://machinelearningmastery.com/master-machine-learning-algorithms/
Data Science Algorithms
Source: Jason Brownlee, https://machinelearningmastery.com/master-machine-learning-algorithms/
Candles and COVID-19
Candles and COVID-19
Working with tables and plots
List traces(String filename, String lineColor, String markerColor) {
def url = getClass().classLoader.getResource(filename)
def table = new XlsxReader().read(builder(url).build())
table.addColumns(DateColumn.create('YearMonth', table.column('Date').collect { LocalDate.of(it.year, it.month, 15) }))
def janFirst2017 = LocalDateTime.of(2017, JANUARY, 1, 0, 0)
Function from2017 = r -> r.dateTimeColumn('Date').isAfter(janFirst2017)
Function top3 = r -> r.intColumn('CandleID').isLessThanOrEqualTo(3)
def byMonth = table.sortAscendingOn('Date')
.where(and(from2017, top3))
.summarize('Rating', mean).by('YearMonth')
def byDate = table.sortAscendingOn('Date')
.where(and(from2017, top3))
.summarize('Rating', mean).by('Date')
def averaged = ScatterTrace.builder(byMonth.dateColumn('YearMonth'), byMonth.nCol('Mean [Rating]'))
.mode(ScatterTrace.Mode.LINE)
.line(Line.builder().width(5).color(lineColor).shape(Line.Shape.SPLINE).smoothing(1.3).build())
.build()
def scatter = ScatterTrace.builder(byDate.dateTimeColumn('Date'), byDate.nCol('Mean [Rating]'))
.marker(Marker.builder().opacity(0.3).color(markerColor).build())
.build()
[averaged, scatter]
}
Working with tables and plots
def (sAverage, sScatter) = traces('Scented_all.xlsx', 'seablue', 'lightskyblue')
def (uAverage, uScatter) = traces('Unscented_all.xlsx', 'seagreen', 'lightgreen')
Plot.show(new Figure(layout('scented'), sAverage, sScatter, line))
Plot.show(new Figure(layout('unscented'), uAverage, uScatter, line))
Working with tables and plots
Regression Overview Clustering:
• Relationship between variables
Types:
• Linear OLSR
• Logistic
• Polynomial
• Stepwise
• Ridge
• Lasso
• ElasticNet
Aspects:
• Under/over fitting
• Outliers
Applications:
• House price prediction
• Market predictions
• Ratings prediction
• Weather forecast
Linear regression case study: House price predictions
See also: https://nicolai92.github.io/posts/pentaho-weka-prediction
Dataset: https://www.kaggle.com/harlfoxem/housesalesprediction/
House price predictions – view data graphically
import org.apache.commons.math3.random.EmpiricalDistribution
import static groovyx.javafx.GroovyFX.start
import static org.apache.commons.csv.CSVFormat.RFC4180 as CSV
def file = 'kc_house_data.csv' as File
def csv = CSV.withFirstRowAsHeader().parse(new FileReader(file))
def all = csv.collect { it.bedrooms.toInteger() }
def dist = new EmpiricalDistribution(all.max()).tap{ load(all as double[]) }
def bins = dist.binStats.withIndex().collectMany { v, i -> [i.toString(), v.n] }
start {
stage(title: 'Number of bedrooms histogram', show: true, width: 800, height: 600) {
scene {
barChart(title: 'Bedroom count', barGap: 0, categoryGap: 2) {
series(name: 'Number of properties', data: bins)
}
}
}
}
House price predictions – view data graphically
House price predictions – view stats
import org.apache.commons.math3.stat.descriptive.SummaryStatistics
import static org.apache.commons.csv.CSVFormat.RFC4180 as CSV
def file = 'kc_house_data.csv' as File
def csv = CSV.withFirstRowAsHeader().parse(new FileReader(file))
def all = csv.collect { it.bedrooms.toInteger() }
def stats = new SummaryStatistics()
all.each{ stats.addValue(it as double) }
println stats.summary
n: 21613
min: 0.0
max: 33.0
mean: 3.370841623097218
std dev: 0.930061831147451
variance: 0.8650150097573497
sum: 72854.0
House price predictions – investigate outliers
import org.apache.commons.csv.CSVFormat.RFC4180 as CSV
def file = 'kc_house_data.csv' as File
def csv = CSV.withFirstRowAsHeader().parse(new FileReader(file))
csv.findAll{ it.bedrooms.toInteger() > 10 }.each{ println it.toMap() as TreeMap }
[bathrooms:3, bedrooms:11, condition:3, date:20140821T000000, floors:2, grade:7, id:1773100755, lat:47.556,
long:-122.363, price:520000, sqft_above:2400, sqft_basement:600, sqft_living:3000, sqft_living15:1420,
sqft_lot:4960, sqft_lot15:4960, view:0, waterfront:0, yr_built:1918, yr_renovated:1999, zipcode:98106]
[bathrooms:1.75, bedrooms:33, condition:5, date:20140625T000000, floors:1, grade:7, id:2402100895, lat:47.6878,
long:-122.331, price:640000, sqft_above:1040, sqft_basement:580, sqft_living:1620, sqft_living15:1330,
sqft_lot:6000, sqft_lot15:4700, view:0, waterfront:0, yr_built:1947, yr_renovated:0, zipcode:98103]
House price predictions – investigate outliers
Some libraries optionally support CSV -> richly-typed domain classes
import com.opencsv.bean.*
import groovy.transform.ToString
@ToString(includeNames = true)
class House {
@CsvBindByName
Integer bedrooms
@CsvBindByName
String bathrooms
@CsvBindByName(column = 'sqft_lot')
Integer area_lot
}
def file = 'kc_house_data.csv' as File
def builder = new CsvToBeanBuilder(new FileReader(file))
def rows = builder.withType(House).build().parse()
rows.findAll{ it.bedrooms > 10 }.each{ println it }
House(bedrooms:11, bathrooms:3, area_lot:4960)
House(bedrooms:33, bathrooms:1.75, area_lot:6000)
House prices – remove outlier and view bedrooms
import org.apache.commons.math3.random.EmpiricalDistribution
import static groovyx.javafx.GroovyFX.start
import static org.apache.commons.csv.CSVFormat.RFC4180 as CSV
def csv = CSV.withFirstRowAsHeader().parse(new FileReader('kc_house_data.csv'))
def all = csv.collect { it.bedrooms.toInteger() }.findAll{ it < 30 }
def dist = new EmpiricalDistribution(all.max()).tap{ load(all as double[]) }
def bins = dist.binStats.withIndex().collectMany { v, i -> [i.toString(), v.n] }
start {
stage(title: 'Number of bedrooms histogram', show: true, width: 800, height: 600) {
scene {
barChart(title: 'Bedroom count', barGap: 0, categoryGap: 2) {
series(name: 'Number of properties', data: bins)
}
}
}
}
House price predictions – remove outlier and view price
import org.apache.commons.math3.random.EmpiricalDistribution
import static groovyx.javafx.GroovyFX.start
import static org.apache.commons.csv.CSVFormat.RFC4180 as CSV
def file = 'kc_house_data.csv' as File
def csv = CSV.withFirstRowAsHeader().parse(new FileReader(file))
def all = csv.findAll { it.bedrooms.toInteger() < 30 }.collect { it.price.toDouble() }
def info = new SummaryStatistics(); all.each(info::addValue)
def head = "Price percentile (min=\$$info.min, mean=\$${info.mean as int}, max=\$$info.max)"
def dist = new EmpiricalDistribution(100).tap{ load(all as double[]) }
def bins = dist.binStats.withIndex().collectMany { v, i -> [i.toString(), v.n] }
start {
stage(title: 'Price histogram', show: true, width: 800, height: 600) {
scene {
barChart(title: head, barGap: 0, categoryGap: 0) {
series(name: 'Number of properties', data: bins)
}
}
}
}
House price predictions – explore with tablesaw
House price predictions – explore with tablesaw
import tech.tablesaw.api.*
import tech.tablesaw.plotly.Plot
import tech.tablesaw.plotly.api.*
import static tech.tablesaw.aggregate.AggregateFunctions.*
def rows = Table.read().csv('kc_house_data.csv')
println rows.shape()
println rows.structure()
println rows.column("bedrooms").summary().print()
println rows.where(rows.column("bedrooms").isGreaterThan(10))
def cleaned = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30))
println cleaned.shape()
println cleaned.summarize("price", mean, min, max).by("bedrooms")
Plot.show(ScatterPlot.create("Price x bathrooms x grade", cleaned, "bathrooms", "price", 'grade'))
House price predictions – explore with tablesaw
import tech.tablesaw.api.*
import tech.tablesaw.plotly.Plot
import tech.tablesaw.plotly.api.*
import static tech.tablesaw.aggregate.AggregateFunctions.*
def rows = Table.read().csv('kc_house_data.csv')
println rows.shape()
println rows.structure()
println rows.column("bedrooms").summary().print()
println rows.where(rows.column("bedrooms").isGreaterThan(10))
def cleaned = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30))
println cleaned.shape()
println cleaned.summarize("price", mean, min, max).by("bedrooms")
Plot.show(ScatterPlot.create("Price x bathrooms x grade", cleaned, "bathrooms", "price", 'grade'))
21613 rows X 21 cols
House price predictions – explore with tablesaw
import tech.tablesaw.api.*
import tech.tablesaw.plotly.Plot
import tech.tablesaw.plotly.api.*
import static tech.tablesaw.aggregate.AggregateFunctions.*
def rows = Table.read().csv('kc_house_data.csv')
println rows.shape()
println rows.structure()
println rows.column("bedrooms").summary().print()
println rows.where(rows.column("bedrooms").isGreaterThan(10))
def cleaned = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30))
println cleaned.shape()
println cleaned.summarize("price", mean, min, max).by("bedrooms")
Plot.show(ScatterPlot.create("Price x bathrooms x grade", cleaned, "bathrooms", "price", 'grade'))
Structure of kc_house_data.csv
Index | Column Name | Column Type |
-------------------------------------------
0 | id | LONG |
1 | date | STRING |
2 | price | DOUBLE |
3 | bedrooms | INTEGER |
4 | bathrooms | DOUBLE |
5 | sqft_living | INTEGER |
6 | sqft_lot | INTEGER |
7 | floors | DOUBLE |
8 | waterfront | INTEGER |
9 | view | INTEGER |
... | ... | ... |
11 | grade | INTEGER |
12 | sqft_above | INTEGER |
13 | sqft_basement | INTEGER |
14 | yr_built | INTEGER |
15 | yr_renovated | INTEGER |
16 | zipcode | INTEGER |
17 | lat | DOUBLE |
18 | long | DOUBLE |
19 | sqft_living15 | INTEGER |
20 | sqft_lot15 | INTEGER |
House price predictions – explore with tablesaw
import tech.tablesaw.api.*
import tech.tablesaw.plotly.Plot
import tech.tablesaw.plotly.api.*
import static tech.tablesaw.aggregate.AggregateFunctions.*
def rows = Table.read().csv('kc_house_data.csv')
println rows.shape()
println rows.structure()
println rows.column("bedrooms").summary().print()
println rows.where(rows.column("bedrooms").isGreaterThan(10))
def cleaned = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30))
println cleaned.shape()
println cleaned.summarize("price", mean, min, max).by("bedrooms")
Plot.show(ScatterPlot.create("Price x bathrooms x grade", cleaned, "bathrooms", "price", 'grade'))
Column: bedrooms
Measure | Value |
-----------------------------------
n | 21613.0 |
sum | 72854.0 |
Mean | 3.370841623097218 |
Min | 0.0 |
Max | 33.0 |
Range | 33.0 |
Variance | 0.8650150097573497 |
Std. Dev | 0.930061831147451 |
House price predictions – explore with tablesaw
import tech.tablesaw.api.*
import tech.tablesaw.plotly.Plot
import tech.tablesaw.plotly.api.*
import static tech.tablesaw.aggregate.AggregateFunctions.*
def rows = Table.read().csv('kc_house_data.csv')
println rows.shape()
println rows.structure()
println rows.column("bedrooms").summary().print()
println rows.where(rows.column("bedrooms").isGreaterThan(10))
def cleaned = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30))
println cleaned.shape()
println cleaned.summarize("price", mean, min, max).by("bedrooms")
Plot.show(ScatterPlot.create("Price x bathrooms x grade", cleaned, "bathrooms", "price", 'grade'))
kc_house_data.csv
id | date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | grade | sqft_above | sqft_basement | yr_built | yr_renovated | zipcode | lat | long | sqft_living15 | sqft_lot15 |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1773100755 | 20140821 | 520000.0 | 11 | 3.0 | 3000 | 4960 | 2.0 | 0 | 0 | 3 | 7 | 2400 | 600 | 1918 | 1999 | 98106 | 47.556 | -122.363 | 1420 | 4960 |
2402100895 | 20140625 | 640000.0 | 33 | 1.75 | 1620 | 6000 | 1.0 | 0 | 0 | 5 | 7 | 1040 | 580 | 1947 | 0 | 98103 | 47.687 | -122.331 | 1330 | 4700 |
House price predictions – explore with tablesaw
import tech.tablesaw.api.*
import tech.tablesaw.plotly.Plot
import tech.tablesaw.plotly.api.*
import static tech.tablesaw.aggregate.AggregateFunctions.*
def rows = Table.read().csv('kc_house_data.csv')
println rows.shape()
println rows.structure()
println rows.column("bedrooms").summary().print()
println rows.where(rows.column("bedrooms").isGreaterThan(10))
def cleaned = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30))
println cleaned.shape()
println cleaned.summarize("price", mean, min, max).by("bedrooms")
Plot.show(ScatterPlot.create("Price x bathrooms x grade", cleaned, "bathrooms", "price", 'grade'))
21612 rows X 21 cols
House price predictions – explore with tablesaw
import tech.tablesaw.api.*
import tech.tablesaw.plotly.Plot
import tech.tablesaw.plotly.api.*
import static tech.tablesaw.aggregate.AggregateFunctions.*
def rows = Table.read().csv('kc_house_data.csv')
println rows.shape()
println rows.structure()
println rows.column("bedrooms").summary().print()
println rows.where(rows.column("bedrooms").isGreaterThan(10))
def cleaned = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30))
println cleaned.shape()
println cleaned.summarize("price", mean, min, max).by("bedrooms")
Plot.show(ScatterPlot.create("Price x bathrooms x grade", cleaned, "bathrooms", "price", 'grade'))
kc_house_data.csv summary
bedrooms | Mean [price] | Min [price] | Max [price] |
-------------------------------------------------------------------
0 | 409503.8461538461 | 139950.0 | 1295650.0 |
1 | 317642.8844221105 | 75000.0 | 1247000.0 |
2 | 401372.68188405805 | 78000.0 | 3278000.0 |
3 | 466232.0784812695 | 82000.0 | 3800000.0 |
4 | 635419.5042138912 | 100000.0 | 4489000.0 |
5 | 786599.8288569644 | 133000.0 | 7062500.0 |
6 | 825520.6360294118 | 175000.0 | 7700000.0 |
7 | 951184.6578947367 | 280000.0 | 3200000.0 |
8 | 1105076.923076923 | 340000.0 | 3300000.0 |
9 | 893999.8333333334 | 450000.0 | 1400000.0 |
10 | 819333.3333333334 | 650000.0 | 1148000.0 |
11 | 520000.0 | 520000.0 | 520000.0 |
House price predictions – explore with tablesaw
import tech.tablesaw.api.*
import tech.tablesaw.plotly.Plot
import tech.tablesaw.plotly.api.*
import static tech.tablesaw.aggregate.AggregateFunctions.*
def rows = Table.read().csv('kc_house_data.csv')
println rows.shape()
println rows.structure()
println rows.column("bedrooms").summary().print()
println rows.where(rows.column("bedrooms").isGreaterThan(10))
def cleaned = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30))
println cleaned.shape()
println cleaned.summarize("price", mean, min, max).by("bedrooms")
Plot.show(ScatterPlot.create("Price x bathrooms x grade", cleaned, "bathrooms", "price", 'grade'))
House price predictions – explore with tablesaw
cleaned.addColumns(
StringColumn.create("waterfrontDesc", cleaned.column("waterfront").collect{ it ? 'waterfront' : 'interior' }),
DoubleColumn.create("scaledGrade", cleaned.column("grade").collect{ it * 2 }),
DoubleColumn.create("scaledPrice", cleaned.column("price").collect{ it / 100000 })
)
Plot.show(BubblePlot.create("Price vs living area and grade (bubble size)",
cleaned, "sqft_living", "price", "scaledGrade", "waterfrontDesc"))
Plot.show(Scatter3DPlot.create("Grade, living space, bathrooms and price (bubble size)",
cleaned, "sqft_living", "bathrooms", "grade", "scaledPrice", "waterfrontDesc"))
House price predictions – explore with jupyter/beakerx
Regression with Least Squares Find line of “best fit”:
Regression with Least Squares Find line of “best fit”:
• Find intercept and slope of
line which minimizes the
sum of the square of the
residual errors
Residual errors
y
x
Regression with Least Squares Find line of “best fit”:
• Find intercept and slope of
line which minimizes the
sum of the square of the
residual errors
• Line is represented by:
y = m x + b
• For points (x0
, y0
)..(xn
, yn
)
errors will be:
err0
= y0
– (m x0
+ b)
..
errn
= yn
– (m xn
+ b)
which are then squared
• To minimise, we find where
the derivative is zero
• Solving the math gives:
b = σ 𝑦 σ 𝑥2−σ 𝑥 σ 𝑥 𝑦
𝑛 σ 𝑥2 − σ 𝑥 2
m =
𝑛 σ 𝑥 𝑦 − σ 𝑥 σ 𝑦
𝑛 σ 𝑥2 − σ 𝑥 2
Residual errors
y
x
(0, b)
(x0
, y0
)
(xn
, yn
)
errn
err0
Regression with Gradient Descent Find line of “best fit”:
• Find intercept and slope of
line which minimizes the
sum of the square of the
residual errors
• Line is represented by:
y = m x + b
and we might have a
similar loss function
• Instead of solving for the
derivative of loss function,
we take repeated steps in
the opposite direction of
the gradient, taking smaller
steps as we approach zero
• A learning rate influences
step size
• Amenable to streaming &
can involve random or
mini-batch subsets of
data for efficiency
Loss function (residual error)
Parameter being optimized (could be multiple dimensions)
House price predictions – linear regression
import org.apache.commons.math3.stat.regression.SimpleRegression
import static groovyx.javafx.GroovyFX.start
import static org.apache.commons.csv.CSVFormat.RFC4180 as CSV
def feature = 'bedrooms'
def nonOutliers = feature == 'bedrooms' ? { it[0] < 30 } : { true }
def csv = CSV.withFirstRecordAsHeader().parse(new FileReader('kc_house_data.csv'))
def all = csv.collect { [it[feature].toDouble(), it.price.toDouble()] }.findAll(nonOutliers)
def reg = new SimpleRegression().tap{ addData(all as double[][]) }
def (min, max) = all.transpose().with{ [it[0].min(), it[0].max()] }
def predicted = [[min, reg.predict(min)], [max, reg.predict(max)]]
start {
stage(title: 'Price vs Number of bedrooms', show: true, width: 800, height: 600) {
scene {
scatterChart {
series(name: 'Actual', data: all)
}
lineChart {
series(name: 'Predicted', data: predicted)
}
}
}
}
House price predictions – linear regression
• Can be repeated for other features
House price predictions – linear regression
House price predictions – evaluating regression
@Canonical
@FXBindable
class Feature {
String name
double rr, rmseTrain, rmseTest, mean
}
def features = [new Feature('bedrooms'), new Feature('bathrooms'), new Feature('grade'), new Feature('sqft_living')]
def idxs = 0..def priceIdx = features.size()
def file = getClass().classLoader.getResource('kc_house_data.csv').file
def csv = CSV.withFirstRecordAsHeader().parse(new FileReader(file))
def all = csv.collect { row ->
[*idxs.collect{ row[features[it].name].toDouble() }, row.price.toDouble()]
}.findAll{ it[0] < 30 }
def (train, test) = all.chop(all.size() * 0.8 as int, -1)
def trainT = train.transpose()
def regs = idxs.collect { idx ->
new SimpleRegression().tap{ addData([trainT[idx], trainT[priceIdx]].transpose() as double[][]) }
}
def residuals = idxs.collect{ idx -> test.collect{ regs[idx].predict(it[idx]) - it[priceIdx] } }
House price predictions – evaluating regression
idxs.each { idx ->
features[idx].rr = regs[idx].RSquare
features[idx].rmseTrain = Math.sqrt(regs[idx].meanSquareError)
features[idx].rmseTest = Math.sqrt(StatUtils.sumSq(residuals[idx] as double[]) / (test.size() - 1))
features[idx].mean = StatUtils.mean(residuals[idx] as double[])
}
start {
stage(title: "Error statistics for feature", visible: true) {
scene(fill: groovyblue, width: 800, height:200) {
stackPane(padding: 10) {
tableView(items: features) {
tableColumn(text: "Feature", property: 'name')
tableColumn(text: "R²", property: 'rr')
tableColumn(text: "RMSE (train)", property: 'rmseTrain')
tableColumn(text: "RMSE (test)", property: 'rmseTest')
tableColumn(text: "Residuals Mean", property: 'mean')
}
}
}
}
}
House price predictions – evaluating regression
def maxError = [residuals[graphIdx].min(), residuals[graphIdx].max()].max{ Math.abs(it) }
residuals[graphIdx] << maxError * -1 // make graph even around origin
maxError = Math.abs(maxError)
def step = maxError.toInteger() / 50
def dist = new EmpiricalDistribution(100).tap{ load(residuals[graphIdx] as double[]) }
def ndist = new NormalDistribution(0, dist.sampleStats.standardDeviation)
def bins = dist.binStats.indexed().collect { i, v ->
[v.n ? v.mean.toInteger().toString(): (-maxError + i * step).toInteger().toString(), v.n]
}
def nbins = dist.binStats.indexed().collect { i, v ->
def x = v.n ? v.mean.toInteger() : (-maxError + i * step).toInteger(); [x.toString(), ndist.probability(x, x + step)]
}
def scale = dist.binStats.max{ it.n }.n / nbins.max{ it[1] }[1]
nbins = nbins.collect{ [it[0], it[1] * scale] }
start {
stage(title: "Error histogram for ${features[graphIdx].name}", show: true, width: 800, height: 600) {
scene {
barChart(title: 'Error percentile', barGap: 0, categoryGap: 0) {
series(name: 'Error in prediction', data: bins)
series(name: 'Normal distribution', data: nbins)
}
}
}
}
House price predictions – evaluating regression
House price predictions – multi linear regression
def features = [
'bedrooms', 'bathrooms', 'sqft_living', 'sqft_living15', 'lat',
'sqft_above', 'grade', 'view', 'waterfront', 'floors'
]
def idxs = 0..def priceIdx = features.size()
def csv = CSV.withFirstRecordAsHeader().parse(new FileReader('kc_house_data.csv'))
def all = csv.collect { row -> [*idxs.collect{ row[features[it]].toDouble() }, row.price.toDouble()] }.findAll{ it[0] < 30 }
def (train, test) = all.chop(all.size() * 0.8 as int, -1)
def trainT = train.transpose()
def price = trainT[priceIdx]
def reg = new OLSMultipleLinearRegression().tap{ newSampleData(price as double[], trainT[0..def params = reg.estimateRegressionParameters()
println params
def predicted = test.collect { data -> params[0] + (1..def residuals = test.indexed().collect { i, data -> predicted[i] - data[priceIdx] }
def rr = reg.calculateRSquared()
def rmseTrain = Math.sqrt(reg.calculateResidualSumOfSquares())
def rmseTest = Math.sqrt(StatUtils.sumSq(residuals as double[]) / (test.size() - 1))
def mean = StatUtils.mean(residuals as double[])
println "$rr $rmseTrain $rmseTest $mean"
[-3.1635392225013282E7, -27484.961699434185, -5236.246584477492, 191.507648719199, 17.94379836055797, 657803.1115560957,
4.369533695386711, 74850.68999978402, 66459.03393303146, 604896.8569593901, -27203.796342244277]
0.6536012563023335 215379.66626496855 212609.34498021202 1886.6015077985683
R2: https://en.wikipedia.org/wiki/Coefficient_of_determination
rmse: https://en.wikipedia.org/wiki/Root-mean-square_deviation
House price predictions – multi linear regression
House price predictions – multi linear regression
def file = getClass().classLoader.getResource('kc_house_data.csv').file as File
def loader = new CSVLoader(file: file)
def model = new LinearRegression()
def allInstances = loader.dataSet
def priceIndex = 2
allInstances.classIndex = priceIndex
// remove "id" and "date" columns
def rm = new Remove(attributeIndices: '1,2', inputFormat: allInstances)
def instances = Filter.useFilter(allInstances, rm)
model.buildClassifier(instances)
println model
def actual = instances.collect{ it.value(0).toDouble() }
def predicted = instances.collect{ model.classifyInstance(it) }
def chart = new XYChartBuilder()…
Linear Regression Model
price =
-35766.5414 * bedrooms +
41144.2785 * bathrooms +
82.8088 * sqft_living +
0.1286 * sqft_lot +
6689.5501 * floors +
582960.4584 * waterfront +
52870.9424 * view +
26385.6491 * condition +
95890.4452 * grade +
98.4193 * sqft_above +
67.2917 * sqft_basement +
-2620.2232 * yr_built +
19.8126 * yr_renovated +
-582.4199 * zipcode +
602748.2265 * lat +
-214729.8283 * long +
21.6814 * sqft_living15 +
-0.3826 * sqft_lot15 +
6690324.6024
House price predictions – multi linear regression
def file = getClass().classLoader.getResource('kc_house_data.csv').file as File
def loader = new CSVLoader(file: file)
def model = new LinearRegression()
def allInstances = loader.dataSet
def priceIndex = 2
allInstances.classIndex = priceIndex
// remove "id" and "date" columns
def rm = new Remove(attributeIndices: '1,2', inputFormat: allInstances)
def instances = Filter.useFilter(allInstances, rm)
model.buildClassifier(instances)
println model
def actual = instances.collect{ it.value(0).toDouble() }
def predicted = instances.collect{ model.classifyInstance(it) }
def chart = new XYChartBuilder()…
Linear Regression Model
price =
-35766.5414 * bedrooms +
41144.2785 * bathrooms +
82.8088 * sqft_living +
0.1286 * sqft_lot +
6689.5501 * floors +
582960.4584 * waterfront +
52870.9424 * view +
26385.6491 * condition +
95890.4452 * grade +
98.4193 * sqft_above +
67.2917 * sqft_basement +
-2620.2232 * yr_built +
19.8126 * yr_renovated +
-582.4199 * zipcode +
602748.2265 * lat +
-214729.8283 * long +
21.6814 * sqft_living15 +
-0.3826 * sqft_lot15 +
6690324.6024
House price predictions – multi linear regression
def file = getClass().classLoader.getResource('kc_house_data.csv').file as File
def loader = new CSVLoader(file: file)
def model = new SGD()
model.options = ['-F', '4', '-N'] as String[] // Huber loss, unscaled
def allInstances = loader.dataSet
def priceIndex = 2
allInstances.classIndex = priceIndex
// remove "id", "date", 'zip', 'lat', 'long' columns
def rm = new Remove(attributeIndices: '1,2,17,18,19', inputFormat:
allInstances)
def instances = Filter.useFilter(allInstances, rm)
model.buildClassifier(instances)
println model
def actual = instances.collect{ it.value(0).toDouble() }
def predicted = instances.collect{ model.classifyInstance(it) }
def chart = new XYChartBuilder()…
Loss function: Huber loss
(robust regression)
price =
-9.6832 bedrooms
+ 0.4953 bathrooms
+ 112.6971 sqft_living
+ 0.8189 sqft_lot
+ 5.5 floors
+ 0.7041 waterfront
+ 11.3372 view
+ 7.8443 condition
+ 23.9548 grade
+ 37.5499 sqft_above
+ 75.1472 sqft_basement
+ -7.8827 yr_built
+ 63.7169 yr_renovated
+ 103.4893 sqft_living15
+ -1.3449 sqft_lot15
+ 0.2788
House price predictions – multi linear regression
def file = getClass().classLoader.getResource('kc_house_data.csv').file as File
def loader = new CSVLoader(file: file)
def model = new SGD()
model.options = ['-F', '4', '-N'] as String[] // Huber loss, unscaled
def allInstances = loader.dataSet
def priceIndex = 2
allInstances.classIndex = priceIndex
// remove "id", "date", 'zip', 'lat', 'long' columns
def rm = new Remove(attributeIndices: '1,2,17,18,19', inputFormat:
allInstances)
def instances = Filter.useFilter(allInstances, rm)
model.buildClassifier(instances)
println model
def actual = instances.collect{ it.value(0).toDouble() }
def predicted = instances.collect{ model.classifyInstance(it) }
def chart = new XYChartBuilder()…
Loss function: Huber loss
(robust regression)
price =
-9.6832 bedrooms
+ 0.4953 bathrooms
+ 112.6971 sqft_living
+ 0.8189 sqft_lot
+ 5.5 floors
+ 0.7041 waterfront
+ 11.3372 view
+ 7.8443 condition
+ 23.9548 grade
+ 37.5499 sqft_above
+ 75.1472 sqft_basement
+ -7.8827 yr_built
+ 63.7169 yr_renovated
+ 103.4893 sqft_living15
+ -1.3449 sqft_lot15
+ 0.2788
House price predictions – explore with Weka
House price predictions – explore with Weka
House price predictions – analyze with Weka
House price predictions – analyze with Weka
House price predictions – analyze with Weka
House price predictions – scaling regression
import …
def rows = Table.read().csv('kc_house_data.csv')
rows = rows.dropWhere(rows.column("bedrooms").isGreaterThan(30))
String[] features = [
'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_living15',
'lat', 'sqft_above', 'grade', 'view', 'waterfront', 'floors'
]
def data = rows.as().doubleMatrix(features)
// configure to all run on local machine but could be a cluster (can be hidden in XML)
def cfg = new IgniteConfiguration(
peerClassLoadingEnabled: true,
discoverySpi: new TcpDiscoverySpi(
ipFinder: new TcpDiscoveryMulticastIpFinder(
addresses: ['127.0.0.1:47500..47509']
)
)
)
static pretty(mdl, features) {
def sign = { val -> val < 0 ? '- ' : '+ ' }
def valIdx = { idx, val -> sprintf '%.2f*%s', val, features[idx+1] }
def valIdxSign = { idx, val -> sign(val) + valIdx(idx, Math.abs(val)) }
def valSign = { val -> sign(val) + sprintf('%.2f', Math.abs(val)) }
def (w, i) = [mdl.weights, mdl.intercept]
[valIdx(0, w.get(0)), *(1..}
Read in data and select
features of interest
Adjust config for known
local mode setup
Utility method to pretty
print model
House price predictions – scaling regression
…
Ignition.start(cfg).withCloseable { ignite ->
println ">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols"
def dataCache = new SandboxMLCache(ignite).fillCacheWith(data)
def trainer = new LSQRTrainer().withEnvironmentBuilder(defaultBuilder().withRNGSeed(0))
def vectorizer = new DoubleArrayVectorizer().labeled(Vectorizer.LabelCoordinate.FIRST)
def split = new TrainTestDatasetSplitter().split(0.8)
def mdl = trainer.fit(ignite, dataCache, split.trainFilter, vectorizer)
def rr = Evaluator.evaluate(dataCache, split.testFilter, mdl, vectorizer, new RM().withMetric { it.r2() })
def rmse = Evaluator.evaluate(dataCache, split.testFilter, mdl, vectorizer, new RM())
dataCache.destroy()
println ">>> Model: " + pretty(mdl, features)
println ">>> R^2 : " + rr
println ">>> RMSE : " + rmse
}
Candidates for
distributed processing
House price predictions – scaling regression
…
Ignition.start(cfg).withCloseable { ignite ->
println ">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols"
def dataCache = new SandboxMLCache(ignite).fillCacheWith(data)
def trainer = new LSQRTrainer().withEnvironmentBuilder(defaultBuilder().withRNGSeed(0))
def vectorizer = new DoubleArrayVectorizer().labeled(Vectorizer.LabelCoordinate.FIRST)
def split = new TrainTestDatasetSplitter().split(0.8)
def mdl = trainer.fit(ignite, dataCache, split.trainFilter, vectorizer)
def rr = Evaluator.evaluate(dataCache, split.testFilter, mdl, vectorizer, new RM().withMetric { it.r2() })
def rmse = Evaluator.evaluate(dataCache, split.testFilter, mdl, vectorizer, new RM())
dataCache.destroy()
println ">>> Model: " + pretty(mdl, features)
println ">>> R^2 : " + rr
println ">>> RMSE : " + rmse
}
[15:02:09] Ignite node started OK (id=63b2d47d)
>>> Ignite grid started for data: 21612 rows X 11 cols
>>> Model: -33081.12*bedrooms - 16205.93*bathrooms + 225.40*sqft_living + 6.77*sqft_living15 - 8766.52*lat
- 38.59*sqft_above + 96251.64*grade + 62303.58*view + 572116.79*waterfront - 11620.55*floors - 48504.18
>>> R^2 : 0.607023233473466
>>> RMSE : 228035.85816493785
[15:02:10] Ignite node stopped OK [uptime=00:00:01.465]
See also: https://github.com/apache/ignite/blob/master/examples/src/main/java/org/apache/ignite/examples/ml/regression/linear/BostonHousePricesPredictionExample.java
House prices – scaling regression
import …
def spark = builder().config('spark.master', 'local[8]').appName('HousePrices').orCreate
def file = '/path/to/kc_house_data.csv'
int k = 5
Dataset ds = spark.read().format('csv')
.options('header': 'true', 'inferSchema': 'true').load(file)
double[] splits = [80, 20]
def (training, test) = ds.randomSplit(splits)
String[] colNames = ds.columns().toList().minus(['id', 'date', 'price'])
def assembler = new VectorAssembler(inputCols: colNames, outputCol: 'features')
Dataset dataset = assembler.transform(training)
def lr = new LinearRegression(labelCol: 'price', maxIter: 10)
def model = lr.fit(dataset)
println 'Coefficients:'
println model.coefficients().values()[1..-1].collect{ sprintf '%.2f', it }.join(', ')
def testSummary = model.evaluate(assembler.transform(test))
printf 'RMSE: %.2f%n', testSummary.rootMeanSquaredError
printf 'r2: %.2f%n', testSummary.r2
spark.stop()
Coefficients:
41979.78, 80853.89, 0.15, 5412.83, 564343.22, 53834.10, 24817.09, 93195.29, -80662.68,
-80694.28, -2713.58, 19.02, -628.67, 594468.23, -228397.19, 21.23, -0.42
RMSE: 187242.12
r2: 0.70
Clustering Overview Clustering:
• Grouping similar items
Algorithm families:
• Hierarchical
• Partitioning k-means, x-means
• Density-based
• Graph-based
Aspects:
• Disjoint vs overlapping
• Preset cluster number
• Dimensionality reduction PCA
• Nominal feature support
Applications:
• Market segmentation
• Recommendation engines
• Search result grouping
• Social network analysis
• Medical imaging
Clustering
https://commons.apache.org/proper/commons-math/userguide/ml.html
Clustering case study: Whiskey flavor profiles
• 86 scotch
whiskies
• 12 flavor
categories
Pictures: https://prasant.net/clustering-scotch-whisky-grouping-distilleries-by-k-means-clustering-81f2ecde069c
https://www.r-bloggers.com/where-the-whisky-flavor-profile-data-came-from/
https://www.centerspace.net/clustering-analysis-part-iv-non-negative-matrix-factorization/
Whiskey – exploring with Dex
import …
def table = Table.read().csv('whiskey.csv')
table = table.removeColumns(0)
def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey",
"Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"]
def panel = new PlotPanel(
*[0..
def color = new Color(72 + (first * 16), 72 + (second * 16), 200 - (first * 4) - (second * 4))
ScatterPlot.plot(table.as().doubleMatrix(cols[first], cols[second]), '#' as char, color)
}
)
new SwingBuilder().edt {
frame(title: 'Frame', size: [800, 600], show: true, defaultCloseOperation: DISPOSE) {
widget(panel)
}
}
Whiskey: How to visualize? Feature combinations?
Hmm… not
particularly
enlightening
Whiskey: How to visualize? Feature combinations?
Another dimension?
Whiskey: How to visualize? Feature combinations?
import …
def table = Table.read().csv('whiskey.csv')
table = table.removeColumns(0)
def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey",
"Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"]
Color[] colors = [CYAN, PINK, MAGENTA, ORANGE, GREEN, BLUE, RED, YELLOW]
def panel = new PlotPanel(
*[cols, cols].combinations().collect { first, second ->
Histogram3D.plot(table.as().doubleMatrix(first, second), 4, colors)
}
)
new SwingBuilder().edt {
frame(title: 'Frame', size: [800, 600], show: true, defaultCloseOperation: DISPOSE) {
widget(panel)
}
}
Whiskey: How to visualize? Feature combinations?
import …
def table = Table.read().csv('whiskey.csv')
table = table.removeColumns(0)
def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey",
"Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"]
Color[] colors = [CYAN, PINK, MAGENTA, ORANGE, GREEN, BLUE, RED, YELLOW]
def panel = new PlotPanel(
*[cols, cols].combinations().collect { first, second ->
Histogram3D.plot(table.as().doubleMatrix(first, second), 4, colors)
}
)
new SwingBuilder().edt {
frame(title: 'Frame', size: [800, 600], show: true, defaultCloseOperation: DISPOSE) {
widget(panel)
}
}
Whiskey: How to visualize? Feature combinations?
import …
def table = Table.read().csv('whiskey.csv')
table = table.removeColumns(0)
def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey",
"Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"]
Color[] colors = [CYAN, PINK, MAGENTA, ORANGE, GREEN, BLUE, RED, YELLOW]
def panel = new PlotPanel(
*[cols, cols].combinations().collect { first, second ->
Histogram3D.plot(table.as().doubleMatrix(first, second), 4, colors)
}
)
SwingUtil.show(size: [1200, 900], panel)
Whiskey: How to visualize? Feature combinations?
Pretty… but
only marginally
more useful
Whiskey: How to visualize? Feature combinations?
Clustering with KMeans Step 1:
• Guess k cluster centroids
at random
Clustering with KMeans Step 1:
• Guess k cluster centroids
Clustering with KMeans Step 1:
• Guess k cluster centroids
Step 2:
• Assign points to closest
centroid
Clustering with KMeans Step 1:
• Guess k cluster centroids
Step 2:
• Assign points to closest
centroid
Clustering with KMeans Step 1:
• Guess k cluster centroids
Step 2:
• Assign points to closest
centroid
Step 3:
• Calculate new centroids
based on selected points
Clustering with KMeans Step 1:
• Guess k cluster centroids
Step 2:
• Assign points to closest
centroid
Step 3:
• Calculate new centroids
based on selected points
Clustering with KMeans Step 1:
• Guess k cluster centroids
Step 2:
• Assign points to closest
centroid
Step 3:
• Calculate new centroids
based on selected points
Clustering with KMeans Step 1:
• Guess k cluster centroids
Step 2:
• Assign points to closest
centroid
Step 3:
• Calculate new centroids
based on selected points
Clustering with KMeans Step 1:
• Guess k cluster centroids
Step 2:
• Assign points to closest
centroid
Step 3:
• Calculate new centroids
based on selected points
Repeat steps 2 and 3 until
stable or some limit reached
Clustering with KMeans Step 1:
• Guess k cluster centroids
Step 2:
• Assign points to closest
centroid
Step 3:
• Calculate new centroids
based on selected points
Repeat steps 2 and 3 until
stable or some limit reached
Clustering with KMeans Step 1:
• Guess k cluster centroids
Step 2:
• Assign points to closest
centroid
Step 3:
• Calculate new centroids
based on selected points
Repeat steps 2 and 3 until
stable or some limit reached
Clustering with KMeans Step 1:
• Guess k cluster centroids
Step 2:
• Assign points to closest
centroid
Step 3:
• Calculate new centroids
based on selected points
Repeat steps 2 and 3 until
stable or some limit reached
import …
def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey",
"Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"]
def numClusters = 5
def loader = new CSVLoader(file: 'whiskey.csv')
def clusterer = new SimpleKMeans(numClusters: numClusters, preserveInstancesOrder: true)
def instances = loader.dataSet
instances.deleteAttributeAt(0) // remove RowID
clusterer.buildClusterer(instances)
println ' ' + cols.join(', ')
def dataset = new DefaultCategoryDataset()
clusterer.clusterCentroids.eachWithIndex{ Instance ctrd, num ->
print "Cluster ${num+1}: "
println ((1..cols.size()).collect{ sprintf '%.3f', ctrd.value(it) }.join(', '))
(1..cols.size()).each { idx ->
dataset.addValue(ctrd.value(idx), "Cluster ${num+1}", cols[idx-1]) }
}
def clusters = (0..clusterer.assignments.eachWithIndex { cnum, idx ->
clusters[cnum] << instances.get(idx).stringValue(0) }
clusters.each { k, v ->
println "Cluster ${k+1}:"
println v.join(', ')
}
def plot = new SpiderWebPlot(dataset: dataset)
def chart = new JFreeChart('Whiskey clusters', plot)
SwingUtil.show(new ChartPanel(chart))
Whiskey – clustering with radar plot and weka
Read CSV and
create clusterer
Display whiskies
assigned to each cluster
Graph cluster centroids
import …
def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey",
"Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"]
def numClusters = 5
def loader = new CSVLoader(file: 'whiskey.csv')
def clusterer = new SimpleKMeans(numClusters: numClusters, preserveInstancesOrder: true)
def instances = loader.dataSet
instances.deleteAttributeAt(0) // remove RowID
clusterer.buildClusterer(instances)
println ' ' + cols.join(', ')
def dataset = new DefaultCategoryDataset()
clusterer.clusterCentroids.eachWithIndex{ Instance ctrd, num ->
print "Cluster ${num+1}: "
println ((1..cols.size()).collect{ sprintf '%.3f', ctrd.value(it) }.join(', '))
(1..cols.size()).each { idx ->
dataset.addValue(ctrd.value(idx), "Cluster ${num+1}", cols[idx-1]) }
}
def clusters = (0..clusterer.assignments.eachWithIndex { cnum, idx ->
clusters[cnum] << instances.get(idx).stringValue(0) }
clusters.each { k, v ->
println "Cluster ${k+1}:"
println v.join(', ')
}
def plot = new SpiderWebPlot(dataset: dataset)
def chart = new JFreeChart('Whiskey clusters', plot)
SwingUtil.show(new ChartPanel(chart))
Whiskey – clustering with radar plot and weka
Body, Sweetness, Smoky, Medicinal, Tobacco, Honey, Spicy, Winey, Nutty, Malty, Fruity, Floral
Cluster 1: 3.800, 1.600, 3.600, 3.600, 0.600, 0.200, 1.600, 0.600, 1.000, 1.400, 1.200, 0.000
Cluster 2: 2.773, 2.409, 1.545, 0.045, 0.000, 1.818, 1.591, 2.000, 2.091, 2.136, 2.136, 1.591
Cluster 3: 1.773, 2.455, 1.318, 0.636, 0.000, 0.636, 1.000, 0.409, 1.636, 1.364, 1.591, 1.591
Cluster 4: 1.500, 2.233, 1.267, 0.267, 0.000, 1.533, 1.400, 0.700, 1.000, 1.900, 1.900, 2.133
Cluster 5: 2.000, 2.143, 1.857, 0.857, 1.000, 0.857, 1.714, 1.000, 1.286, 2.000, 1.429, 1.714
Cluster 1:
Ardbeg, Clynelish, Lagavulin, Laphroig, Talisker
Cluster 2:
Aberfeldy, Aberlour, Ardmore, Auchroisk, Balmenach, BenNevis, Benrinnes, Benromach,
BlairAthol, Dailuaine, Dalmore, Edradour, Glendronach, Glendullan, Glenfarclas,
Glenrothes, Glenturret, Longmorn, Macallan, Mortlach, RoyalLochnagar, Strathisla
Cluster 3:
ArranIsleOf, Aultmore, Balblair, Cardhu, Craigganmore, Dufftown, GlenGrant, GlenKeith,
GlenScotia, GlenSpey, Glenfiddich, Glenmorangie, Isle of Jura, Mannochmore, Miltonduff,
Oban, Speyside, Springbank, Strathmill, Tamnavulin, Teaninich, Tomore
Cluster 4:
AnCnoc, Auchentoshan, Belvenie, Benriach, Bladnoch, Bowmore, Bruichladdich, Bunnahabhain,
Dalwhinnie, Deanston, GlenElgin, GlenGarioch, GlenMoray, GlenOrd, Glenallachie, Glengoyne,
Glenkinchie, Glenlivet, Glenlossie, Highland Park, Inchgower, Knochando, Linkwood,
Loch Lomond, Scapa, Speyburn, Tamdhu, Tobermory, Tomatin, Tomintoul
Cluster 5:
Caol Ila, Craigallechie, GlenDeveronMacduff, OldFettercairn,
OldPulteney, RoyalBrackla, Tullibardine
import …
def rows = CSV.withFirstRecordAsHeader().parse(new FileReader('whiskey.csv'))
def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey",
"Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"]
def clusterer = new KMeansPlusPlusClusterer(5)
def data = rows.collect{ row -> new DoublePoint(cols.collect{ col -> row[col] } as int[]) }
def centroids = clusterer.cluster(data)
println cols.join(', ') + ', Medoid'
def dataset = new DefaultCategoryDataset()
centroids.eachWithIndex{ ctrd, num ->
def cpt = ctrd.center.point
def closest = ctrd.points.min{ pt ->
sumSq((0..}
def medoid = rows.find{ row -> cols.collect{ row[it] as double } == closest.point }?.Distillery
println cpt.collect{ sprintf '%.3f', it }.join(', ') + ", $medoid"
cpt.eachWithIndex { val, idx -> dataset.addValue(val, "Cluster ${num+1}", cols[idx]) }
}
def plot = new SpiderWebPlot(dataset: dataset)
def chart = new JFreeChart('Whiskey clusters', plot)
SwingUtil.show(new ChartPanel(chart))
Whiskey – clustering with radar plot and medoids
Same again with Apache
Commons Math but also
calculate medoids
import …
def rows = CSV.withFirstRecordAsHeader().parse(new FileReader('whiskey.csv'))
def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey",
"Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"]
def clusterer = new KMeansPlusPlusClusterer(5)
def data = rows.collect{ row -> new DoublePoint(cols.collect{ col -> row[col] } as int[]) }
def centroids = clusterer.cluster(data)
println cols.join(', ') + ', Medoid'
def dataset = new DefaultCategoryDataset()
centroids.eachWithIndex{ ctrd, num ->
def cpt = ctrd.center.point
def closest = ctrd.points.min{ pt ->
sumSq((0..}
def medoid = rows.find{ row -> cols.collect{ row[it] as double } == closest.point }?.Distillery
println cpt.collect{ sprintf '%.3f', it }.join(', ') + ", $medoid"
cpt.eachWithIndex { val, idx -> dataset.addValue(val, "Cluster ${num+1}", cols[idx]) }
}
def plot = new SpiderWebPlot(dataset: dataset)
def chart = new JFreeChart('Whiskey clusters', plot)
SwingUtil.show(new ChartPanel(chart))
Whiskey – clustering with radar plot and medoids
Libraries: Apache Commons Math and
JFreeChart
Body, Sweetness, Smoky, Medicinal, Tobacco, Honey, Spicy, Winey, Nutty, Malty, Fruity, Floral, Medoid
2.000, 2.533, 1.267, 0.267, 0.200, 1.067, 1.667, 0.933, 0.267, 1.733, 1.800, 1.733, GlenOrd
2.789, 2.474, 1.474, 0.053, 0.000, 1.895, 1.632, 2.211, 2.105, 2.105, 2.211, 1.737, Aberfeldy
2.909, 1.545, 2.909, 2.727, 0.455, 0.455, 1.455, 0.545, 1.545, 1.455, 1.182, 0.545, Clynelish
1.333, 2.333, 0.944, 0.111, 0.000, 1.000, 0.444, 0.444, 1.500, 1.944, 1.778, 1.778, Aultmore
1.696, 2.304, 1.565, 0.435, 0.087, 1.391, 1.696, 0.609, 1.652, 1.652, 1.783, 2.130, Benromach
Dimensionality reduction
Whiskey – clustering with PCA plot
import …
def rows = Table.read().csv('whiskey.csv')
def cols = ["Body", "Sweetness", "Smoky", "Medicinal",
"Tobacco", "Honey", "Spicy", "Winey",
"Nutty", "Malty", "Fruity", "Floral"]
def data = table.as().doubleMatrix(*cols)
def pca = new PCA(data)
pca.projection = 2
def projected = pca.project(data)
def clusterer = new KMeans(data, 5)
table = table.addColumns(
DoubleColumn.create("PCA1", (0..DoubleColumn.create("PCA2", (0..StringColumn.create("Cluster", clusterer.clusterLabel.collect{ it.toString() })
)
def title = "Clusters x Principal Components"
Plot.show(ScatterPlot.create(title, table, "PCA1", "PCA2", "Cluster"))
Whiskey – clustering with 2-4D PCA plot
import …
def rows = Table.read().csv('whiskey.csv')
def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey",
"Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"]
def data = rows.as().doubleMatrix(*cols)
def pca = new PCA(data)
def dims = 4 // can be 2, 3 or 4
pca.projection = dims
def projected = pca.project(data)
def adj = [1, 1, 1, 5]
def clusterer = new KMeans(data, 5)
def labels = clusterer.clusterLabel.collect{ "Cluster " + (it+1) }
rows = rows.addColumns(
*(0..
DoubleColumn.create("PCA${idx+1}", (0..adj[idx] * (projected[it][idx] + adj[idx])
})
},
StringColumn.create("Cluster", labels)
)
def title = "Clusters x Principal Components"
def type = dims == 2 ? ScatterPlot : Scatter3DPlot
Plot.show(type.create(title, rows, *(1..dims).collect { "PCA$it" }, "Cluster"))
Whiskey – clustering and visualizing centroids
…
def data = table.as().doubleMatrix(*cols)
def pca = new PCA(data)
pca.projection = 3
def projected = pca.project(data)
def clusterer = new KMeans(data, 5)
def labels = clusterer.clusterLabel.collect { "Cluster " + (it + 1) }
table = table.addColumns(
*(0..<3).collect { idx ->
DoubleColumn.create("PCA${idx+1}", (0..projected[it][idx]
})},
StringColumn.create("Cluster", labels),
DoubleColumn.create("Centroid", [10] * labels.size())
)
def centroids = pca.project(clusterer.centroids())
def toAdd = table.emptyCopy(1)
(0..
toAdd[0].setString("Cluster", "Cluster " + (idx+1))
(1..3).each { toAdd[0].setDouble("PCA" + it, centroids[idx][it-1]) }
toAdd[0].setDouble("Centroid", 50)
table.append(toAdd)
}
def title = "Clusters x Principal Components w/ centroids"
Plot.show(Scatter3DPlot.create(title, table, *(1..3).collect { "PCA$it" }, "Centroid", "Cluster"))
Whiskey – Screeplot
import …
def rows = Table.read().csv('whiskey.csv')
def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey",
"Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"]
def data = table.as().doubleMatrix(*cols)
def pca = new PCA(data)
pca.projection = 2
def plots = [PlotCanvas.screeplot(pca)]
def projected = pca.project(data)
table = table.addColumns(
*(1..2).collect { idx ->
DoubleColumn.create("PCA$idx", (0..}
)
def colors = [RED, BLUE, GREEN, ORANGE, MAGENTA, GRAY]
def symbols = ['*', 'Q', '#', 'Q', '*', '#']
(2..6).each { k ->
def clusterer = new KMeans(data, k)
double[][] components = table.as().doubleMatrix('PCA1', 'PCA2')
plots << ScatterPlot.plot(components, clusterer.clusterLabel,
symbols[0..}
SwingUtil.show(size: [1200, 900], new PlotPanel(*plots))
import …
def rows = Table.read().csv('whiskey.csv')
def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey",
"Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"]
def data = table.as().doubleMatrix(*cols)
def pca = new PCA(data)
pca.projection = 2
def plots = [PlotCanvas.screeplot(pca)]
def projected = pca.project(data)
table = table.addColumns(
*(1..2).collect { idx ->
DoubleColumn.create("PCA$idx", (0..}
)
def colors = [RED, BLUE, GREEN, ORANGE, MAGENTA, GRAY]
def symbols = ['*', 'Q', '#', 'Q', '*', '#']
(2..6).each { k ->
def clusterer = new KMeans(data, k)
double[][] components = table.as().doubleMatrix('PCA1', 'PCA2')
plots << ScatterPlot.plot(components, clusterer.clusterLabel,
symbols[0..}
SwingUtil.show(size: [1200, 900], new PlotPanel(*plots))
Whiskey – Screeplot
Whiskey – XMeans clustering with 2-4D PCA plot
import …
def rows = Table.read().csv('whiskey.csv')
def cols = ["Body", "Sweetness", "Smoky", "Medicinal", "Tobacco", "Honey",
"Spicy", "Winey", "Nutty", "Malty", "Fruity", "Floral"]
def data = rows.as().doubleMatrix(*cols)
def pca = new PCA(data)
def dims = 4 // can be 2, 3 or 4
pca.projection = dims
def projected = pca.project(data)
def adj = [1, 1, 1, 5]
def kmax = 10
def clusterer = new XMeans(data, kmax)
def labels = clusterer.clusterLabel.collect { "Cluster " + (it + 1) }
rows = rows.addColumns(
*(0..
DoubleColumn.create("PCA${idx+1}", (0..adj[idx] * (projected[it][idx] + adj[idx])
})
},
StringColumn.create("Cluster", labels)
)
def title = "Clusters x Principal Components"
def type = dims == 2 ? ScatterPlot : Scatter3DPlot
Plot.show(type.create(title, rows, *(1..dims).collect { "PCA$it" }, "Cluster"))
Whiskey – SOM with Heatmap
import …
def file = new File(getClass().classLoader.getResource('whiskey.csv').file)
def table = Read.csv(file.toPath(), CSV.withFirstRecordAsHeader())
String[] cols = ['Body', 'Sweetness', 'Smoky', 'Medicinal', 'Tobacco', 'Honey',
'Spicy', 'Winey', 'Nutty', 'Malty', 'Fruity', 'Floral']
def data = table.select(cols).toArray()
def distilleries = table.column('Distillery').toStringArray()
//def (ncols, nrows) = [3, 3]
//def (ncols, nrows) = [7, 4]
def (ncols, nrows) = [7, 6]
//def (ncols, nrows) = [8, 6]
def lattice = SOM.lattice(nrows, ncols, data)
int epochs = 100
def model = new SOM(lattice, TimeFunction.constant(0.1),
Neighborhood.Gaussian(1, data.length * epochs / 8.0))
for (int i = 0; i < epochs; i++) {
for (int j : MathEx.permutate(data.length)) {
model.update(data[j])
}
}
…
Whiskey – SOM with Heatmap
…
def groups = data.toList().indices.groupBy { idx -> group(model, data[idx]) }
def names = groups.collectEntries { k, v -> [k, distilleries[v].join(', ')] }
private group(SOM model, double[] row) {
double[][][] neurons = model.neurons()
[0..
def (i, j) = pair
MathEx.distance(neurons[i][j], row)
}
}
names.toSorted{ e1, e2 -> e1.key[0] <=> e2.key[0] ?: e1.key[1] <=> e2.key[1] }.each { k, v ->
println "Cluster ${k[0]},${k[1]}: $v"
}
def tooltip = { i, j -> names[[i, j]] ?: '' } as Hexmap.Tooltip
new Hexmap(model.umatrix(), Palette.jet(256), tooltip)
.canvas().tap { setAxisLabels('', '') }.window()
Whiskey – SOM with Heatmap
Cluster 0,0: Auchentoshan, Glengoyne
Cluster 0,1: AnCnoc
Cluster 0,2: Inchgower, Tomintoul
Cluster 0,3: ArranIsleOf, Speyburn
Cluster 0,4: GlenElgin, Linkwood, RoyalBrackla
Cluster 0,5: Glenfarclas, Glenlivet, Glenturret
Cluster 0,6: Aberlour, Strathisla
Cluster 1,0: Bladnoch, Bunnahabhain, Glenfiddich
Cluster 1,1: Cardhu, Glenallachie
Cluster 1,2: Glenkinchie, Glenlossie, Loch Lomond
Cluster 1,3: Benriach, Dalwhinnie
Cluster 1,4: GlenOrd
Cluster 1,5: Craigallechie, GlenMoray, Longmorn
Cluster 1,6: Edradour, Knochando
Cluster 2,0: GlenSpey, Miltonduff
Cluster 2,1: Tamnavulin
Cluster 2,2: Mannochmore
Cluster 2,3: Aultmore, GlenGrant, Speyside, Tamdhu, Tobermory
Cluster 2,4: Deanston, Scapa
Cluster 2,5: Benromach
Cluster 2,6: Belvenie, BenNevis, Benrinnes
Cluster 3,0: GlenKeith, Tullibardine
Cluster 3,1: Balblair, Glenmorangie, Strathmill
Cluster 3,3: Tomore
Cluster 3,4: Ardmore, OldFettercairn
Cluster 3,5: Aberfeldy, BlairAthol
Cluster 3,6: Glendullan, RoyalLochnagar
Cluster 4,0: Craigganmore, Dufftown
Cluster 4,1: OldPulteney
Cluster 4,2: Oban
Cluster 4,3: Bruichladdich, GlenScotia, Isle of Jura, Springbank
Cluster 4,4: GlenDeveronMacduff, Tomatin
Cluster 4,5: Auchroisk, Glenrothes
Cluster 4,6: Balmenach, Glendronach, Macallan
Cluster 5,0: GlenGarioch, Teaninich
Cluster 5,1: Caol Ila, Clynelish, Talisker
Cluster 5,2: Ardbeg, Lagavulin, Laphroig
Cluster 5,4: Bowmore, Highland Park
Cluster 5,5: Dalmore
Cluster 5,6: Dailuaine, Mortlach
Whiskey – SOM with Heatmap
Cluster 0,0: Auchentoshan, Glengoyne
Cluster 0,1: AnCnoc
Cluster 0,2: Inchgower, Tomintoul
Cluster 0,3: ArranIsleOf, Speyburn
Cluster 0,4: GlenElgin, Linkwood, RoyalBrackla
Cluster 0,5: Glenfarclas, Glenlivet, Glenturret
Cluster 0,6: Aberlour, Strathisla
Cluster 1,0: Bladnoch, Bunnahabhain, Glenfiddich
Cluster 1,1: Cardhu, Glenallachie
Cluster 1,2: Glenkinchie, Glenlossie, Loch Lomond
Cluster 1,3: Benriach, Dalwhinnie
Cluster 1,4: GlenOrd
Cluster 1,5: Craigallechie, GlenMoray, Longmorn
Cluster 1,6: Edradour, Knochando
Cluster 2,0: GlenSpey, Miltonduff
Cluster 2,1: Tamnavulin
Cluster 2,2: Mannochmore
Cluster 2,3: Aultmore, GlenGrant, Speyside, Tamdhu, Tobermory
Cluster 2,4: Deanston, Scapa
Cluster 2,5: Benromach
Cluster 2,6: Belvenie, BenNevis, Benrinnes
Cluster 3,0: GlenKeith, Tullibardine
Cluster 3,1: Balblair, Glenmorangie, Strathmill
Cluster 3,3: Tomore
Cluster 3,4: Ardmore, OldFettercairn
Cluster 3,5: Aberfeldy, BlairAthol
Cluster 3,6: Glendullan, RoyalLochnagar
Cluster 4,0: Craigganmore, Dufftown
Cluster 4,1: OldPulteney
Cluster 4,2: Oban
Cluster 4,3: Bruichladdich, GlenScotia, Isle of Jura, Springbank
Cluster 4,4: GlenDeveronMacduff, Tomatin
Cluster 4,5: Auchroisk, Glenrothes
Cluster 4,6: Balmenach, Glendronach, Macallan
Cluster 5,0: GlenGarioch, Teaninich
Cluster 5,1: Caol Ila, Clynelish, Talisker
Cluster 5,2: Ardbeg, Lagavulin, Laphroig
Cluster 5,4: Bowmore, Highland Park
Cluster 5,5: Dalmore
Cluster 5,6: Dailuaine, Mortlach
Whiskey – Hierarchical clustering with Dendrogram
import …
def file = new File(getClass().classLoader.getResource('whiskey.csv').file)
def table = Read.csv(file.toPath(), CSV.withFirstRecordAsHeader())
String[] cols = ['Body', 'Sweetness', 'Smoky', 'Medicinal', 'Tobacco', 'Honey',
'Spicy', 'Winey', 'Nutty', 'Malty', 'Fruity', 'Floral']
def data = table.select(cols).toArray()
def distilleries = table.column('Distillery').toStringArray()
def ninetyDeg = 1.57 // radians
def FOREST_GREEN = new Color(0X808000)
def clusters = HierarchicalClustering.fit(CompleteLinkage.of(data))
//println clusters.tree
//println clusters.height
def partitions = clusters.partition(4)
// little trick to work out cluster colors
def colorMap = new LinkedHashSet(partitions.toList()).toList().reverse().indexed().collectEntries { k, v ->
[v, Palette.COLORS[k]]
}
Font font = new Font("BitStream Vera Sans", Font.PLAIN, 12)
…
Whiskey – Hierarchical clustering with Dendrogram
…
def dendrogram = new Dendrogram(clusters.tree, clusters.height, FOREST_GREEN).canvas().tap {
title = 'Whiskey Dendrogram'
setAxisLabels('Distilleries', 'Similarity')
def lb = lowerBounds
setBound([lb[0] - 1, lb[1] - 20] as double[], upperBounds)
distilleries.eachWithIndex { String label, int i ->
add(new Label(label, [i, -1] as double[], 0, 0, ninetyDeg, font, colorMap[partitions[i]]))
}
}.panel()
def pca = PCA.fit(data)
pca.projection = 2
def projected = pca.project(data)
char mark = '#'
def scatter = ScatterPlot.of(projected, partitions, mark).canvas().tap {
title = 'Clustered by dendrogram partitions'
setAxisLabels('PCA1', 'PCA2')
}.panel()
new PlotGrid(dendrogram, scatter).window()
Whiskey – Hierarchical clustering with Dendrogram
…
def dendrogram = new Dendrogram(clusters.tree, clusters.height, FOREST_GREEN).canvas().tap {
title = 'Whiskey Dendrogram'
setAxisLabels('Distilleries', 'Similarity')
def lb = lowerBounds
setBound([lb[0] - 1, lb[1] - 20] as double[], upperBounds)
distilleries.eachWithIndex { String label, int i ->
add(new Label(label, [i, -1] as double[], 0, 0, ninetyDeg, font, colorMap[partitions[i]]))
}
}.panel()
def pca = PCA.fit(data)
pca.projection = 2
def projected = pca.project(data)
char mark = '#'
def scatter = ScatterPlot.of(projected, partitions, mark).canvas().tap {
title = 'Clustered by dendrogram partitions'
setAxisLabels('PCA1', 'PCA2')
}.panel()
new PlotGrid(dendrogram, scatter).window()
Whiskey – Exploring Weka clustering algorithms
Whiskey flavors – scaling clustering
import …
def rows = Table.read().csv('whiskey.csv')
String[] features = [
'Distillery', 'Body', 'Sweetness',
'Smoky', 'Medicinal', 'Tobacco',
'Honey', 'Spicy', 'Winey', 'Nutty',
'Malty', 'Fruity', 'Floral'
]
def data = rows.as().doubleMatrix(features)
def cfg = new IgniteConfiguration(…)
Ignition.start(cfg).withCloseable { ignite ->
println ">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols"
def trainer = new KMeansTrainer().withDistance(new EuclideanDistance()).withAmountOfClusters(5)
def dataCache = new SandboxMLCache(ignite).fillCacheWith(data)
def vectorizer = new DoubleArrayVectorizer().labeled(Vectorizer.LabelCoordinate.FIRST)
def mdl = trainer.fit(ignite, dataCache, vectorizer)
println ">>> KMeans centroids"
println features[1..-1].join(', ‘)
mdl.centers.each { c -> println c.all().collect{ sprintf '%.4f', it.get() }.join(', ') }
dataCache.destroy()
}
[17:18:07] Ignite node started OK (id=ea92fc20)
>>> Ignite grid started for data: 86 rows X 13 cols
>>> KMeans centroids
Body, Sweetness, Smoky, Medicinal, Tobacco, Honey, Spicy, Winey, Nutty, Malty, Fruity, Floral
2.9091, 1.5455, 2.9091, 2.7273, 0.4545, 0.4545, 1.4545, 0.5455, 1.5455, 1.4545, 1.1818, 0.5455
1.8095, 2.4762, 1.5714, 0.4286, 0.1429, 1.5238, 1.7143, 0.8571, 0.7143, 1.6667, 1.6190, 2.0476
2.5217, 2.3913, 1.4783, 0.0435, 0.0435, 1.6087, 1.4783, 1.9565, 2.0435, 2.0435, 1.9565, 1.3913
1.3704, 2.3704, 1.0000, 0.2593, 0.0370, 0.7778, 0.9259, 0.4074, 1.5185, 1.7407, 2.0000, 2.1111
3.2500, 2.2500, 1.5000, 0.0000, 0.0000, 3.0000, 2.0000, 1.0000, 1.5000, 2.5000, 2.2500, 2.0000
[17:18:07] Ignite node stopped OK [uptime=00:00:00.368]
Whiskey flavors – scaling clustering
import …
def spark = builder().config('spark.master', 'local[8]').appName('Whiskey').orCreate
def file = '/path/to/whiskey.csv'
int k = 5
Dataset rows = spark.read().format('com.databricks.spark.csv')
.options('header': 'true', 'inferSchema': 'true').load(file)
String[] colNames = rows.columns().toList().minus(['RowID', 'Distillery'])
def assembler = new VectorAssembler(inputCols: colNames, outputCol: 'features')
Dataset dataset = assembler.transform(rows)
def clusterer = new KMeans(k: k, seed: 1L)
def model = clusterer.fit(dataset)
println 'Cluster centers:'
model.clusterCenters().each{ println it.values().collect{ sprintf '%.2f', it }.join(', ') }
spark.stop()
Cluster centers:
1.73, 2.35, 1.58, 0.81, 0.19, 1.15, 1.42, 0.81, 1.23, 1.77, 1.23, 1.31
2.00, 1.00, 3.00, 0.00, 0.00, 0.00, 3.00, 1.00, 0.00, 2.00, 2.00, 2.00
2.86, 2.38, 1.52, 0.05, 0.00, 1.95, 1.76, 2.05, 1.81, 2.05, 2.19, 1.71
1.53, 2.38, 1.06, 0.16, 0.03, 1.09, 1.00, 0.50, 1.53, 1.75, 2.13, 2.28
3.67, 1.50, 3.67, 3.33, 0.67, 0.17, 1.67, 0.50, 1.17, 1.33, 1.17, 0.17
Whiskey flavors – scaling clustering
class Point { double[] pts }
class TaggedPoint extends Point { int cluster }
@Canonical(includeSuperProperties=true)
class TaggedPointCounter extends TaggedPoint {
long count
TaggedPointCounter plus(TaggedPointCounter other) {
new TaggedPointCounter((0..}
TaggedPointCounter average() {
new TaggedPointCounter(pts.collect{it/count } as double[], cluster, 0)
}
}
int k = 5
Integer iterations = 20
def configuration = new Configuration()
Was: Rheem
Whiskey flavors – scaling clustering
class SelectNearestCentroid implements FunctionDescriptor.ExtendedSerializableFunction {
Iterable centroids
@Override
void open(ExecutionContext executionCtx) {
centroids = executionCtx.getBroadcast("centroids")
}
@Override
TaggedPointCounter apply(Point point) {
def minDistance = Double.POSITIVE_INFINITY
def nearestCentroidId = -1
for (centroid in centroids) {
def distance = sqrt((0..if (distance < minDistance) {
minDistance = distance
nearestCentroidId = centroid.cluster
}
}
new TaggedPointCounter(point.pts, nearestCentroidId, 1)
}
}
Was: Rheem
Whiskey flavors – scaling clustering
def url = WhiskeyRheem.classLoader.getResource('whiskey.csv').file
def rheemContext = new RheemContext(configuration)
.withPlugin(Java.basicPlugin())
.withPlugin(Spark.basicPlugin())
def planBuilder = new JavaPlanBuilder(rheemContext)
.withJobName("KMeans ($url, k=$k, iterations=$iterations)")
def points = new File(url).readLines()[1..-1].collect{ new Point(pts: it.split(",")[2..-1]*.toDouble()) }
def pointsBuilder = planBuilder.loadCollection(points)
def random = new Random()
double[][] initPts = (1..k).collect{ (0..def initialCentroidsBuilder = planBuilder
.loadCollection((0...withName("Load random centroids")
Was: Rheem
Whiskey flavors – scaling clustering
def url = WhiskeyRheem.classLoader.getResource('whiskey.csv').file
def rheemContext = new RheemContext(configuration)
.withPlugin(Java.basicPlugin())
.withPlugin(Spark.basicPlugin())
def planBuilder = new JavaPlanBuilder(rheemContext)
.withJobName("KMeans ($url, k=$k, iterations=$iterations)")
def points = new File(url).readLines()[1..-1].collect{ new Point(pts: it.split(",")[2..-1]*.toDouble()) }
def pointsBuilder = planBuilder.loadCollection(points)
def random = new Random()
double[][] initPts = (1..k).collect{ (0..def initialCentroidsBuilder = planBuilder
.loadCollection((0...withName("Load random centroids")
[main] INFO org.qcri.rheem.core.api.Job - StopWatch results:
* Optimization - 0:00:00.297
* Prepare - 0:00:00.063
* Prune&Isolate - 0:00:00.011
* Transformations - 0:00:00.052
* Sanity - 0:00:00.000
* Cardinality&Load Estimation - 0:00:00.115
* Create OptimizationContext - 0:00:00.008
* Create CardinalityEstimationManager - 0:00:00.001
* Push Estimation - 0:00:00.106
* Create Initial Execution Plan - 0:00:00.119
* Enumerate - 0:00:00.092
* Concatenation - 0:00:00.058
* Channel Conversion - 0:00:00.050
* Prune - 0:00:00.017
* Pick Best Plan - 0:00:00.009
* Split Stages - 0:00:00.013
* Execution - 0:00:00.352
* Execution 0 - 0:00:00.352
* Execute - 0:00:00.350
* Post-processing - 0:00:00.047
* Log measurements - 0:00:00.047
* Release Resources - 0:00:00.000
Centroids:
Cluster0: 2.500, 2.324, 1.588, 0.176, 0.059, 1.882, 1.647, 1.676, 1.824, 2.088, 1.912, 1.706
Cluster1: 1.488, 2.463, 1.122, 0.268, 0.073, 0.927, 1.146, 0.512, 1.146, 1.659, 1.878, 2.000
Cluster2: 2.909, 1.545, 2.909, 2.727, 0.455, 0.455, 1.455, 0.545, 1.545, 1.455, 1.182, 0.545
Was: Rheem
Clustering lab – repeat some of the examples with beer
https://rstudio-pubs-static.s3.amazonaws.com/560654_2b44eef198f24a0bba45e002ae86d237.html
Clustering lab – repeat some of the examples with beer
Classification Overview Clustering:
• Predicting class of some data
Algorithms:
• Logistic Regression
• Naïve Bayes
• Stochastic Gradient Descent
• K-Nearest Neighbours
• Decision Tree
• Random Forest
• Support Vector Machine
Aspects:
• Over/underfitting
• Ensemble
• Confusion
Applications:
• Image recognition
• Speech recognition
• Medical daignosis
Case Study: classification of Iris flowers
Created by British statistician and biologist Ronald
Fisher for his 1936 paper “The use of multiple
measurements in taxonomic problems as an
example of linear discriminant analysis”
150 samples, 50 each of three species of Iris:
• setosa
• versicolor
• virginica
Four features measured for each sample:
• sepal length
• sepal width
• petal length
• petal width
https://en.wikipedia.org/wiki/Iris_flower_data_set
https://archive.ics.uci.edu/ml/datasets/Iris
setosa
versicolor
virginica
sepal
petal
Iris flower data – explore with Weka
Iris flower data – explore with Weka (cont’d)
Iris flower data – explore with Weka (cont’d)
Iris flower data – explore with Weka (cont’d)
Iris flower data – explore with Weka (cont’d)
Iris flower data – explore with Weka (cont’d)
Iris flower data – Weka Naïve Bayes
def file = getClass().classLoader.getResource('iris_data.csv').file as File
def species = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
def loader = new CSVLoader(file: file)
def model = new NaiveBayes()
def allInstances = loader.dataSet
allInstances.classIndex = 4
model.buildClassifier(allInstances)
double[] actual = allInstances.collect{ it.value(4) }
double[] predicted = allInstances.collect{ model.classifyInstance(it) }
double[] petalW = allInstances.collect{ it.value(2) }
double[] petalL = allInstances.collect{ it.value(3) }
def indices = actual.indices
Iris flower data – Weka Naïve Bayes
def file = getClass().classLoader.getResource('iris_data.csv').file as File
def species = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
def loader = new CSVLoader(file: file)
def model = new NaiveBayes()
def allInstances = loader.dataSet
allInstances.classIndex = 4
model.buildClassifier(allInstances)
double[] actual = allInstances.collect{ it.value(4) }
double[] predicted = allInstances.collect{ model.classifyInstance(it) }
double[] petalW = allInstances.collect{ it.value(0) }
double[] petalL = allInstances.collect{ it.value(1) }
def indices = actual.indices
def chart = new XYChartBuilder().width(900).height(450).
title("Species").xAxisTitle("Petal length").yAxisTitle("Petal width").build()
species.eachWithIndex{ String name, int i ->
def groups = indices.findAll{ predicted[it] == i }.groupBy{ actual[it] == i }
Collection found = groups[true] ?: []
Collection errors = groups[false] ?: []
println "$name: ${found.size()} correct, ${errors.size()} incorrect"
chart.addSeries("$name correct", petalW[found], petalL[found]).with {
XYSeriesRenderStyle = Scatter
}
if (errors) {
chart.addSeries("$name incorrect", petalW[errors], petalL[errors]).with {
XYSeriesRenderStyle = Scatter
}
}
}
new SwingWrapper(chart).displayChart()
def chart = new XYChartBuilder().width(900).height(450).
title("Species").xAxisTitle("Petal length").yAxisTitle("Petal width").build()
species.eachWithIndex{ String name, int i ->
def groups = indices.findAll{ predicted[it] == i }.groupBy{ actual[it] == i }
Collection found = groups[true] ?: []
Collection errors = groups[false] ?: []
println "$name: ${found.size()} correct, ${errors.size()} incorrect"
chart.addSeries("$name correct", petalW[found], petalL[found]).with {
XYSeriesRenderStyle = Scatter
}
if (errors) {
chart.addSeries("$name incorrect", petalW[errors], petalL[errors]).with {
XYSeriesRenderStyle = Scatter
}
}
}
new SwingWrapper(chart).displayChart()
Iris flower data – Weka Naïve Bayes
def file = getClass().classLoader.getResource('iris_data.csv').file as File
def species = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
def loader = new CSVLoader(file: file)
def model = new NaiveBayes()
def allInstances = loader.dataSet
allInstances.classIndex = 4
model.buildClassifier(allInstances)
double[] actual = allInstances.collect{ it.value(4) }
double[] predicted = allInstances.collect{ model.classifyInstance(it) }
double[] petalW = allInstances.collect{ it.value(0) }
double[] petalL = allInstances.collect{ it.value(1) }
def indices = actual.indices
Iris-setosa: 50 correct, 0 incorrect
Iris-versicolor: 48 correct, 4 incorrect
Iris-virginica: 46 correct, 2 incorrect
Iris flower data – Weka Logistic Regression
def file = getClass().classLoader.getResource('iris_data.csv').file as File
def species = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
def loader = new CSVLoader(file: file)
def model = new SimpleLogistic()
def allInstances = loader.dataSet
allInstances.classIndex = 4
model.buildClassifier(allInstances)
double[] actual = allInstances.collect{ it.value(4) }
double[] predicted = allInstances.collect{ model.classifyInstance(it) }
double[] petalW = allInstances.collect{ it.value(2) }
double[] petalL = allInstances.collect{ it.value(3) }
def indices = actual.indices
Iris flower data – Weka Logistic Regression
def file = getClass().classLoader.getResource('iris_data.csv').file as File
def species = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
def loader = new CSVLoader(file: file)
def model = new SimpleLogistic()
def allInstances = loader.dataSet
allInstances.classIndex = 4
model.buildClassifier(allInstances)
double[] actual = allInstances.collect{ it.value(4) }
double[] predicted = allInstances.collect{ model.classifyInstance(it) }
double[] petalW = allInstances.collect{ it.value(0) }
double[] petalL = allInstances.collect{ it.value(1) }
def indices = actual.indices
def chart = new XYChartBuilder().width(900).height(450).
title("Species").xAxisTitle("Petal length").yAxisTitle("Petal width").build()
species.eachWithIndex{ String name, int i ->
def groups = indices.findAll{ predicted[it] == i }.groupBy{ actual[it] == i }
Collection found = groups[true] ?: []
Collection errors = groups[false] ?: []
println "$name: ${found.size()} correct, ${errors.size()} incorrect"
chart.addSeries("$name correct", petalW[found], petalL[found]).with {
XYSeriesRenderStyle = Scatter
}
if (errors) {
chart.addSeries("$name incorrect", petalW[errors], petalL[errors]).with {
XYSeriesRenderStyle = Scatter
}
}
}
new SwingWrapper(chart).displayChart()
Iris-setosa: 50 correct, 0 incorrect
Iris-versicolor: 49 correct, 1 incorrect
Iris-virginica: 49 correct, 1 incorrect
Iris flower data – Weka J48 Decision Tree
def file = getClass().classLoader.getResource('iris_data.csv').file as File
def species = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
def loader = new CSVLoader(file: file)
def model = new J48()
def allInstances = loader.dataSet
allInstances.classIndex = 4
model.buildClassifier(allInstances)
double[] actual = allInstances.collect{ it.value(4) }
double[] predicted = allInstances.collect{ model.classifyInstance(it) }
double[] petalW = allInstances.collect{ it.value(0) }
double[] petalL = allInstances.collect{ it.value(1) }
def indices = actual.indices
def chart = new XYChartBuilder().width(900).height(450).
title("Species").xAxisTitle("Petal length").yAxisTitle("Petal width").build()
species.eachWithIndex{ String name, int i ->
def groups = indices.findAll{ predicted[it] == i }.groupBy{ actual[it] == i }
Collection found = groups[true] ?: []
Collection errors = groups[false] ?: []
println "$name: ${found.size()} correct, ${errors.size()} incorrect"
chart.addSeries("$name correct", petalW[found], petalL[found]).with {
XYSeriesRenderStyle = Scatter
}
if (errors) {
chart.addSeries("$name incorrect", petalW[errors], petalL[errors]).with {
XYSeriesRenderStyle = Scatter
}
}
}
new SwingWrapper(chart).displayChart()
Petal width <= 0.6: Iris-setosa (50.0)
Petal width > 0.6
| Petal width <= 1.7
| | Petal length <= 4.9: Iris-versicolor (48.0/1.0)
| | Petal length > 4.9
| | | Petal width <= 1.5: Iris-virginica (3.0)
| | | Petal width > 1.5: Iris-versicolor (3.0/1.0)
| Petal width > 1.7: Iris-virginica (46.0/1.0)
Number of Leaves : 5
Size of the tree : 9
Iris-setosa: 50 correct, 0 incorrect
Iris-versicolor: 49 correct, 2 incorrect
Iris-virginica: 48 correct, 1 incorrect
Iris flower data – wekaDeeplearning4j
WekaPackageManager.loadPackages(true)
def file = getClass().classLoader.getResource('iris_data.csv').file as File
def loader = new CSVLoader(file: file)
def data = loader.dataSet
data.classIndex = 4
def options = Utils.splitOptions("-S 1 -numEpochs 10 -layer \"weka.dl4j.layers.OutputLayer -activation weka.dl4j.activations.ActivationSoftmax \
-lossFn weka.dl4j.lossfunctions.LossMCXENT\"")
AbstractClassifier myClassifier = Utils.forName(AbstractClassifier, "weka.classifiers.functions.Dl4jMlpClassifier", options)
// Stratify and split
Random rand = new Random(0)
Instances randData = new Instances(data)
randData.randomize(rand)
randData.stratify(3)
Instances train = randData.trainCV(3, 0)
Instances test = randData.testCV(3, 0)
// Build the classifier on the training data
myClassifier.buildClassifier(train)
// Evaluate the model on test data
Evaluation eval = new Evaluation(test)
eval.evaluateModel(myClassifier, test)
println eval.toSummaryString()
println eval.toMatrixString() Image: https://www.neuraldesigner.com/learning/examples/iris-flowers-classification
Iris flower data – wekaDeeplearning4j
WekaPackageManager.loadPackages(true)
def file = getClass().classLoader.getResource('iris_data.csv').file as File
def loader = new CSVLoader(file: file)
def data = loader.dataSet
data.classIndex = 4
def options = Utils.splitOptions("-S 1 -numEpochs 10 -layer \"weka.dl4j.layers.OutputLayer -activation weka.dl4j.activations.ActivationSoftmax \
-lossFn weka.dl4j.lossfunctions.LossMCXENT\"")
AbstractClassifier myClassifier = Utils.forName(AbstractClassifier, "weka.classifiers.functions.Dl4jMlpClassifier", options)
// Stratify and split
Random rand = new Random(0)
Instances randData = new Instances(data)
randData.randomize(rand)
randData.stratify(3)
Instances train = randData.trainCV(3, 0)
Instances test = randData.testCV(3, 0)
// Build the classifier on the training data
myClassifier.buildClassifier(train)
// Evaluate the model on test data
Evaluation eval = new Evaluation(test)
eval.evaluateModel(myClassifier, test)
println eval.toSummaryString()
println eval.toMatrixString()
[main] INFO org.deeplearning4j.nn.graph.ComputationGraph - Starting ComputationGraph
with WorkspaceModes set to [training: ENABLED; inference: ENABLED], cacheMode set to [NONE]
Training Dl4jMlpClassifier...: [] ETA: 00:00:00[INFO ] 00:03:31.035
[main] weka.classifiers.functions.Dl4jMlpClassifier - Epoch [1/10] took 00:00:00.670
Training Dl4jMlpClassifier...: [====== ] ETA: 00:00:06[INFO ] 00:03:31.152
[main] weka.classifiers.functions.Dl4jMlpClassifier - Epoch [2/10] took 00:00:00.113
Training Dl4jMlpClassifier...: [============ ] ETA: 00:00:03[INFO ] 00:03:31.244
[main] weka.classifiers.functions.Dl4jMlpClassifier - Epoch [3/10] took 00:00:00.090
Training Dl4jMlpClassifier...: [================== ] ETA: 00:00:02[INFO ] 00:03:31.325
[main] weka.classifiers.functions.Dl4jMlpClassifier - Epoch [4/10] took 00:00:00.079
Training Dl4jMlpClassifier...: [======================== ] ETA: 00:00:01[INFO ] 00:03:31.470
[main] weka.dl4j.listener.EpochListener - Epoch [5/10]
Train Set:
Loss: 0.510342
[INFO ] 00:03:31.470 [main] weka.classifiers.functions.Dl4jMlpClassifier - Epoch [5/10] took 00:00:00.144
Training Dl4jMlpClassifier...: [============================== ] ETA: 00:00:01[INFO ] 00:03:31.546
[main] weka.classifiers.functions.Dl4jMlpClassifier - Epoch [6/10] took 00:00:00.073
Training Dl4jMlpClassifier...: [==================================== ] ETA: 00:00:00[INFO ] 00:03:31.611
[main] weka.classifiers.functions.Dl4jMlpClassifier - Epoch [7/10] took 00:00:00.063
Training Dl4jMlpClassifier...: [========================================== ] ETA: 00:00:00[INFO ] 00:03:31.714
[main] weka.classifiers.functions.Dl4jMlpClassifier - Epoch [8/10] took 00:00:00.101
Training Dl4jMlpClassifier...: [================================================ ] ETA: 00:00:00[INFO ] 00:03:31.790
[main] weka.classifiers.functions.Dl4jMlpClassifier - Epoch [9/10] took 00:00:00.074
Training Dl4jMlpClassifier...: [====================================================== ] ETA: 00:00:00[INFO ] 00:03:31.882
[main] weka.dl4j.listener.EpochListener - Epoch [10/10]
Train Set:
Loss: 0.286469
…
Iris flower data – wekaDeeplearning4j
WekaPackageManager.loadPackages(true)
def file = getClass().classLoader.getResource('iris_data.csv').file as File
def loader = new CSVLoader(file: file)
def data = loader.dataSet
data.classIndex = 4
def options = Utils.splitOptions("-S 1 -numEpochs 10 -layer \"weka.dl4j.layers.OutputLayer -activation weka.dl4j.activations.ActivationSoftmax \
-lossFn weka.dl4j.lossfunctions.LossMCXENT\"")
AbstractClassifier myClassifier = Utils.forName(AbstractClassifier, "weka.classifiers.functions.Dl4jMlpClassifier", options)
// Stratify and split
Random rand = new Random(0)
Instances randData = new Instances(data)
randData.randomize(rand)
randData.stratify(3)
Instances train = randData.trainCV(3, 0)
Instances test = randData.testCV(3, 0)
// Build the classifier on the training data
myClassifier.buildClassifier(train)
// Evaluate the model on test data
Evaluation eval = new Evaluation(test)
eval.evaluateModel(myClassifier, test)
println eval.toSummaryString()
println eval.toMatrixString()
…
[INFO ] 00:03:31.883 [main] weka.classifiers.functions.Dl4jMlpClassifier - Epoch [10/10] took 00:00:00.091
Training Dl4jMlpClassifier...: [============================================================] ETA: 00:00:00
Done!
Correctly Classified Instances 40 80 %
Incorrectly Classified Instances 10 20 %
Kappa statistic 0.701
Mean absolute error 0.2542
Root mean squared error 0.3188
Relative absolute error 57.2154 %
Root relative squared error 67.6336 %
Total Number of Instances 50
=== Confusion Matrix ===
a b c <-- classified as
17 0 0 | a = Iris-setosa
0 9 8 | b = Iris-versicolor
0 2 14 | c = Iris-virginica
BUILD SUCCESSFUL in 22s
Classify language
Apache OpenNLP parts:
• sentence detector, tokenizer, name finder, document
categorizer, part-of-speech tagger, chunker, parser,
coreference resolution
import opennlp.tools.langdetect.*
def base = 'http://apache.forsale.plus/opennlp/models'
def url = "$base/langdetect/1.8.3/langdetect-183.bin"
def model = new LanguageDetectorModel(new URL(url))
def detector = new LanguageDetectorME(model)
[ 'spa': 'Bienvenido a Madrid',
'fra': 'Bienvenue à Paris',
'dan': 'Velkommen til København'
].each { k, v ->
assert detector.predictLanguage(v).lang == k
}
Supported language
detection training
models:
• MaxEnt
• Perceptron
• Naïve Bayes
Classify language – entities
String[] sentences = [
"A commit by Daniel Sun on December 6, 2020 improved Groovy 4's language integrated query.",
"A commit by Daniel on Sun., December 6, 2020 improved Groovy 4's language integrated query.",
'The Groovy in Action book by Dierk Koenig et. al. is a bargain at $50, or indeed any price.',
'The conference wrapped up yesterday at 5:30 p.m. in Copenhagen, Denmark.',
'I saw Ms. May Smith waving to June Jones.',
'The parcel was passed from May to June.'
]
// use a helper to cache models
def helper = new ResourceHelper('http://opennlp.sourceforge.net/models-1.5')
def modelNames = ['person', 'money', 'date', 'time', 'location']
def finders = modelNames.collect{ new NameFinderME(new TokenNameFinderModel(helper.load("en-ner-$it"))) }
def tokenizer = SimpleTokenizer.INSTANCE
…
Classify language – entities
sentences.each { sentence ->
String[] tokens = tokenizer.tokenize(sentence)
Span[] tokenSpans = tokenizer.tokenizePos(sentence)
def entityText = [:]
def entityPos = [:]
finders.indices.each {fi ->
// could be made smarter by looking at probabilities and overlapping spans
Span[] spans = finders[fi].find(tokens)
spans.each{span ->
def se = span.start..def pos = (tokenSpans[se.from].start)..<(tokenSpans[se.to].end)
entityPos[span.start] = pos
entityText[span.start] = "$span.type(${sentence[pos]})"
}
}
entityPos.keySet().toList().reverseEach {
def pos = entityPos[it]
def (from, to) = [pos.from, pos.to + 1]
sentence = sentence[0..}
println sentence
}
Classify language – entities
sentences.each { sentence ->
String[] tokens = tokenizer.tokenize(sentence)
Span[] tokenSpans = tokenizer.tokenizePos(sentence)
def entityText = [:]
def entityPos = [:]
finders.indices.each {fi ->
// could be made smarter by looking at probabilities and overlapping spans
Span[] spans = finders[fi].find(tokens)
spans.each{span ->
def se = span.start..def pos = (tokenSpans[se.from].start)..<(tokenSpans[se.to].end)
entityPos[span.start] = pos
entityText[span.start] = "$span.type(${sentence[pos]})"
}
}
entityPos.keySet().toList().reverseEach {
def pos = entityPos[it]
def (from, to) = [pos.from, pos.to + 1]
sentence = sentence[0..}
println sentence
}
A commit by person(Daniel Sun) on date(December 6, 2020) improved Groovy 4's language integrated query.
A commit by person(Daniel) on Sun., date(December 6, 2020) improved Groovy 4's language integrated query.
The Groovy in Action book by person(Dierk Koenig) et. al. is a bargain at money($50), or indeed any price.
The conference wrapped up date(yesterday) at time(5:30 p.m.) in location(Copenhagen), location(Denmark).
I saw Ms. person(May Smith) waving to person(June Jones).
The parcel was passed from date(May to June).
Classify language – parts of speech
def sentences = [
'Paul has two sisters, Maree and Christine.',
'His bark was much worse than his bite',
'Turn on the lights to the main bedroom',
"Light 'em all up",
'Make it dark downstairs'
]
def tokenizer = SimpleTokenizer.INSTANCE
sentences.each {
String[] tokens = tokenizer.tokenize(it)
def model = new POSModel(helper.load('en-pos-maxent'))
def posTagger = new POSTaggerME(model)
String[] tags = posTagger.tag(tokens)
println tokens.indices.collect{tags[it] == tokens[it] ? tags[it] : "${tags[it]}(${tokens[it]})" }.join(' ')
}
NNP(Paul) VBZ(has) CD(two) NNS(sisters) , NNP(Maree) CC(and) NNP(Christine) .
PRP$(His) NN(bark) VBD(was) RB(much) JJR(worse) IN(than) PRP$(his) NN(bite)
VB(Turn) IN(on) DT(the) NNS(lights) TO(to) DT(the) JJ(main) NN(bedroom)
NN(Light) POS(') NN(em) DT(all) IN(up)
VB(Make) PRP(it) JJ(dark) NN(downstairs)
Classify language – sentence detection
def helper = new ResourceHelper('http://opennlp.sourceforge.net/models-1.5')
def model = new SentenceModel(helper.load('en-sent'))
def detector = new SentenceDetectorME(model)
def sentences = detector.sentDetect(text)
assert text.count('.') == 28
assert sentences.size() == 4
println sentences.join('\n\n')
Classify language – sentence detection
def helper = new ResourceHelper('http://opennlp.sourceforge.net/models-1.5')
def model = new SentenceModel(helper.load('en-sent'))
def detector = new SentenceDetectorME(model)
def sentences = detector.sentDetect(text)
assert text.count('.') == 28
assert sentences.size() == 4
println sentences.join('\n\n')
The most referenced scientific paper of all time is "Protein measurement with the
Folin phenol reagent" by Lowry, O. H., Rosebrough, N. J., Farr, A. L. & Randall,
R. J. and was published in the J. BioChem. in 1951. It describes a method for
measuring the amount of protein (even as small as 0.2 γ, were γ is the specific
weight) in solutions and has been cited over 300,000 times and can be found here:
https://www.jbc.org/content/193/1/265.full.pdf. Dr. Lowry completed
two doctoral degrees under an M.D.-Ph.D. program from the University of Chicago
before moving to Harvard under A. Baird Hastings. He was also the H.O.D of
Pharmacology at Washington University in St. Louis for 29 years.
The most referenced scientific paper of all time is "Protein measurement with the
Folin phenol reagent" by Lowry, O. H., Rosebrough, N. J., Farr, A. L. & Randall,
R. J. and was published in the J. BioChem. in 1951.
It describes a method for
measuring the amount of protein (even as small as 0.2 ?, were ? is the specific
weight) in solutions and has been cited over 300,000 times and can be found here:
https://www.jbc.org/content/193/1/265.full.pdf.
Dr. Lowry completed
two doctoral degrees under an M.D.-Ph.D. program from the University of Chicago
before moving to Harvard under A. Baird Hastings.
He was also the H.O.D of
Pharmacology at Washington University in St. Louis for 29 years.
Classify language
Apache NLPCraft (incubating)
start(LightSwitchModel)
def cli = new NCTestClientBuilder().newBuilder().build()
cli.open("nlpcraft.lightswitch.ex.java")
println cli.ask('Turn on the lights in the master bedroom')
println cli.ask("Light 'em all up")
println cli.ask('Make it dark downstairs') // expecting no match
if (cli) {
cli.close()
}
stop()
Lights are [on] in [master bedroom].
Lights are [on] in [entire house].
No matching intent found.
Classify language
Apache NLPCraft (incubating)
@NCIntentRef("ls")
@NCIntentSample([
"Turn the lights off in the entire house.",
"Switch on the illumination in the master bedroom closet.",
"Get the lights on.",
"Lights up in the kitchen.",
"Please, put the light out in the upstairs bedroom.",
"Set the lights on in the entire house.",
…
"Kill off all the lights now!",
"No lights in the bedroom, please.",
"Light up the garage, please!",
"Kill the illumination now!"
])
NCResult onMatch(
@NCIntentTerm("act") NCToken actTok,
@NCIntentTerm("loc") List locToks) {
String status = actTok.id == "ls:on" ? "on" : "off"
String locations = locToks ? locToks.collect{ it.meta("nlpcraft:nlp:origtext") }.join(", ") : "entire house"
// Add HomeKit, Arduino or other integration here.
// By default - just return a descriptive action string.
NCResult.text("Lights are [$status] in [${locations.toLowerCase()}].")
}
Lights are [on] in [master bedroom].
Lights are [on] in [entire house].
No matching intent found.
Classification with Neural networks
For >1 layers (deep)
model caters for non-
linear relationships
See: https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6
Classification with Neural networks Case Study: MNIST
• Handwritten digits
• 60,000 training examples
• 10,000 test examples
• Size-normalized (28x28 pixels) and centered
• http://yann.lecun.com/exdb/mnist/
MNIST with Apache Commons Math
class MnistNeuralNet {
private static int inodes = 784
private static int hnodes = 200
private static int onodes = 10
private static double learning_rate = 0.1
private static RealMatrix inputHidden
private static RealMatrix hiddenOutput
static initRandom() {
inputHidden = initRandom(createRealMatrix(hnodes, inodes), inodes**-0.5 as double)
hiddenOutput = initRandom(createRealMatrix(onodes, hnodes), hnodes**-0.5 as double)
}
static double[] normalize(Integer[] img) {
img.collect { it / 255.0d * 0.999d + 0.001d } as double[]
}
static RealMatrix createTarget(int label) {
createRealMatrix(onodes, 1).tap {
(0..}
}
static RealMatrix query(double[] inputArray) {
RealMatrix inputs = createColumnRealMatrix(inputArray)
RealMatrix hiddenInputs = inputHidden * inputs
RealMatrix hiddenOutputs = scalarSigmoid(hiddenInputs)
RealMatrix finalInputs = hiddenOutput * hiddenOutputs
return scalarSigmoid(finalInputs)
}
…
Inspired by: https://golb.hplar.ch/2018/12/simple-neural-network.html
MnistNeuralNet
class supports training
and prediction
0
1
2
9
…
…
…
10
28 * 28
= 784
200
MNIST with Apache Commons Math
…
static RealMatrix initRandom(RealMatrix matrix, double desiredStandardDeviation) {
scalar(matrix, it -> new Random().nextGaussian() * desiredStandardDeviation)
}
static void train(RealMatrix inputs, RealMatrix targets) {
// forward
RealMatrix hiddenInputs = inputHidden * inputs
RealMatrix hiddenOutputs = scalarSigmoid(hiddenInputs)
RealMatrix finalInputs = hiddenOutput * hiddenOutputs
RealMatrix finalOutputs = scalarSigmoid(finalInputs)
// back
RealMatrix outputErrors = targets.subtract(finalOutputs)
RealMatrix t1 = multiplyElements(outputErrors, finalOutputs)
RealMatrix t2 = multiplyElements(t1, scalar(finalOutputs, it -> 1.0d - it))
RealMatrix t3 = t2 * hiddenOutputs.transpose()
hiddenOutput = hiddenOutput.add(scalar(t3, it -> learning_rate * it))
RealMatrix hiddenErrors = hiddenOutput.transpose() * outputErrors
t1 = multiplyElements(hiddenErrors, hiddenOutputs)
t2 = multiplyElements(t1, scalar(hiddenOutputs, it -> 1.0d - it))
t3 = t2 * inputs.transpose()
inputHidden = inputHidden.add(scalar(t3, it -> learning_rate * it))
}
static void save(String filename) {
new FileOutputStream(filename).withObjectOutputStream { oos ->
oos.writeObject(inputHidden.data)
oos.writeObject(hiddenOutput.data)
}
}
}
Inspired by: https://golb.hplar.ch/2018/12/simple-neural-network.html
MnistNeuralNet
class supports training
and prediction cont’d
MNIST with Apache Commons Math
static RealMatrix scalar(RealMatrix matrix, Function function) {
int numRows = matrix.rowDimension
int numCols = matrix.columnDimension
RealMatrix result = createRealMatrix(numRows, numCols)
for (r in 0..for (c in 0..result.setEntry(r, c, function.apply(matrix.getEntry(r, c)))
}
}
return result
}
static int[][] rotate(int[][] img, double angleInDegrees) {
double angle = Math.toRadians(angleInDegrees)
int[][] result = new int[img.length][]
for (y in 0..result[y] = new int[img[y].length]
Arrays.fill(result[y], 0)
}
double cosAngle = Math.cos(angle)
double sinAngle = Math.sin(angle)
double x0 = img[0].length / 2 - cosAngle * img[0].length / 2 - sinAngle * img.length / 2
double y0 = img.length / 2 - cosAngle * img.length / 2 + sinAngle * img[0].length / 2
for (y in 0..for (x in 0..int xRot = (int) (x * cosAngle + y * sinAngle + x0)
int yRot = (int) (-x * sinAngle + y * cosAngle + y0)
if (xRot >= 0 && yRot >= 0 && xRot <= 27 && yRot <= 27) {
result[y][x] = img[yRot][xRot]
}
}
}
return result
} …
Inspired by: https://golb.hplar.ch/2018/12/simple-neural-network.html
Utility methods
MNIST with Apache Commons Math
…
static RealMatrix multiplyElements(RealMatrix matrixA, RealMatrix matrixB) {
// elementWise multiplication has same compatibility requirements as addition
checkAdditionCompatible(matrixA, matrixB)
int numRows = matrixA.rowDimension
int numCols = matrixA.columnDimension
RealMatrix product = createRealMatrix(numRows, numCols)
for (r in 0..for (c in 0..product.setEntry(r, c, matrixA.getEntry(r, c) * matrixB.getEntry(r, c))
}
}
return product
}
static int maxIndex(RealMatrix result) {
double[][] data = result.data
(0..}
static RealMatrix scalarSigmoid(RealMatrix matrix) {
def sigmoid = [start: { a, b, c, d, e, f -> },
visit: { r, c, double v -> 1 / (1 + Math.exp(-v)) },
end : { -> 0d }] as RealMatrixChangingVisitor
RealMatrix result = matrix.copy()
result.walkInRowOrder(sigmoid)
result
}
Inspired by: https://golb.hplar.ch/2018/12/simple-neural-network.html
Utility methods
cont’d
MNIST with Apache Commons Math
int epochs = 10
initRandom()
int[] labels = MnistReader.getLabels(Paths.get("/path/to/train-labels-idx1-ubyte.gz"))
List images = MnistReader.getImages(Paths.get("/path/to/train-images-idx3-ubyte.gz"))
println 'Processing images'
def scaledImages = images.collect { image -> normalize(image.flatten() as Integer[]) } as double[][]
def scaledImagesRot10 = images.collect { image -> normalize(rotate(image, 10).flatten() as Integer[]) } as double[][]
def scaledImagesRot350 = images.collect { image -> normalize(rotate(image, -10).flatten() as Integer[]) } as double[][]
epochs.times { e ->
println "Running epoch: ${e + 1}"
for (i in 0..RealMatrix targets = createTarget(labels[i])
train(createColumnRealMatrix(scaledImages[i]), targets)
train(createColumnRealMatrix(scaledImagesRot10[i]), targets)
train(createColumnRealMatrix(scaledImagesRot350[i]), targets)
}
}
int[] testLabels = MnistReader.getLabels(Paths.get("/path/to/t10k-labels-idx1-ubyte.gz"))
List testImages = MnistReader.getImages(Paths.get("/path/to/t10k-images-idx3-ubyte.gz"))
int correct = 0
for (i in 0..int correctLabel = testLabels[i]
RealMatrix predict = query(scale(testImages.get(i).flatten() as Integer[]))
int predictLabel = maxIndex(predict)
if (predictLabel == correctLabel) correct++
}
println "Accuracy: " + correct / testLabels.length
save("../resources/weights")
Processing images
Running epoch: 1
Running epoch: 2
Running epoch: 3
Running epoch: 4
Running epoch: 5
Running epoch: 6
Running epoch: 7
Running epoch: 8
Running epoch: 9
Running epoch: 10
Accuracy: 0.9767
Inspired by: https://golb.hplar.ch/2018/12/simple-neural-network.html
Train
model
Test
model
MNIST with DeepLearning4J
int rngSeed = 123
int batchSize = 125
int numEpochs = 15
int numInputs = 28 * 28
int hiddenLayerSize = 1000
int numOutputs = 10
def (trainSet, testSet) = [true, false].collect {
new MnistDataSetIterator(batchSize, it, rngSeed) }
def log = LoggerFactory.getLogger(getClass())
def conf = new NeuralNetConfiguration.Builder()
.seed(rngSeed) //include the random seed for reproducibility
// use stochastic gradient descent as an optimization algorithm
.updater(new Nesterovs(0.006, 0.9))
.l2(1e-4)
.list()
// input layer:
.layer(new DenseLayer.Builder()
.nIn(numInputs)
.nOut(hiddenLayerSize)
.activation(Activation.RELU)
.weightInit(WeightInit.XAVIER)
.build())
// hidden layer:
.layer(new OutputLayer.Builder(LossFunction.NEGATIVELOGLIKELIHOOD)
.nIn(hiddenLayerSize)
.nOut(numOutputs)
.activation(Activation.SOFTMAX)
.weightInit(WeightInit.XAVIER)
.build())
.build()
…
…
log.info("Creating model ...")
def model = new MultiLayerNetwork(conf)
model.init()
log.info("Training model ...")
model.listeners = new ScoreIterationListener(100)
model.fit(trainSet, numEpochs)
// model.save('mlp_model.dat' as File)
log.info("Evaluating model...")
def eval = model.evaluate(testSet)
log.info(eval.stats())
======== Evaluation Metrics ========
# of classes: 10
Accuracy: 0.9728
Precision: 0.9728
Recall: 0.9726
F1 Score: 0.9726
Based on: https://github.com/deeplearning4j/dl4j-examples/blob/master/dl4j-
examples/src/main/java/org/deeplearning4j/examples/feedforward/mnist/MLPMnistSingleLayerExample.java
0
1
2
9
…
…
…
10
28 * 28
= 784 1000
…
MNIST with DeepLearning4J
Based on: https://github.com/deeplearning4j/dl4j-examples/blob/master/dl4j-
examples/src/main/java/org/deeplearning4j/examples/feedforward/mnist/MLPMnistSingleLayerExample.java
0
1
2
9
…
…
…
10
28 * 28
= 784 500
int rngSeed = 123
int batchSize = 64
int numEpochs = 15
int numInputs = 28 * 28
int numOutputs = 10
double rate = 0.0015
def (trainSet, testSet) = [true, false].collect {
new MnistDataSetIterator(batchSize, it, rngSeed) }
def log = LoggerFactory.getLogger(getClass())
def conf = new NeuralNetConfiguration.Builder()
.seed(rngSeed) //include the random seed for reproducibility
// use stochastic gradient descent as an optimization algorithm
.updater(new Nadam())
.l2(rate * 0.005) // regularized
.list()
// first input layer:
.layer(new DenseLayer.Builder()
.nIn(numInputs)
.nOut(500)
.build())
// second input layer:
.layer(new DenseLayer.Builder()
.nIn(500)
.nOut(100)
.build())
// hidden layer:
.layer(new OutputLayer.Builder(LossFunction.NEGATIVELOGLIKELIHOOD)
.nOut(numOutputs)
.activation(Activation.SOFTMAX)
.build())
.build()
…
log.info("Creating model ...")
def model = new MultiLayerNetwork(conf)
model.init()
log.info("Training model ...")
model.listeners = new ScoreIterationListener(100)
model.fit(trainSet, numEpochs)
log.info("Evaluating model...")
def eval = model.evaluate(testSet)
log.info(eval.stats())
======== Evaluation Metrics ========
# of classes: 10
Accuracy: 0.9820
Precision: 0.9820
Recall: 0.9818
F1 Score: 0.9819
100
MNIST GUI with GroovyFX
import javafx.embed.swing.SwingFXUtils
import javafx.scene.canvas.GraphicsContext
import javafx.scene.image.Image
import javafx.scene.image.PixelFormat
import javafx.scene.paint.Color
import javax.imageio.ImageIO
static clear(GraphicsContext g, size) {
g.fill = Color.WHITE; g.fillRect(0, 0, size, size); g.fill = Color.BLACK
}
static int[] imageToArray(Image img) {
int w = img.width
int h = img.height
int[] buf = new int[h * w]
img.pixelReader.getPixels(0, 0, w, h, PixelFormat.intArgbInstance, buf, 0, w)
buf
}
static Image snapshot(canvas) {
def baos = new ByteArrayOutputStream()
ImageIO.write(SwingFXUtils.fromFXImage(canvas.snapshot(null, null), null), "png", baos)
new Image(new ByteArrayInputStream(baos.toByteArray()), 28, 28, true, true)
}
static displayResult(List items, Integer predict) {
items.indexed().collect{ idx, val ->
def marker = (idx == predict && items[predict] > 0.5) ? ' **' : ''
"$idx ${val ? sprintf('%.4f', val) + marker : '?'}"
}.join('\n')
}
A few more
utility methods
MNIST GUI with GroovyFX
class MnistInfer {
private static RealMatrix inputHidden
private static RealMatrix hiddenOutput
static void load(Class klass) {
def weights = klass.classLoader.getResourceAsStream('weights')
try (ObjectInputStream ois = new ObjectInputStream(weights)) {
inputHidden = MatrixUtils.createRealMatrix((double[][]) ois.readObject())
hiddenOutput = MatrixUtils.createRealMatrix((double[][]) ois.readObject())
}
}
static RealMatrix query(double[] inputArray) {
RealMatrix inputs = createColumnRealMatrix(inputArray)
RealMatrix hiddenInputs = inputHidden * inputs
RealMatrix hiddenOutputs = scalarSigmoid(hiddenInputs)
RealMatrix finalInputs = hiddenOutput * hiddenOutputs
return scalarSigmoid(finalInputs)
}
static double[] normalize(int[] img) {
double[] result = new double[img.length]
for (int i = 0; i < img.length; i++) {
int red = (img[i] >> 16) & 0xff
int green = (img[i] >> 8) & 0xff
int blue = img[i] & 0xff
result[i] = 1 - ((red + green + blue) / 765.0 * 999 + 1) / 1000
}
return result
}
}
Predictor class
MNIST GUI with GroovyFX
…
start {
stage(title: 'MNIST', visible: true) {
scene(id: "scene", fill: WHITE) {
borderPane {
top {
canvas(id: "canvas", width: size, height: size)
def g = canvas.graphicsContext2D
clear(g, size)
canvas.onMouseDragged { e -> g.fillOval e.x - pen, e.y - pen, pen * 2, pen * 2 }
}
center {
hbox(alignment: 'Center') {
button('Clear', onAction: {
clear(canvas.graphicsContext2D, size)
out.text = displayResult([null] * 10, null)
})
button('Predict', onAction: {
def result = Mnist.query(Mnist.scale(imageToArray(snapshot(canvas))))
def predictLabel = Mnist.maxIndex(result)
out.text = displayResult(result.data.collect{ it[0] }, predictLabel)
})
}
}
bottom {
textArea(displayResult([null] * 10, null), id: 'out', editable: false, prefColumnCount: 16)
}
}
}
}
}
Apache MXNET
Source: https://mxnet.apache.org/versions/master/faq/why_mxnet.html
Apache MXNET Object detection
import groovy.swing.SwingBuilder
import javax.imageio.ImageIO
import org.apache.mxnet.infer.javaapi.ObjectDetector
import org.apache.mxnet.javaapi.*
import static javax.swing.WindowConstants.DISPOSE_ON_CLOSE
static void downloadUrl(String srcDirUrl, String destDir, String filename) {
def destination = new File(destDir, filename)
if (!destination.parentFile.exists()) {
destination.parentFile.mkdirs()
}
if (!destination.exists()) {
destination.bytes = new URL(srcDirUrl + filename).bytes
}
}
static downloadModelImage() {
String baseDir = "${System.getProperty('java.io.tmpdir')}/resnetssd/"
def imageName = 'dog-ssd.jpg'
def modelName = 'resnet50_ssd_model'
String imgURL = "https://s3.amazonaws.com/model-server/inputs/"
println "Downloading image to ${baseDir}..."
downloadUrl(imgURL, baseDir, imageName)
println "Downloading model files to ${baseDir}... (may take a while)"
String modelURL = "https://s3.amazonaws.com/model-server/models/resnet50_ssd/"
downloadUrl(modelURL, baseDir, "$modelName-symbol.json")
downloadUrl(modelURL, baseDir, "$modelName-0000.params")
downloadUrl(modelURL, baseDir, 'synset.txt')
[baseDir + imageName, baseDir + modelName]
}
…
Utility methods for
downloading test image
and pre-trained model
Test image
Apache MXNET Object detection
…
static detectObjects(String modelPath, String imagePath, inputShape) {
def context = [Context.cpu()]
def inputDescriptor = new DataDesc("data", inputShape, DType.Float32(), "NCHW")
def objDet = new ObjectDetector(modelPath, [inputDescriptor], context, 0)
return objDet.imageObjectDetect(ObjectDetector.loadImageFromFile(imagePath), 3)
}
def (imagePath, modelPath) = downloadModelImage()
def image = ImageIO.read(imagePath as File)
def (w, h) = image.with{ [it.width, it.height] }
def count = 1 // batch size
def channels = 3 // for RGB
def inputShape = new Shape(count, channels, w * 0.8 as int, h)
def results = detectObjects(modelPath, imagePath, inputShape).sum()
def boxes = results.collect {[
xmin: w * it.XMin as int, ymin: h * it.YMin as int,
xmax: w * it.XMax as int, ymax: h * it.YMax as int
]}
def names = results.collect{ it.className + sprintf(' %.3f', it.probability) }
(0..Image.drawBoundingBox(image, boxes, names)
new SwingBuilder().edt {
frame(title: "${results.size()} detected objects", size: [w, h],
show: true, defaultCloseOperation: DISPOSE_ON_CLOSE) {
label(icon: imageIcon(image: image))
}
} car 0.996 [xmin:439, ymin:62, xmax:715, ymax:173]
bicycle 0.986 [xmin:144, ymin:128, xmax:547, ymax:453]
dog 0.919 [xmin:102, ymin:184, xmax:344, ymax:541]
Bonus Material
Scaling up Data Science
If your big data framework doesn’t have the algorithm you need?
Preprocessing
Partition based dataset
Linear regression
K-Means clustering
Genetic algorithms
Multilayer perceptron
Decision trees
k-NN classification
k-NN regression
SVM binary classification
SVM multi-class Classification
Model cross validation
Logistic regression
Random forest
Gradient boosting
Model updating
ANN (Approximate Nearest Neighbor)
Deep learning
Summary statistics
Correlations
Stratified sampling
Hypothesis testing
Random data generation
Classification and regression: support
Vector machines logistic regression
Linear regression
Decision trees
Naive Bayes classification
Alternating least squares (ALS)
K-Means clustering
Latent Dirichlet allocation (LDA)
Singular value decomposition (SVD)
Principal component analysis (PCA)
Feature extraction
Transformation functions
Optimization algorithms
Re-formulate your
algorithm using
map/filter/reduce
House pricing revisited – Initial algorithm
Data
Training Test
Model
Stats
Split
Fit
Predict and
Evaluate
House pricing revisited – Rejigging our algorithm
Data
Training Test
Model
Stats
Split
Fit
Predict and
Evaluate
Data
Chunk1
Model1
Stats
Split
Fit
Predict and
Evaluate
…
Chunk2
Model2
ChunkN
ModelN
Fit
Model
Aggregate
…
… ChunkN
Stats1 Stats2 StatsN
…
Aggregate
Chunk2
Chunk1
House pricing revisited - GPars
import org.apache.commons.math3.stat.StatUtils
import smile.math.Math
import smile.regression.OLS
import tech.tablesaw.api.*
import groovyx.gpars.GParsPool
import static java.lang.Math.sqrt
def features = [
'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_living15', 'lat',
'sqft_above', 'grade', 'view', 'waterfront', 'floors'
]
def file = getClass().classLoader.getResource('kc_house_data.csv').file
def table = Table.read().csv(file)
table = table.dropWhere(table.column("bedrooms").isGreaterThan(30))
def data = table.as().doubleMatrix(*features)
…
• Read CSV in normal way
• Split data into chunks
• Fit each chunk and aggregate resulting models
• Split data into chunks
• Evaluate each chunk against model and aggregate resulting stats
House pricing revisited - GPars
…
GParsPool.withPool {
def trainChunkSize = 3000
def numTrainChunks = 8
def models = (0..def list = data.toList().shuffled().take(trainChunkSize)
new OLS(list.collect{ it[1..-1] } as double[][],
list.collect{ it[0] } as double[]).with{
[it.intercept(), *it.coefficients()]
}
}
def model = models.transpose()*.sum().collect{ it/numTrainChunks }
println "Intercept: ${model[0]}"
println "Coefficients: ${model[1..-1].join(', ')}"
double[] coefficients = model[1..-1]
double intercept = model[0]
def stats = { chunk ->
def predicted = chunk.collect { row -> intercept + dot(row[1..-1] as double[], coefficients) }
def residuals = chunk.toList().indexed().collect { idx, row -> predicted[idx] - row[0] }
def rmse = sqrt(sumSq(residuals as double[]) / chunk.size())
[rmse, residuals.average(), chunk.size()]
}
def evalChunkSize = 2000
def results = data.collate(evalChunkSize).collectParallel(stats)
println 'RMSE: ' + sqrt(results.collect{ it[0] * it[0] * it[2] }.sum() / data.size())
println 'mean: ' + results.collect{ it[1] * it[2] }.sum() / data.size()
}
• Read CSV in normal way
• Split data into chunks
• Fit each chunk and aggregate resulting models
• Split data into chunks
• Evaluate each chunk against model and aggregate resulting stats
House pricing revisited - GPars
…
GParsPool.withPool {
def trainChunkSize = 3000
def numTrainChunks = 8
def models = (0..def list = data.toList().shuffled().take(trainChunkSize)
new OLS(list.collect{ it[1..-1] } as double[][],
list.collect{ it[0] } as double[]).with{
[it.intercept(), *it.coefficients()]
}
}
def model = models.transpose()*.sum().collect{ it/numTrainChunks }
println "Intercept: ${model[0]}"
println "Coefficients: ${model[1..-1].join(', ')}"
double[] coefficients = model[1..-1]
double intercept = model[0]
def stats = { chunk ->
def predicted = chunk.collect { row -> intercept + dot(row[1..-1] as double[], coefficients) }
def residuals = chunk.toList().indexed().collect { idx, row -> predicted[idx] - row[0] }
def rmse = sqrt(sumSq(residuals as double[]) / chunk.size())
[rmse, residuals.average(), chunk.size()]
}
def evalChunkSize = 2000
def results = data.collate(evalChunkSize).collectParallel(stats)
println 'RMSE: ' + sqrt(results.collect{ it[0] * it[0] * it[2] }.sum() / data.size())
println 'mean: ' + results.collect{ it[1] * it[2] }.sum() / data.size()
}
Intercept: -3.16904445043026E7
Coefficients: -21131.141578491588, -5431.675525641348,
179.47932136012156, 11.273983523229091, 657752.8815669084,
-10.658550821995767, 86876.08541623666, 61708.58759419169,
611717.9878928157, -21819.967894981055
RMSE: 215185.8177879586
mean: -2151.974648800721
…
GParsPool.withPool {
def trainChunkSize = 3000
def numTrainChunks = 8
def models = (0..def list = data.toList().shuffled().take(trainChunkSize)
new OLS(list.collect{ it[1..-1] } as double[][],
list.collect{ it[0] } as double[]).with{
[it.intercept(), *it.coefficients()]
}
}
def model = models.transpose()*.sum().collect{ it/numTrainChunks }
println "Intercept: ${model[0]}"
println "Coefficients: ${model[1..-1].join(', ')}"
double[] coefficients = model[1..-1]
double intercept = model[0]
def stats = { chunk ->
def predicted = chunk.collect { row -> intercept + dot(row[1..-1] as double[], coefficients) }
def residuals = chunk.toList().indexed().collect { idx, row -> predicted[idx] - row[0] }
def rmse = sqrt(sumSq(residuals as double[]) / chunk.size())
[rmse, residuals.average(), chunk.size()]
}
def evalChunkSize = 2000
def results = data.collate(evalChunkSize).collectParallel(stats)
println 'RMSE: ' + sqrt(results.collect{ it[0] * it[0] * it[2] }.sum() / data.size())
println 'mean: ' + results.collect{ it[1] * it[2] }.sum() / data.size()
}
House pricing revisited - GPars Data
Chunk1
Model1
Stats
shuffle/take
results
…
Chunk2
Model2
ChunkN
ModelN
OLS#intercept/OLS#coefficients
Model
…
… ChunkN
Stats1 Stats2 StatsN
…
Chunk2
Chunk1
results.collect
transpose/sum
model
collate
stats
House pricing revisited – Apache Beam
…
class MeanDoubleArrayCols implements SerializableFunction, double[]> {
@Override
double[] apply(Iterable input) {
double[] sum = null
def count = 0
for (double[] next : input) {
if (sum == null) {
sum = new double[next.size()]
(0..}
(0..count++
}
if (sum != null) (0..return sum
}
}
class AggregateModelStats implements SerializableFunction, double[]> {
@Override
double[] apply(Iterable input) {
double[] sum = null
for (double[] next : input) {
if (sum == null) {
sum = new double[next.size()]
(0..}
def total = sum[2] + next[2]
sum[0] = sqrt((sum[2] * sum[0] * sum[0] + next[2] * next[0] * next[0]) / total)
sum[1] = (sum[2] * sum[1] + next[2] * next[1]) / total
sum[2] = total
}
return sum
}
}
…
Aggregate models
Aggregate stats
House pricing revisited – Apache Beam
…
static buildPipeline(Pipeline p) {
def features = [
'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_living15', 'lat',
'sqft_above', 'grade', 'view', 'waterfront', 'floors'
]
def readCsvChunks = new DoFn() {
@ProcessElement
void processElement(@Element String path, OutputReceiver receiver) throws IOException {
def chunkSize = 6000
def table = Table.read().csv(path)
table = table.dropWhere(table.column("bedrooms").isGreaterThan(30))
def idxs = 0..for (nextChunkIdxs in idxs.shuffled().collate(chunkSize)) {
def chunk = table.rows(*nextChunkIdxs)
receiver.output(chunk.as().doubleMatrix(*features))
}
}
}
def fitModel = new DoFn() {
@ProcessElement
void processElement(@Element double[][] rows, OutputReceiver receiver) throws IOException {
double[] model = new OLS(rows.collect{ it[1..-1] } as double[][], rows.collect{ it[0] } as double[]).with{ [it.intercept(), *it.coefficients()] }
receiver.output(model)
}
}
def evalModel = { double[][] chunk, double[] model ->
double intercept = model[0]
double[] coefficients = model[1..-1]
def predicted = chunk.collect { row -> intercept + dot(row[1..-1] as double[], coefficients) }
def residuals = chunk.toList().indexed().collect { idx, row -> predicted[idx] - row[0] }
def rmse = sqrt(sumSq(residuals as double[]) / chunk.size())
def mean = residuals.average()
[rmse, mean, chunk.size()] as double[]
}
…
Map filename to chunks
Map chunk to model
Map chunk to stats
House pricing revisited – Apache Beam
…
def model2out = new DoFn() {
@ProcessElement
void processElement(@Element double[] ds, OutputReceiver out) {
out.output("intercept: ${ds[0]}, coeffs: ${ds[1..-1].join(', ')}".toString())
}
}
def stats2out = new DoFn() {
@ProcessElement
void processElement(@Element double[] ds, OutputReceiver out) {
out.output("rmse: ${ds[0]}, mean: ${ds[1]}, count: ${ds[2]}".toString())
}
}
var csvChunks = p
.apply(Create.of('/path/to/kc_house_data.csv'))
.apply('Create chunks', ParDo.of(readCsvChunks))
var model = csvChunks
.apply('Fit chunks', ParDo.of(fitModel))
.apply(Combine.globally(new MeanDoubleArrayCols()))
var modelView = model
.apply(View.asSingleton())
csvChunks
.apply(ParDo.of(new EvaluateModel(modelView, evalModel)).withSideInputs(modelView))
.apply(Combine.globally(new AggregateModelStats()))
.apply('Log stats', ParDo.of(stats2out)).apply(Log.ofElements())
model
.apply('Log model', ParDo.of(model2out)).apply(Log.ofElements())
}
def pipeline = Pipeline.create()
buildPipeline(pipeline)
pipeline.run().waitUntilFinish()
INFO: intercept: -3.2504584735458568E7, coeffs: -27862.329468295735, -2269.8423856962077, 198.1365232723505, 2.9357512250793327,
675276.3494316386, -2.323389992226333, 83241.42291377622, 66606.19682072607, 607326.5819472861, -30354.844181689274
INFO: rmse: 214702.83588071115, mean: -1191.2433423574066, count: 21612.0
Map model
to output
Map stats to
output
Create and run pipeline
Build
pipeline
…
var csvChunks = p
| Create.of('/path/to/kc_house_data.csv')
| 'Create chunks' >> ParDo.of(readCsvChunks)
var model = csvChunks
| 'Fit chunks' >> ParDo.of(fitModel)
| Combine.globally(new MeanDoubleArrayCols())
var modelView = model
| View.asSingleton()
csvChunks
| ParDo.of(new EvaluateModel(modelView, evalModel)).withSideInputs(modelView)
| Combine.globally(new AggregateModelStats())
| 'Log stats' >> ParDo.of(stats2out) | Log.ofElements()
model
| 'Log model' >> ParDo.of(model2out) | Log.ofElements()
}
PCollection.metaClass.or = { List arg -> delegate.apply(*arg) }
PCollection.metaClass.or = { PTransform arg -> delegate.apply(arg) }
String.metaClass.rightShift = { PTransform arg -> [delegate, arg] }
Pipeline.metaClass.or = { PTransform arg -> delegate.apply(arg) }
def pipeline = Pipeline.create()
buildPipeline(pipeline)
pipeline.run().waitUntilFinish()
House pricing revisited – Apache Beam
INFO: intercept: -3.2504584735458568E7, coeffs: -27862.329468295735, -2269.8423856962077, 198.1365232723505, 2.9357512250793327,
675276.3494316386, -2.323389992226333, 83241.42291377622, 66606.19682072607, 607326.5819472861, -30354.844181689274
INFO: rmse: 214702.83588071115, mean: -1191.2433423574066, count: 21612.0
Create and run pipeline
Some optional
metaprogramming for
Python-like coding style
House pricing revisited – Apache Beam
…
var csvChunks = p
| Create.of('/path/to/kc_house_data.csv')
| 'Create chunks' >> ParDo.of(readCsvChunks)
var model = csvChunks
| 'Fit chunks' >> ParDo.of(fitModel)
| Combine.globally(new MeanDoubleArrayCols())
var modelView = model
| View.asSingleton()
csvChunks
| ParDo.of(new EvaluateModel(modelView, evalModel)).withSideInputs(modelView)
| Combine.globally(new AggregateModelStats())
| 'Log stats' >> ParDo.of(stats2out) | Log.ofElements()
model
| 'Log model' >> ParDo.of(model2out) | Log.ofElements()
}
PCollection.metaClass.or = { List arg -> delegate.apply(*arg) }
PCollection.metaClass.or = { PTransform arg -> delegate.apply(arg) }
String.metaClass.rightShift = { PTransform arg -> [delegate, arg] }
Pipeline.metaClass.or = { PTransform arg -> delegate.apply(arg) }
def pipeline = Pipeline.create()
buildPipeline(pipeline)
pipeline.run().waitUntilFinish()
INFO: intercept: -3.2504584735458568E7, coeffs: -27862.329468295735, -2269.8423856962077, 198.1365232723505, 2.9357512250793327,
675276.3494316386, -2.323389992226333, 83241.42291377622, 66606.19682072607, 607326.5819472861, -30354.844181689274
INFO: rmse: 214702.83588071115, mean: -1191.2433423574066, count: 21612.0
Data
Chunk1
Model1
Stats
readCsvChunks
csvChunks
…
Chunk2
Model2
ChunkN
ModelN
fitModel
Model
…
… ChunkN
Stats1 Stats2 StatsN
…
Chunk2
Chunk1
AggregateModelStats
MeanDoubleArrayCols
csvChunks
model
modelView sideInput
evalModel
Optimization Overview Optimization:
• Finding (optimal) solutions
Approaches:
• Dynamic programming
• Linear programming
• Non-linear programming
• Constraint programming
• Integer programming
Aspects:
• Combinatorial optimization
• Continuous optimization
Applications:
• Timetabling/scheduling
• Vehicle routing
• Packing problems
• Trading systems
• Logistics
• Language detection
CP
Constraint
Programming
CIP
Constraint
Integer
Programming
MINLP
Mixed Integer
Non-linear
Programming
MIP
Mixed Integer Linear
Programming
IP
Integer Linear
Programming
SAT
Satisfiability
LP
Linear
Programming
Constraint Programming
Picture: http://osullivan.ucc.ie/AAAI_OSullivan.pdf
Constraint Programming
Typical domains:
• Boolean
(SAT problem)
• integer
• rational
• interval
(scheduling problems)
• linear
(approaches to non-
linear problems do exist)
• finite sets
• mixed
(2 or more of above)
Picture: http://osullivan.ucc.ie/AAAI_OSullivan.pdf
Cryptarithmetic
S E N D
+ M O R E
= M O N E Y
Replace the letters with
decimal digits to make a
valid arithmetic sum
Cryptarithmetic: Brute force
def solutions():
# letters = ('s', 'e', 'n', 'd', 'm', 'o', 'r', 'y')
all_solutions = list()
for s in range(1, 10):
for e in range(0, 10):
for n in range(0, 10):
for d in range(0, 10):
for m in range(1, 10):
for o in range(0, 10):
for r in range(0, 10):
for y in range(0, 10):
if len({s, e, n, d, m, o, r, y}) == 8:
send = 1000 * s + 100 * e + 10 * n + d
more = 1000 * m + 100 * o + 10 * r + e
money = 10000 * m + 1000 * o + 100 * n + 10 * e + y
if send + more == money:
all_solutions.append((send, more, money))
return all_solutions
print(solutions())
S E N D
+ M O R E
= M O N E Y
[(9567, 1085, 10652)]
Cryptarithmetic: Brute force
def solutions() {
// letters = ['s', 'e', 'n', 'd', 'm', 'o', 'r', 'y']
def all_solutions = []
for (s in 1..<10)
for (e in 0..9)
for (n in 0..9)
for (d in 0..9)
for (m in 1..9)
for (o in 0..9)
for (r in 0..9)
for (y in 0..9)
if ([s, e, n, d, m, o, r, y].toSet().size() == 8) {
def send = 1000 * s + 100 * e + 10 * n + d
def more = 1000 * m + 100 * o + 10 * r + e
def money = 10000 * m + 1000 * o + 100 * n + 10 * e + y
if (send + more == money)
all_solutions.add([send, more, money])
}
return all_solutions
}
print(solutions())
S E N D
+ M O R E
= M O N E Y
[[9567, 1085, 10652]]
Cryptarithmetic: Brute force
from itertools import permutations
def solution2():
letters = ('s', 'e', 'n', 'd', 'm', 'o', 'r', 'y')
digits = range(10)
for perm in permutations(digits, len(letters)):
sol = dict(zip(letters, perm))
if sol['s'] == 0 or sol['m'] == 0:
continue
send = 1000 * sol['s'] + 100 * sol['e'] + 10 * sol['n'] + sol['d']
more = 1000 * sol['m'] + 100 * sol['o'] + 10 * sol['r'] + sol['e']
money = 10000 * sol['m'] + 1000 * sol['o'] + 100 * sol['n'] + 10 * sol['e'] + sol['y']
if send + more == money:
return send, more, money
print(solution2())
S E N D
+ M O R E
= M O N E Y
(9567, 1085, 10652)
Cryptarithmetic: Brute force
def solution2() {
def digits = 0..9
for (p in digits.permutations()) {
if (p[-1] < p[-2]) continue
def (s, e, n, d, m, o, r, y) = p
if (s == 0 || m == 0)
continue
def send = 1000 * s + 100 * e + 10 * n + d
def more = 1000 * m + 100 * o + 10 * r + e
def money = 10000 * m + 1000 * o + 100 * n + 10 * e + y
if (send + more == money)
return [send, more, money]
}
}
print(solution2())
S E N D
+ M O R E
= M O N E Y
[9567, 1085, 10652]
from csp import Constraint, CSP
from typing import Dict, List, Optional
class SendMoreMoneyConstraint(Constraint[str, int]):
def __init__(self, letters: List[str]) -> None:
super().__init__(letters)
self.letters: List[str] = letters
def satisfied(self, assignment: Dict[str, int]) -> bool:
# not a solution if duplicate values
if len(set(assignment.values())) < len(assignment):
return False
# if all vars assigned, check if correct
if len(assignment) == len(self.letters):
s: int = assignment["S"]
e: int = assignment["E"]
n: int = assignment["N"]
d: int = assignment["D"]
m: int = assignment["M"]
o: int = assignment["O"]
r: int = assignment["R"]
y: int = assignment["Y"]
send: int = s * 1000 + e * 100 + n * 10 + d
more: int = m * 1000 + o * 100 + r * 10 + e
money: int = m * 10000 + o * 1000 + n * 100 + e * 10 + y
return send + more == money
return True
…
Cryptarithmetic: Constraint programming
…
letters = ["S", "E", "N", "D", "M", "O", "R", "Y"]
possible_digits = {}
for letter in letters:
possible_digits[letter] = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
possible_digits["M"] = [1] # can't start with 0 constraint
csp: CSP[str, int] = CSP(letters, possible_digits)
csp.add_constraint(SendMoreMoneyConstraint(letters))
solution: Optional[Dict[str, int]] = csp.backtracking_search()
if solution is None:
print("No solution found!")
else:
print(solution)
S E N D
+ M O R E
= M O N E Y
{'S': 9, 'E': 5, 'N': 6, 'D': 7,
'M': 1, 'O': 0, 'R': 8, 'Y': 2}
Adapted from: https://freecontent.manning.com/constraint-satisfaction-problems-in-python/
@Grab('org.choco-solver:choco-solver:4.10.0')
import org.chocosolver.solver.Model
import org.chocosolver.solver.variables.IntVar
def model = new Model("SEND+MORE=MONEY")
def S = model.intVar("S", 1, 9)
def E = model.intVar("E", 0, 9)
def N = model.intVar("N", 0, 9)
def D = model.intVar("D", 0, 9)
def M = model.intVar("M", 1, 9)
def O = model.intVar("O", 0, 9)
def R = model.intVar("R", 0, 9)
def Y = model.intVar("Y", 0, 9)
model.allDifferent(S, E, N, D, M, O, R, Y).post()
…
Cryptarithmetic: Constraint programming
…
IntVar[] ALL = [
S, E, N, D,
M, O, R, E,
M, O, N, E, Y]
int[] COEFFS = [
1000, 100, 10, 1,
1000, 100, 10, 1,
-10000, -1000, -100, -10, -1]
model.scalar(ALL, COEFFS, "=", 0).post()
model.solver.findSolution()
S E N D
+ M O R E
= M O N E Y
Solution: S=9, E=5, N=6,
D=7, M=1, O=0, R=8, Y=2,
Dietary restrictions
Minimise cost of diet given:
• Must be at least 300 calories
• Not more than 10 grams of protein
• Not less than 10 grams of carbohydrates
• Not less than 8 grams of fat
• At least 0.5 units of fish
• No more than 1 unit of milk
Bread Milk Cheese Potato Fish Yogurt
Cost 2.0 3.5 8.0 1.5 11.0 1.0
Protein (g) 4.0 8.0 7.0 1.3 8.0 9.2
Fat (g) 1.0 5.0 9.0 0.1 7.0 1.0
Carbohydrates (g) 15.0 11.7 0.4 22.6 0.0 17.0
Calories 90 120 106 97 130 180
Dietary restrictions (Excel)
Dietary restrictions (Excel)
Dietary restrictions (Google sheets)
Diet problem (ojalgo)
def model = new ExpressionsBasedModel()
def bread = model.addVariable("Bread").lower(0)
def milk = model.addVariable("Milk").lower(0).upper(1)
def cheese = model.addVariable("Cheese").lower(0)
def potato = model.addVariable("Potato").lower(0)
def fish = model.addVariable("Fish").lower(0.5)
def yogurt = model.addVariable("Yogurt").lower(0)
def cost = model.addExpression("Cost")
cost.set(bread, 2.0).set(milk, 3.5).set(cheese, 8.0).set(potato, 1.5).set(fish,
11.0).set(yogurt, 1.0)
def protein = model.addExpression("Protein").upper(10)
protein.set(bread, 4.0).set(milk, 8.0).set(cheese, 7.0).set(potato,
1.3).set(fish, 8.0).set(yogurt, 9.2)
def fat = model.addExpression("Fat").lower(8)
fat.set(bread, 1.0).set(milk, 5.0).set(cheese, 9.0).set(potato, 0.1).set(fish,
7.0).set(yogurt, 1.0)
def carbs = model.addExpression("Carbohydrates").lower(10)
carbs.set(bread, 15.0).set(milk, 11.7).set(cheese, 0.4).set(potato,
22.6).set(fish, 0.0).set(yogurt, 17.0)
def calories = model.addExpression("Calories").lower(300)
calories.set(bread, 90).set(milk, 120).set(cheese, 106).set(potato, 97).set(fish,
130).set(yogurt, 180)
def result = model.minimise()
// for a variation, see:
// https://www.ojalgo.org/2019/05/the-diet-problem/
…
OPTIMAL 0.0 @ { 0.0, 0.0, 0.40813898143741034,
1.8538791051880057, 0.5916230366492151, 0.0 }
############################################
0 <= Bread: 0
0 <= Milk: 0 <= 1
0 <= Cheese: 0.408139
0 <= Potato: 1.853879
0.5 <= Fish: 0.591623
0 <= Yogurt: 0
10 <= Carbohydrates: 42.060923
8 <= Fat: 8.0
Cost: 12.553784
Protein: 10.0 <= 10
300 <= Calories: 300.0
############################################
Diet problem (Choco)
def model = new Model("Diet problem")
def unbounded = 100000
// scale quantities by 10, coefficients by 10, products by 100
def bread = model.intVar("Bread", 0, unbounded)
def milk = model.intVar("Milk", 0, 10)
def cheese = model.intVar("Cheese", 0, unbounded)
def potato = model.intVar("Potato", 0, unbounded)
def fish = model.intVar("Fish", 5, unbounded)
def yogurt = model.intVar("Yogurt", 0, unbounded)
IntVar[] all = [bread, milk, cheese, potato, fish, yogurt]
def cost = model.intVar("Cost", 0, unbounded)
model.scalar(all, [20, 35, 80, 15, 110, 10] as int[], "=", cost).post()
def protein = model.intVar("Protein", 0, 1000)
model.scalar(all, [40, 80, 70, 13, 80, 92] as int[], "=", protein).post()
def fat = model.intVar("Fat", 800, unbounded)
model.scalar(all, [10, 50, 90, 1, 70, 10] as int[], "=", fat).post()
def carbs = model.intVar("Carbohydrates", 1000, unbounded)
model.scalar(all, [150, 117, 4, 226, 0, 170] as int[], "=", carbs).post()
def calories = model.intVar("Calories", 30000, unbounded)
model.scalar(all, [900, 1200, 1060, 970, 1300, 1800] as int[], "=", calories).post()
model.setObjective(Model.MINIMIZE, cost)
def found = model.solver.findSolution()
if (found) {
all.each { println "$it.name: ${it.value / 10}" }
[carbs, fat, protein, calories, cost].each { println "$it.name: ${it.value / 100}" }
} else {
println "No solution"
}
Bread: 0
Milk: 0
Cheese: 0.5
Potato: 1.9
Fish: 0.5
Yogurt: 0
Carbohydrates: 43.14
Fat: 8.19
Protein: 9.97
Calories: 302.3
Cost: 12.35
Choco supports RealVar but doesn’t have as
rich a set of possible constraints for such vars.
Diet problem (Choco)
def model = new Model("Diet problem")
def unbounded = 1000.0d
def precision = 0.00001d
// scale quantities by 10, coefficients by 10, products by 100
def bread = model.realVar("Bread", 0.0, unbounded, precision)
def milk = model.realVar("Milk", 0.0, 1.0, precision)
def cheese = model.realVar("Cheese", 0.0, unbounded, precision)
def potato = model.realVar("Potato", 0.0, unbounded, precision)
def fish = model.realVar("Fish", 0.5, unbounded, precision)
def yogurt = model.realVar("Yogurt", 0.0, unbounded, precision)
RealVar[] all = [bread, milk, cheese, potato, fish, yogurt]
def scalarIbex = { coeffs, var ->
def (a, b, c, d, e, f) = coeffs
model.realIbexGenericConstraint("$a*{0}+$b*{1}+$c*{2}+$d*{3}+$e*{4}+$f*{5}={6}",
[*all, var] as RealVar[]).post();
}
def cost = model.realVar("Cost", 0.0, unbounded, precision)
scalarIbex([2.0, 3.5, 8.0, 1.5, 11.0, 1.0], cost)
def protein = model.realVar("Protein", 0.0, 10.0, precision)
scalarIbex([4.0, 8.0, 7.0, 1.3, 8.0, 9.2], protein)
def fat = model.realVar("Fat", 8.0, unbounded, precision)
scalarIbex([1.0, 5.0, 9.0, 0.1, 7.0, 1.0], fat)
def carbs = model.realVar("Carbohydrates", 10.0, unbounded, precision)
scalarIbex([15.0, 11.7, 0.4, 22.6, 0.0, 17.0], carbs)
def calories = model.realVar("Calories", 300, unbounded, precision)
scalarIbex([90, 120, 106, 97, 130, 180], calories)
model.setObjective(Model.MINIMIZE, cost)
def found = model.solver.findSolution()
Bread: 0.025131 .. 0.025137
Milk: 0.000009 .. 0.000010
Cheese: 0.428571 .. 0.428571
Potato: 1.848118 .. 1.848124
Fish: 0.561836 .. 0.561836
Yogurt: 0.000007 .. 0.000010
Carbohydrates: 42.316203 .. 42.316211
Fat: 8.000000 .. 8.000005
Protein: 9.997920 .. 9.997926
Calories: 300.000000 .. 300.000008
Cost: 12.431241 .. 12.431245
Choco does have a plugin (via JNI) for
the Ibex C++ constraint processing library
which does handle real numbers.
def pretty = { var ->
def bounds = found.getRealBounds(var)
printf "%s: %.6f .. %.6f%n", var.name, *bounds
}
if (found) {
all.each { pretty(it) }
[carbs, fat, protein, calories, cost].each { pretty(it) }
} else {
println "No solution"
}
Diet problem (Apache Commons Math)
import org.apache.commons.math3.optim.linear.*
import org.apache.commons.math3.optim.nonlinear.scalar.GoalType
import static org.apache.commons.math3.optim.linear.Relationship.*
def cost = new LinearObjectiveFunction([2.0, 3.5, 8.0, 1.5, 11.0, 1.0] as double[], 0)
static scalar(coeffs, rel, val) { new LinearConstraint(coeffs as double[], rel, val) }
def bread_min = scalar([1, 0, 0, 0, 0, 0], GEQ, 0)
def milk_min = scalar([0, 1, 0, 0, 0, 0], GEQ, 0)
def milk_max = scalar([0, 1, 0, 0, 0, 0], LEQ, 1)
def cheese_min = scalar([0, 0, 1, 0, 0, 0], GEQ, 0)
def potato_min = scalar([0, 0, 0, 1, 0, 0], GEQ, 0)
def fish_min = scalar([0, 0, 0, 0, 1, 0], GEQ, 0.5)
def yogurt_min = scalar([0, 0, 0, 0, 0, 1], GEQ, 0)
def protein = scalar([4.0, 8.0, 7.0, 1.3, 8.0, 9.2], LEQ, 10)
def fat = scalar([1.0, 5.0, 9.0, 0.1, 7.0, 1.0], GEQ, 8)
def carbs = scalar([15.0, 11.7, 0.4, 22.6, 0.0, 17.0], GEQ, 10)
def calories = scalar([90, 120, 106, 97, 130, 180], GEQ, 300)
LinearConstraintSet constraints = [bread_min, milk_min, milk_max, fish_min, cheese_min,
potato_min, yogurt_min, protein, fat, carbs, calories]
def solution = new SimplexSolver().optimize(cost, constraints, GoalType.MINIMIZE)
if (solution != null) {
printf "Opt: %.2f%n", solution.value
println solution.point.collect{ sprintf '%.2f', it }.join(', ')
}
Opt: 12.08
-0.00, 0.05, 0.45, 1.87, 0.50, 0.00
Diet problem (Google OrTools)
import …
Loader.loadNativeLibraries()
static addConstraint(MPSolver solver, List vars, MPVariable comp, List coeffs) {
solver.makeConstraint(0, 0).tap {constraint ->
constraint.setCoefficient(comp, -1)
vars.indices.each { i ->
constraint.setCoefficient(vars[i], coeffs[i])
}
}
}
MPSolver.createSolver("SCIP").with {solver ->
def bread = makeNumVar(0, infinity(),'bread')
def milk = makeNumVar(0, 1, 'milk')
def cheese = makeNumVar(0, infinity(), 'cheese')
def potato = makeNumVar(0, infinity(), 'potato')
def fish = makeNumVar(0.5, infinity(), 'fish')
def yogurt = makeNumVar(0, infinity(), 'yogurt')
def food = [bread, milk, cheese, potato, fish, yogurt]
…
Diet problem (Google OrTools)
…
def cost = makeNumVar(0, infinity(),'Cost')
addConstraint(solver, food, cost, [2.0, 3.5, 8.0, 1.5, 11.0, 1.0])
def protein = makeNumVar(0, 10,'Protein')
addConstraint(solver, food, protein, [4.0, 8.0, 7.0, 1.3, 8.0, 9.2])
def fat = makeNumVar(8, infinity(),'Fat')
addConstraint(solver, food, fat, [1.0, 5.0, 9.0, 0.1, 7.0, 1.0])
def carbs = makeNumVar(10, infinity(),'Carbs')
addConstraint(solver, food, carbs, [15.0, 11.7, 0.4, 22.6, 0.0, 17.0])
def cals = makeNumVar(300, infinity(),'Calories')
addConstraint(solver, food, cals, [90, 120, 106, 97, 130, 180])
def components = [protein, fat, carbs, cals]
objective().with {objective ->
objective.setCoefficient(cost, 1)
objective.setMinimization()
def result = solve()
println result
if (result == OPTIMAL) {
println "Solution: " + objective.value()
println "Foods: ${food.collect{ "${it.name()} ${it.solutionValue()}" }}"
println "Components: ${components.collect{ "${it.name()} ${it.solutionValue()}" }}"
println "Iterations: ${iterations()}, Wall time: ${wallTime()}ms"
} else {
System.err.println "The problem does not have an optimal solution!"
}
}
}
Diet problem (Google OrTools)
…
def cost = makeNumVar(0, infinity(),'Cost')
addConstraint(solver, food, cost, [2.0, 3.5, 8.0, 1.5, 11.0, 1.0])
def protein = makeNumVar(0, 10,'Protein')
addConstraint(solver, food, protein, [4.0, 8.0, 7.0, 1.3, 8.0, 9.2])
def fat = makeNumVar(8, infinity(),'Fat')
addConstraint(solver, food, fat, [1.0, 5.0, 9.0, 0.1, 7.0, 1.0])
def carbs = makeNumVar(10, infinity(),'Carbs')
addConstraint(solver, food, carbs, [15.0, 11.7, 0.4, 22.6, 0.0, 17.0])
def cals = makeNumVar(300, infinity(),'Calories')
addConstraint(solver, food, cals, [90, 120, 106, 97, 130, 180])
def components = [protein, fat, carbs, cals]
objective().with {objective ->
objective.setCoefficient(cost, 1)
objective.setMinimization()
def result = solve()
println result
if (result == OPTIMAL) {
println "Solution: " + objective.value()
println "Foods: ${food.collect{ "${it.name()} ${it.solutionValue()}" }}"
println "Components: ${components.collect{ "${it.name()} ${it.solutionValue()}" }}"
println "Iterations: ${iterations()}, Wall time: ${wallTime()}ms"
} else {
System.err.println "The problem does not have an optimal solution!"
}
}
}
OPTIMAL
Solution: 12.081337881108173
Foods: [bread 0.0, milk 0.05359877488514538, cheese 0.44949881665042474, potato 1.8651677572045107, fish 0.5, yogurt 0.0]
Components: [Protein 10.0, Fat 8.0, Carbs 42.95969650563832, Calories 300.0]
Iterations: 4, Wall time: 89ms
225
objectcomputing.com
© 2021, Object Computing, Inc. (OCI). All rights reserved.
Groovy Community Information
• groovy.apache.org
• groovy-lang.org
• github.com/apache/groovy
• groovycommunity.com
• [email protected]
• [email protected]
• @ApacheGroovy
• objectcomputing.com/training/catalog/groovy-programming/
© 2021 Object Computing, Inc. (OCI). All rights reserved.
objectcomputing.com
© 2021, Object Computing, Inc. (OCI). All rights reserved. 226
CONNECT WITH US
1+ (314) 579-0066
@objectcomputing
objectcomputing.com
THANK YOU
Find me on twitter
@paulk_asert
© 2021 Object Computing, Inc. (OCI). All rights reserved.
References
• https://www.analyticsvidhya.com/blog/2017/02/lintroductory-guide-on-linear-programming-
explained-in-simple-english/
• Computational Mixed-Integer Programming
https://www.birs.ca/cmo-workshops/2018/18w5208/files/GleixnerAmbros.pdf
• Integer Linear Programming: Graphical Introduction
https://www.youtube.com/watch?v=RhHhy-8sz-4
• Integer Linear Programming: Binary Constraints
https://www.youtube.com/watch?v=-3my1TkyFiM
https://www.youtube.com/watch?v=B3biWsBLeCw
https://www.youtube.com/watch?v=MO8uQnIch6I
• Linear Programming: Alternate solutions, Infeasibility, Unboundedness, & Redundancy
https://www.youtube.com/watch?v=eMA0LWsRQQ