Machine Learning with Clojure and Apache Spark

Slide 1

Slide 1 text

Machine Learning with Clojure and Apache Spark ;; Eric Weinstein ;; EuroClojure 2016 ;; Bratislava, Slovakia ;; 25 October 2016

Slide 2

Slide 2 text

for Joshua

Slide 3

Slide 3 text

Part 0: Hello!

Slide 4

Slide 4 text

About Me (def eric-weinstein {:employer "Hulu" :github "ericqweinstein" :twitter "ericqweinstein" :website "ericweinste.in"}) 30% off with EURORUBY30!

Slide 5

Slide 5 text

Agenda • Machine learning • Apache Spark • Flambo vs. Sparkling • DL4J, deep learning, and convolutional neural networks

Slide 6

Slide 6 text

Part 1: ⚡✨

Slide 7

Slide 7 text

What’s machine learning?

Slide 8

Slide 8 text

In a word:

Slide 9

Slide 9 text

Generalization

Slide 10

Slide 10 text

What’s Supervised Learning? Classification or regression, generalizing from labeled data to unlabeled data

Slide 11

Slide 11 text

What’s Apache Spark? Apache Spark is an open-source cluster computing framework; its parallelism makes it ideal for processing large data sets, and in ML, the more data, the better!

Slide 12

Slide 12 text

Some Spark Terminology • RDD: Resilient Distributed Dataset • Dataset: RDD + Spark SQL execution engine • DataFrame: Dataset organized into named columns

Slide 13

Slide 13 text

Our Data • Police stop data for the city of Los Angeles, California in 2015 • 4 features, ~600,000 instances • http://bit.ly/2f9jVwn

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Decision Trees X[0] <= 0.5 gini = 0.4033 samples = 139572 value = [100477, 39095] X[1] <= 5.5 gini = 0.4318 samples = 102419 value = [70118, 32301] True X[1] <= 5.5 gini = 0.2989 samples = 37153 value = [30359, 6794] False X[1] <= 4.5 gini = 0.4399 samples = 96665 value = [65083, 31582] gini = 0.2187 samples = 5754 value = [5035, 719] X[1] <= 3.5 gini = 0.4483 samples = 78400 value = [51805, 26595] gini = 0.397 samples = 18265 value = [13278, 4987] X[1] <= 2.5 gini = 0.4324 samples = 51662 value = [35328, 16334] gini = 0.473 samples = 26738 value = [16477, 10261] X[1] <= 0.5 gini = 0.4406 samples = 48927 value = [32894, 16033] gini = 0.1959 samples = 2735 value = [2434, 301] gini = 0.4658 samples = 65 value = [41, 24] gini = 0.4406 samples = 48862 value = [32853, 16009] X[1] <= 3.5 gini = 0.3067 samples = 34817 value = [28234, 6583] gini = 0.1643 samples = 2336 value = [2125, 211] X[1] <= 2.5 gini = 0.2796 samples = 15786 value = [13133, 2653] X[1] <= 4.5 gini = 0.3277 samples = 19031 value = [15101, 3930] X[1] <= 0.5 gini = 0.2921 samples = 13985 value = [11501, 2484] gini = 0.1701 samples = 1801 value = [1632, 169] gini = 0.426 samples = 26 value = [18, 8] gini = 0.2918 samples = 13959 value = [11483, 2476] gini = 0.3747 samples = 9522 value = [7144, 2378] gini = 0.2732 samples = 9509 value = [7957, 1552]

Slide 17

Slide 17 text

Part 2: A Tale of Two DSLs vs. ✨✨ Image credit: Adventure Time

Slide 18

Slide 18 text

Flambo Example (defn make-spark-context "Creates the Apache Spark context using the Flambo DSL." [] (-> (conf/spark-conf) (conf/master "local") (conf/app-name "euroclojure") (f/spark-context)))

Slide 19

Slide 19 text

Sparkling Example (defn make-spark-context "Creates the Apache Spark context using the Sparkling DSL." [] (-> (conf/spark-conf) (conf/master "local") (conf/app-name "euroclojure") (spark/spark-context)))

Slide 20

Slide 20 text

Straight Spark (def model (DecisionTree/trainClassifier training 2 categorical-features- info "gini" 5 32)) ; max depth: 5, max leaves: 32 (defn predict [p] ; LabeledPoint (let [prediction (.predict model (.features p))] [(.label p) prediction]))

Slide 21

Slide 21 text

Accuracy: 0.77352

Slide 22

Slide 22 text

Part 3: Deep Learning

Slide 23

Slide 23 text

What’s Deep Learning? • Neural networks (computational architecture modeled after the human brain) • Neural networks with many layers (> 1 hidden layer, but in practice, can be hundreds) • The vanishing/exploding gradient problem

Slide 24

Slide 24 text

Vanishing && Gradients

Slide 25

Slide 25 text

Image credit for all ConvNet images: https://deeplearning4j.org/convolutionalnets

Slide 26

Slide 26 text

Max Pooling/Downsampling

Slide 27

Slide 27 text

Alternating Layers

Slide 28

Slide 28 text

Our Data Image credit: http://digitalmedia.fws.gov/cdm/

Slide 29

Slide 29 text

What’s DL4J? • DL4J == Deep Learning 4 Java, a library (for Java, unsurprisingly) • Examples on GitHub: https://github.com/ deeplearning4j/deeplearning4j • ConvNet worked example: http://bit.ly/2eBM8ss

Slide 30

Slide 30 text

DL4J Example (def nn-conf (-> (NeuralNetConfiguration$Builder.) ;; Some values omitted for space (.activation "relu") (.learningRate 0.0001) (.weightInit (WeightInit/XAVIER)) (.optimizationAlgo OptimizationAlgorithm/STOCHASTIC_GRADIENT_DESCENT) (.updater Updater/RMSPROP) (.momentum 0.9) (.list) (.layer 0 conv-init) (.layer 1 (max-pool "maxpool1" (int-array [2 2]))) (.layer 2 (conv-5x5 "cnn2" 100 (int-array [5 5]) (int-array [1 1]) 0)) (.layer 3 (max-pool "maxpool2" (int-array [2 2]))) (.layer 4 (fully-connected 500)) (.layer 5 output-layer) (.build)))

Slide 31

Slide 31 text

How’d We Do? • Accuracy: 0.375 • Precision: 0.3333 • Recall: 0.375 • F1 Score: 0.3529

Slide 32

Slide 32 text

Summary • Clojure + Spark = • Flambo and Sparkling are roughly equally powerful • Deep learning is super doable with Clojure (though Java interop is kind of a pain)

Slide 33

Slide 33 text

Takeaways (TL;DPA) • Contribute to Flambo and/or Sparkling! • Let’s build or contribute to a nicer DSL for DL4J • https://github.com/ericqweinstein/euroclojure

Slide 34

Slide 34 text

No content