Machine Learning with Clojure and Apache Spark

Machine Learning with Clojure and Apache Spark

Slides for my EuroClojure 2016 talk on machine learning.

Facce030b679bda34eb7c64885a741fc?s=128

Eric Weinstein

October 25, 2016
Tweet

Transcript

  1. Machine Learning with Clojure and Apache Spark ;; Eric Weinstein

    ;; EuroClojure 2016 ;; Bratislava, Slovakia ;; 25 October 2016
  2. for Joshua

  3. Part 0: Hello!

  4. About Me (def eric-weinstein {:employer "Hulu" :github "ericqweinstein" :twitter "ericqweinstein"

    :website "ericweinste.in"}) 30% off with EURORUBY30!
  5. Agenda • Machine learning • Apache Spark • Flambo vs.

    Sparkling • DL4J, deep learning, and convolutional neural networks
  6. Part 1: ⚡✨

  7. What’s machine learning?

  8. In a word:

  9. Generalization

  10. What’s Supervised Learning? Classification or regression, generalizing from labeled data

    to unlabeled data
  11. What’s Apache Spark? Apache Spark is an open-source cluster computing

    framework; its parallelism makes it ideal for processing large data sets, and in ML, the more data, the better!
  12. Some Spark Terminology • RDD: Resilient Distributed Dataset • Dataset:

    RDD + Spark SQL execution engine • DataFrame: Dataset organized into named columns
  13. Our Data • Police stop data for the city of

    Los Angeles, California in 2015 • 4 features, ~600,000 instances • http://bit.ly/2f9jVwn
  14. Features && Labels • Sex (Male | Female) • Race

    (American Indian | Asian | Black | Hispanic | White | Other) • Stop type (Pedestrian | Vehicle) • Post-stop activity (Yes | No)
  15. Features && Labels • Sex (Male | Female) • Race

    (American Indian | Asian | Black | Hispanic | White | Other) • Stop type (Pedestrian | Vehicle) • Post-stop activity (Yes | No)
  16. Decision Trees X[0] <= 0.5 gini = 0.4033 samples =

    139572 value = [100477, 39095] X[1] <= 5.5 gini = 0.4318 samples = 102419 value = [70118, 32301] True X[1] <= 5.5 gini = 0.2989 samples = 37153 value = [30359, 6794] False X[1] <= 4.5 gini = 0.4399 samples = 96665 value = [65083, 31582] gini = 0.2187 samples = 5754 value = [5035, 719] X[1] <= 3.5 gini = 0.4483 samples = 78400 value = [51805, 26595] gini = 0.397 samples = 18265 value = [13278, 4987] X[1] <= 2.5 gini = 0.4324 samples = 51662 value = [35328, 16334] gini = 0.473 samples = 26738 value = [16477, 10261] X[1] <= 0.5 gini = 0.4406 samples = 48927 value = [32894, 16033] gini = 0.1959 samples = 2735 value = [2434, 301] gini = 0.4658 samples = 65 value = [41, 24] gini = 0.4406 samples = 48862 value = [32853, 16009] X[1] <= 3.5 gini = 0.3067 samples = 34817 value = [28234, 6583] gini = 0.1643 samples = 2336 value = [2125, 211] X[1] <= 2.5 gini = 0.2796 samples = 15786 value = [13133, 2653] X[1] <= 4.5 gini = 0.3277 samples = 19031 value = [15101, 3930] X[1] <= 0.5 gini = 0.2921 samples = 13985 value = [11501, 2484] gini = 0.1701 samples = 1801 value = [1632, 169] gini = 0.426 samples = 26 value = [18, 8] gini = 0.2918 samples = 13959 value = [11483, 2476] gini = 0.3747 samples = 9522 value = [7144, 2378] gini = 0.2732 samples = 9509 value = [7957, 1552]
  17. Part 2: A Tale of Two DSLs vs. ✨✨ Image

    credit: Adventure Time
  18. Flambo Example (defn make-spark-context "Creates the Apache Spark context using

    the Flambo DSL." [] (-> (conf/spark-conf) (conf/master "local") (conf/app-name "euroclojure") (f/spark-context)))
  19. Sparkling Example (defn make-spark-context "Creates the Apache Spark context using

    the Sparkling DSL." [] (-> (conf/spark-conf) (conf/master "local") (conf/app-name "euroclojure") (spark/spark-context)))
  20. Straight Spark (def model (DecisionTree/trainClassifier training 2 categorical-features- info "gini"

    5 32)) ; max depth: 5, max leaves: 32 (defn predict [p] ; LabeledPoint (let [prediction (.predict model (.features p))] [(.label p) prediction]))
  21. Accuracy: 0.77352

  22. Part 3: Deep Learning

  23. What’s Deep Learning? • Neural networks (computational architecture modeled after

    the human brain) • Neural networks with many layers (> 1 hidden layer, but in practice, can be hundreds) • The vanishing/exploding gradient problem
  24. Vanishing && Gradients

  25. Image credit for all ConvNet images: https://deeplearning4j.org/convolutionalnets

  26. Max Pooling/Downsampling

  27. Alternating Layers

  28. Our Data Image credit: http://digitalmedia.fws.gov/cdm/

  29. What’s DL4J? • DL4J == Deep Learning 4 Java, a

    library (for Java, unsurprisingly) • Examples on GitHub: https://github.com/ deeplearning4j/deeplearning4j • ConvNet worked example: http://bit.ly/2eBM8ss
  30. DL4J Example (def nn-conf (-> (NeuralNetConfiguration$Builder.) ;; Some values omitted

    for space (.activation "relu") (.learningRate 0.0001) (.weightInit (WeightInit/XAVIER)) (.optimizationAlgo OptimizationAlgorithm/STOCHASTIC_GRADIENT_DESCENT) (.updater Updater/RMSPROP) (.momentum 0.9) (.list) (.layer 0 conv-init) (.layer 1 (max-pool "maxpool1" (int-array [2 2]))) (.layer 2 (conv-5x5 "cnn2" 100 (int-array [5 5]) (int-array [1 1]) 0)) (.layer 3 (max-pool "maxpool2" (int-array [2 2]))) (.layer 4 (fully-connected 500)) (.layer 5 output-layer) (.build)))
  31. How’d We Do? • Accuracy: 0.375 • Precision: 0.3333 •

    Recall: 0.375 • F1 Score: 0.3529
  32. Summary • Clojure + Spark = • Flambo and Sparkling

    are roughly equally powerful • Deep learning is super doable with Clojure (though Java interop is kind of a pain)
  33. Takeaways (TL;DPA) • Contribute to Flambo and/or Sparkling! • Let’s

    build or contribute to a nicer DSL for DL4J • https://github.com/ericqweinstein/euroclojure
  34. None