Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Machine Learning with Clojure and Apache Spark

Machine Learning with Clojure and Apache Spark

Slides for my EuroClojure 2016 talk on machine learning.

Eric Weinstein

October 25, 2016
Tweet

More Decks by Eric Weinstein

Other Decks in Technology

Transcript

  1. Machine Learning with Clojure and Apache Spark ;; Eric Weinstein

    ;; EuroClojure 2016 ;; Bratislava, Slovakia ;; 25 October 2016
  2. Agenda • Machine learning • Apache Spark • Flambo vs.

    Sparkling • DL4J, deep learning, and convolutional neural networks
  3. What’s Apache Spark? Apache Spark is an open-source cluster computing

    framework; its parallelism makes it ideal for processing large data sets, and in ML, the more data, the better!
  4. Some Spark Terminology • RDD: Resilient Distributed Dataset • Dataset:

    RDD + Spark SQL execution engine • DataFrame: Dataset organized into named columns
  5. Our Data • Police stop data for the city of

    Los Angeles, California in 2015 • 4 features, ~600,000 instances • http://bit.ly/2f9jVwn
  6. Features && Labels • Sex (Male | Female) • Race

    (American Indian | Asian | Black | Hispanic | White | Other) • Stop type (Pedestrian | Vehicle) • Post-stop activity (Yes | No)
  7. Features && Labels • Sex (Male | Female) • Race

    (American Indian | Asian | Black | Hispanic | White | Other) • Stop type (Pedestrian | Vehicle) • Post-stop activity (Yes | No)
  8. Decision Trees X[0] <= 0.5 gini = 0.4033 samples =

    139572 value = [100477, 39095] X[1] <= 5.5 gini = 0.4318 samples = 102419 value = [70118, 32301] True X[1] <= 5.5 gini = 0.2989 samples = 37153 value = [30359, 6794] False X[1] <= 4.5 gini = 0.4399 samples = 96665 value = [65083, 31582] gini = 0.2187 samples = 5754 value = [5035, 719] X[1] <= 3.5 gini = 0.4483 samples = 78400 value = [51805, 26595] gini = 0.397 samples = 18265 value = [13278, 4987] X[1] <= 2.5 gini = 0.4324 samples = 51662 value = [35328, 16334] gini = 0.473 samples = 26738 value = [16477, 10261] X[1] <= 0.5 gini = 0.4406 samples = 48927 value = [32894, 16033] gini = 0.1959 samples = 2735 value = [2434, 301] gini = 0.4658 samples = 65 value = [41, 24] gini = 0.4406 samples = 48862 value = [32853, 16009] X[1] <= 3.5 gini = 0.3067 samples = 34817 value = [28234, 6583] gini = 0.1643 samples = 2336 value = [2125, 211] X[1] <= 2.5 gini = 0.2796 samples = 15786 value = [13133, 2653] X[1] <= 4.5 gini = 0.3277 samples = 19031 value = [15101, 3930] X[1] <= 0.5 gini = 0.2921 samples = 13985 value = [11501, 2484] gini = 0.1701 samples = 1801 value = [1632, 169] gini = 0.426 samples = 26 value = [18, 8] gini = 0.2918 samples = 13959 value = [11483, 2476] gini = 0.3747 samples = 9522 value = [7144, 2378] gini = 0.2732 samples = 9509 value = [7957, 1552]
  9. Flambo Example (defn make-spark-context "Creates the Apache Spark context using

    the Flambo DSL." [] (-> (conf/spark-conf) (conf/master "local") (conf/app-name "euroclojure") (f/spark-context)))
  10. Sparkling Example (defn make-spark-context "Creates the Apache Spark context using

    the Sparkling DSL." [] (-> (conf/spark-conf) (conf/master "local") (conf/app-name "euroclojure") (spark/spark-context)))
  11. Straight Spark (def model (DecisionTree/trainClassifier training 2 categorical-features- info "gini"

    5 32)) ; max depth: 5, max leaves: 32 (defn predict [p] ; LabeledPoint (let [prediction (.predict model (.features p))] [(.label p) prediction]))
  12. What’s Deep Learning? • Neural networks (computational architecture modeled after

    the human brain) • Neural networks with many layers (> 1 hidden layer, but in practice, can be hundreds) • The vanishing/exploding gradient problem
  13. What’s DL4J? • DL4J == Deep Learning 4 Java, a

    library (for Java, unsurprisingly) • Examples on GitHub: https://github.com/ deeplearning4j/deeplearning4j • ConvNet worked example: http://bit.ly/2eBM8ss
  14. DL4J Example (def nn-conf (-> (NeuralNetConfiguration$Builder.) ;; Some values omitted

    for space (.activation "relu") (.learningRate 0.0001) (.weightInit (WeightInit/XAVIER)) (.optimizationAlgo OptimizationAlgorithm/STOCHASTIC_GRADIENT_DESCENT) (.updater Updater/RMSPROP) (.momentum 0.9) (.list) (.layer 0 conv-init) (.layer 1 (max-pool "maxpool1" (int-array [2 2]))) (.layer 2 (conv-5x5 "cnn2" 100 (int-array [5 5]) (int-array [1 1]) 0)) (.layer 3 (max-pool "maxpool2" (int-array [2 2]))) (.layer 4 (fully-connected 500)) (.layer 5 output-layer) (.build)))
  15. Summary • Clojure + Spark = • Flambo and Sparkling

    are roughly equally powerful • Deep learning is super doable with Clojure (though Java interop is kind of a pain)
  16. Takeaways (TL;DPA) • Contribute to Flambo and/or Sparkling! • Let’s

    build or contribute to a nicer DSL for DL4J • https://github.com/ericqweinstein/euroclojure