Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Machine Learning with Clojure and Apache Spark

Machine Learning with Clojure and Apache Spark

Slides for my EuroClojure 2016 talk on machine learning.

Eric Weinstein

October 25, 2016
Tweet

More Decks by Eric Weinstein

Other Decks in Technology

Transcript

  1. Machine Learning with
    Clojure and Apache Spark
    ;; Eric Weinstein
    ;; EuroClojure 2016
    ;; Bratislava, Slovakia
    ;; 25 October 2016

    View Slide

  2. for Joshua

    View Slide

  3. Part 0: Hello!

    View Slide

  4. About Me
    (def eric-weinstein
    {:employer "Hulu"
    :github "ericqweinstein"
    :twitter "ericqweinstein"
    :website "ericweinste.in"})
    30% off with EURORUBY30!

    View Slide

  5. Agenda
    • Machine learning
    • Apache Spark
    • Flambo vs. Sparkling
    • DL4J, deep learning, and convolutional neural
    networks

    View Slide

  6. Part 1: ⚡✨

    View Slide

  7. What’s machine learning?

    View Slide

  8. In a word:

    View Slide

  9. Generalization

    View Slide

  10. What’s Supervised Learning?
    Classification or regression, generalizing from
    labeled data to unlabeled data

    View Slide

  11. What’s Apache Spark?
    Apache Spark is an open-source cluster computing
    framework; its parallelism makes it ideal for
    processing large data sets, and in ML, the more
    data, the better!

    View Slide

  12. Some Spark Terminology
    • RDD: Resilient Distributed Dataset
    • Dataset: RDD + Spark SQL execution engine
    • DataFrame: Dataset organized into named
    columns

    View Slide

  13. Our Data
    • Police stop data for the city of Los Angeles,
    California in 2015
    • 4 features, ~600,000 instances
    • http://bit.ly/2f9jVwn

    View Slide

  14. Features && Labels
    • Sex (Male | Female)
    • Race (American Indian | Asian | Black |
    Hispanic | White | Other)
    • Stop type (Pedestrian | Vehicle)
    • Post-stop activity (Yes | No)

    View Slide

  15. Features && Labels
    • Sex (Male | Female)
    • Race (American Indian | Asian | Black |
    Hispanic | White | Other)
    • Stop type (Pedestrian | Vehicle)
    • Post-stop activity (Yes | No)

    View Slide

  16. Decision Trees
    X[0] <= 0.5
    gini = 0.4033
    samples = 139572
    value = [100477, 39095]
    X[1] <= 5.5
    gini = 0.4318
    samples = 102419
    value = [70118, 32301]
    True
    X[1] <= 5.5
    gini = 0.2989
    samples = 37153
    value = [30359, 6794]
    False
    X[1] <= 4.5
    gini = 0.4399
    samples = 96665
    value = [65083, 31582]
    gini = 0.2187
    samples = 5754
    value = [5035, 719]
    X[1] <= 3.5
    gini = 0.4483
    samples = 78400
    value = [51805, 26595]
    gini = 0.397
    samples = 18265
    value = [13278, 4987]
    X[1] <= 2.5
    gini = 0.4324
    samples = 51662
    value = [35328, 16334]
    gini = 0.473
    samples = 26738
    value = [16477, 10261]
    X[1] <= 0.5
    gini = 0.4406
    samples = 48927
    value = [32894, 16033]
    gini = 0.1959
    samples = 2735
    value = [2434, 301]
    gini = 0.4658
    samples = 65
    value = [41, 24]
    gini = 0.4406
    samples = 48862
    value = [32853, 16009]
    X[1] <= 3.5
    gini = 0.3067
    samples = 34817
    value = [28234, 6583]
    gini = 0.1643
    samples = 2336
    value = [2125, 211]
    X[1] <= 2.5
    gini = 0.2796
    samples = 15786
    value = [13133, 2653]
    X[1] <= 4.5
    gini = 0.3277
    samples = 19031
    value = [15101, 3930]
    X[1] <= 0.5
    gini = 0.2921
    samples = 13985
    value = [11501, 2484]
    gini = 0.1701
    samples = 1801
    value = [1632, 169]
    gini = 0.426
    samples = 26
    value = [18, 8]
    gini = 0.2918
    samples = 13959
    value = [11483, 2476]
    gini = 0.3747
    samples = 9522
    value = [7144, 2378]
    gini = 0.2732
    samples = 9509
    value = [7957, 1552]

    View Slide

  17. Part 2: A Tale of Two DSLs
    vs. ✨✨
    Image credit: Adventure Time

    View Slide

  18. Flambo Example
    (defn make-spark-context
    "Creates the Apache Spark context using the Flambo DSL."
    []
    (-> (conf/spark-conf)
    (conf/master "local")
    (conf/app-name "euroclojure")
    (f/spark-context)))

    View Slide

  19. Sparkling Example
    (defn make-spark-context
    "Creates the Apache Spark context using the Sparkling DSL."
    []
    (-> (conf/spark-conf)
    (conf/master "local")
    (conf/app-name "euroclojure")
    (spark/spark-context)))

    View Slide

  20. Straight Spark
    (def model
    (DecisionTree/trainClassifier training 2 categorical-features-
    info "gini" 5 32)) ; max depth: 5, max leaves: 32
    (defn predict
    [p] ; LabeledPoint
    (let [prediction (.predict model (.features p))]
    [(.label p) prediction]))

    View Slide

  21. Accuracy: 0.77352

    View Slide

  22. Part 3: Deep Learning

    View Slide

  23. What’s Deep Learning?
    • Neural networks (computational architecture
    modeled after the human brain)
    • Neural networks with many layers (> 1 hidden
    layer, but in practice, can be hundreds)
    • The vanishing/exploding gradient problem

    View Slide

  24. Vanishing && Gradients

    View Slide

  25. Image credit for all ConvNet images: https://deeplearning4j.org/convolutionalnets

    View Slide

  26. Max Pooling/Downsampling

    View Slide

  27. Alternating Layers

    View Slide

  28. Our Data
    Image credit: http://digitalmedia.fws.gov/cdm/

    View Slide

  29. What’s DL4J?
    • DL4J == Deep Learning 4 Java, a library (for
    Java, unsurprisingly)
    • Examples on GitHub: https://github.com/
    deeplearning4j/deeplearning4j
    • ConvNet worked example: http://bit.ly/2eBM8ss

    View Slide

  30. DL4J Example
    (def nn-conf
    (-> (NeuralNetConfiguration$Builder.)
    ;; Some values omitted for space
    (.activation "relu") (.learningRate 0.0001)
    (.weightInit (WeightInit/XAVIER)) (.optimizationAlgo OptimizationAlgorithm/STOCHASTIC_GRADIENT_DESCENT)
    (.updater Updater/RMSPROP) (.momentum 0.9) (.list)
    (.layer 0 conv-init)
    (.layer 1 (max-pool "maxpool1" (int-array [2 2])))
    (.layer 2 (conv-5x5 "cnn2" 100 (int-array [5 5]) (int-array [1 1]) 0))
    (.layer 3 (max-pool "maxpool2" (int-array [2 2])))
    (.layer 4 (fully-connected 500))
    (.layer 5 output-layer) (.build)))

    View Slide

  31. How’d We Do?
    • Accuracy: 0.375
    • Precision: 0.3333
    • Recall: 0.375
    • F1 Score: 0.3529

    View Slide

  32. Summary
    • Clojure + Spark =
    • Flambo and Sparkling are roughly equally
    powerful
    • Deep learning is super doable with Clojure
    (though Java interop is kind of a pain)

    View Slide

  33. Takeaways (TL;DPA)
    • Contribute to Flambo and/or Sparkling!
    • Let’s build or contribute to a nicer DSL for
    DL4J
    • https://github.com/ericqweinstein/euroclojure

    View Slide

  34. View Slide