Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Machine Learning with Clojure and Apache Spark

Machine Learning with Clojure and Apache Spark

Slides for my EuroClojure 2016 talk on machine learning.

Eric Weinstein

October 25, 2016
Tweet

More Decks by Eric Weinstein

Other Decks in Technology

Transcript

  1. Machine Learning with
    Clojure and Apache Spark
    ;; Eric Weinstein
    ;; EuroClojure 2016
    ;; Bratislava, Slovakia
    ;; 25 October 2016

    View full-size slide

  2. Part 0: Hello!

    View full-size slide

  3. About Me
    (def eric-weinstein
    {:employer "Hulu"
    :github "ericqweinstein"
    :twitter "ericqweinstein"
    :website "ericweinste.in"})
    30% off with EURORUBY30!

    View full-size slide

  4. Agenda
    • Machine learning
    • Apache Spark
    • Flambo vs. Sparkling
    • DL4J, deep learning, and convolutional neural
    networks

    View full-size slide

  5. Part 1: ⚡✨

    View full-size slide

  6. What’s machine learning?

    View full-size slide

  7. Generalization

    View full-size slide

  8. What’s Supervised Learning?
    Classification or regression, generalizing from
    labeled data to unlabeled data

    View full-size slide

  9. What’s Apache Spark?
    Apache Spark is an open-source cluster computing
    framework; its parallelism makes it ideal for
    processing large data sets, and in ML, the more
    data, the better!

    View full-size slide

  10. Some Spark Terminology
    • RDD: Resilient Distributed Dataset
    • Dataset: RDD + Spark SQL execution engine
    • DataFrame: Dataset organized into named
    columns

    View full-size slide

  11. Our Data
    • Police stop data for the city of Los Angeles,
    California in 2015
    • 4 features, ~600,000 instances
    • http://bit.ly/2f9jVwn

    View full-size slide

  12. Features && Labels
    • Sex (Male | Female)
    • Race (American Indian | Asian | Black |
    Hispanic | White | Other)
    • Stop type (Pedestrian | Vehicle)
    • Post-stop activity (Yes | No)

    View full-size slide

  13. Features && Labels
    • Sex (Male | Female)
    • Race (American Indian | Asian | Black |
    Hispanic | White | Other)
    • Stop type (Pedestrian | Vehicle)
    • Post-stop activity (Yes | No)

    View full-size slide

  14. Decision Trees
    X[0] <= 0.5
    gini = 0.4033
    samples = 139572
    value = [100477, 39095]
    X[1] <= 5.5
    gini = 0.4318
    samples = 102419
    value = [70118, 32301]
    True
    X[1] <= 5.5
    gini = 0.2989
    samples = 37153
    value = [30359, 6794]
    False
    X[1] <= 4.5
    gini = 0.4399
    samples = 96665
    value = [65083, 31582]
    gini = 0.2187
    samples = 5754
    value = [5035, 719]
    X[1] <= 3.5
    gini = 0.4483
    samples = 78400
    value = [51805, 26595]
    gini = 0.397
    samples = 18265
    value = [13278, 4987]
    X[1] <= 2.5
    gini = 0.4324
    samples = 51662
    value = [35328, 16334]
    gini = 0.473
    samples = 26738
    value = [16477, 10261]
    X[1] <= 0.5
    gini = 0.4406
    samples = 48927
    value = [32894, 16033]
    gini = 0.1959
    samples = 2735
    value = [2434, 301]
    gini = 0.4658
    samples = 65
    value = [41, 24]
    gini = 0.4406
    samples = 48862
    value = [32853, 16009]
    X[1] <= 3.5
    gini = 0.3067
    samples = 34817
    value = [28234, 6583]
    gini = 0.1643
    samples = 2336
    value = [2125, 211]
    X[1] <= 2.5
    gini = 0.2796
    samples = 15786
    value = [13133, 2653]
    X[1] <= 4.5
    gini = 0.3277
    samples = 19031
    value = [15101, 3930]
    X[1] <= 0.5
    gini = 0.2921
    samples = 13985
    value = [11501, 2484]
    gini = 0.1701
    samples = 1801
    value = [1632, 169]
    gini = 0.426
    samples = 26
    value = [18, 8]
    gini = 0.2918
    samples = 13959
    value = [11483, 2476]
    gini = 0.3747
    samples = 9522
    value = [7144, 2378]
    gini = 0.2732
    samples = 9509
    value = [7957, 1552]

    View full-size slide

  15. Part 2: A Tale of Two DSLs
    vs. ✨✨
    Image credit: Adventure Time

    View full-size slide

  16. Flambo Example
    (defn make-spark-context
    "Creates the Apache Spark context using the Flambo DSL."
    []
    (-> (conf/spark-conf)
    (conf/master "local")
    (conf/app-name "euroclojure")
    (f/spark-context)))

    View full-size slide

  17. Sparkling Example
    (defn make-spark-context
    "Creates the Apache Spark context using the Sparkling DSL."
    []
    (-> (conf/spark-conf)
    (conf/master "local")
    (conf/app-name "euroclojure")
    (spark/spark-context)))

    View full-size slide

  18. Straight Spark
    (def model
    (DecisionTree/trainClassifier training 2 categorical-features-
    info "gini" 5 32)) ; max depth: 5, max leaves: 32
    (defn predict
    [p] ; LabeledPoint
    (let [prediction (.predict model (.features p))]
    [(.label p) prediction]))

    View full-size slide

  19. Accuracy: 0.77352

    View full-size slide

  20. Part 3: Deep Learning

    View full-size slide

  21. What’s Deep Learning?
    • Neural networks (computational architecture
    modeled after the human brain)
    • Neural networks with many layers (> 1 hidden
    layer, but in practice, can be hundreds)
    • The vanishing/exploding gradient problem

    View full-size slide

  22. Vanishing && Gradients

    View full-size slide

  23. Image credit for all ConvNet images: https://deeplearning4j.org/convolutionalnets

    View full-size slide

  24. Max Pooling/Downsampling

    View full-size slide

  25. Alternating Layers

    View full-size slide

  26. Our Data
    Image credit: http://digitalmedia.fws.gov/cdm/

    View full-size slide

  27. What’s DL4J?
    • DL4J == Deep Learning 4 Java, a library (for
    Java, unsurprisingly)
    • Examples on GitHub: https://github.com/
    deeplearning4j/deeplearning4j
    • ConvNet worked example: http://bit.ly/2eBM8ss

    View full-size slide

  28. DL4J Example
    (def nn-conf
    (-> (NeuralNetConfiguration$Builder.)
    ;; Some values omitted for space
    (.activation "relu") (.learningRate 0.0001)
    (.weightInit (WeightInit/XAVIER)) (.optimizationAlgo OptimizationAlgorithm/STOCHASTIC_GRADIENT_DESCENT)
    (.updater Updater/RMSPROP) (.momentum 0.9) (.list)
    (.layer 0 conv-init)
    (.layer 1 (max-pool "maxpool1" (int-array [2 2])))
    (.layer 2 (conv-5x5 "cnn2" 100 (int-array [5 5]) (int-array [1 1]) 0))
    (.layer 3 (max-pool "maxpool2" (int-array [2 2])))
    (.layer 4 (fully-connected 500))
    (.layer 5 output-layer) (.build)))

    View full-size slide

  29. How’d We Do?
    • Accuracy: 0.375
    • Precision: 0.3333
    • Recall: 0.375
    • F1 Score: 0.3529

    View full-size slide

  30. Summary
    • Clojure + Spark =
    • Flambo and Sparkling are roughly equally
    powerful
    • Deep learning is super doable with Clojure
    (though Java interop is kind of a pain)

    View full-size slide

  31. Takeaways (TL;DPA)
    • Contribute to Flambo and/or Sparkling!
    • Let’s build or contribute to a nicer DSL for
    DL4J
    • https://github.com/ericqweinstein/euroclojure

    View full-size slide