Machine Learning Fundamentals in Clojure

Machine Learning Fundamentals in Clojure By Paul English Twitter: @logymxm
Email: [email protected]

Paul English Student & Software Developer Twitter: @logymxm Email: [email protected]
Red Brain Labs

• Small Team: A few computer scientists; a couple of
bald mathematicians; and an analytics guy • Clojure (Ruby, Python, Node.js, R, C#, whatever works best or is needed) • Predictive Analytics (Automated Decisions, Machine Learning, Visualizations & Intelligence) • Math Mornings (Stochastic Calculus For Finance; Friends of Red Brain Labs)

• Why Lisp? Why Clojure? • Managing Data & IO
• Analysis • Features & Data Manipulation • Machine Learning, Training, & Evaluation • Distributed Computing Overview

Why Lisp? • 1958: List Processing • Second Oldest High
Order Language, Second to Fortran • Practical Mathematical Notation (Lambda Calculus) • A favorite for early AI research

Why Clojure? • A modern Lisp • Laziness, concurrency, sequence
manipulation, and more. • Everything great about lisp abstractions • Portability and integration of the JVM • Still a favorite for AI applications

Managing Data & IO • Loading Data (CSV, Database) •
Working Lazily • Development & Production Diﬀerences

Loading Data • clojure.data.csv • clojure.java.io • SQL Korma (DSL
for SQL using JDBC) • Anything Java can do we can do better and with less lines of code.

Laziness (use 'clojure.data.csv) (use 'clojure.java.io) (defn lazy-‐read-‐csv [csv-‐file]
(let [in-‐file (reader csv-‐file) csv-‐seq (read-‐csv in-‐file) lazy (fn lazy [wrapped] (lazy-‐seq (if-‐let [s (seq wrapped)] (cons (first s) (lazy (rest s))) (.close in-‐file))))] (lazy csv-‐seq)))

Data In Development (def big-‐list-‐of-‐superheros
(read-‐csv "marvel-‐superheros.csv")) (def development-‐sample (take 100 big-‐list-‐of-‐superheros))

Features & Data Manipulation • Map/Reduce and the Sequence Library
• Functional Data Manipulation

Map/Reduce • Who needs Hadoop? • JK, distributed processing is
cool. • We’re already thinking in functionally!

Functional Data Manipulation (defn column "Returns the ith column
from a matrix as a vector" [M i] (map (fn [m] (nth m i)) M))

Functional Data Manipulation (defn argmax "Returns the argument in
collection that produces the maximum result of a function f" [f coll] (reduce (fn [a b] (if (> (f a) (f b)) a b)) coll))

Elementary Row Operations (defn interchange-‐rows "Produce a new matrix
`M` with rows `a` & `b` swapped" [M a b] (assoc M b (M a) a (M b)))

Elementary Row Operations (defn scaler-‐multiply "Return the scalar multiplication
`s` of a row `a` from the matrix `M1" [M s a] (map #(* % s) (M a)))

Elementary Row Operations (defn add-‐row-‐to-‐row "Adds the values of
a row `a` from the matrix `M` to the row `b` returning the updated matrix. Assumes that `a` is the values of the row, and `b` is the index of the row to be operated on." [M a b] (assoc M b (vec (map (fn [a_i b_i] (+ a_i b_i)) a (M b)))))

The Start Of Our Own Matrix Math Library (defn eliminate
[M pivot] (let [m (count M)] (reduce (fn [M i] (let [scaler-‐factor (scale-‐factor M i pivot) scaled-‐row (scaler-‐multiply M scaler-‐factor pivot)] (add-‐row-‐to-‐row M scaled-‐row i))) M (range (inc pivot) m)))) (defn row-‐echelon-‐form "Computes the row-‐echelon form of a matrix, `A`, using gaussian elimination." [A] (let [m (count A)] (reduce (fn [A pivot] (let [B (partial-‐pivot A pivot)] (eliminate B pivot))) A (range 0 m))))

Transforming Data: One Hot Encoding (defn one-‐hot-‐encode "Will fit
and encode an array of categorical values using one-‐hot encoding. Assumes that empty lists, and nil values aren't supposed to be encoded." [& values] (let [categories (remove vacant? (distinct (apply concat values))) empty (vec (zeros (count categories))) encoded (map (fn [row] (if (> (count row) 1) (map (fn [category] (let [index (first (indexes-‐of category categories))] (if index (assoc empty index 1) empty))) row) empty)) values)] [categories encoded]))

Transforming Data: Transpose (defn transpose [m] (apply mapv vector
m))

Multiple Data Files (defn read-‐lf1 [dir] (let [path (.getPath
dir)] (map concat (read-‐csv path "TimeTicks1") (read-‐csv path "LF1I") (read-‐csv path "LF1V"))))

Let’s speed it up! • Parallel: pmap, preduce • Reducers
(fork/join) • Native Arrays: areduce, amap • Hip Hip Array! • core.matrix (BLAS support!)

Hip Hip Array! ;; 1ms for 10k doubles: 20 MFlops
(defn dot-‐product [^doubles ws ^doubles xs] (reduce + (map * ws xs)) ;; 8.5 us for 10k doubles: 2.3 GFlops ;; (11 us with *unchecked-‐math* false) (defn dot-‐product [^doubles ws ^doubles xs] (areduce xs i ret 0.0 (+ ret (* (aget xs i) (aget ws i))))) ;; 8.5 us for 10k doubles: 2.3 GFlops (require '[hiphip.double :as dbl]) (defn dot-‐product [ws xs] (dbl/asum [x xs w ws] (* x w))) // 8.5 us for 10k doubles: 2.3 GFlops public double dotProduct(double [] ws, double[] xs) { double result = 0.0; for (int i=0; i < ws.length; ++i) { result += ws[i] * xs[i]; } return result; } http://blog.getprismatic.com/blog/2013/7/10/introducing-hiphip-array-fast-and-ﬂexible-numerical-computation-in-clojure

Core.Matrix (use 'core.matrix) (+ [[1 2] [3
4]] (* (identity-‐matrix 2) 3.0)) => [[4.0 2.0] [3.0 7.0]]

Analysis With Incanter • Summary Statistics • Scaling/Normalizing, Relationships •
Time Series Data • Graphing and plotting • and More

Summary Statistics (require '[incanter.core :as i]
'incanter.io '[incanter.stats :as s]) (def data-‐file "data/all_160.P3.csv") (def census (incanter.io/read-‐dataset data-‐file :header true)) (i/$rollup :mean :POP100 :STATE census) (i/$rollup s/sd :POP100 :STATE census)

Time Series (require '[incanter.core :as i] '[incanter.zoo :as zoo] '[clj-‐time.format
:as tf]) (def ^:dynamic *formatter* (tf/formatter "dd-‐MMM-‐yy")) (defn parse-‐date [date] (tf/parse *formatter* date)) (def data (i/with-‐data (i/col-‐names (incanter.io/read-‐dataset data-‐file) [:date-‐str :open :high :low :close :volume]) (-‐>> (i/$map parse-‐date :date-‐str) (i/dataset [:date]) (i/conj-‐cols i/$data)))) (def data-‐zoo (zoo/zoo data :date)) (def data-‐roll5 (-‐>> (i/sel data-‐zoo :cols :close) (zoo/roll-‐mean 5) (i/dataset [:five-‐day]) (i/conj-‐cols data-‐zoo)))

Graphing/Plotting (def iris-‐petal-‐scatter (c/scatter-‐plot (i/sel iris :cols :Petal.Width)
(i/sel iris :cols :Petal.Length) :title "Irises: Petal Width by Petal Length" :x-‐label "Width (cm)" :y-‐label "Length (cm)")) (i/view iris-‐petal-‐scatter)

Machine Learning • A few algorithms implemented in Clojure •
WEKA • clj-ml

Logistic Regression (defn sigmoid [z] (/ 1 (+ 1
(Math/exp (-‐ z))))) (defn weights [init c mat] (let [start-‐values init alpha 0.001 error (minus labels (map sigmoid (mmult mat start-‐values))) stop-‐values (plus start-‐values (mult alpha (mmult (trans mat) error)))] (if (> (+ c 1) 500) start-‐values (weights stop-‐values (+ c 1) mat)))) (println (weights (matrix 1 3 1) 1 data-‐matrix)) https://github.com/jandot/machine-learning-in-action/blob/master/logistic-regression.clj

K-Nearest Neighbors (defn calculate-‐distance [v1 v2] "Calculates distance between
2 vectors. The higher the result, the more different the vectors." (/ (count (filter #(= false %) (map = v1 v2))) (count v1))) (defn count-‐occurences [v] "Counts occurences in a vector" (partition 2 (interleave (set v) (map #(count (filter #{%} v)) (set v))))) (defn majority-‐vote [v] "Calculates which item appears most often in a vector. CAUTION: in case two items appear equally often, will pick at random" (first (last (sort-‐by second (count-‐occurences v))))) (defn classify [sample training-‐set k] (let [distances (pmap #(calculate-‐distance (:pattern %) sample) training-‐set) labels (map :label training-‐set) distances-‐with-‐labels (map first (take k (sort-‐by second (partition 2 (interleave labels distances)))))] (majority-‐vote distances-‐with-‐labels))) https://github.com/jandot/machine-learning-in-action/blob/master/knn.clj

Perceptron https://github.com/fffej/ClojureProjects/blob/master/neural-networks/src/uk/co/fatvat/perceptron.clj (defn create-‐network [in] (repeat in
0)) (defn run-‐network [input weights] (if (pos? (reduce + (map * input weights))) 1 0)) (def learning-‐rate 0.05) (defn-‐ update-‐weights [weights inputs error] (map (fn [weight input] (+ weight (* learning-‐rate error input))) weights inputs)) (defn train ([samples expecteds] (train samples expecteds (create-‐network (count (first samples))))) ([samples expecteds weights] (if (empty? samples) weights (let [sample (first samples) expected (first expecteds) actual (run-‐network sample weights) error (-‐ expected actual)] (recur (rest samples) (rest expecteds) (update-‐weights weights sample error))))))

WEKA • Data mining & machine learning libraries for Java,
built by the University of Waikato • clj-ml: Clojure interface for using weka

Loading Data (use 'clj-‐ml-‐dev.io) (def ds (load-‐instances :arff "marvel-‐universe.arff"))
(def ds (load-‐instances :csv "marvel-‐universe.csv")) (save-‐instances :csv "marvel-‐universe-‐transformed.csv" ds) (use 'clj-‐ml-‐dev.data) (def ds (make-‐dataset "super-‐powers" [:name :first-‐issue {:superpower [:flying :melts-‐things :regenerates]}] [["Wolverine" "Incredible Hulk #180" :regenerates] ["Bruno Horgan" "Tales of Suspense #47" :melts-‐things] ...]))

Filtering & Attributes (use '(clj-‐ml-‐dev filters io)) (def filtered-‐ds (-‐>
ds (unsupervised-‐string-‐to-‐nominal {:attributes [:name :first-‐issue]}) (unsupervised-‐nominal-‐to-‐binary {:attributes [:name :first-‐issue]}))

Classiﬁcation & Training (use 'clj-‐ml-‐dev.classifiers) (def classifier (make-‐classifier :decission-‐tree :c45))
(dataset-‐set-‐class filtered-‐ds :superpower) (classifier-‐train classifier filtered-‐ds) (def evaluation (classifier-‐evaluate classifier :cross-‐validation filtered-‐ds 10)) (def to-‐classify (make-‐instance ds {:superpower :metal-‐suit :name "Iron Man" :first-‐issue "Tales of Suspense #39"})) (classifier-‐classify classifier to-‐classify)

Saving Your Progress (use 'clj-‐ml-‐dev.utils) (serialize-‐to-‐file
classifier "superhero-‐classifier-‐svm.bin")

Performance Computing • Parrallel Processing (Reducers, Parallel Computations) • Hazelcast
• Cascalog

Parallel Processing • pmap, preduce, and other parallel operations. •
Reducers: map, reduce, ﬁlter, fold. Moves all the transformation operations into reduce. Can make use of Java’s Fork & Join for parallel processing.

Reducers (require '[clojure.core.reducers :as r]) (defn old-‐reduce [nums]
(reduce + (map inc (map inc (map inc nums))))) (defn new-‐reduce [nums] (reduce + (r/map inc (r/map inc (r/map inc nums))))) (defn new-‐fold [nums] (r/fold + (r/map inc (r/map inc (r/map inc nums))))) (println "Old reduce: " (benchmark old-‐reduce N times) "ms") (println "New reduce: " (benchmark new-‐reduce N times) "ms") (println "New fold: " (benchmark new-‐fold N times) "ms") ;; Old reduce: 1450 ms ;; Reducers reduce: 1256 ms ;; Reducers fold: 306 ms http://adambard.com/blog/clojure-reducers-for-mortals/

Hazelcast • Clustering & Distributed Processing Queue • Dynamically spool
up, conﬁgure, and start up jobs on a cluster

Cascalog • Connect to Hadoop • DSL inspired by datalog

Cascalog (?-‐ (stdout) (<-‐ [?word ?count]
(sentence :> ?line) (tokenise :< ?line :> ?word) (c/count :> ?count)))

Machine Learning Fundamentals in Clojure

Machine Learning Fundamentals in Clojure

Other Decks in Programming

Featured

Transcript