Clojure - A Sweetspot for Analytics

Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

Sweetspot for Analytics

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

ClojureWerkz 35+ high-quality Clojure libraries User reports from all over the world 20+ active contributors We value documentation

Slide 5

Slide 5 text

Why Clojure is so awesome?

Slide 6

Slide 6 text

Overtone isn’t just for music

Slide 7

Slide 7 text

Minority Report with Clojure REPL

Slide 8

Slide 8 text

Great for Math™

Slide 9

Slide 9 text

Talk Expectations

Slide 10

Slide 10 text

What is even Analytics

Slide 11

Slide 11 text

Getting something out of nothing

Slide 12

Slide 12 text

Problems to solve understand the shape of data in a pace that suits your business

Slide 13

Slide 13 text

Problems to solve classify and predict the outcomes reliably

Slide 14

Slide 14 text

Problems to solve model and understand your chances

Slide 15

Slide 15 text

Challenges

Slide 16

Slide 16 text

Clojure Shine

Slide 17

Slide 17 text

Anglican Is what you open when they say “what are the odds”

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

Anglican Probabilistic Programming DSL/Language Clojure Macros at their very best

Slide 20

Slide 20 text

Coin flips (let [data (->> #(sample (flip 0.5)) repeatedly (take 50) (map #(if % "heads" "tails")) frequencies)] (plot/bar-chart (keys data) (vals data))) Anglican

Slide 21

Slide 21 text

Coin Flips Coin flips: fair coin, few trials

Slide 22

Slide 22 text

Coin Flips Coin flips: rigged coin, many trials

Slide 23

Slide 23 text

Coin Flips Coin flips: fair coin, many trials

Slide 24

Slide 24 text

Simulating Cassandra Cluster Sizes disclaimer: model is simplified and numbers are made up! 1 node can handle 10K requests Latency is normally distributed with mean of 20ms standard deviation of 5ms “Extra” requests add overhead exponentially Anglican

Slide 25

Slide 25 text

(def base-requests (* 10 1000)) (defquery cluster-latency [n write-rate] (let [per-node (/ write-rate n) overhead (/ 1000 (if (> per-node base-requests) (- per-node base-requests) 1))] (predict :latency (+ (sample (exponential overhead)) (sample (normal 20 5)))))) disclaimer: model is simplified and numbers are made up! Simulating Cassandra Cluster Sizes

Slide 26

Slide 26 text

5 Nodes / 50K requests per second Simulating Cassandra Cluster Sizes

Slide 27

Slide 27 text

5 Nodes / 500K requests per second Simulating Cassandra Cluster Sizes

Slide 28

Slide 28 text

30 Nodes / 500K Requests Simulating Cassandra Cluster Sizes

Slide 29

Slide 29 text

Subset of Clojure, compiled into CPS-style fns Stackless language Built-in memoisation DSL for building sampling fns for distributions Anglican

Slide 30

Slide 30 text

Statistiker En statistiker er en person som jobber innen faget statistikk.

Slide 31

Slide 31 text

Implementing gaussian Naïve Bayes Algorithm

Slide 32

Slide 32 text

Implementing Naïve Bayes Algorithm

Slide 33

Slide 33 text

P(blue)= Number of Blue Total number of objects P(red)= Number of Red Total number of objects

Slide 34

Slide 34 text

P(X | blue)= Number of Blue near X Total number of blue P(X | red)= Number of Red near X Total number of Red

Slide 35

Slide 35 text

P(blue)= Number of Blue Total number of objects P(red)= Number of Blue Total number of objects Model (prior) (defn make-model [train-data] (let [total (->> train-data vals (map count) (reduce +))] (for [[k v] train-data] [k {:p (/ (count v) total) :evidence (->> v transpose (map (fn [v] {:mean (mean v) :variance (variance v)})))}])))

Slide 36

Slide 36 text

Classifier (posterior) (defn posterior-prob [point variance mean] (* (/ 1 (sqrt (* 2 pi variance))) (exp (/ (* -1 (pow (- point mean) 2)) (* 2 variance))))) (map (fn [point {:keys [mean variance]}] (posterior-prob point variance mean)) model) P(X | blue)= Number of Blue near X Total number of blue P(X | red)= Number of Red near X Total number of Red

Slide 37

Slide 37 text

P(X | blue)= Number of Blue near X Total number of blue P(X | red)= Number of Red near X Total number of Red

Slide 38

Slide 38 text

Implementing Linear Regression with Gradient Descent

Slide 39

Slide 39 text

Linear Regression with Gradient Descent (s/defrecord GradientProblem [^{:s ObjectiveFunction} objective-fn ^{:s ObjectiveFunctionGradient} objective-fn-gradient])

Slide 40

Slide 40 text

Linear Regression: Objective Function Basically, the distance between predicted and actual Y: (objective-function (fn [intercept slope] (let [f (line intercept slope) res (->> points (map (fn [[x y]] (sqr (- y (f x))))) (reduce +))] res)))

Slide 41

Slide 41 text

Linear Regression: Objective Function Basically, the distance between predicted and actual Y: (objective-function-gradient (let [factors (->> points (map butlast) (map #(cons 1 %))) y (map last points)] (fn [& point] (let [xT (matrix/transpose factors) m! (matrix/inverse (matrix/dot xT factors)) b (matrix/dot xT y)] (ops/- (matrix/mmul m! b) point)))))

Slide 42

Slide 42 text

Linear Regression: Objective Function

Slide 43

Slide 43 text

Bunch of JVM libraries available clojure.matrix is great clojure.match greatly helps with algos Clojure fns are easy to test With immutable DSs nothing goes wrong Experience Report

Slide 44

Slide 44 text

Balagan When `update-in` and `get-in` is not enough

Slide 45

Slide 45 text

No content

Slide 46

Slide 46 text

Nested data structures Map-inside-vector-inside-map Straightforward query language Balagan

Slide 47

Slide 47 text

[:* :* even? :c] Balagan

Slide 48

Slide 48 text

:* wildcard matches Custom key matchers (essentially any fn) `get-in`-like key matches Balagan

Slide 49

Slide 49 text

(b/select {:a {:b [{:c 1} {:c 2} {:c 3}]}} [:* :* even? :c]) ;; => [1 3] Balagan

Slide 50

Slide 50 text

Walking-transformations Match paths and apply updates Balagan

Slide 51

Slide 51 text

(b/update {:a {:b [1 2 3]}} [:a :b :*] inc) ;; => {:a {:b [2 3 4]}} Balagan

Slide 52

Slide 52 text

Operations on complex data structures DSL for querying and transforming Balagan

Slide 53

Slide 53 text

Inspired by Enlive Extensible walkers for custom data structs Clojure/ClojureScript enabled via cljx Balagan

Slide 54

Slide 54 text

Meltdown Stream Processing with a long half-life

Slide 55

Slide 55 text

No content

Slide 56

Slide 56 text

Data processing pipeline Performant Tuneable Multiple backends (Disruptor, Queues, sync…) CEP/EEP for everyone Meltdown

Slide 57

Slide 57 text

Base API (reactor/on ) (reactor/notify

Slide 58

Slide 58 text

filter* map* batch* reduce* group* consume* Base API

Slide 59

Slide 59 text

(reactor/on ($ “a”) filter* map* batch* reduce* (reactor/on ($ “b”) (reactor/on ($ “c”) (reactor/on ($ “a”) ) ) ) ) “Named” streams: DSLs

Slide 60

Slide 60 text

Reduce boilerplate for processing topologies Implicit wiring between occurring parts No changes to the base API Attach parts of the stream for better composition DSLs

Slide 61

Slide 61 text

(streams/stream ($ “a”) filter* map* batch* reduce* ) “Anonymous” streams: DSLs

Slide 62

Slide 62 text

On-premise streams Per-entity streams Decouple data processing pipelines Avoid hash lookups within sync operations Parallelize Maintain streams independently DSLs

Slide 63

Slide 63 text

(streams/lazy-stream ($ #“a”) filter* map* batch* reduce* ) “Lazy” streams: DSLs

Slide 64

Slide 64 text

DSLs: Macros Powerful way to hide the “wiring” No changes to API Completely different handling logic Eager, delayed, wired, etc… streams

Slide 65

Slide 65 text

Max-out single-box performance Pluggable back-end Anonymous, lazy, named streams DSLs

Slide 66

Slide 66 text

Buffy the byte buffer slayer

Slide 67

Slide 67 text

byte buffer

Slide 68

Slide 68 text

Buffy Composeable binary protocols Partial deserialisation Named access to serialised parts

Slide 69

Slide 69 text

Slide 70

Slide 70 text

(let [s (spec :int-field (int32-type) :string-field (string-type 10)) buf (compose-buffer s)] (set-field buf :int-field 101) (get-field buf :int-field)) ;; => 101 Use spec to access fields like in a map Buffy

Slide 71

Slide 71 text

(let [s (spec :first-field (int32-type) :second-field (string-type 10) :third-field (boolean-type)) buf (compose-buffer spec)] (set-fields buf {:first-field 101 :second-field "string" :third-field true}) (decompose buf)) ;; => {:third-field true :second-field "string" :first-field 101} Or decode them all together Buffy

Slide 72

Slide 72 text

(composite-type (int32-type) (string-type 10)) Composite (tuple) types Buffy DSLs

Slide 73

Slide 73 text

(repeated-type (string-type 10) 5) Array/vector types Buffy DSLs

Slide 74

Slide 74 text

(repeated-type (composite-type (int32-type) (string-type 10)) 5) And recursion! Buffy DSLs

Slide 75

Slide 75 text

(def dynamic-string (frame-type (frame-encoder [value] length (short-type) (count value) string (string-type (count value)) value) (frame-decoder [buffer offset] length (short-type) string (string-type (read length buffer offset))) second)) Dynamic types: netstrings Buffy DSLs

Slide 76

Slide 76 text

Protocols help to abstract a notion of Data Type Data Types are extendable! Macros for creating the custom decoders Lessons Learned