Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Clojure - A Sweetspot for Analytics

Clojure - A Sweetspot for Analytics

EuroClojure 2015 Talk Slides:

Clojure is getting more and more traction, and more people use it for all kinds of backend processing. During last 3 years us in ClojureWerkz concentrated on making lives of backend developers simple. Today, Clojure is one of the best choices for Analytics and Data-Driven Backends.

I'll tell you about our motivation, experiences and our success story, how we made a data processing backend, currently pushing millions of messages per second, how Clojure made our development cycles and time-to-production shorter, lives of our devs better, and made our customers happier.

About the speaker: Alex is working on making backends for analytics and data processing. He's been involved with Clojure since 2011, and co-created ClojureWerkz, is actively involved in development and maintenance of many Clojure libraries. Spends most of free time reading Math and Probability Theory textbooks, figuring out how things work.

αλεx π

June 25, 2015
Tweet

More Decks by αλεx π

Other Decks in Programming

Transcript

  1. ClojureWerkz 35+ high-quality Clojure libraries User reports from all over

    the world 20+ active contributors We value documentation
  2. Coin flips (let [data (->> #(sample (flip 0.5)) repeatedly (take

    50) (map #(if % "heads" "tails")) frequencies)] (plot/bar-chart (keys data) (vals data))) Anglican
  3. Simulating Cassandra Cluster Sizes disclaimer: model is simplified and numbers

    are made up! 1 node can handle 10K requests Latency is normally distributed with mean of 20ms standard deviation of 5ms “Extra” requests add overhead exponentially Anglican
  4. (def base-requests (* 10 1000)) (defquery cluster-latency [n write-rate] (let

    [per-node (/ write-rate n) overhead (/ 1000 (if (> per-node base-requests) (- per-node base-requests) 1))] (predict :latency (+ (sample (exponential overhead)) (sample (normal 20 5)))))) disclaimer: model is simplified and numbers are made up! Simulating Cassandra Cluster Sizes
  5. Subset of Clojure, compiled into CPS-style fns Stackless language Built-in

    memoisation DSL for building sampling fns for distributions Anglican
  6. P(X | blue)= Number of Blue near X Total number

    of blue P(X | red)= Number of Red near X Total number of Red
  7. P(blue)= Number of Blue Total number of objects P(red)= Number

    of Blue Total number of objects Model (prior) (defn make-model [train-data] (let [total (->> train-data vals (map count) (reduce +))] (for [[k v] train-data] [k {:p (/ (count v) total) :evidence (->> v transpose (map (fn [v] {:mean (mean v) :variance (variance v)})))}])))
  8. Classifier (posterior) (defn posterior-prob [point variance mean] (* (/ 1

    (sqrt (* 2 pi variance))) (exp (/ (* -1 (pow (- point mean) 2)) (* 2 variance))))) (map (fn [point {:keys [mean variance]}] (posterior-prob point variance mean)) model) P(X | blue)= Number of Blue near X Total number of blue P(X | red)= Number of Red near X Total number of Red
  9. P(X | blue)= Number of Blue near X Total number

    of blue P(X | red)= Number of Red near X Total number of Red
  10. Linear Regression: Objective Function Basically, the distance between predicted and

    actual Y: (objective-function (fn [intercept slope] (let [f (line intercept slope) res (->> points (map (fn [[x y]] (sqr (- y (f x))))) (reduce +))] res)))
  11. Linear Regression: Objective Function Basically, the distance between predicted and

    actual Y: (objective-function-gradient (let [factors (->> points (map butlast) (map #(cons 1 %))) y (map last points)] (fn [& point] (let [xT (matrix/transpose factors) m! (matrix/inverse (matrix/dot xT factors)) b (matrix/dot xT y)] (ops/- (matrix/mmul m! b) point)))))
  12. Bunch of JVM libraries available clojure.matrix is great clojure.match greatly

    helps with algos Clojure fns are easy to test With immutable DSs nothing goes wrong Experience Report
  13. (b/select {:a {:b [{:c 1} {:c 2} {:c 3}]}} [:*

    :* even? :c]) ;; => [1 3] Balagan
  14. (b/update {:a {:b [1 2 3]}} [:a :b :*] inc)

    ;; => {:a {:b [2 3 4]}} Balagan
  15. (reactor/on ($ “a”) filter* map* batch* reduce* (reactor/on ($ “b”)

    (reactor/on ($ “c”) (reactor/on ($ “a”) ) ) ) ) “Named” streams: DSLs
  16. Reduce boilerplate for processing topologies Implicit wiring between occurring parts

    No changes to the base API Attach parts of the stream for better composition DSLs
  17. On-premise streams Per-entity streams Decouple data processing pipelines Avoid hash

    lookups within sync operations Parallelize Maintain streams independently DSLs
  18. DSLs: Macros Powerful way to hide the “wiring” No changes

    to API Completely different handling logic Eager, delayed, wired, etc… streams
  19. (spec :my-field-1 (int32-type) :my-field-2 (string-type 10)) Create a spec out

    of parts 0 4 14 +------------+-------------------------+ | my-field-1 | my-field-2 | | (int) | (10 string) | +------------+-------------------------+ Memory Layout Buffy
  20. (let [s (spec :int-field (int32-type) :string-field (string-type 10)) buf (compose-buffer

    s)] (set-field buf :int-field 101) (get-field buf :int-field)) ;; => 101 Use spec to access fields like in a map Buffy
  21. (let [s (spec :first-field (int32-type) :second-field (string-type 10) :third-field (boolean-type))

    buf (compose-buffer spec)] (set-fields buf {:first-field 101 :second-field "string" :third-field true}) (decompose buf)) ;; => {:third-field true :second-field "string" :first-field 101} Or decode them all together Buffy
  22. (def dynamic-string (frame-type (frame-encoder [value] length (short-type) (count value) string

    (string-type (count value)) value) (frame-decoder [buffer offset] length (short-type) string (string-type (read length buffer offset))) second)) Dynamic types: netstrings Buffy DSLs
  23. Protocols help to abstract a notion of Data Type Data

    Types are extendable! Macros for creating the custom decoders Lessons Learned