Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Functional Monitoring with Riemann

Functional Monitoring with Riemann

Riemann is a specialized stream-processing engine, dedicated to monitoring distributed systems. Built on top of Clojure, it provides a comprehensive syntax for dealing with events. In this talk we will walk through the underlying concepts and the benefits of internal DSLs in Clojure.

Pierre-Yves Ritschard

December 10, 2016

More Decks by Pierre-Yves Ritschard

Other Decks in Programming


  1. 2 . 1 @pyr Co-Founder & CTO at Exoscale: Simple

    cloud computing for teams. Open source developer: riemann, cyanite, collectd, graphite, OpenBSD/OpenSSH. Datastax MVP for Apache Cassandra.
  2. 3 . 1 Agenda Founding principles Introduction to riemann A

    bit of technology Select use-cases Looking under the hood
  3. 5 . 1 The nirvana of ops Quiet days, silent

    nights Better insight Informed decision
  4. 10 . 1 11 . 1 How do we make

    informed decisions? Facts, Numbers, Visualizations A better map for our territory
  5. 12 . 1 Why do we need this? Systems are

    increasingly complex We're still mostly looking at system metrics
  6. 19 . 1 Q: How is business doing today? A:

    Based on these key metrics, we're looking good.
  7. 20 . 1 21 . 1 Exoscale in a nutshell

    Infrastructure as a Service VMs Object Storage A large distributed system Control plane Large number of spread-out tasks High node volatility
  8. 22 . 1 Monitoring the cloud No place for traditional

    metrics High CPU usage is a good thing High Memory usage is a good thing How much are we billing per hour? Early spike detection, API error surfacing
  9. 23 . 1 Our shopping list, back in 2012 Passive

    monitoring engine Ability to work on windows of events Ability to combine in-app and system metrics
  10. 25 . 1 Nothing new to functional programmers A different

    intended audience Same concepts, better UI
  11. 27 . 1 28 . 1 A uni ed language

    {:host "web01" :service "cpu" :state "ok" :time 1465944922 :description "all good" :tags ["production" "web-cluster"] :metric 64.5 :ttl 60} The identity of an event is the combination of its host and service.
  12. 29 . 1 Sending data to riemann (require '[riemann.client :as

    r]) (def c (r/tcp-client {:host "localhost"})) @(r/send-event {:host "myhost.foobar.com" :service "myservice" :metric 12.0}) Other languages: Java, Golang, Python, Ruby, C/C++, C#, and more!
  13. 33 . 1 A stream processing engine Fast multi-core asynchronous

    network processing engine An in-memory store for events A language (DSL) for routing events from input to output
  14. 34 . 1 Fast network input engine Based on Netty

    Upwards of a million of events per second on a VM Protobuf encoded events, with batching support
  15. 35 . 1 The event store Based on a (very)

    fast lock-less concurrent hash map. Stores the last version of events by identity (host, service). In memory event store
  16. 36 . 1 The riemann DSL An internal DSL, in

    Clojure Simple functions to: Work on events Set-up inputs Set-up outputs A plug-in system to pull-in 3rd-party extensions
  17. 37 . 1 How is this better than Storm? Single

    host solution In-memory event store These compromises enable performance Scaling is still possible
  18. 39 . 1 Con guration key concepts Event go through

    streams Streams are lists of functions called for each event
  19. 40 . 1 A simple con g (logging/init {:file "/logriemann.log"

    :console? false}) (tcp-server {:host "" :port 5555}) (periodically-expire 60) (let [store! (index) email (mailer {:from "[email protected]"})] (streams (default {:state "ok" :ttl 120} (expired #(info %)) store! (by [:host :service] (changed :state {:init "ok"} (email "[email protected]"))))))
  20. 41 . 1 UI Matters! This is hard: (def high-latency-transducer

    (comp (filter #(and (= (:service %) "api request latency") (> (:metric %) 300.0))) (map #(assoc % :state "warning"))))
  21. 42 . 1 Ui Matters! This is easy: (def store!

    (index)) (streams (where (and (service "api request latency") (> metric 300.0)) (with {:state "warning"} store!)))
  22. 48 . 1 Grouping and folding (moving-time-window 60 ;; moving

    time window gives us the whole window (smap folds/min ;; mapping folds/min always yields the minima (with :service "minimum per minute" store!)))
  23. 49 . 1 Bounds checking (within [0 1] ...) (without

    [0 1] ...) (over 9000 ...) (under 10 ...)
  24. 50 . 1 Fun idea: trending metrics (let [store (index)

    trending (top 10 (juxt :metric :time) (tag "top" store) store)] (streams (by :service (moving-time-window 3600 (smap folds/sum trending))))) This is awesome for nding outliers in your cluster. Full description: http://spootnik.org/entries/2014/01/14_real- time-twitter-trending-on-a-budget.html
  25. 51 . 1 A look under the hood How do

    we build a good programming UI? Lots of macros Relying on Clojure's STM
  26. 52 . 1 Clojure as a con guration language (defn

    include [path] (let [path (config-file-path path) file (file path)] (binding [*config-file* path *ns* (find-ns 'riemann.config)] (load-file path))))
  27. 53 . 1 Mutable state (def store! (index)) (streams store!

    prn (expired #(info "event expired:" %))) (streams (where (state "critical") (email "[email protected]")))
  28. 54 . 1 Mutable state (defn streams [& children] (swap!

    next-core update :streams #(reduce conj % children)))
  29. 55 . 1 Clojure Internals A simple example: (def store!

    (index)) (streams store! (tagged-any ["should-print" "debug"] prn))) (streams (coalesce (smap folds/sum (with :service "global sum" store!))))
  30. 56 . 1 tagged-any (defn tagged-any? [tags event] (not= nil

    (some (set tags) (:tags event)))) (defn tagged-any [tags & children] (fn stream [event] (when (tagged-any? tags event) (call-rescue event children ))))
  31. 57 . 1 call-rescue (defmacro call-rescue [event children] `(doseq [child#

    ~children] (try (child# ~event) (catch Throwable e# (warn e# (str child# " threw"))))))
  32. 58 . 1 coalesce (defn coalesce [& children] (let [stored

    (atom [])] ;; Our mutable state (fn stream [event] (let [events (swap! stored conj event)] ;; Add new event to stored list (call-rescue events children ))))) ;; Hand-off to children
  33. 60 . 1 Clojure helps! Exposing DSLs is as simple

    as it can get. The STM makes building building functions that safely hold on to state very easy.
  34. 61 . 1 They're using riemann now Spotify, extensively (7k+

    nodes) Kickstarter CC in2p3 (Huge research compute center) Plenty of startups