Save 37% off PRO during our Black Friday Sale! »

Functional Monitoring with Riemann

Functional Monitoring with Riemann

Riemann is a specialized stream-processing engine, dedicated to monitoring distributed systems. Built on top of Clojure, it provides a comprehensive syntax for dealing with events. In this talk we will walk through the underlying concepts and the benefits of internal DSLs in Clojure.

2fcc875f98607b3007909fe4be99160d?s=128

Pierre-Yves Ritschard

December 10, 2016
Tweet

Transcript

  1. 1 Riemann @pyr

  2. 2 . 1 @pyr Co-Founder & CTO at Exoscale: Simple

    cloud computing for teams. Open source developer: riemann, cyanite, collectd, graphite, OpenBSD/OpenSSH. Datastax MVP for Apache Cassandra.
  3. 3 . 1 Agenda Founding principles Introduction to riemann A

    bit of technology Select use-cases Looking under the hood
  4. 4 . 1 Founding Principles

  5. 5 . 1 The nirvana of ops Quiet days, silent

    nights Better insight Informed decision
  6. 6 . 1 Why the need? Breaking out of our

    mental model
  7. None
  8. I'll push this minor change, it won't do any harm.

  9. 7 . 1

  10. 8 . 1 I'll just add a static route real

    quick
  11. 9 . 1

  12. 10 . 1 11 . 1 How do we make

    informed decisions? Facts, Numbers, Visualizations A better map for our territory
  13. 12 . 1 Why do we need this? Systems are

    increasingly complex We're still mostly looking at system metrics
  14. 13 . 1 An example: the web industry

  15. 14 . 1 Web infrastructure circa '00 2 servers

  16. 15 . 1 Visibility circa '00

  17. 16 . 1 Web infrastructure circa '16

  18. 17 . 1 Visibility circa '16

  19. 18 . 1 Q: How is business doing today? A:

  20. 19 . 1 Q: How is business doing today? A:

    Based on these key metrics, we're looking good.
  21. Quick case study: Exoscale

  22. 20 . 1 21 . 1 Exoscale in a nutshell

    Infrastructure as a Service VMs Object Storage A large distributed system Control plane Large number of spread-out tasks High node volatility
  23. 22 . 1 Monitoring the cloud No place for traditional

    metrics High CPU usage is a good thing High Memory usage is a good thing How much are we billing per hour? Early spike detection, API error surfacing
  24. 23 . 1 Our shopping list, back in 2012 Passive

    monitoring engine Ability to work on windows of events Ability to combine in-app and system metrics
  25. 24 . 1 Introducing Riemann http://riemann.io https://github.com/riemann/riemann

  26. A distributed system monitoring engine A uni ed language for

    dealing with events
  27. 25 . 1 Nothing new to functional programmers A different

    intended audience Same concepts, better UI
  28. 26 . 1

  29. 27 . 1 28 . 1 A uni ed language

    {:host "web01" :service "cpu" :state "ok" :time 1465944922 :description "all good" :tags ["production" "web-cluster"] :metric 64.5 :ttl 60} The identity of an event is the combination of its host and service.
  30. 29 . 1 Sending data to riemann (require '[riemann.client :as

    r]) (def c (r/tcp-client {:host "localhost"})) @(r/send-event {:host "myhost.foobar.com" :service "myservice" :metric 12.0}) Other languages: Java, Golang, Python, Ruby, C/C++, C#, and more!
  31. 30 . 1 Data emission Collectd Syslog-ng Nagios TCP/TLS/UDP HTTP/Websockets/SSE

  32. 31 . 1 Output support TCP/TLS/UDP UDP HTTP Graphite Pagerduty

    Slack Hipchat
  33. 32 . 1 Visualization support

  34. 33 . 1 A stream processing engine Fast multi-core asynchronous

    network processing engine An in-memory store for events A language (DSL) for routing events from input to output
  35. 34 . 1 Fast network input engine Based on Netty

    Upwards of a million of events per second on a VM Protobuf encoded events, with batching support
  36. 35 . 1 The event store Based on a (very)

    fast lock-less concurrent hash map. Stores the last version of events by identity (host, service). In memory event store
  37. 36 . 1 The riemann DSL An internal DSL, in

    Clojure Simple functions to: Work on events Set-up inputs Set-up outputs A plug-in system to pull-in 3rd-party extensions
  38. 37 . 1 How is this better than Storm? Single

    host solution In-memory event store These compromises enable performance Scaling is still possible
  39. 38 . 1 Con guring Riemann

  40. 39 . 1 Con guration key concepts Event go through

    streams Streams are lists of functions called for each event
  41. 40 . 1 A simple con g (logging/init {:file "/logriemann.log"

    :console? false}) (tcp-server {:host "0.0.0.0" :port 5555}) (periodically-expire 60) (let [store! (index) email (mailer {:from "ops@example.com"})] (streams (default {:state "ok" :ttl 120} (expired #(info %)) store! (by [:host :service] (changed :state {:init "ok"} (email "ops@example.com"))))))
  42. 41 . 1 UI Matters! This is hard: (def high-latency-transducer

    (comp (filter #(and (= (:service %) "api request latency") (> (:metric %) 300.0))) (map #(assoc % :state "warning"))))
  43. 42 . 1 Ui Matters! This is easy: (def store!

    (index)) (streams (where (and (service "api request latency") (> metric 300.0)) (with {:state "warning"} store!)))
  44. 43 . 1 Select use-cases

  45. 44 . 1 Filtering events (streams (where (and (service "iptv")

    (state "critical")) (email "alerts@example.com")))
  46. 45 . 1 Logical manifolds with by (by [:host :service]

    (changed :state {:init "ok"} (email "alerts@example.com")))
  47. 46 . 1 Rollups (rollup 5 3600 (email "ops@example.com"))

  48. 47 . 1 Rewriting (with :service "rewritten service" ...) (adjust

    [:service str "rate"] ...)
  49. 48 . 1 Grouping and folding (moving-time-window 60 ;; moving

    time window gives us the whole window (smap folds/min ;; mapping folds/min always yields the minima (with :service "minimum per minute" store!)))
  50. 49 . 1 Bounds checking (within [0 1] ...) (without

    [0 1] ...) (over 9000 ...) (under 10 ...)
  51. 50 . 1 Fun idea: trending metrics (let [store (index)

    trending (top 10 (juxt :metric :time) (tag "top" store) store)] (streams (by :service (moving-time-window 3600 (smap folds/sum trending))))) This is awesome for nding outliers in your cluster. Full description: http://spootnik.org/entries/2014/01/14_real- time-twitter-trending-on-a-budget.html
  52. 51 . 1 A look under the hood How do

    we build a good programming UI? Lots of macros Relying on Clojure's STM
  53. 52 . 1 Clojure as a con guration language (defn

    include [path] (let [path (config-file-path path) file (file path)] (binding [*config-file* path *ns* (find-ns 'riemann.config)] (load-file path))))
  54. 53 . 1 Mutable state (def store! (index)) (streams store!

    prn (expired #(info "event expired:" %))) (streams (where (state "critical") (email "ops@example.com")))
  55. 54 . 1 Mutable state (defn streams [& children] (swap!

    next-core update :streams #(reduce conj % children)))
  56. 55 . 1 Clojure Internals A simple example: (def store!

    (index)) (streams store! (tagged-any ["should-print" "debug"] prn))) (streams (coalesce (smap folds/sum (with :service "global sum" store!))))
  57. 56 . 1 tagged-any (defn tagged-any? [tags event] (not= nil

    (some (set tags) (:tags event)))) (defn tagged-any [tags & children] (fn stream [event] (when (tagged-any? tags event) (call-rescue event children ))))
  58. 57 . 1 call-rescue (defmacro call-rescue [event children] `(doseq [child#

    ~children] (try (child# ~event) (catch Throwable e# (warn e# (str child# " threw"))))))
  59. 58 . 1 coalesce (defn coalesce [& children] (let [stored

    (atom [])] ;; Our mutable state (fn stream [event] (let [events (swap! stored conj event)] ;; Add new event to stored list (call-rescue events children ))))) ;; Hand-off to children
  60. 59 . 1 Wrapping Up

  61. 60 . 1 Clojure helps! Exposing DSLs is as simple

    as it can get. The STM makes building building functions that safely hold on to state very easy.
  62. 61 . 1 They're using riemann now Spotify, extensively (7k+

    nodes) Kickstarter CC in2p3 (Huge research compute center) Plenty of startups
  63. 62 . 1 Thank you! Questions? Illustrations by Kyle Kingsbury

    - @aphyr.