Slide 1

Slide 1 text

1 Riemann @pyr

Slide 2

Slide 2 text

2 . 1 @pyr Co-Founder & CTO at Exoscale: Simple cloud computing for teams. Open source developer: riemann, cyanite, collectd, graphite, OpenBSD/OpenSSH. Datastax MVP for Apache Cassandra.

Slide 3

Slide 3 text

3 . 1 Agenda Founding principles Introduction to riemann A bit of technology Select use-cases Looking under the hood

Slide 4

Slide 4 text

4 . 1 Founding Principles

Slide 5

Slide 5 text

5 . 1 The nirvana of ops Quiet days, silent nights Better insight Informed decision

Slide 6

Slide 6 text

6 . 1 Why the need? Breaking out of our mental model

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

I'll push this minor change, it won't do any harm.

Slide 9

Slide 9 text

7 . 1

Slide 10

Slide 10 text

8 . 1 I'll just add a static route real quick

Slide 11

Slide 11 text

9 . 1

Slide 12

Slide 12 text

10 . 1 11 . 1 How do we make informed decisions? Facts, Numbers, Visualizations A better map for our territory

Slide 13

Slide 13 text

12 . 1 Why do we need this? Systems are increasingly complex We're still mostly looking at system metrics

Slide 14

Slide 14 text

13 . 1 An example: the web industry

Slide 15

Slide 15 text

14 . 1 Web infrastructure circa '00 2 servers

Slide 16

Slide 16 text

15 . 1 Visibility circa '00

Slide 17

Slide 17 text

16 . 1 Web infrastructure circa '16

Slide 18

Slide 18 text

17 . 1 Visibility circa '16

Slide 19

Slide 19 text

18 . 1 Q: How is business doing today? A:

Slide 20

Slide 20 text

19 . 1 Q: How is business doing today? A: Based on these key metrics, we're looking good.

Slide 21

Slide 21 text

Quick case study: Exoscale

Slide 22

Slide 22 text

20 . 1 21 . 1 Exoscale in a nutshell Infrastructure as a Service VMs Object Storage A large distributed system Control plane Large number of spread-out tasks High node volatility

Slide 23

Slide 23 text

22 . 1 Monitoring the cloud No place for traditional metrics High CPU usage is a good thing High Memory usage is a good thing How much are we billing per hour? Early spike detection, API error surfacing

Slide 24

Slide 24 text

23 . 1 Our shopping list, back in 2012 Passive monitoring engine Ability to work on windows of events Ability to combine in-app and system metrics

Slide 25

Slide 25 text

24 . 1 Introducing Riemann http://riemann.io https://github.com/riemann/riemann

Slide 26

Slide 26 text

A distributed system monitoring engine A uni ed language for dealing with events

Slide 27

Slide 27 text

25 . 1 Nothing new to functional programmers A different intended audience Same concepts, better UI

Slide 28

Slide 28 text

26 . 1

Slide 29

Slide 29 text

27 . 1 28 . 1 A uni ed language {:host "web01" :service "cpu" :state "ok" :time 1465944922 :description "all good" :tags ["production" "web-cluster"] :metric 64.5 :ttl 60} The identity of an event is the combination of its host and service.

Slide 30

Slide 30 text

29 . 1 Sending data to riemann (require '[riemann.client :as r]) (def c (r/tcp-client {:host "localhost"})) @(r/send-event {:host "myhost.foobar.com" :service "myservice" :metric 12.0}) Other languages: Java, Golang, Python, Ruby, C/C++, C#, and more!

Slide 31

Slide 31 text

30 . 1 Data emission Collectd Syslog-ng Nagios TCP/TLS/UDP HTTP/Websockets/SSE

Slide 32

Slide 32 text

31 . 1 Output support TCP/TLS/UDP UDP HTTP Graphite Pagerduty Slack Hipchat

Slide 33

Slide 33 text

32 . 1 Visualization support

Slide 34

Slide 34 text

33 . 1 A stream processing engine Fast multi-core asynchronous network processing engine An in-memory store for events A language (DSL) for routing events from input to output

Slide 35

Slide 35 text

34 . 1 Fast network input engine Based on Netty Upwards of a million of events per second on a VM Protobuf encoded events, with batching support

Slide 36

Slide 36 text

35 . 1 The event store Based on a (very) fast lock-less concurrent hash map. Stores the last version of events by identity (host, service). In memory event store

Slide 37

Slide 37 text

36 . 1 The riemann DSL An internal DSL, in Clojure Simple functions to: Work on events Set-up inputs Set-up outputs A plug-in system to pull-in 3rd-party extensions

Slide 38

Slide 38 text

37 . 1 How is this better than Storm? Single host solution In-memory event store These compromises enable performance Scaling is still possible

Slide 39

Slide 39 text

38 . 1 Con guring Riemann

Slide 40

Slide 40 text

39 . 1 Con guration key concepts Event go through streams Streams are lists of functions called for each event

Slide 41

Slide 41 text

40 . 1 A simple con g (logging/init {:file "/logriemann.log" :console? false}) (tcp-server {:host "0.0.0.0" :port 5555}) (periodically-expire 60) (let [store! (index) email (mailer {:from "ops@example.com"})] (streams (default {:state "ok" :ttl 120} (expired #(info %)) store! (by [:host :service] (changed :state {:init "ok"} (email "ops@example.com"))))))

Slide 42

Slide 42 text

41 . 1 UI Matters! This is hard: (def high-latency-transducer (comp (filter #(and (= (:service %) "api request latency") (> (:metric %) 300.0))) (map #(assoc % :state "warning"))))

Slide 43

Slide 43 text

42 . 1 Ui Matters! This is easy: (def store! (index)) (streams (where (and (service "api request latency") (> metric 300.0)) (with {:state "warning"} store!)))

Slide 44

Slide 44 text

43 . 1 Select use-cases

Slide 45

Slide 45 text

44 . 1 Filtering events (streams (where (and (service "iptv") (state "critical")) (email "alerts@example.com")))

Slide 46

Slide 46 text

45 . 1 Logical manifolds with by (by [:host :service] (changed :state {:init "ok"} (email "alerts@example.com")))

Slide 47

Slide 47 text

46 . 1 Rollups (rollup 5 3600 (email "ops@example.com"))

Slide 48

Slide 48 text

47 . 1 Rewriting (with :service "rewritten service" ...) (adjust [:service str "rate"] ...)

Slide 49

Slide 49 text

48 . 1 Grouping and folding (moving-time-window 60 ;; moving time window gives us the whole window (smap folds/min ;; mapping folds/min always yields the minima (with :service "minimum per minute" store!)))

Slide 50

Slide 50 text

49 . 1 Bounds checking (within [0 1] ...) (without [0 1] ...) (over 9000 ...) (under 10 ...)

Slide 51

Slide 51 text

50 . 1 Fun idea: trending metrics (let [store (index) trending (top 10 (juxt :metric :time) (tag "top" store) store)] (streams (by :service (moving-time-window 3600 (smap folds/sum trending))))) This is awesome for nding outliers in your cluster. Full description: http://spootnik.org/entries/2014/01/14_real- time-twitter-trending-on-a-budget.html

Slide 52

Slide 52 text

51 . 1 A look under the hood How do we build a good programming UI? Lots of macros Relying on Clojure's STM

Slide 53

Slide 53 text

52 . 1 Clojure as a con guration language (defn include [path] (let [path (config-file-path path) file (file path)] (binding [*config-file* path *ns* (find-ns 'riemann.config)] (load-file path))))

Slide 54

Slide 54 text

53 . 1 Mutable state (def store! (index)) (streams store! prn (expired #(info "event expired:" %))) (streams (where (state "critical") (email "ops@example.com")))

Slide 55

Slide 55 text

54 . 1 Mutable state (defn streams [& children] (swap! next-core update :streams #(reduce conj % children)))

Slide 56

Slide 56 text

55 . 1 Clojure Internals A simple example: (def store! (index)) (streams store! (tagged-any ["should-print" "debug"] prn))) (streams (coalesce (smap folds/sum (with :service "global sum" store!))))

Slide 57

Slide 57 text

56 . 1 tagged-any (defn tagged-any? [tags event] (not= nil (some (set tags) (:tags event)))) (defn tagged-any [tags & children] (fn stream [event] (when (tagged-any? tags event) (call-rescue event children ))))

Slide 58

Slide 58 text

57 . 1 call-rescue (defmacro call-rescue [event children] `(doseq [child# ~children] (try (child# ~event) (catch Throwable e# (warn e# (str child# " threw"))))))

Slide 59

Slide 59 text

58 . 1 coalesce (defn coalesce [& children] (let [stored (atom [])] ;; Our mutable state (fn stream [event] (let [events (swap! stored conj event)] ;; Add new event to stored list (call-rescue events children ))))) ;; Hand-off to children

Slide 60

Slide 60 text

59 . 1 Wrapping Up

Slide 61

Slide 61 text

60 . 1 Clojure helps! Exposing DSLs is as simple as it can get. The STM makes building building functions that safely hold on to state very easy.

Slide 62

Slide 62 text

61 . 1 They're using riemann now Spotify, extensively (7k+ nodes) Kickstarter CC in2p3 (Huge research compute center) Plenty of startups

Slide 63

Slide 63 text

62 . 1 Thank you! Questions? Illustrations by Kyle Kingsbury - @aphyr.