Slide 1

Slide 1 text

Monitorin’ it Friday, September 20, 13

Slide 2

Slide 2 text

Friday, September 20, 13

Slide 3

Slide 3 text

@ifesdjeen tweet right along the talk if you have a question Friday, September 20, 13

Slide 4

Slide 4 text

This talk is rather philosophical Friday, September 20, 13

Slide 5

Slide 5 text

Hobbyist monitoring geek (in free time, of course) Friday, September 20, 13

Slide 6

Slide 6 text

Several rewrites from scratch Friday, September 20, 13

Slide 7

Slide 7 text

Tons of experiences Friday, September 20, 13

Slide 8

Slide 8 text

Many extracted components already available as open-source solutions Friday, September 20, 13

Slide 9

Slide 9 text

Needs your opinion Friday, September 20, 13

Slide 10

Slide 10 text

Monitoring is not about software Friday, September 20, 13

Slide 11

Slide 11 text

It’s all about insight Friday, September 20, 13

Slide 12

Slide 12 text

Current state of art: So easy to get started Just as easy to hit the limit Just a tiny bit too hard to change tooling And even harder to extend existing one Friday, September 20, 13

Slide 13

Slide 13 text

General path: Try out some existing tool Find ways to implement more complex scenarios Eventually give in and use what’s already there Keep ranting and saying `#monitoringsucks` Friday, September 20, 13

Slide 14

Slide 14 text

But wait, I thought there was... Friday, September 20, 13

Slide 15

Slide 15 text

#monitoringlove #monitoringlove Friday, September 20, 13

Slide 16

Slide 16 text

Put on your monitoring gloves monitoring hats monitoring socks (you got ‘em) We’re going out for a journey Friday, September 20, 13

Slide 17

Slide 17 text

When I hear Monitoring, I’m like Friday, September 20, 13

Slide 18

Slide 18 text

In my world everyone's a pony and they all eat rainbows and poop butterflies Friday, September 20, 13

Slide 19

Slide 19 text

Let’s try to redefine it all but first let’s simplify it Friday, September 20, 13

Slide 20

Slide 20 text

Ad-hoc vs Post-hoc Friday, September 20, 13

Slide 21

Slide 21 text

Friday, September 20, 13

Slide 22

Slide 22 text

I once wanted to understand what’s going on on my website Friday, September 20, 13

Slide 23

Slide 23 text

...and then I was like monitoring.increment "page_load_#{response.status}_count" monitoring.timing "page_load_#{response.status}_load_time", time monitoring.increment "page_load_#{request.user_agent}_count" monitoring.gauge "page_load_#{response.status}_time_gauge", time Friday, September 20, 13

Slide 24

Slide 24 text

And deployed it to a hundred something servers (rofl, right?) Friday, September 20, 13

Slide 25

Slide 25 text

And then I wanted to add more Friday, September 20, 13

Slide 26

Slide 26 text

And had to deploy it to a hundred something servers (lmao, right?) Friday, September 20, 13

Slide 27

Slide 27 text

And then I wanted to... well, you get the idea Friday, September 20, 13

Slide 28

Slide 28 text

Anything that a server can do server should do Friday, September 20, 13

Slide 29

Slide 29 text

What if you want a suit... Friday, September 20, 13

Slide 30

Slide 30 text

Let’s turn it other way around Friday, September 20, 13

Slide 31

Slide 31 text

Package everything related to a single event into single payload and figure out which metrics you need on the server side Friday, September 20, 13

Slide 32

Slide 32 text

What it gives you granularity simple, stupid client no need to think in advance* rethink your metrics any time you want add more rollups and aggregates as you need them (of course, you’d still have to think, you can just take it much more easy) * Friday, September 20, 13

Slide 33

Slide 33 text

What would monitoring system look like? Friday, September 20, 13

Slide 34

Slide 34 text

Friday, September 20, 13

Slide 35

Slide 35 text

reporter Processing unit Persistent Store State Machine Alerts Reduce Engine (ad-hoc queries and analytics) History Graphs Table Data Real-time graphs Real-time table data Latest Events (capped in-memory store) reporter reporter reporter reporter UDP, TCP, AMQP Console (nrepl, custom consumers, real-time analysis) Collector Collector Friday, September 20, 13

Slide 36

Slide 36 text

Friday, September 20, 13

Slide 37

Slide 37 text

Processing unit Friday, September 20, 13

Slide 38

Slide 38 text

Simple Scenario Even / Odd splitter Even numbers Odd numbers Friday, September 20, 13

Slide 39

Slide 39 text

add counts for both Even / Odd splitter Multicast Multicast Even numbers count Even numbers buffer Odd numbers count Even numbers buffer Friday, September 20, 13

Slide 40

Slide 40 text

calculate sum for each 10 Even / Odd splitter Multicast Multicast Even numbers count Even numbers buffer Odd numbers count Even numbers buffer Summarizer Sum buffer Summarizer Sum buffer Friday, September 20, 13

Slide 41

Slide 41 text

calculate sum for each 10 Even / Odd splitter Multicast Multicast Even numbers count Even numbers buffer Odd numbers count Even numbers buffer Summarizer Sum buffer Friday, September 20, 13

Slide 42

Slide 42 text

calculate sum for each 10 Even / Odd splitter Multicast Multicast Even numbers count Even numbers buffer Summarizer Sum buffer Friday, September 20, 13

Slide 43

Slide 43 text

Event-based Create simple, independent parts (aggregate, filter, multicast, transfomer etc) Define dependencies between them (routing) Parts are completely decoupled Every part can have it’s own state Routing is dynamic, and can be changed in runtime Graphs are expressive, easy to understand Friday, September 20, 13

Slide 44

Slide 44 text

Even more applicable to monitoring Friday, September 20, 13

Slide 45

Slide 45 text

Windowed operations Friday, September 20, 13

Slide 46

Slide 46 text

Why matters? Raw numbers are too much Window is an easy way to accumulate and summarize Flexible and very comosable Friday, September 20, 13

Slide 47

Slide 47 text

Sliding window t0 t1 t2 (emit) +---+ +---+---+ +---+---+---+ | 1 | | 1 | 2 | | 1 | 2 | 3 | <6> +---+ +---+---+ +---+---+---+ t4 (emit) -...+---+---+---+ -...+---+---+---+ : 1 : 2 | 3 | 4 | <9> : 2 : 3 | 4 | 5 | <12> -...+---+---+---+ -...+---+---+---+ Friday, September 20, 13

Slide 48

Slide 48 text

Sliding window Accumulates items in buffer When full: - emits all contents - drops an oldest value Friday, September 20, 13

Slide 49

Slide 49 text

Tumbling window t0 t1 t2 (emit) +---+ +---+---+ +---+---+---+ -...-...-...- | 1 | | 1 | 2 | | 1 | 2 | 3 | <6> : 1 : 2 : 3 : +---+ +---+---+ +---+---+---+ -...-...-...- t3 t4 t5 (emit) +---+ +---+---+ +---+---+---+ -...-...-...- | 4 | | 4 | 5 | | 4 | 5 | 6 | <15> : 4 : 5 : 6 : +---+ +---+---+ +---+---+---+ -...-...-...- Friday, September 20, 13

Slide 50

Slide 50 text

Tumbling window Accumulates items in buffer When full: - emits all contents - drops all values Friday, September 20, 13

Slide 51

Slide 51 text

Clock Control wether window should or should not yet (emit) Clocks in windows are arbitrary Can be - monotonic - wall clock - arbitrary (business clock) Friday, September 20, 13

Slide 52

Slide 52 text

Summing up Friday, September 20, 13

Slide 53

Slide 53 text

Idea is simple: • Everything that’s coming in is an event • Events are split by triplet (application/env/event type) • Every event can have multiple metrics • Metric has a key and value • And filter • And several rollups (tumbling window) • Rollup has an aggregate function triggered on overflow • And sliding window with last N values • And visualization (area, line, barchart) attached Friday, September 20, 13

Slide 54

Slide 54 text

Incoming payload {:application "my-app" :environment "production" :type “page_load” :execution_time 542 :user_agent “Mozilla...” :host "web001" :status 200} Friday, September 20, 13

Slide 55

Slide 55 text

{:application "my-app" :environment "production" :type “page_load” :execution_time 542 :user_agent “Mozilla...” :host "web001" :status 200} Identification (splitter) Friday, September 20, 13

Slide 56

Slide 56 text

{:application "my-app" :environment "production" :type “page_load” :execution_time 542 :user_agent “Mozilla...” :host "web001" :status 200} key -> value -> aggregate: median Friday, September 20, 13

Slide 57

Slide 57 text

{:application "my-app" :environment "production" :type “page_load” :execution_time 542 :user_agent “Mozilla...” :host "web001" :status 200} key -> value -> aggregate: max Friday, September 20, 13

Slide 58

Slide 58 text

{:application "my-app" :environment "production" :type “page_load” :execution_time 542 :user_agent “Mozilla...” :host "web001" :status 200} key -> aggregate: count Friday, September 20, 13

Slide 59

Slide 59 text

{:application "my-app" :environment "production" :type “page_load” :execution_time 542 :user_agent “Mozilla...” :host "web001" :status 200} key -> aggregate: count Friday, September 20, 13

Slide 60

Slide 60 text

{:application "my-app" :environment "production" :type “exception” :stacktrace “NullPointer.” :host "web001"} key -> aggregate: count Friday, September 20, 13

Slide 61

Slide 61 text

Idea is simple: • Everything that’s coming in is an event • Events are split by triplet (application/env/event type) • Every event can have multiple metrics • Metric has a key and value • And filter • And several rollups (tumbling window) • Rollup has an aggregate function triggered on overflow • And sliding window with last N values • And visualization (area, line, barchart) attached Friday, September 20, 13

Slide 62

Slide 62 text

Please pardon my lisp Friday, September 20, 13

Slide 63

Slide 63 text

(event :my_web_app :production :page_load (metric :page-load (group #(get-in % [:additional_info :status])) (value [:additional_info :execution_time]) (rollup 1 :second (aggregate :max #(apply max %) (store-last 60) (visualize :line :y "Max response time")) (aggregate :mean stats/mean (store-last 60) (visualize :line :y "Mean response time")) (aggregate :count #(apply min %) (store-last 60) (visualize :line :y "Min response time"))))) Friday, September 20, 13

Slide 64

Slide 64 text

(event :my_web_app :production :page_load (metric :page-load (group #(get-in % [:additional_info :status])) (value [:additional_info :execution_time]) (rollup 1 :second (aggregate :max #(apply max %) (store-last 60) (visualize :line :y "Max response time")) (aggregate :mean stats/mean (store-last 60) (visualize :line :y "Mean response time")) (aggregate :count #(apply min %) (store-last 60) (visualize :line :y "Min response time"))))) Friday, September 20, 13

Slide 65

Slide 65 text

{:application "my-app" :environment "production" :type “page_load” :execution_time 542 :user_agent “Mozilla...” :host "web001" :status 200} Friday, September 20, 13

Slide 66

Slide 66 text

(event :my_web_app :production :page_load (metric :page-load (group #(get-in % [:status])) (value [:execution_time]) (rollup 1 :second (aggregate :max #(apply max %) (store-last 60) (visualize :line :y "Max response time")) (aggregate :mean stats/mean (store-last 60) (visualize :line :y "Mean response time")) (aggregate :count #(apply min %) (store-last 60) (visualize :line :y "Min response time"))))) Friday, September 20, 13

Slide 67

Slide 67 text

{:application "my-app" :environment "production" :type “page_load” :execution_time 542 :user_agent “Mozilla...” :host "web001" :status 200} Friday, September 20, 13

Slide 68

Slide 68 text

(event :my_web_app :production :page_load (metric :page-load (group #(get-in % [:additional_info :status])) (value [:additional_info :execution_time]) (rollup 1 :second (aggregate :max #(apply max %) (store-last 60) (visualize :line :y "Max response time")) (aggregate :mean stats/mean (store-last 60) (visualize :line :y "Mean response time")) (aggregate :count #(apply min %) (store-last 60) (visualize :line :y "Min response time"))))) Friday, September 20, 13

Slide 69

Slide 69 text

Wait, I want to see correlations! Friday, September 20, 13

Slide 70

Slide 70 text

Friday, September 20, 13

Slide 71

Slide 71 text

Boxplots? Friday, September 20, 13

Slide 72

Slide 72 text

You got em’ Friday, September 20, 13

Slide 73

Slide 73 text

Linear regression? Friday, September 20, 13

Slide 74

Slide 74 text

Sure, sir! Friday, September 20, 13

Slide 75

Slide 75 text

Friday, September 20, 13

Slide 76

Slide 76 text

Built on open source EEP for event emitter & windows https://github.com/clojurewerkz/eep Meltdown for anonymous topologies https://github.com/clojurewerkz/eep Eventoverse-graphs for graphs https://github.com/ifesdjeen/eventoverse-graphs Clj-push for websockets https://github.com/ifesdjeen/clj-pushr Cascalog for map/reduce https://github.com/nathanmarz/cascalog Friday, September 20, 13

Slide 77

Slide 77 text

Available soon under @clojurewerkz Friday, September 20, 13

Slide 78

Slide 78 text

Friday, September 20, 13

Slide 79

Slide 79 text

@ifesdjeen Friday, September 20, 13