Upgrade to Pro — share decks privately, control downloads, hide ads and more …

New-age monitoring toolchain

αλεx π
September 20, 2013

New-age monitoring toolchain

Talk about Monitoring toolchain for a more civilized age

αλεx π

September 20, 2013
Tweet

More Decks by αλεx π

Other Decks in Programming

Transcript

  1. @ifesdjeen tweet right along the talk if you have a

    question Friday, September 20, 13
  2. Current state of art: So easy to get started Just

    as easy to hit the limit Just a tiny bit too hard to change tooling And even harder to extend existing one Friday, September 20, 13
  3. General path: Try out some existing tool Find ways to

    implement more complex scenarios Eventually give in and use what’s already there Keep ranting and saying `#monitoringsucks` Friday, September 20, 13
  4. Put on your monitoring gloves monitoring hats monitoring socks (you

    got ‘em) We’re going out for a journey Friday, September 20, 13
  5. In my world everyone's a pony and they all eat

    rainbows and poop butterflies Friday, September 20, 13
  6. I once wanted to understand what’s going on on my

    website Friday, September 20, 13
  7. ...and then I was like monitoring.increment "page_load_#{response.status}_count" monitoring.timing "page_load_#{response.status}_load_time", time

    monitoring.increment "page_load_#{request.user_agent}_count" monitoring.gauge "page_load_#{response.status}_time_gauge", time Friday, September 20, 13
  8. And had to deploy it to a hundred something servers

    (lmao, right?) Friday, September 20, 13
  9. Package everything related to a single event into single payload

    and figure out which metrics you need on the server side Friday, September 20, 13
  10. What it gives you granularity simple, stupid client no need

    to think in advance* rethink your metrics any time you want add more rollups and aggregates as you need them (of course, you’d still have to think, you can just take it much more easy) * Friday, September 20, 13
  11. reporter Processing unit Persistent Store State Machine Alerts Reduce Engine

    (ad-hoc queries and analytics) History Graphs Table Data Real-time graphs Real-time table data Latest Events (capped in-memory store) reporter reporter reporter reporter UDP, TCP, AMQP Console (nrepl, custom consumers, real-time analysis) Collector Collector Friday, September 20, 13
  12. add counts for both Even / Odd splitter Multicast Multicast

    Even numbers count Even numbers buffer Odd numbers count Even numbers buffer Friday, September 20, 13
  13. calculate sum for each 10 Even / Odd splitter Multicast

    Multicast Even numbers count Even numbers buffer Odd numbers count Even numbers buffer Summarizer Sum buffer Summarizer Sum buffer Friday, September 20, 13
  14. calculate sum for each 10 Even / Odd splitter Multicast

    Multicast Even numbers count Even numbers buffer Odd numbers count Even numbers buffer Summarizer Sum buffer Friday, September 20, 13
  15. calculate sum for each 10 Even / Odd splitter Multicast

    Multicast Even numbers count Even numbers buffer Summarizer Sum buffer Friday, September 20, 13
  16. Event-based Create simple, independent parts (aggregate, filter, multicast, transfomer etc)

    Define dependencies between them (routing) Parts are completely decoupled Every part can have it’s own state Routing is dynamic, and can be changed in runtime Graphs are expressive, easy to understand Friday, September 20, 13
  17. Why matters? Raw numbers are too much Window is an

    easy way to accumulate and summarize Flexible and very comosable Friday, September 20, 13
  18. Sliding window t0 t1 t2 (emit) +---+ +---+---+ +---+---+---+ |

    1 | | 1 | 2 | | 1 | 2 | 3 | <6> +---+ +---+---+ +---+---+---+ t4 (emit) -...+---+---+---+ -...+---+---+---+ : 1 : 2 | 3 | 4 | <9> : 2 : 3 | 4 | 5 | <12> -...+---+---+---+ -...+---+---+---+ Friday, September 20, 13
  19. Sliding window Accumulates items in buffer When full: - emits

    all contents - drops an oldest value Friday, September 20, 13
  20. Tumbling window t0 t1 t2 (emit) +---+ +---+---+ +---+---+---+ -...-...-...-

    | 1 | | 1 | 2 | | 1 | 2 | 3 | <6> : 1 : 2 : 3 : +---+ +---+---+ +---+---+---+ -...-...-...- t3 t4 t5 (emit) +---+ +---+---+ +---+---+---+ -...-...-...- | 4 | | 4 | 5 | | 4 | 5 | 6 | <15> : 4 : 5 : 6 : +---+ +---+---+ +---+---+---+ -...-...-...- Friday, September 20, 13
  21. Tumbling window Accumulates items in buffer When full: - emits

    all contents - drops all values Friday, September 20, 13
  22. Clock Control wether window should or should not yet (emit)

    Clocks in windows are arbitrary Can be - monotonic - wall clock - arbitrary (business clock) Friday, September 20, 13
  23. Idea is simple: • Everything that’s coming in is an

    event • Events are split by triplet (application/env/event type) • Every event can have multiple metrics • Metric has a key and value • And filter • And several rollups (tumbling window) • Rollup has an aggregate function triggered on overflow • And sliding window with last N values • And visualization (area, line, barchart) attached Friday, September 20, 13
  24. Incoming payload {:application "my-app" :environment "production" :type “page_load” :execution_time 542

    :user_agent “Mozilla...” :host "web001" :status 200} Friday, September 20, 13
  25. {:application "my-app" :environment "production" :type “page_load” :execution_time 542 :user_agent “Mozilla...”

    :host "web001" :status 200} Identification (splitter) Friday, September 20, 13
  26. {:application "my-app" :environment "production" :type “page_load” :execution_time 542 :user_agent “Mozilla...”

    :host "web001" :status 200} key -> value -> aggregate: median Friday, September 20, 13
  27. {:application "my-app" :environment "production" :type “page_load” :execution_time 542 :user_agent “Mozilla...”

    :host "web001" :status 200} key -> value -> aggregate: max Friday, September 20, 13
  28. Idea is simple: • Everything that’s coming in is an

    event • Events are split by triplet (application/env/event type) • Every event can have multiple metrics • Metric has a key and value • And filter • And several rollups (tumbling window) • Rollup has an aggregate function triggered on overflow • And sliding window with last N values • And visualization (area, line, barchart) attached Friday, September 20, 13
  29. (event :my_web_app :production :page_load (metric :page-load (group #(get-in % [:additional_info

    :status])) (value [:additional_info :execution_time]) (rollup 1 :second (aggregate :max #(apply max %) (store-last 60) (visualize :line :y "Max response time")) (aggregate :mean stats/mean (store-last 60) (visualize :line :y "Mean response time")) (aggregate :count #(apply min %) (store-last 60) (visualize :line :y "Min response time"))))) Friday, September 20, 13
  30. (event :my_web_app :production :page_load (metric :page-load (group #(get-in % [:additional_info

    :status])) (value [:additional_info :execution_time]) (rollup 1 :second (aggregate :max #(apply max %) (store-last 60) (visualize :line :y "Max response time")) (aggregate :mean stats/mean (store-last 60) (visualize :line :y "Mean response time")) (aggregate :count #(apply min %) (store-last 60) (visualize :line :y "Min response time"))))) Friday, September 20, 13
  31. (event :my_web_app :production :page_load (metric :page-load (group #(get-in % [:status]))

    (value [:execution_time]) (rollup 1 :second (aggregate :max #(apply max %) (store-last 60) (visualize :line :y "Max response time")) (aggregate :mean stats/mean (store-last 60) (visualize :line :y "Mean response time")) (aggregate :count #(apply min %) (store-last 60) (visualize :line :y "Min response time"))))) Friday, September 20, 13
  32. (event :my_web_app :production :page_load (metric :page-load (group #(get-in % [:additional_info

    :status])) (value [:additional_info :execution_time]) (rollup 1 :second (aggregate :max #(apply max %) (store-last 60) (visualize :line :y "Max response time")) (aggregate :mean stats/mean (store-last 60) (visualize :line :y "Mean response time")) (aggregate :count #(apply min %) (store-last 60) (visualize :line :y "Min response time"))))) Friday, September 20, 13
  33. Built on open source EEP for event emitter & windows

    https://github.com/clojurewerkz/eep Meltdown for anonymous topologies https://github.com/clojurewerkz/eep Eventoverse-graphs for graphs https://github.com/ifesdjeen/eventoverse-graphs Clj-push for websockets https://github.com/ifesdjeen/clj-pushr Cascalog for map/reduce https://github.com/nathanmarz/cascalog Friday, September 20, 13