Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Berlin 2013 - Session - Alex Petrov

0580d500edfdb2e5e80e4732ac8df1ea?s=47 Monitorama
September 20, 2013
580

Berlin 2013 - Session - Alex Petrov

0580d500edfdb2e5e80e4732ac8df1ea?s=128

Monitorama

September 20, 2013
Tweet

Transcript

  1. Monitorin’ it Friday, September 20, 13

  2. Friday, September 20, 13

  3. @ifesdjeen tweet right along the talk if you have a

    question Friday, September 20, 13
  4. This talk is rather philosophical Friday, September 20, 13

  5. Hobbyist monitoring geek (in free time, of course) Friday, September

    20, 13
  6. Several rewrites from scratch Friday, September 20, 13

  7. Tons of experiences Friday, September 20, 13

  8. Many extracted components already available as open-source solutions Friday, September

    20, 13
  9. Needs your opinion Friday, September 20, 13

  10. Monitoring is not about software Friday, September 20, 13

  11. It’s all about insight Friday, September 20, 13

  12. Current state of art: So easy to get started Just

    as easy to hit the limit Just a tiny bit too hard to change tooling And even harder to extend existing one Friday, September 20, 13
  13. General path: Try out some existing tool Find ways to

    implement more complex scenarios Eventually give in and use what’s already there Keep ranting and saying `#monitoringsucks` Friday, September 20, 13
  14. But wait, I thought there was... Friday, September 20, 13

  15. #monitoringlove #monitoringlove Friday, September 20, 13

  16. Put on your monitoring gloves monitoring hats monitoring socks (you

    got ‘em) We’re going out for a journey Friday, September 20, 13
  17. When I hear Monitoring, I’m like Friday, September 20, 13

  18. In my world everyone's a pony and they all eat

    rainbows and poop butterflies Friday, September 20, 13
  19. Let’s try to redefine it all but first let’s simplify

    it Friday, September 20, 13
  20. Ad-hoc vs Post-hoc Friday, September 20, 13

  21. Friday, September 20, 13

  22. I once wanted to understand what’s going on on my

    website Friday, September 20, 13
  23. ...and then I was like monitoring.increment "page_load_#{response.status}_count" monitoring.timing "page_load_#{response.status}_load_time", time

    monitoring.increment "page_load_#{request.user_agent}_count" monitoring.gauge "page_load_#{response.status}_time_gauge", time Friday, September 20, 13
  24. And deployed it to a hundred something servers (rofl, right?)

    Friday, September 20, 13
  25. And then I wanted to add more Friday, September 20,

    13
  26. And had to deploy it to a hundred something servers

    (lmao, right?) Friday, September 20, 13
  27. And then I wanted to... well, you get the idea

    Friday, September 20, 13
  28. Anything that a server can do server should do Friday,

    September 20, 13
  29. What if you want a suit... Friday, September 20, 13

  30. Let’s turn it other way around Friday, September 20, 13

  31. Package everything related to a single event into single payload

    and figure out which metrics you need on the server side Friday, September 20, 13
  32. What it gives you granularity simple, stupid client no need

    to think in advance* rethink your metrics any time you want add more rollups and aggregates as you need them (of course, you’d still have to think, you can just take it much more easy) * Friday, September 20, 13
  33. What would monitoring system look like? Friday, September 20, 13

  34. Friday, September 20, 13

  35. reporter Processing unit Persistent Store State Machine Alerts Reduce Engine

    (ad-hoc queries and analytics) History Graphs Table Data Real-time graphs Real-time table data Latest Events (capped in-memory store) reporter reporter reporter reporter UDP, TCP, AMQP Console (nrepl, custom consumers, real-time analysis) Collector Collector Friday, September 20, 13
  36. Friday, September 20, 13

  37. Processing unit Friday, September 20, 13

  38. Simple Scenario Even / Odd splitter Even numbers Odd numbers

    Friday, September 20, 13
  39. add counts for both Even / Odd splitter Multicast Multicast

    Even numbers count Even numbers buffer Odd numbers count Even numbers buffer Friday, September 20, 13
  40. calculate sum for each 10 Even / Odd splitter Multicast

    Multicast Even numbers count Even numbers buffer Odd numbers count Even numbers buffer Summarizer Sum buffer Summarizer Sum buffer Friday, September 20, 13
  41. calculate sum for each 10 Even / Odd splitter Multicast

    Multicast Even numbers count Even numbers buffer Odd numbers count Even numbers buffer Summarizer Sum buffer Friday, September 20, 13
  42. calculate sum for each 10 Even / Odd splitter Multicast

    Multicast Even numbers count Even numbers buffer Summarizer Sum buffer Friday, September 20, 13
  43. Event-based Create simple, independent parts (aggregate, filter, multicast, transfomer etc)

    Define dependencies between them (routing) Parts are completely decoupled Every part can have it’s own state Routing is dynamic, and can be changed in runtime Graphs are expressive, easy to understand Friday, September 20, 13
  44. Even more applicable to monitoring Friday, September 20, 13

  45. Windowed operations Friday, September 20, 13

  46. Why matters? Raw numbers are too much Window is an

    easy way to accumulate and summarize Flexible and very comosable Friday, September 20, 13
  47. Sliding window t0 t1 t2 (emit) +---+ +---+---+ +---+---+---+ |

    1 | | 1 | 2 | | 1 | 2 | 3 | <6> +---+ +---+---+ +---+---+---+ t4 (emit) -...+---+---+---+ -...+---+---+---+ : 1 : 2 | 3 | 4 | <9> : 2 : 3 | 4 | 5 | <12> -...+---+---+---+ -...+---+---+---+ Friday, September 20, 13
  48. Sliding window Accumulates items in buffer When full: - emits

    all contents - drops an oldest value Friday, September 20, 13
  49. Tumbling window t0 t1 t2 (emit) +---+ +---+---+ +---+---+---+ -...-...-...-

    | 1 | | 1 | 2 | | 1 | 2 | 3 | <6> : 1 : 2 : 3 : +---+ +---+---+ +---+---+---+ -...-...-...- t3 t4 t5 (emit) +---+ +---+---+ +---+---+---+ -...-...-...- | 4 | | 4 | 5 | | 4 | 5 | 6 | <15> : 4 : 5 : 6 : +---+ +---+---+ +---+---+---+ -...-...-...- Friday, September 20, 13
  50. Tumbling window Accumulates items in buffer When full: - emits

    all contents - drops all values Friday, September 20, 13
  51. Clock Control wether window should or should not yet (emit)

    Clocks in windows are arbitrary Can be - monotonic - wall clock - arbitrary (business clock) Friday, September 20, 13
  52. Summing up Friday, September 20, 13

  53. Idea is simple: • Everything that’s coming in is an

    event • Events are split by triplet (application/env/event type) • Every event can have multiple metrics • Metric has a key and value • And filter • And several rollups (tumbling window) • Rollup has an aggregate function triggered on overflow • And sliding window with last N values • And visualization (area, line, barchart) attached Friday, September 20, 13
  54. Incoming payload {:application "my-app" :environment "production" :type “page_load” :execution_time 542

    :user_agent “Mozilla...” :host "web001" :status 200} Friday, September 20, 13
  55. {:application "my-app" :environment "production" :type “page_load” :execution_time 542 :user_agent “Mozilla...”

    :host "web001" :status 200} Identification (splitter) Friday, September 20, 13
  56. {:application "my-app" :environment "production" :type “page_load” :execution_time 542 :user_agent “Mozilla...”

    :host "web001" :status 200} key -> value -> aggregate: median Friday, September 20, 13
  57. {:application "my-app" :environment "production" :type “page_load” :execution_time 542 :user_agent “Mozilla...”

    :host "web001" :status 200} key -> value -> aggregate: max Friday, September 20, 13
  58. {:application "my-app" :environment "production" :type “page_load” :execution_time 542 :user_agent “Mozilla...”

    :host "web001" :status 200} key -> aggregate: count Friday, September 20, 13
  59. {:application "my-app" :environment "production" :type “page_load” :execution_time 542 :user_agent “Mozilla...”

    :host "web001" :status 200} key -> aggregate: count Friday, September 20, 13
  60. {:application "my-app" :environment "production" :type “exception” :stacktrace “NullPointer.” :host "web001"}

    key -> aggregate: count Friday, September 20, 13
  61. Idea is simple: • Everything that’s coming in is an

    event • Events are split by triplet (application/env/event type) • Every event can have multiple metrics • Metric has a key and value • And filter • And several rollups (tumbling window) • Rollup has an aggregate function triggered on overflow • And sliding window with last N values • And visualization (area, line, barchart) attached Friday, September 20, 13
  62. Please pardon my lisp Friday, September 20, 13

  63. (event :my_web_app :production :page_load (metric :page-load (group #(get-in % [:additional_info

    :status])) (value [:additional_info :execution_time]) (rollup 1 :second (aggregate :max #(apply max %) (store-last 60) (visualize :line :y "Max response time")) (aggregate :mean stats/mean (store-last 60) (visualize :line :y "Mean response time")) (aggregate :count #(apply min %) (store-last 60) (visualize :line :y "Min response time"))))) Friday, September 20, 13
  64. (event :my_web_app :production :page_load (metric :page-load (group #(get-in % [:additional_info

    :status])) (value [:additional_info :execution_time]) (rollup 1 :second (aggregate :max #(apply max %) (store-last 60) (visualize :line :y "Max response time")) (aggregate :mean stats/mean (store-last 60) (visualize :line :y "Mean response time")) (aggregate :count #(apply min %) (store-last 60) (visualize :line :y "Min response time"))))) Friday, September 20, 13
  65. {:application "my-app" :environment "production" :type “page_load” :execution_time 542 :user_agent “Mozilla...”

    :host "web001" :status 200} Friday, September 20, 13
  66. (event :my_web_app :production :page_load (metric :page-load (group #(get-in % [:status]))

    (value [:execution_time]) (rollup 1 :second (aggregate :max #(apply max %) (store-last 60) (visualize :line :y "Max response time")) (aggregate :mean stats/mean (store-last 60) (visualize :line :y "Mean response time")) (aggregate :count #(apply min %) (store-last 60) (visualize :line :y "Min response time"))))) Friday, September 20, 13
  67. {:application "my-app" :environment "production" :type “page_load” :execution_time 542 :user_agent “Mozilla...”

    :host "web001" :status 200} Friday, September 20, 13
  68. (event :my_web_app :production :page_load (metric :page-load (group #(get-in % [:additional_info

    :status])) (value [:additional_info :execution_time]) (rollup 1 :second (aggregate :max #(apply max %) (store-last 60) (visualize :line :y "Max response time")) (aggregate :mean stats/mean (store-last 60) (visualize :line :y "Mean response time")) (aggregate :count #(apply min %) (store-last 60) (visualize :line :y "Min response time"))))) Friday, September 20, 13
  69. Wait, I want to see correlations! Friday, September 20, 13

  70. Friday, September 20, 13

  71. Boxplots? Friday, September 20, 13

  72. You got em’ Friday, September 20, 13

  73. Linear regression? Friday, September 20, 13

  74. Sure, sir! Friday, September 20, 13

  75. Friday, September 20, 13

  76. Built on open source EEP for event emitter & windows

    https://github.com/clojurewerkz/eep Meltdown for anonymous topologies https://github.com/clojurewerkz/eep Eventoverse-graphs for graphs https://github.com/ifesdjeen/eventoverse-graphs Clj-push for websockets https://github.com/ifesdjeen/clj-pushr Cascalog for map/reduce https://github.com/nathanmarz/cascalog Friday, September 20, 13
  77. Available soon under @clojurewerkz Friday, September 20, 13

  78. Friday, September 20, 13

  79. @ifesdjeen Friday, September 20, 13