$30 off During Our Annual Pro Sale. View Details »

Berlin 2013 - Session - Alex Petrov

Monitorama
September 20, 2013
620

Berlin 2013 - Session - Alex Petrov

Monitorama

September 20, 2013
Tweet

Transcript

  1. Monitorin’ it
    Friday, September 20, 13

    View Slide

  2. Friday, September 20, 13

    View Slide

  3. @ifesdjeen
    tweet right along the talk if you have a question
    Friday, September 20, 13

    View Slide

  4. This talk is rather
    philosophical
    Friday, September 20, 13

    View Slide

  5. Hobbyist monitoring geek
    (in free time, of course)
    Friday, September 20, 13

    View Slide

  6. Several rewrites from
    scratch
    Friday, September 20, 13

    View Slide

  7. Tons of experiences
    Friday, September 20, 13

    View Slide

  8. Many extracted
    components
    already available as open-source solutions
    Friday, September 20, 13

    View Slide

  9. Needs your opinion
    Friday, September 20, 13

    View Slide

  10. Monitoring is not about
    software
    Friday, September 20, 13

    View Slide

  11. It’s all about
    insight
    Friday, September 20, 13

    View Slide

  12. Current state of art:
    So easy to get started
    Just as easy to hit the limit
    Just a tiny bit too hard to change tooling
    And even harder to extend existing one
    Friday, September 20, 13

    View Slide

  13. General path:
    Try out some existing tool
    Find ways to implement more complex scenarios
    Eventually give in and use what’s already there
    Keep ranting and saying `#monitoringsucks`
    Friday, September 20, 13

    View Slide

  14. But wait, I thought there
    was...
    Friday, September 20, 13

    View Slide

  15. #monitoringlove
    #monitoringlove
    Friday, September 20, 13

    View Slide

  16. Put on your
    monitoring gloves
    monitoring hats
    monitoring socks (you got ‘em)
    We’re going out for a journey
    Friday, September 20, 13

    View Slide

  17. When I hear Monitoring, I’m like
    Friday, September 20, 13

    View Slide

  18. In my world everyone's a pony
    and they all eat rainbows and poop butterflies
    Friday, September 20, 13

    View Slide

  19. Let’s try to redefine it all but first let’s
    simplify it
    Friday, September 20, 13

    View Slide

  20. Ad-hoc vs Post-hoc
    Friday, September 20, 13

    View Slide

  21. Friday, September 20, 13

    View Slide

  22. I once wanted to
    understand what’s going on on my website
    Friday, September 20, 13

    View Slide

  23. ...and then I was like
    monitoring.increment "page_load_#{response.status}_count"
    monitoring.timing "page_load_#{response.status}_load_time", time
    monitoring.increment "page_load_#{request.user_agent}_count"
    monitoring.gauge "page_load_#{response.status}_time_gauge", time
    Friday, September 20, 13

    View Slide

  24. And deployed it to
    a hundred something servers (rofl, right?)
    Friday, September 20, 13

    View Slide

  25. And then I wanted
    to add more
    Friday, September 20, 13

    View Slide

  26. And had to deploy it to
    a hundred something servers (lmao, right?)
    Friday, September 20, 13

    View Slide

  27. And then I wanted to...
    well, you get the idea
    Friday, September 20, 13

    View Slide

  28. Anything that
    a server
    can do
    server
    should do
    Friday, September 20, 13

    View Slide

  29. What if you want a suit...
    Friday, September 20, 13

    View Slide

  30. Let’s turn it other way
    around
    Friday, September 20, 13

    View Slide

  31. Package everything related to a
    single event
    into
    single payload
    and figure out which metrics you need on the server side
    Friday, September 20, 13

    View Slide

  32. What it gives you
    granularity
    simple, stupid client
    no need to think in advance*
    rethink your metrics any time you want
    add more rollups and aggregates as you need them
    (of course, you’d still have to think, you can just take it much more easy) *
    Friday, September 20, 13

    View Slide

  33. What would monitoring
    system look like?
    Friday, September 20, 13

    View Slide

  34. Friday, September 20, 13

    View Slide

  35. reporter
    Processing
    unit
    Persistent Store
    State Machine Alerts
    Reduce Engine
    (ad-hoc queries and analytics)
    History Graphs
    Table Data
    Real-time graphs
    Real-time table data
    Latest Events
    (capped in-memory store)
    reporter
    reporter
    reporter
    reporter
    UDP,
    TCP,
    AMQP
    Console
    (nrepl, custom consumers,
    real-time analysis)
    Collector Collector
    Friday, September 20, 13

    View Slide

  36. Friday, September 20, 13

    View Slide

  37. Processing
    unit
    Friday, September 20, 13

    View Slide

  38. Simple Scenario
    Even / Odd
    splitter
    Even numbers Odd numbers
    Friday, September 20, 13

    View Slide

  39. add counts for both
    Even / Odd
    splitter
    Multicast Multicast
    Even numbers
    count
    Even numbers
    buffer
    Odd numbers
    count
    Even numbers
    buffer
    Friday, September 20, 13

    View Slide

  40. calculate sum for each 10
    Even / Odd
    splitter
    Multicast Multicast
    Even numbers
    count
    Even numbers
    buffer
    Odd numbers
    count
    Even numbers
    buffer
    Summarizer
    Sum buffer
    Summarizer
    Sum buffer
    Friday, September 20, 13

    View Slide

  41. calculate sum for each 10
    Even / Odd
    splitter
    Multicast Multicast
    Even numbers
    count
    Even numbers
    buffer
    Odd numbers
    count
    Even numbers
    buffer
    Summarizer
    Sum buffer
    Friday, September 20, 13

    View Slide

  42. calculate sum for each 10
    Even / Odd
    splitter
    Multicast Multicast
    Even numbers
    count
    Even numbers
    buffer
    Summarizer
    Sum buffer
    Friday, September 20, 13

    View Slide

  43. Event-based
    Create simple, independent parts (aggregate, filter, multicast, transfomer etc)
    Define dependencies between them (routing)
    Parts are completely decoupled
    Every part can have it’s own state
    Routing is dynamic, and can be changed in runtime
    Graphs are expressive, easy to understand
    Friday, September 20, 13

    View Slide

  44. Even more applicable to monitoring
    Friday, September 20, 13

    View Slide

  45. Windowed operations
    Friday, September 20, 13

    View Slide

  46. Why matters?
    Raw numbers are too much
    Window is an easy way to accumulate and summarize
    Flexible and very comosable
    Friday, September 20, 13

    View Slide

  47. Sliding window
    t0 t1 t2 (emit)
    +---+ +---+---+ +---+---+---+
    | 1 | | 1 | 2 | | 1 | 2 | 3 | <6>
    +---+ +---+---+ +---+---+---+
    t4 (emit)
    -...+---+---+---+ -...+---+---+---+
    : 1 : 2 | 3 | 4 | <9> : 2 : 3 | 4 | 5 | <12>
    -...+---+---+---+ -...+---+---+---+
    Friday, September 20, 13

    View Slide

  48. Sliding window
    Accumulates items in buffer
    When full:
    - emits all contents
    - drops an oldest value
    Friday, September 20, 13

    View Slide

  49. Tumbling window
    t0 t1 t2 (emit)
    +---+ +---+---+ +---+---+---+ -...-...-...-
    | 1 | | 1 | 2 | | 1 | 2 | 3 | <6> : 1 : 2 : 3 :
    +---+ +---+---+ +---+---+---+ -...-...-...-
    t3 t4 t5 (emit)
    +---+ +---+---+ +---+---+---+ -...-...-...-
    | 4 | | 4 | 5 | | 4 | 5 | 6 | <15> : 4 : 5 : 6 :
    +---+ +---+---+ +---+---+---+ -...-...-...-
    Friday, September 20, 13

    View Slide

  50. Tumbling window
    Accumulates items in buffer
    When full:
    - emits all contents
    - drops all values
    Friday, September 20, 13

    View Slide

  51. Clock
    Control wether window should or should not yet (emit)
    Clocks in windows are arbitrary
    Can be
    - monotonic
    - wall clock
    - arbitrary (business clock)
    Friday, September 20, 13

    View Slide

  52. Summing up
    Friday, September 20, 13

    View Slide

  53. Idea is simple:
    • Everything that’s coming in is an event
    • Events are split by triplet (application/env/event type)
    • Every event can have multiple metrics
    • Metric has a key and value
    • And filter
    • And several rollups (tumbling window)
    • Rollup has an aggregate function triggered on overflow
    • And sliding window with last N values
    • And visualization (area, line, barchart) attached
    Friday, September 20, 13

    View Slide

  54. Incoming payload
    {:application "my-app"
    :environment "production"
    :type “page_load”
    :execution_time 542
    :user_agent “Mozilla...”
    :host "web001"
    :status 200}
    Friday, September 20, 13

    View Slide

  55. {:application "my-app"
    :environment "production"
    :type “page_load”
    :execution_time 542
    :user_agent “Mozilla...”
    :host "web001"
    :status 200}
    Identification (splitter)
    Friday, September 20, 13

    View Slide

  56. {:application "my-app"
    :environment "production"
    :type “page_load”
    :execution_time 542
    :user_agent “Mozilla...”
    :host "web001"
    :status 200}
    key ->
    value ->
    aggregate: median
    Friday, September 20, 13

    View Slide

  57. {:application "my-app"
    :environment "production"
    :type “page_load”
    :execution_time 542
    :user_agent “Mozilla...”
    :host "web001"
    :status 200}
    key ->
    value ->
    aggregate: max
    Friday, September 20, 13

    View Slide

  58. {:application "my-app"
    :environment "production"
    :type “page_load”
    :execution_time 542
    :user_agent “Mozilla...”
    :host "web001"
    :status 200}
    key ->
    aggregate: count
    Friday, September 20, 13

    View Slide

  59. {:application "my-app"
    :environment "production"
    :type “page_load”
    :execution_time 542
    :user_agent “Mozilla...”
    :host "web001"
    :status 200}
    key ->
    aggregate: count
    Friday, September 20, 13

    View Slide

  60. {:application "my-app"
    :environment "production"
    :type “exception”
    :stacktrace “NullPointer.”
    :host "web001"}
    key ->
    aggregate: count
    Friday, September 20, 13

    View Slide

  61. Idea is simple:
    • Everything that’s coming in is an event
    • Events are split by triplet (application/env/event type)
    • Every event can have multiple metrics
    • Metric has a key and value
    • And filter
    • And several rollups (tumbling window)
    • Rollup has an aggregate function triggered on overflow
    • And sliding window with last N values
    • And visualization (area, line, barchart) attached
    Friday, September 20, 13

    View Slide

  62. Please pardon my lisp
    Friday, September 20, 13

    View Slide

  63. (event :my_web_app :production :page_load
    (metric :page-load
    (group #(get-in % [:additional_info :status]))
    (value [:additional_info :execution_time])
    (rollup 1 :second
    (aggregate :max #(apply max %)
    (store-last 60)
    (visualize :line :y "Max response time"))
    (aggregate :mean stats/mean
    (store-last 60)
    (visualize :line :y "Mean response time"))
    (aggregate :count #(apply min %)
    (store-last 60)
    (visualize :line :y "Min response time")))))
    Friday, September 20, 13

    View Slide

  64. (event :my_web_app :production :page_load
    (metric :page-load
    (group #(get-in % [:additional_info :status]))
    (value [:additional_info :execution_time])
    (rollup 1 :second
    (aggregate :max #(apply max %)
    (store-last 60)
    (visualize :line :y "Max response time"))
    (aggregate :mean stats/mean
    (store-last 60)
    (visualize :line :y "Mean response time"))
    (aggregate :count #(apply min %)
    (store-last 60)
    (visualize :line :y "Min response time")))))
    Friday, September 20, 13

    View Slide

  65. {:application "my-app"
    :environment "production"
    :type “page_load”
    :execution_time 542
    :user_agent “Mozilla...”
    :host "web001"
    :status 200}
    Friday, September 20, 13

    View Slide

  66. (event :my_web_app :production :page_load
    (metric :page-load
    (group #(get-in % [:status]))
    (value [:execution_time])
    (rollup 1 :second
    (aggregate :max #(apply max %)
    (store-last 60)
    (visualize :line :y "Max response time"))
    (aggregate :mean stats/mean
    (store-last 60)
    (visualize :line :y "Mean response time"))
    (aggregate :count #(apply min %)
    (store-last 60)
    (visualize :line :y "Min response time")))))
    Friday, September 20, 13

    View Slide

  67. {:application "my-app"
    :environment "production"
    :type “page_load”
    :execution_time 542
    :user_agent “Mozilla...”
    :host "web001"
    :status 200}
    Friday, September 20, 13

    View Slide

  68. (event :my_web_app :production :page_load
    (metric :page-load
    (group #(get-in % [:additional_info :status]))
    (value [:additional_info :execution_time])
    (rollup 1 :second
    (aggregate :max #(apply max %)
    (store-last 60)
    (visualize :line :y "Max response time"))
    (aggregate :mean stats/mean
    (store-last 60)
    (visualize :line :y "Mean response time"))
    (aggregate :count #(apply min %)
    (store-last 60)
    (visualize :line :y "Min response time")))))
    Friday, September 20, 13

    View Slide

  69. Wait, I want to see
    correlations!
    Friday, September 20, 13

    View Slide

  70. Friday, September 20, 13

    View Slide

  71. Boxplots?
    Friday, September 20, 13

    View Slide

  72. You got em’
    Friday, September 20, 13

    View Slide

  73. Linear regression?
    Friday, September 20, 13

    View Slide

  74. Sure, sir!
    Friday, September 20, 13

    View Slide

  75. Friday, September 20, 13

    View Slide

  76. Built on open source
    EEP for event emitter & windows https://github.com/clojurewerkz/eep
    Meltdown for anonymous topologies https://github.com/clojurewerkz/eep
    Eventoverse-graphs for graphs https://github.com/ifesdjeen/eventoverse-graphs
    Clj-push for websockets https://github.com/ifesdjeen/clj-pushr
    Cascalog for map/reduce https://github.com/nathanmarz/cascalog
    Friday, September 20, 13

    View Slide

  77. Available soon under
    @clojurewerkz
    Friday, September 20, 13

    View Slide

  78. Friday, September 20, 13

    View Slide

  79. @ifesdjeen
    Friday, September 20, 13

    View Slide