Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Monitoring like a boss

Monitoring like a boss

Fabian Gutierrez

December 07, 2017
Tweet

Other Decks in Programming

Transcript

  1. Agenda • Why are we here? • Monitoring • Logging

    • Push model andThen demo • Pull model andThen demo • Questions 2
  2. 4

  3. 5

  4. 6

  5. Monitoring Follow-up the state of an application Can be achieved

    by any/all of these three things? • Logging • Tracing • Metrics 7
  6. Logging • “Solved” problem • No more SSH + tail

    -f • Strings containing diverse information (level, user, host, etc) • Slf4j (logback, others) • Individual records matter • Direct business value (€) • Non-ephemeral • Logs can be used as metrics 9
  7. Metrics • (Hopefully) Multidimensional data • Direct tech value ◦

    Response times ◦ Complex flows • Business value (?) • Ephemeral 11 • Logs 217.0.0.1 - jean [INFO] “GET /icon.gif HTTP/1.0” 200 2326 • Metrics http_request_total { method=”post”, code=”200”} 1027 1395066363000
  8. 13

  9. Kamon Monitoring tool for the JVM • Open source •

    Metrics and tracing API • Instrumentation for common libraries (akka, play, etc) • Collection and Reporting are Separate ◦ Instrument once, report anywhere 18
  10. kamon akka kamon core kamon new relic kamon statsd kamon

    datadog kamon scala Modules / write only Reporters / read only kamon play Kamon - High Level 19
  11. Kamon + Telegraf + influxDB Application Telegraf InfluxDB Application Telegraf

    StatsD UDP StatsD UDP • Agent that accepts StatsD protocol metrics • Aggregates and parses metrics • Periodically forwards the metrics to InfluxDB 21
  12. 22

  13. Kamon and Actors For each actor you have access to

    4 metrics: • errors • mailbox-size • processing-time • time-in-mailbox 23
  14. Recap on Kamon • Push approach • Great integration with

    the JVM • Several modules (JMX, StatsD, etc) • Active project: A new version (1.0.0) came out a couple of months ago 24
  15. Recap on Kamon • Bytecode instrumentation (?) • Working with

    modules is sometimes confusing (à la Spring) • Potential bytecode incompatibilities 25
  16. Prometheus Pull approach Metric Store Application Metric Collector 27 •

    You can run your monitoring on your laptop when developing changes • You can more easily tell if a target is down • You can manually go to a target and inspect its health with a web browser
  17. Prometheus System monitoring tool with built-in timeseries DB • Integrates

    collecting and reporting • Metric API • Alerting already provided • Only numeric timeseries metrics It is not: • Don’t do logging or tracing • Do not care about individual events • Not distributed storage (only local) by design! 28
  18. /metrics # HELP http_request_duration_seconds Duration of HTTP request in seconds

    # TYPE http_request_duration_seconds histogram http_request_duration_seconds_count{ method="GET", path="/metrics", status="2xx"} 5 http_request_duration_seconds_sum{ method="GET", path="/metrics", status="2xx"} 0.065599873 # HELP http_request_mismatch_total Number mismatched routes # TYPE http_request_mismatch_total counter http_request_mismatch_total 1.0 # HELP play_current_users Actual connected users # TYPE play_current_users gauge play_current_users 3.0 # HELP play_requests_total Total requests. # TYPE play_requests_total counter play_requests_total 1.0 29
  19. Alerting rules ALERT low_connected_users IF play_current_users < 2 FOR 30s

    LABELS { severity = "warning" } ANNOTATIONS { summary = "Instance {{ $labels.instance }} under lower load", description = "{{ $labels.instance }} of job {{ $labels.job }} is under lower load.", } 33
  20. 34

  21. Recap on Prometheus • Pull approach • Prepackaged solution (collect

    + storage) • Easy to start with • Simple metric API • Active project (version 2 just came out) • Lots of exporters • prometheus-akka seems nice 36
  22. Recap on Prometheus • Ephemeral persistence (?) • What to

    do after a few weeks of logs? (existing adaptors to influxDB) • App overhead? • Kamon-prometheus bridge :( 37
  23. From zero to hero 38 Go a bit deeper and

    analyze sections of functionality within your app Start with high level metrics, like user experienced response time Go even deeper and analyze the core components of your app How long does a login take? How long did the "select all products" JDBC call take? How many messages is handling this actor?
  24. 39

  25. Conclusions and takeaways • Both approaches are robust enough •

    Good integrations for both • Don’t guess … monitor • Does not matter which approach … choose one 40
  26. 41