$30 off During Our Annual Pro Sale. View Details »

Monitoring like a boss

Monitoring like a boss

Fabian Gutierrez

December 07, 2017
Tweet

Other Decks in Programming

Transcript

  1. Monitoring my
    application like a boss
    fagossa
    fabgutierr
    fabian.gutierrez
    1

    View Slide

  2. Agenda
    ● Why are we here?
    ● Monitoring
    ● Logging
    ● Push model andThen demo
    ● Pull model andThen demo
    ● Questions
    2

    View Slide

  3. because we need to know
    when the house is on fire
    Why are we here?
    3

    View Slide

  4. 4

    View Slide

  5. 5

    View Slide

  6. 6

    View Slide

  7. Monitoring
    Follow-up the state of an application
    Can be achieved by any/all of these three things?
    ● Logging
    ● Tracing
    ● Metrics
    7

    View Slide

  8. Logging vs Tracing
    8

    View Slide

  9. Logging
    ● “Solved” problem
    ● No more SSH + tail -f
    ● Strings containing diverse information (level, user, host, etc)
    ● Slf4j (logback, others)
    ● Individual records matter
    ● Direct business value (€)
    ● Non-ephemeral
    ● Logs can be used as metrics
    9

    View Slide

  10. Application
    Logback
    appender
    Logback
    appender
    Rsyslog
    10

    View Slide

  11. Metrics
    ● (Hopefully) Multidimensional data
    ● Direct tech value
    ○ Response times
    ○ Complex flows
    ● Business value (?)
    ● Ephemeral
    11
    ● Logs
    217.0.0.1 - jean [INFO]
    “GET /icon.gif HTTP/1.0”
    200 2326
    ● Metrics
    http_request_total {
    method=”post”,
    code=”200”} 1027
    1395066363000

    View Slide

  12. 12
    How can I get
    metrics out of
    my app?

    View Slide

  13. 13

    View Slide

  14. Push Pull
    14

    View Slide

  15. We are talking about this
    15

    View Slide

  16. Push approach
    16

    View Slide

  17. Kamon
    Push approach
    Metric
    Store
    Application
    Metric
    Collector
    17

    View Slide

  18. Kamon
    Monitoring tool for the JVM
    ● Open source
    ● Metrics and tracing API
    ● Instrumentation for common libraries (akka, play,
    etc)
    ● Collection and Reporting are Separate
    ○ Instrument once, report anywhere
    18

    View Slide

  19. kamon
    akka
    kamon
    core
    kamon
    new
    relic
    kamon
    statsd
    kamon
    datadog
    kamon
    scala
    Modules / write only Reporters / read only
    kamon
    play
    Kamon - High Level
    19

    View Slide

  20. Kamon + JMX Reporter
    JMX
    Mission
    Control
    Application
    Kamon
    20

    View Slide

  21. Kamon + Telegraf + influxDB
    Application Telegraf
    InfluxDB
    Application Telegraf
    StatsD
    UDP
    StatsD
    UDP
    ● Agent that accepts StatsD protocol
    metrics
    ● Aggregates and parses metrics
    ● Periodically forwards the metrics to
    InfluxDB 21

    View Slide

  22. 22

    View Slide

  23. Kamon and Actors
    For each actor you have access to 4 metrics:
    ● errors
    ● mailbox-size
    ● processing-time
    ● time-in-mailbox
    23

    View Slide

  24. Recap on Kamon
    ● Push approach
    ● Great integration with the JVM
    ● Several modules (JMX, StatsD, etc)
    ● Active project: A new version (1.0.0) came out a couple
    of months ago
    24

    View Slide

  25. Recap on Kamon
    ● Bytecode instrumentation (?)
    ● Working with modules is sometimes confusing (à la
    Spring)
    ● Potential bytecode incompatibilities
    25

    View Slide

  26. Pull approach
    26

    View Slide

  27. Prometheus
    Pull approach
    Metric
    Store
    Application
    Metric
    Collector
    27
    ● You can run your monitoring on your laptop
    when developing changes
    ● You can more easily tell if a target is down
    ● You can manually go to a target and inspect
    its health with a web browser

    View Slide

  28. Prometheus
    System monitoring tool with built-in timeseries DB
    ● Integrates collecting and reporting
    ● Metric API
    ● Alerting already provided
    ● Only numeric timeseries metrics
    It is not:
    ● Don’t do logging or tracing
    ● Do not care about individual events
    ● Not distributed storage (only local) by design!
    28

    View Slide

  29. /metrics
    # HELP http_request_duration_seconds Duration of HTTP request in seconds
    # TYPE http_request_duration_seconds histogram
    http_request_duration_seconds_count{
    method="GET",
    path="/metrics",
    status="2xx"} 5
    http_request_duration_seconds_sum{
    method="GET",
    path="/metrics",
    status="2xx"} 0.065599873
    # HELP http_request_mismatch_total Number mismatched routes
    # TYPE http_request_mismatch_total counter
    http_request_mismatch_total 1.0
    # HELP play_current_users Actual connected users
    # TYPE play_current_users gauge
    play_current_users 3.0
    # HELP play_requests_total Total requests.
    # TYPE play_requests_total counter
    play_requests_total 1.0
    29

    View Slide

  30. Prometheus + docker
    30

    View Slide

  31. Prometheus - High Level
    31

    View Slide

  32. Alerting
    32

    View Slide

  33. Alerting rules
    ALERT low_connected_users
    IF play_current_users < 2
    FOR 30s
    LABELS {
    severity = "warning"
    }
    ANNOTATIONS {
    summary = "Instance {{ $labels.instance }} under lower load",
    description = "{{ $labels.instance }} of job {{ $labels.job }} is under
    lower load.",
    }
    33

    View Slide

  34. 34

    View Slide

  35. Application
    Application
    Application
    Prometheus
    JMX
    Exporter
    DB Exporter
    Alert
    Manager
    A more complex architecture
    targets
    35

    View Slide

  36. Recap on Prometheus
    ● Pull approach
    ● Prepackaged solution (collect + storage)
    ● Easy to start with
    ● Simple metric API
    ● Active project (version 2 just came out)
    ● Lots of exporters
    ● prometheus-akka seems nice
    36

    View Slide

  37. Recap on Prometheus
    ● Ephemeral persistence (?)
    ● What to do after a few weeks of logs? (existing adaptors
    to influxDB)
    ● App overhead?
    ● Kamon-prometheus bridge :(
    37

    View Slide

  38. From zero to hero
    38
    Go a bit
    deeper and
    analyze
    sections of
    functionality
    within your
    app
    Start with
    high level
    metrics, like
    user
    experienced
    response time
    Go even
    deeper and
    analyze the
    core
    components
    of your app
    How long does a login
    take? How long did the "select all
    products" JDBC call take?
    How many messages is
    handling this actor?

    View Slide

  39. 39

    View Slide

  40. Conclusions and takeaways
    ● Both approaches are robust enough
    ● Good integrations for both
    ● Don’t guess … monitor
    ● Does not matter which approach … choose one
    40

    View Slide

  41. 41

    View Slide

  42. Going further
    ● https://github.com/fagossa/play-prometheus
    ● http://blog.xebia.fr/2017/07/28/superviser-mon-application-play-avec-promethe
    us
    ● https://en.fabernovel.com/insights/tech-en/alerting-in-prometheus-or-how-i-can-
    sleep-well-at-night
    42

    View Slide