Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Beyond grep: Practical Logging and Metrics

Beyond grep: Practical Logging and Metrics

Your Python applications are running but you’re wondering what they are doing? Your only clue about their current state is the server load after ssh-ing into the servers? Let’s change that!

Hynek Schlawack

April 12, 2015
Tweet

More Decks by Hynek Schlawack

Other Decks in Programming

Transcript

  1. Beyond grep

    Pragmatic Logging & Metrics

    https://ox.cx/b
    Hynek Schlawack

    View Slide

  2. Hi!
    @hynek
    https://hynek.me
    https://github.com/hynek

    View Slide

  3. Agenda

    View Slide

  4. Agenda

    View Slide

  5. Agenda

    View Slide

  6. Agenda

    View Slide

  7. Errors

    View Slide

  8. Errors

    View Slide

  9. Requirements

    View Slide

  10. Requirements
    •fast

    View Slide

  11. Requirements
    •fast
    •once

    View Slide

  12. Requirements
    •fast
    •once

    View Slide

  13. Requirements
    •fast
    •once
    •context

    View Slide

  14. Requirements
    •fast
    •once
    •context

    View Slide

  15. View Slide

  16. View Slide

  17. View Slide

  18. View Slide

  19. View Slide

  20. View Slide

  21. View Slide

  22. View Slide

  23. View Slide

  24. Raven-Python

    View Slide

  25. Raven-Python
    Integrations:

    View Slide

  26. Raven-Python
    Integrations:
    • logging

    View Slide

  27. Raven-Python
    Integrations:
    • logging
    • Django

    View Slide

  28. Raven-Python
    Integrations:
    • logging
    • Django
    • WSGI

    View Slide

  29. Raven-Python
    Integrations:
    • logging
    • Django
    • WSGI
    • …9 others

    View Slide

  30. Vanilla
    from raven import Client
    client = Client("https://yoursentry")
    try:
    1 / 0
    except ZeroDivisionError:
    client.captureException()

    View Slide

  31. Vanilla
    from raven import Client
    client = Client("https://yoursentry")
    try:
    1 / 0
    except ZeroDivisionError:
    client.captureException()

    View Slide

  32. Vanilla
    from raven import Client
    client = Client("https://yoursentry")
    try:
    1 / 0
    except ZeroDivisionError:
    client.captureException()

    View Slide

  33. Django
    INSTALLED_APPS = (

    "raven.contrib.django.raven_compat",

    )

    View Slide

  34. Progress!

    View Slide

  35. Progress!

    View Slide

  36. Metrics

    View Slide

  37. Metrics?

    View Slide

  38. Metrics?
    • numbers in a DB

    View Slide

  39. Metrics?
    • numbers in a DB
    • guessing vs knowing

    View Slide

  40. System Metrics vs App Metrics
    • load
    • network traffic
    • I/O
    • …

    View Slide

  41. System Metrics vs App Metrics
    • load
    • network traffic
    • I/O
    • …
    • counters

    View Slide

  42. System Metrics vs App Metrics
    • load
    • network traffic
    • I/O
    • …
    • counters
    • timers

    View Slide

  43. System Metrics vs App Metrics
    • load
    • network traffic
    • I/O
    • …
    • counters
    • timers
    • gauges

    View Slide

  44. System Metrics vs App Metrics
    • load
    • network traffic
    • I/O
    • …
    • counters
    • timers
    • gauges
    • …

    View Slide

  45. Aggregation

    View Slide

  46. Correlation

    View Slide

  47. Math

    View Slide

  48. Math
    • # reqs / s?

    View Slide

  49. Math
    • # reqs / s?
    • worst 0.01% ⟨req time⟩?

    View Slide

  50. Math
    • # reqs / s?
    • worst 0.01% ⟨req time⟩?
    • don’t try this alone!

    View Slide

  51. Monitoring

    View Slide

  52. Monitoring
    • latency

    View Slide

  53. Monitoring
    • latency
    • error rates

    View Slide

  54. Monitoring
    • latency
    • error rates
    • anomalies

    View Slide

  55. Storage

    View Slide

  56. Librato Metrics

    View Slide

  57. Graphite
    800 pound gorilla

    View Slide

  58. Grafana

    View Slide

  59. InfluxDB
    Graphite++ in Go

    View Slide

  60. InfluxDB
    Graphite++ in Go

    View Slide

  61. Collecting

    View Slide

  62. Approaches

    View Slide

  63. Approaches
    1. external aggregation: StatsD, Riemann

    View Slide

  64. Approaches
    1. external aggregation: StatsD, Riemann
    + no state, simple

    View Slide

  65. Approaches
    1. external aggregation: StatsD, Riemann
    + no state, simple
    – no direct introspection

    View Slide

  66. Approaches
    1. external aggregation: StatsD, Riemann
    + no state, simple
    – no direct introspection
    2. aggregate in-app, deliver to DB

    View Slide

  67. Approaches
    1. external aggregation: StatsD, Riemann
    + no state, simple
    – no direct introspection
    2. aggregate in-app, deliver to DB
    + in-app dashboard, useful in dev

    View Slide

  68. Approaches
    1. external aggregation: StatsD, Riemann
    + no state, simple
    – no direct introspection
    2. aggregate in-app, deliver to DB
    + in-app dashboard, useful in dev
    – state w/i app

    View Slide

  69. View Slide

  70. (g|py)?statsd(py|-client)?
    import statsd
    c = statsd.StatsClient(
    "statsd.local", 8125
    )

    View Slide

  71. (g|py)?statsd(py|-client)?

    View Slide

  72. (g|py)?statsd(py|-client)?
    c.incr("foo")

    View Slide

  73. (g|py)?statsd(py|-client)?
    c.incr("foo")
    c.timing("stats.timed", 320)

    View Slide

  74. Scales
    from greplin import scales
    from greplin.scales.meter import MeterStat
    STATS = scales.collection(
    "/Resource",
    MeterStat("reqs"),
    scales.PmfStat("request_time")
    )

    View Slide

  75. Scales
    from greplin import scales
    from greplin.scales.meter import MeterStat
    STATS = scales.collection(
    "/Resource",
    MeterStat("reqs"),
    scales.PmfStat("request_time")
    )

    View Slide

  76. Scales

    View Slide

  77. Scales
    STATS.reqs.mark()

    View Slide

  78. Scales
    STATS.reqs.mark()
    with STATS.request_time.time():

    View Slide

  79. Scales
    STATS.reqs.mark()
    with STATS.request_time.time():
    do_something_expensive()

    View Slide

  80. Dashboard
    Scales

    View Slide

  81. Dashboard
    Scales

    View Slide

  82. Dashboard
    Scales

    "request_time": {
    "count": 567315293,
    "99percentile": 0.10978688716888428,
    "75percentile": 0.013181567192077637,
    "min": 0.0002448558807373047,
    "max": 30.134822130203247,
    "98percentile": 0.08934824466705339,
    "95percentile": 0.027234303951263434,
    "median": 0.009176492691040039,
    "999percentile": 0.14235656142234793,
    "stddev": 0.01676855570363413,
    "mean": 0.013247184020535955
    },

    View Slide

  83. Progress!

    View Slide

  84. Progress!
    ✓ ✓

    View Slide

  85. Logging

    View Slide

  86. Splunk

    View Slide

  87. Splunk

    View Slide

  88. Papertrail

    View Slide

  89. loggly

    View Slide

  90. ELK
    Elasticsearch + Logstash + Kibana

    View Slide

  91. Graylog

    View Slide

  92. Graylog

    View Slide

  93. Graylog

    View Slide

  94. Graylog

    View Slide

  95. View Slide

  96. Goal
    @400000005270e0d604afce64 {
    "event": "logged_in",
    "user": "guido",
    "ip": "8.8.8.8",
    "referrer": "http://google.com"
    }

    View Slide

  97. Context

    &
    Format

    View Slide

  98. structlog

    View Slide

  99. BoundLogger
    structlog

    View Slide

  100. Original
    Logger
    BoundLogger
    structlog
    e.g.
    logging.Logger

    View Slide

  101. Original
    Logger
    BoundLogger
    bind values
    log.bind(key=value)
    Context
    structlog

    View Slide

  102. Original
    Logger
    BoundLogger
    bind values
    log.bind(key=value)
    Context
    log events
    log.info(event, another_key=another_value)
    +
    structlog

    View Slide

  103. Original
    Logger
    BoundLogger
    Processor 1 Processor n Return
    Value
    Return
    Value
    bind values
    log.bind(key=value)
    Context
    log events
    log.info(event, another_key=another_value)
    +
    structlog

    View Slide

  104. Original
    Logger
    BoundLogger
    Processor 1 Processor n Return
    Value
    Return
    Value
    bind values
    log.bind(key=value)
    Context
    log events
    log.info(event, another_key=another_value)
    +
    structlog

    View Slide

  105. import logging, sys
    logger = logging.getLogger()
    logger.addHandler(
    logging.StreamHandler(
    sys.stdout
    )
    )

    View Slide

  106. import logging, sys
    logger = logging.getLogger()
    logger.addHandler(
    logging.StreamHandler(
    sys.stdout
    )
    )

    View Slide

  107. Capture

    View Slide

  108. Capture
    • into files

    View Slide

  109. Capture
    • into files
    • to syslog / a queue

    View Slide

  110. Capture
    • into files
    • to syslog / a queue
    • pipe into a logging agent

    View Slide

  111. log = log.bind(user="guido")
    log.info("user.login")

    View Slide

  112. {"event": "user.login", "user": "guido"}
    log = log.bind(user="guido")
    log.info("user.login")
    structlog

    View Slide

  113. {"event": "user.login", "user": "guido"}
    log = log.bind(user="guido")
    log.info("user.login")
    structlog
    stdout
    logging

    View Slide

  114. {"event": "user.login", "user": "guido"}
    log = log.bind(user="guido")
    log.info("user.login")
    structlog
    stdout
    logging
    /var/log/app/current
    runit’s svlogd
    (adds TAI64 timestamp)

    View Slide

  115. {"event": "user.login", "user": "guido"}
    log = log.bind(user="guido")
    log.info("user.login")
    structlog
    logstash-forwarder
    logstash
    stdout
    logging
    /var/log/app/current
    runit’s svlogd
    (adds TAI64 timestamp)

    View Slide

  116. {"event": "user.login", "user": "guido"}
    log = log.bind(user="guido")
    log.info("user.login")
    structlog
    logstash-forwarder
    logstash
    1010001101
    Elasticsearch
    stdout
    logging
    /var/log/app/current
    runit’s svlogd
    (adds TAI64 timestamp)

    View Slide

  117. Progress!
    ✓ ✓

    View Slide

  118. Progress!
    ✓ ✓ ✓

    View Slide

  119. Wait a Minute…

    View Slide

  120. Ugh
    try:
    STATS.time.timing():
    something()
    except Exception as e:
    log.error("omg", exc_info=e)
    raven_client.captureError()
    STATS.errors.mark()

    View Slide

  121. Ugh
    try:
    STATS.time.timing():
    something()
    except Exception as e:
    log.error("omg", exc_info=e)
    raven_client.captureError()
    STATS.errors.mark()

    View Slide

  122. Awww
    try:
    something()
    except Exception as e:
    log.error("omg", exc_info=e)

    View Slide

  123. Errors

    View Slide

  124. Errors
    • logging integration

    View Slide

  125. Errors
    • logging
    • structlog

    View Slide

  126. Errors
    • logging
    • structlog
    • web apps: error views

    View Slide

  127. Error View
    @view_config(context=Exception)
    def err(exc, request):
    return Response(
    "oops: " + raven_client.captureException()
    )

    View Slide

  128. Metrics
    measure from outside

    View Slide

  129. WSGI Servers

    View Slide

  130. WSGI Servers
    • gunicorn: --statsd-host

    View Slide

  131. WSGI Servers
    • gunicorn: --statsd-host
    • uWSGI:

    View Slide

  132. WSGI Servers
    • gunicorn: --statsd-host
    • uWSGI:
    • --stats-push statsd:

    View Slide

  133. WSGI Servers
    • gunicorn: --statsd-host
    • uWSGI:
    • --stats-push statsd:
    • --carbon

    View Slide

  134. Middleware
    def timing_tween_factory(handler, registry):
    def timing_tween(request):
    with STATS.request_time.time():
    return handler(request)
    return timing_tween

    View Slide

  135. Middleware
    def timing_tween_factory(handler, registry):
    def timing_tween(request):
    with STATS.request_time.time():
    return handler(request)
    return timing_tween

    View Slide

  136. Extract from Logs

    View Slide

  137. Leverage
    Monitoring

    View Slide

  138. Remaining

    View Slide

  139. Remaining
    1. measure code paths

    View Slide

  140. Remaining
    1. measure code paths
    2. expose gauges

    View Slide

  141. Summary

    View Slide

  142. Summary

    View Slide

  143. Summary

    View Slide

  144. Summary

    View Slide

  145. Summary

    View Slide

  146. ox.cx/b
    @hynek
    vrmd.de

    View Slide