Beyond grep: Practical Logging and Metrics

Beyond grep: Practical Logging and Metrics

Your Python applications are running but you’re wondering what they are doing? Your only clue about their current state is the server load after ssh-ing into the servers? Let’s change that!

174e7b0ff60963f821d0b9a4f1a3ef52?s=128

Hynek Schlawack

April 12, 2015
Tweet

Transcript

  1. Beyond grep Pragmatic Logging & Metrics https://ox.cx/b Hynek Schlawack

  2. Hi! @hynek https://hynek.me https://github.com/hynek

  3. Agenda

  4. Agenda

  5. Agenda

  6. Agenda

  7. Errors

  8. Errors

  9. Requirements

  10. Requirements •fast

  11. Requirements •fast •once

  12. Requirements •fast •once

  13. Requirements •fast •once •context

  14. Requirements •fast •once •context

  15. None
  16. None
  17. None
  18. None
  19. None
  20. None
  21. None
  22. None
  23. None
  24. Raven-Python

  25. Raven-Python Integrations:

  26. Raven-Python Integrations: • logging

  27. Raven-Python Integrations: • logging • Django

  28. Raven-Python Integrations: • logging • Django • WSGI

  29. Raven-Python Integrations: • logging • Django • WSGI • …9

    others
  30. Vanilla from raven import Client client = Client("https://yoursentry") try: 1

    / 0 except ZeroDivisionError: client.captureException()
  31. Vanilla from raven import Client client = Client("https://yoursentry") try: 1

    / 0 except ZeroDivisionError: client.captureException()
  32. Vanilla from raven import Client client = Client("https://yoursentry") try: 1

    / 0 except ZeroDivisionError: client.captureException()
  33. Django INSTALLED_APPS = ( … "raven.contrib.django.raven_compat", … )

  34. Progress!

  35. Progress! ✓

  36. Metrics

  37. Metrics?

  38. Metrics? • numbers in a DB

  39. Metrics? • numbers in a DB • guessing vs knowing

  40. System Metrics vs App Metrics • load • network traffic

    • I/O • …
  41. System Metrics vs App Metrics • load • network traffic

    • I/O • … • counters
  42. System Metrics vs App Metrics • load • network traffic

    • I/O • … • counters • timers
  43. System Metrics vs App Metrics • load • network traffic

    • I/O • … • counters • timers • gauges
  44. System Metrics vs App Metrics • load • network traffic

    • I/O • … • counters • timers • gauges • …
  45. Aggregation

  46. Correlation

  47. Math

  48. Math • # reqs / s?

  49. Math • # reqs / s? • worst 0.01% ⟨req

    time⟩?
  50. Math • # reqs / s? • worst 0.01% ⟨req

    time⟩? • don’t try this alone!
  51. Monitoring

  52. Monitoring • latency

  53. Monitoring • latency • error rates

  54. Monitoring • latency • error rates • anomalies

  55. Storage

  56. Librato Metrics

  57. Graphite 800 pound gorilla

  58. Grafana

  59. InfluxDB Graphite++ in Go

  60. InfluxDB Graphite++ in Go

  61. Collecting

  62. Approaches

  63. Approaches 1. external aggregation: StatsD, Riemann

  64. Approaches 1. external aggregation: StatsD, Riemann + no state, simple

  65. Approaches 1. external aggregation: StatsD, Riemann + no state, simple

    – no direct introspection
  66. Approaches 1. external aggregation: StatsD, Riemann + no state, simple

    – no direct introspection 2. aggregate in-app, deliver to DB
  67. Approaches 1. external aggregation: StatsD, Riemann + no state, simple

    – no direct introspection 2. aggregate in-app, deliver to DB + in-app dashboard, useful in dev
  68. Approaches 1. external aggregation: StatsD, Riemann + no state, simple

    – no direct introspection 2. aggregate in-app, deliver to DB + in-app dashboard, useful in dev – state w/i app
  69. None
  70. (g|py)?statsd(py|-client)? import statsd c = statsd.StatsClient( "statsd.local", 8125 )

  71. (g|py)?statsd(py|-client)?

  72. (g|py)?statsd(py|-client)? c.incr("foo")

  73. (g|py)?statsd(py|-client)? c.incr("foo") c.timing("stats.timed", 320)

  74. Scales from greplin import scales from greplin.scales.meter import MeterStat STATS

    = scales.collection( "/Resource", MeterStat("reqs"), scales.PmfStat("request_time") )
  75. Scales from greplin import scales from greplin.scales.meter import MeterStat STATS

    = scales.collection( "/Resource", MeterStat("reqs"), scales.PmfStat("request_time") )
  76. Scales

  77. Scales STATS.reqs.mark()

  78. Scales STATS.reqs.mark() with STATS.request_time.time():

  79. Scales STATS.reqs.mark() with STATS.request_time.time(): do_something_expensive()

  80. Dashboard Scales

  81. Dashboard Scales

  82. Dashboard Scales … "request_time": { "count": 567315293, "99percentile": 0.10978688716888428, "75percentile":

    0.013181567192077637, "min": 0.0002448558807373047, "max": 30.134822130203247, "98percentile": 0.08934824466705339, "95percentile": 0.027234303951263434, "median": 0.009176492691040039, "999percentile": 0.14235656142234793, "stddev": 0.01676855570363413, "mean": 0.013247184020535955 }, …
  83. Progress! ✓

  84. Progress! ✓ ✓

  85. Logging

  86. Splunk

  87. Splunk

  88. Papertrail

  89. loggly

  90. ELK Elasticsearch + Logstash + Kibana

  91. Graylog

  92. Graylog

  93. Graylog

  94. Graylog

  95. None
  96. Goal @400000005270e0d604afce64 { "event": "logged_in", "user": "guido", "ip": "8.8.8.8", "referrer":

    "http://google.com" }
  97. Context & Format

  98. structlog

  99. BoundLogger structlog

  100. Original Logger BoundLogger structlog e.g. logging.Logger

  101. Original Logger BoundLogger bind values log.bind(key=value) Context structlog

  102. Original Logger BoundLogger bind values log.bind(key=value) Context log events log.info(event,

    another_key=another_value) + structlog
  103. Original Logger BoundLogger Processor 1 Processor n Return Value Return

    Value bind values log.bind(key=value) Context log events log.info(event, another_key=another_value) + structlog
  104. Original Logger BoundLogger Processor 1 Processor n Return Value Return

    Value bind values log.bind(key=value) Context log events log.info(event, another_key=another_value) + structlog
  105. import logging, sys logger = logging.getLogger() logger.addHandler( logging.StreamHandler( sys.stdout )

    )
  106. import logging, sys logger = logging.getLogger() logger.addHandler( logging.StreamHandler( sys.stdout )

    )
  107. Capture

  108. Capture • into files

  109. Capture • into files • to syslog / a queue

  110. Capture • into files • to syslog / a queue

    • pipe into a logging agent
  111. log = log.bind(user="guido") log.info("user.login")

  112. {"event": "user.login", "user": "guido"} log = log.bind(user="guido") log.info("user.login") structlog

  113. {"event": "user.login", "user": "guido"} log = log.bind(user="guido") log.info("user.login") structlog stdout

    logging
  114. {"event": "user.login", "user": "guido"} log = log.bind(user="guido") log.info("user.login") structlog stdout

    logging /var/log/app/current runit’s svlogd (adds TAI64 timestamp)
  115. {"event": "user.login", "user": "guido"} log = log.bind(user="guido") log.info("user.login") structlog logstash-forwarder

    logstash stdout logging /var/log/app/current runit’s svlogd (adds TAI64 timestamp)
  116. {"event": "user.login", "user": "guido"} log = log.bind(user="guido") log.info("user.login") structlog logstash-forwarder

    logstash 1010001101 Elasticsearch stdout logging /var/log/app/current runit’s svlogd (adds TAI64 timestamp)
  117. Progress! ✓ ✓

  118. Progress! ✓ ✓ ✓

  119. Wait a Minute…

  120. Ugh try: STATS.time.timing(): something() except Exception as e: log.error("omg", exc_info=e)

    raven_client.captureError() STATS.errors.mark()
  121. Ugh try: STATS.time.timing(): something() except Exception as e: log.error("omg", exc_info=e)

    raven_client.captureError() STATS.errors.mark()
  122. Awww try: something() except Exception as e: log.error("omg", exc_info=e)

  123. Errors

  124. Errors • logging integration

  125. Errors • logging • structlog

  126. Errors • logging • structlog • web apps: error views

  127. Error View @view_config(context=Exception) def err(exc, request): return Response( "oops: "

    + raven_client.captureException() )
  128. Metrics measure from outside

  129. WSGI Servers

  130. WSGI Servers • gunicorn: --statsd-host <host>

  131. WSGI Servers • gunicorn: --statsd-host <host> • uWSGI:

  132. WSGI Servers • gunicorn: --statsd-host <host> • uWSGI: • --stats-push

    statsd:<host>
  133. WSGI Servers • gunicorn: --statsd-host <host> • uWSGI: • --stats-push

    statsd:<host> • --carbon <host>
  134. Middleware def timing_tween_factory(handler, registry): def timing_tween(request): with STATS.request_time.time(): return handler(request)

    return timing_tween
  135. Middleware def timing_tween_factory(handler, registry): def timing_tween(request): with STATS.request_time.time(): return handler(request)

    return timing_tween
  136. Extract from Logs

  137. Leverage Monitoring

  138. Remaining

  139. Remaining 1. measure code paths

  140. Remaining 1. measure code paths 2. expose gauges

  141. Summary

  142. Summary

  143. Summary

  144. Summary

  145. Summary

  146. ox.cx/b @hynek vrmd.de