Beyond grep – EuroPython Edition

Beyond grep – EuroPython Edition

Knowing that your application is up and running is great. However in order to make informed decisions about the future, you also need to know in what state your application currently is and how its state is developing over time.

This talk combines two topics that are usually discussed separately. However I do believe that they have a lot of overlap and ultimately a similar goal: giving you vital insights about your system in production.

We'll have a look at their commonalities, differences, popular tools, and how to apply everything in your own systems while avoiding some common pitfalls.

174e7b0ff60963f821d0b9a4f1a3ef52?s=128

Hynek Schlawack

July 21, 2015
Tweet

Transcript

  1. Beyond grep Pragmatic Logging & Metrics Hynek Schlawack

  2. Hi! @hynek https://hynek.me https://github.com/hynek

  3. www.variomedia.de

  4. ox.cx/b

  5. Agenda

  6. Agenda

  7. Agenda

  8. Agenda

  9. Errors

  10. Errors

  11. Requirements

  12. Requirements •fast

  13. Requirements •fast •once

  14. Requirements •fast •once

  15. Requirements •fast •once •context

  16. Requirements •fast •once •context

  17. None
  18. None
  19. None
  20. None
  21. None
  22. None
  23. None
  24. None
  25. None
  26. None
  27. Raven-Python

  28. Raven-Python Transports:

  29. Raven-Python Transports: • gevent

  30. Raven-Python Transports: • gevent • aiohttp

  31. Raven-Python Transports: • gevent • aiohttp • Twisted

  32. Raven-Python Transports: • gevent • aiohttp • Twisted • …8

    others
  33. Raven-Python Integrations: Transports: • gevent • aiohttp • Twisted •

    …8 others
  34. Raven-Python Integrations: • logging Transports: • gevent • aiohttp •

    Twisted • …8 others
  35. Raven-Python Integrations: • logging • Django Transports: • gevent •

    aiohttp • Twisted • …8 others
  36. Raven-Python Integrations: • logging • Django • WSGI Transports: •

    gevent • aiohttp • Twisted • …8 others
  37. Raven-Python Integrations: • logging • Django • WSGI • …9

    others Transports: • gevent • aiohttp • Twisted • …8 others
  38. Vanilla from raven import Client client = Client("https://yoursentry") try: 1

    / 0 except ZeroDivisionError: client.captureException()
  39. Vanilla from raven import Client client = Client("https://yoursentry") try: 1

    / 0 except ZeroDivisionError: client.captureException()
  40. Vanilla from raven import Client client = Client("https://yoursentry") try: 1

    / 0 except ZeroDivisionError: client.captureException()
  41. Vanilla @client.capture_exceptions def foo(): raise Exception()

  42. Django INSTALLED_APPS = ( … "raven.contrib.django.raven_compat", … )

  43. Progress!

  44. Progress! ✓

  45. hynek2015

  46. Metrics

  47. Metrics?

  48. Metrics? • numbers in a DB

  49. Metrics? • numbers in a DB • guessing vs knowing

  50. –D. J. Bernstein “Measure. Don’t speculate.”

  51. System Metrics vs App Metrics • load • network traffic

    • I/O • …
  52. System Metrics vs App Metrics • load • network traffic

    • I/O • … • counters
  53. System Metrics vs App Metrics • load • network traffic

    • I/O • … • counters • timers
  54. System Metrics vs App Metrics • load • network traffic

    • I/O • … • counters • timers • gauges
  55. System Metrics vs App Metrics • load • network traffic

    • I/O • … • counters • timers • gauges • …
  56. Aggregation

  57. Correlation

  58. Math

  59. Math • # reqs / s?

  60. Math • # reqs / s? • worst 0.01% ⟨req

    time⟩?
  61. Math • # reqs / s? • worst 0.01% ⟨req

    time⟩? • don’t try this alone!
  62. Monitoring

  63. Monitoring • latency

  64. Monitoring • latency • error rates

  65. Monitoring • latency • error rates • anomalies

  66. Storage

  67. Librato Metrics

  68. Graphite 800 pound gorilla

  69. Grafana

  70. InfluxDB Graphite++ in Go

  71. InfluxDB Graphite++ in Go

  72. Collecting

  73. Approaches

  74. Approaches 1. external aggregation: StatsD, Riemann

  75. Approaches 1. external aggregation: StatsD, Riemann + no state, simple

  76. Approaches 1. external aggregation: StatsD, Riemann + no state, simple

    – no direct introspection
  77. Approaches 1. external aggregation: StatsD, Riemann + no state, simple

    – no direct introspection 2. aggregate in-app, deliver to DB
  78. Approaches 1. external aggregation: StatsD, Riemann + no state, simple

    – no direct introspection 2. aggregate in-app, deliver to DB + in-app dashboard, useful in dev
  79. Approaches 1. external aggregation: StatsD, Riemann + no state, simple

    – no direct introspection 2. aggregate in-app, deliver to DB + in-app dashboard, useful in dev – state w/i app
  80. None
  81. (g|py)?statsd(py|-client)? import statsd c = statsd.StatsClient( "statsd.local", 8125 )

  82. (g|py)?statsd(py|-client)?

  83. (g|py)?statsd(py|-client)? c.incr("foo")

  84. (g|py)?statsd(py|-client)? c.incr("foo") c.timing("stats.timed", 320)

  85. Scales from greplin import scales from greplin.scales.meter import MeterStat STATS

    = scales.collection( "/Resource", MeterStat("reqs"), scales.PmfStat("request_time") )
  86. Scales from greplin import scales from greplin.scales.meter import MeterStat STATS

    = scales.collection( "/Resource", MeterStat("reqs"), scales.PmfStat("request_time") )
  87. Scales

  88. Scales STATS.reqs.mark()

  89. Scales STATS.reqs.mark() with STATS.request_time.time():

  90. Scales STATS.reqs.mark() with STATS.request_time.time(): do_something_expensive()

  91. Dashboard Scales

  92. Dashboard Scales

  93. Dashboard Scales … "request_time": { "count": 567315293, "99percentile": 0.10978688716888428, "75percentile":

    0.013181567192077637, "min": 0.0002448558807373047, "max": 30.134822130203247, "98percentile": 0.08934824466705339, "95percentile": 0.027234303951263434, "median": 0.009176492691040039, "999percentile": 0.14235656142234793, "stddev": 0.01676855570363413, "mean": 0.013247184020535955 }, …
  94. from greplin.scales import graphite pusher = graphite.GraphitePeriodicPusher( "graphite.host", 2003, period=30,

    prefix="yourapp", ) pusher.allow("**") pusher.start()
  95. Progress! ✓

  96. Progress! ✓ ✓

  97. Logging

  98. Splunk

  99. Splunk

  100. Papertrail

  101. loggly

  102. ELK Elasticsearch + Logstash + Kibana

  103. None
  104. None
  105. Logstash

  106. Elasticsearch Logstash

  107. Elasticsearch Logstash

  108. Elasticsearch Logstash Kibana

  109. Graylog

  110. Graylog

  111. Graylog

  112. Graylog

  113. None
  114. None
  115. Goal @400000005270e0d604afce64 { "event": "logged_in", "user": "guido", "ip": "8.8.8.8", "referrer":

    "http://google.com" }
  116. Context & Format

  117. structlog

  118. BoundLogger structlog

  119. Original Logger BoundLogger structlog e.g. logging.Logger

  120. Original Logger BoundLogger bind values log.bind(key=value) Context structlog

  121. Original Logger BoundLogger bind values log.bind(key=value) Context log events log.info(event,

    another_key=another_value) + structlog
  122. Original Logger BoundLogger Processor 1 Processor n Return Value Return

    Value bind values log.bind(key=value) Context log events log.info(event, another_key=another_value) + structlog
  123. >>> import structlog >>> log = structlog.get_logger() >>> log.error("ohai", why=42,

    wtf=True) why=42 wtf=True event='ohai'
  124. >>> import structlog >>> log = structlog.get_logger() >>> log.error("ohai", why=42,

    wtf=True) why=42 wtf=True event='ohai'
  125. >>> import structlog >>> log = structlog.get_logger() >>> log.error("ohai", why=42,

    wtf=True) why=42 wtf=True event='ohai'
  126. >>> import structlog >>> log = structlog.get_logger() >>> log.error("ohai", why=42,

    wtf=True) why=42 wtf=True event='ohai'
  127. >>> from structlog import get_logger >>> log = get_logger() >>>

    log = log.bind(user='anonymous', some_key=23) >>> log = log.bind(user='hynek', another_key=42) >>> log.info('user.logged_in', happy=True) some_key=23 user='hynek' another_key=42 happy=True event='user.logged_in'
  128. >>> from structlog import get_logger >>> log = get_logger() >>>

    log = log.bind(user='anonymous', some_key=23) >>> log = log.bind(user='hynek', another_key=42) >>> log.info('user.logged_in', happy=True) some_key=23 user='hynek' another_key=42 happy=True event='user.logged_in'
  129. >>> from structlog import get_logger >>> log = get_logger() >>>

    log = log.bind(user='anonymous', some_key=23) >>> log = log.bind(user='hynek', another_key=42) >>> log.info('user.logged_in', happy=True) some_key=23 user='hynek' another_key=42 happy=True event='user.logged_in'
  130. >>> from structlog import get_logger >>> log = get_logger() >>>

    log = log.bind(user='anonymous', some_key=23) >>> log = log.bind(user='hynek', another_key=42) >>> log.info('user.logged_in', happy=True) some_key=23 user='hynek' another_key=42 happy=True event='user.logged_in'
  131. >>> from structlog import get_logger >>> log = get_logger() >>>

    log = log.bind(user='anonymous', some_key=23) >>> log = log.bind(user='hynek', another_key=42) >>> log.info('user.logged_in', happy=True) some_key=23 user='hynek' another_key=42 happy=True event='user.logged_in'
  132. def view(request): log = logger.bind(request=request) log.info("view_rendered") return "hello, world"

  133. def view(request): log = logger.bind(request=request) log.info("view_rendered") return "hello, world"

  134. def view(request): log = logger.bind(request=request) log.info("view_rendered") return "hello, world"

  135. def request_extractor(_, __, event): req = event.pop("request", None) if req

    is not None: event.update({ "client_addr": req.client_addr, "user_id": req.authenticated_userid, }) return event
  136. def request_extractor(_, __, event): req = event.pop("request", None) if req

    is not None: event.update({ "client_addr": req.client_addr, "user_id": req.authenticated_userid, }) return event
  137. def request_extractor(_, __, event): req = event.pop("request", None) if req

    is not None: event.update({ "client_addr": req.client_addr, "user_id": req.authenticated_userid, }) return event
  138. def request_extractor(_, __, event): req = event.pop("request", None) if req

    is not None: event.update({ "client_addr": req.client_addr, "user_id": req.authenticated_userid, }) return event
  139. {"client_addr": "192.168.42.42", "user_id": null, "event": "view_rendered"}

  140. import logging, sys logger = logging.getLogger() logger.addHandler( logging.StreamHandler( sys.stdout )

    )
  141. import logging, sys logger = logging.getLogger() logger.addHandler( logging.StreamHandler( sys.stdout )

    )
  142. Capture

  143. Capture • into files

  144. Capture • into files • to syslog / a queue

  145. Capture • into files • to syslog / a queue

    • pipe into a logging agent
  146. None
  147. log = log.bind(user="guido") log.info("user.login")

  148. {"event": "user.login", "user": "guido"} log = log.bind(user="guido") log.info("user.login") structlog

  149. {"event": "user.login", "user": "guido"} log = log.bind(user="guido") log.info("user.login") structlog stdout

    logging
  150. {"event": "user.login", "user": "guido"} log = log.bind(user="guido") log.info("user.login") structlog stdout

    logging /var/log/app/current runit’s svlogd (adds TAI64 timestamp)
  151. {"event": "user.login", "user": "guido"} log = log.bind(user="guido") log.info("user.login") structlog logstash-forwarder

    logstash stdout logging /var/log/app/current runit’s svlogd (adds TAI64 timestamp)
  152. {"event": "user.login", "user": "guido"} log = log.bind(user="guido") log.info("user.login") structlog logstash-forwarder

    logstash 1010001101 Elasticsearch stdout logging /var/log/app/current runit’s svlogd (adds TAI64 timestamp)
  153. Progress! ✓ ✓

  154. Progress! ✓ ✓ ✓

  155. Wait a Minute…

  156. Wait a Minute…

  157. Ugh try: STATS.time.timing(): something() except Exception as e: log.error("omg", exc_info=e)

    raven_client.captureError() STATS.errors.mark()
  158. Ugh try: STATS.time.timing(): something() except Exception as e: log.error("omg", exc_info=e)

    raven_client.captureError() STATS.errors.mark()
  159. Awww try: something() except Exception as e: log.exception("omg")

  160. Errors

  161. Errors • logging integration

  162. Errors • logging • structlog

  163. Errors • logging • structlog • web apps: error views

  164. Error View @view_config(context=Exception) def err(exc, request): return Response( "oops: "

    + raven_client.captureException() )
  165. None
  166. Metrics measure from outside

  167. WSGI Servers

  168. WSGI Servers • gunicorn: --statsd-host <host>

  169. WSGI Servers • gunicorn: --statsd-host <host> • uWSGI:

  170. WSGI Servers • gunicorn: --statsd-host <host> • uWSGI: • --stats-push

    statsd:<host>
  171. WSGI Servers • gunicorn: --statsd-host <host> • uWSGI: • --stats-push

    statsd:<host> • --carbon <host>
  172. Middleware def timing_tween_factory(handler, registry): def timing_tween(request): with STATS.request_time.time(): return handler(request)

    return timing_tween
  173. Middleware def timing_tween_factory(handler, registry): def timing_tween(request): with STATS.request_time.time(): return handler(request)

    return timing_tween
  174. Extract from Logs

  175. Extract from Logs logstash

  176. Extract from Logs logstash • StatsD

  177. Extract from Logs logstash • StatsD • Graphite

  178. Extract from Logs logstash • StatsD • Graphite • librato

  179. Extract from Logs logstash • StatsD • Graphite • librato

    • …
  180. Extract from Logs logstash • StatsD • Graphite • librato

    • … or structlog
  181. Leverage Monitoring

  182. Remaining

  183. Remaining 1. measure code paths

  184. Remaining 1. measure code paths 2. expose gauges

  185. Summary

  186. Summary

  187. Summary

  188. Summary

  189. Summary

  190. ox.cx/b @hynek vrmd.de