Beyond grep – PyCon JP

Beyond grep – PyCon JP

174e7b0ff60963f821d0b9a4f1a3ef52?s=128

Hynek Schlawack

October 10, 2015
Tweet

Transcript

  1. Beyond grep Pragmatic Logging & Metrics Hynek Schlawack

  2. None
  3. ͩΩ΁ͷ΅! @hynek https://hynek.me https://github.com/hynek

  4. www.variomedia.de

  5. ox.cx/b

  6. Agenda

  7. Agenda

  8. Agenda

  9. Agenda

  10. Errors

  11. Errors

  12. Requirements

  13. Requirements •fast

  14. Requirements •fast •once

  15. Requirements •fast •once

  16. Requirements •fast •once •context

  17. Requirements •fast •once •context

  18. None
  19. None
  20. None
  21. None
  22. None
  23. None
  24. None
  25. None
  26. None
  27. None
  28. Raven-Python

  29. Raven-Python Transports:

  30. Raven-Python Transports: • gevent

  31. Raven-Python Transports: • gevent • aiohttp

  32. Raven-Python Transports: • gevent • aiohttp • Twisted

  33. Raven-Python Transports: • gevent • aiohttp • Twisted • …8

    others
  34. Raven-Python Integrations: Transports: • gevent • aiohttp • Twisted •

    …8 others
  35. Raven-Python Integrations: • logging Transports: • gevent • aiohttp •

    Twisted • …8 others
  36. Raven-Python Integrations: • logging • Django Transports: • gevent •

    aiohttp • Twisted • …8 others
  37. Raven-Python Integrations: • logging • Django • WSGI Transports: •

    gevent • aiohttp • Twisted • …8 others
  38. Raven-Python Integrations: • logging • Django • WSGI • …9

    others Transports: • gevent • aiohttp • Twisted • …8 others
  39. Vanilla from raven import Client client = Client("https://yoursentry") try: 1

    / 0 except ZeroDivisionError: client.captureException()
  40. Vanilla from raven import Client client = Client("https://yoursentry") try: 1

    / 0 except ZeroDivisionError: client.captureException()
  41. Vanilla from raven import Client client = Client("https://yoursentry") try: 1

    / 0 except ZeroDivisionError: client.captureException()
  42. Vanilla @client.capture_exceptions def foo(): raise Exception()

  43. Django INSTALLED_APPS = ( … "raven.contrib.django.raven_compat", … )

  44. Progress!

  45. Progress! ✓

  46. Metrics

  47. Metrics?

  48. Metrics? • numbers in a DB

  49. Metrics? • numbers in a DB • guessing vs knowing

  50. –D. J. Bernstein “Measure. Don’t speculate.”

  51. System Metrics vs App Metrics • load • network traffic

    • I/O • …
  52. System Metrics vs App Metrics • load • network traffic

    • I/O • … • counters
  53. System Metrics vs App Metrics • load • network traffic

    • I/O • … • counters • timers
  54. System Metrics vs App Metrics • load • network traffic

    • I/O • … • counters • timers • gauges
  55. System Metrics vs App Metrics • load • network traffic

    • I/O • … • counters • timers • gauges • …
  56. Aggregation

  57. Correlation

  58. Math

  59. Math • # reqs / s?

  60. Math • # reqs / s? • worst 0.01% ⟨req

    time⟩?
  61. Math • # reqs / s? • worst 0.01% ⟨req

    time⟩? • don’t try this alone!
  62. Monitoring

  63. Monitoring • latency

  64. Monitoring • latency • error rates

  65. Monitoring • latency • error rates • anomalies

  66. Storage

  67. Librato Metrics

  68. Graphite 800 pound gorilla

  69. Grafana

  70. InfluxDB Graphite++ in Go

  71. InfluxDB Graphite++ in Go

  72. Prometheus Metrics & Monitoring

  73. Prometheus Metrics & Monitoring

  74. Prometheus Metrics & Monitoring

  75. Prometheus Metrics & Monitoring

  76. Prometheus Metrics & Monitoring

  77. Prometheus Metrics & Monitoring

  78. Collecting

  79. Approaches

  80. Approaches 1. external aggregation: StatsD, Riemann

  81. Approaches 1. external aggregation: StatsD, Riemann + no state, simple

  82. Approaches 1. external aggregation: StatsD, Riemann + no state, simple

    – no direct introspection
  83. Approaches 1. external aggregation: StatsD, Riemann + no state, simple

    – no direct introspection 2. aggregate in-app, deliver to DB
  84. Approaches 1. external aggregation: StatsD, Riemann + no state, simple

    – no direct introspection 2. aggregate in-app, deliver to DB + in-app dashboard, useful in dev
  85. Approaches 1. external aggregation: StatsD, Riemann + no state, simple

    – no direct introspection 2. aggregate in-app, deliver to DB + in-app dashboard, useful in dev – state w/i app
  86. None
  87. (g|py)?statsd(py|-client)? import statsd c = statsd.StatsClient( "statsd.local", 8125 )

  88. (g|py)?statsd(py|-client)?

  89. (g|py)?statsd(py|-client)? c.incr("foo")

  90. (g|py)?statsd(py|-client)? c.incr("foo") c.timing("stats.timed", 320)

  91. Scales from greplin import scales from greplin.scales.meter import MeterStat STATS

    = scales.collection( "/Resource", MeterStat("reqs"), scales.PmfStat("request_time") )
  92. Scales from greplin import scales from greplin.scales.meter import MeterStat STATS

    = scales.collection( "/Resource", MeterStat("reqs"), scales.PmfStat("request_time") )
  93. Scales

  94. Scales STATS.reqs.mark()

  95. Scales STATS.reqs.mark() with STATS.request_time.time():

  96. Scales STATS.reqs.mark() with STATS.request_time.time(): do_something_expensive()

  97. Dashboard Scales

  98. Dashboard Scales

  99. Dashboard Scales … "request_time": { "count": 567315293, "99percentile": 0.10978688716888428, "75percentile":

    0.013181567192077637, "min": 0.0002448558807373047, "max": 30.134822130203247, "98percentile": 0.08934824466705339, "95percentile": 0.027234303951263434, "median": 0.009176492691040039, "999percentile": 0.14235656142234793, "stddev": 0.01676855570363413, "mean": 0.013247184020535955 }, …
  100. from greplin.scales import graphite pusher = graphite.GraphitePeriodicPusher( "graphite.host", 2003, period=30,

    prefix="yourapp", ) pusher.allow("**") pusher.start()
  101. Prometheus

  102. Prometheus import random, time from prometheus_client import \ start_http_server, Summary

  103. Prometheus import random, time from prometheus_client import \ start_http_server, Summary

    FUNC_TIME = Summary( "func_seconds", "Time spent in func")
  104. Prometheus import random, time from prometheus_client import \ start_http_server, Summary

    FUNC_TIME = Summary( "func_seconds", "Time spent in func") @FUNC_TIME.time() def func(t): time.sleep(t)
  105. Prometheus import random, time from prometheus_client import \ start_http_server, Summary

    FUNC_TIME = Summary( "func_seconds", "Time spent in func") @FUNC_TIME.time() def func(t): time.sleep(t) if __name__ == '__main__': start_http_server(8000) while True: func(random.random())
  106. Prometheus # HELP func_seconds Time spent in func # TYPE

    func_seconds summary func_seconds_count 78.0 func_seconds_sum 37.8028838634491
  107. Progress! ✓

  108. Progress! ✓ ✓

  109. Logging

  110. Splunk

  111. Splunk

  112. Papertrail

  113. loggly

  114. ELK Elasticsearch + Logstash + Kibana

  115. None
  116. None
  117. Logstash

  118. Elasticsearch Logstash

  119. Elasticsearch Logstash

  120. Elasticsearch Logstash Kibana

  121. Graylog

  122. Graylog

  123. Graylog

  124. Graylog

  125. None
  126. None
  127. Goal @400000005270e0d604afce64 { "event": "logged_in", "user": "guido", "ip": "8.8.8.8", "referrer":

    "http://google.com" }
  128. Context & Format

  129. structlog

  130. BoundLogger structlog

  131. Original Logger BoundLogger structlog e.g. logging.Logger

  132. Original Logger BoundLogger bind values log.bind(key=value) Context structlog

  133. Original Logger BoundLogger bind values log.bind(key=value) Context log events log.info(event,

    another_key=another_value) + structlog
  134. Original Logger BoundLogger Processor 1 Processor n Return Value Return

    Value bind values log.bind(key=value) Context log events log.info(event, another_key=another_value) + structlog
  135. >>> import structlog >>> log = structlog.get_logger() >>> log.error("ohai", why=42,

    wtf=True) why=42 wtf=True event='ohai'
  136. >>> import structlog >>> log = structlog.get_logger() >>> log.error("ohai", why=42,

    wtf=True) why=42 wtf=True event='ohai'
  137. >>> import structlog >>> log = structlog.get_logger() >>> log.error("ohai", why=42,

    wtf=True) why=42 wtf=True event='ohai'
  138. >>> import structlog >>> log = structlog.get_logger() >>> log.error("ohai", why=42,

    wtf=True) why=42 wtf=True event='ohai'
  139. >>> from structlog import get_logger >>> log = get_logger() >>>

    log = log.bind(user='anonymous', some_key=23) >>> log = log.bind(user='hynek', another_key=42) >>> log.info('user.logged_in', happy=True) some_key=23 user='hynek' another_key=42 happy=True event='user.logged_in'
  140. >>> from structlog import get_logger >>> log = get_logger() >>>

    log = log.bind(user='anonymous', some_key=23) >>> log = log.bind(user='hynek', another_key=42) >>> log.info('user.logged_in', happy=True) some_key=23 user='hynek' another_key=42 happy=True event='user.logged_in'
  141. >>> from structlog import get_logger >>> log = get_logger() >>>

    log = log.bind(user='anonymous', some_key=23) >>> log = log.bind(user='hynek', another_key=42) >>> log.info('user.logged_in', happy=True) some_key=23 user='hynek' another_key=42 happy=True event='user.logged_in'
  142. >>> from structlog import get_logger >>> log = get_logger() >>>

    log = log.bind(user='anonymous', some_key=23) >>> log = log.bind(user='hynek', another_key=42) >>> log.info('user.logged_in', happy=True) some_key=23 user='hynek' another_key=42 happy=True event='user.logged_in'
  143. >>> from structlog import get_logger >>> log = get_logger() >>>

    log = log.bind(user='anonymous', some_key=23) >>> log = log.bind(user='hynek', another_key=42) >>> log.info('user.logged_in', happy=True) some_key=23 user='hynek' another_key=42 happy=True event='user.logged_in'
  144. def view(request): log = logger.bind(request=request) log.info("view_rendered") return "hello, world"

  145. def view(request): log = logger.bind(request=request) log.info("view_rendered") return "hello, world"

  146. def view(request): log = logger.bind(request=request) log.info("view_rendered") return "hello, world"

  147. def request_extractor(_, __, event): req = event.pop("request", None) if req

    is not None: event.update({ "client_addr": req.client_addr, "user_id": req.authenticated_userid, }) return event
  148. def request_extractor(_, __, event): req = event.pop("request", None) if req

    is not None: event.update({ "client_addr": req.client_addr, "user_id": req.authenticated_userid, }) return event
  149. def request_extractor(_, __, event): req = event.pop("request", None) if req

    is not None: event.update({ "client_addr": req.client_addr, "user_id": req.authenticated_userid, }) return event
  150. def request_extractor(_, __, event): req = event.pop("request", None) if req

    is not None: event.update({ "client_addr": req.client_addr, "user_id": req.authenticated_userid, }) return event
  151. {"client_addr": "192.168.42.42", "user_id": null, "event": "view_rendered"}

  152. import logging, sys logger = logging.getLogger() logger.addHandler( logging.StreamHandler( sys.stdout )

    )
  153. import logging, sys logger = logging.getLogger() logger.addHandler( logging.StreamHandler( sys.stdout )

    )
  154. Capture

  155. Capture • into files

  156. Capture • into files • to syslog / a queue

  157. Capture • into files • to syslog / a queue

    • pipe into a logging agent
  158. None
  159. log = log.bind(user="guido") log.info("user.login")

  160. {"event": "user.login", "user": "guido"} log = log.bind(user="guido") log.info("user.login") structlog

  161. {"event": "user.login", "user": "guido"} log = log.bind(user="guido") log.info("user.login") structlog stdout

    logging
  162. {"event": "user.login", "user": "guido"} log = log.bind(user="guido") log.info("user.login") structlog stdout

    logging /var/log/app/current runit’s svlogd (adds TAI64 timestamp)
  163. {"event": "user.login", "user": "guido"} log = log.bind(user="guido") log.info("user.login") structlog logstash-forwarder

    logstash stdout logging /var/log/app/current runit’s svlogd (adds TAI64 timestamp)
  164. {"event": "user.login", "user": "guido"} log = log.bind(user="guido") log.info("user.login") structlog logstash-forwarder

    logstash 1010001101 Elasticsearch stdout logging /var/log/app/current runit’s svlogd (adds TAI64 timestamp)
  165. Progress! ✓ ✓

  166. Progress! ✓ ✓ ✓

  167. Wait a Minute…

  168. Wait a Minute…

  169. Ugh try: STATS.time.timing(): something() except Exception as e: log.error("omg", exc_info=e)

    raven_client.captureError() STATS.errors.mark()
  170. Ugh try: STATS.time.timing(): something() except Exception as e: log.error("omg", exc_info=e)

    raven_client.captureError() STATS.errors.mark()
  171. Awww try: something() except Exception as e: log.exception("omg")

  172. Errors

  173. Errors • logging integration

  174. Errors • logging integration • structlog

  175. Errors • logging integration • structlog • web apps: error

    views
  176. Error View @view_config(context=Exception) def err(exc, request): return Response( "oops: "

    + raven_client.captureException() )
  177. None
  178. Metrics measure from outside

  179. WSGI Servers

  180. WSGI Servers • gunicorn: --statsd-host <host>

  181. WSGI Servers • gunicorn: --statsd-host <host> • uWSGI:

  182. WSGI Servers • gunicorn: --statsd-host <host> • uWSGI: • --stats-push

    statsd:<host>
  183. WSGI Servers • gunicorn: --statsd-host <host> • uWSGI: • --stats-push

    statsd:<host> • --carbon <host>
  184. Middleware def timing_tween_factory(handler, registry): def timing_tween(request): with STATS.request_time.time(): return handler(request)

    return timing_tween
  185. Middleware def timing_tween_factory(handler, registry): def timing_tween(request): with STATS.request_time.time(): return handler(request)

    return timing_tween
  186. Extract from Logs

  187. Extract from Logs logstash

  188. Extract from Logs logstash • StatsD

  189. Extract from Logs logstash • StatsD • Graphite

  190. Extract from Logs logstash • StatsD • Graphite • librato

  191. Extract from Logs logstash • StatsD • Graphite • librato

    • …
  192. Extract from Logs logstash • StatsD • Graphite • librato

    • … or structlog
  193. Leverage Monitoring

  194. Remaining

  195. Remaining 1. measure code paths

  196. Remaining 1. measure code paths 2. expose gauges

  197. Summary

  198. Summary

  199. Summary

  200. Summary

  201. Summary

  202. ox.cx/b @hynek vrmd.de