$30 off During Our Annual Pro Sale. View Details »

Beyond grep – PyCon JP

Beyond grep – PyCon JP

Hynek Schlawack

October 10, 2015
Tweet

More Decks by Hynek Schlawack

Other Decks in Programming

Transcript

  1. Beyond grep

    Pragmatic Logging & Metrics
    Hynek Schlawack

    View Slide

  2. View Slide

  3. ͩΩ΁ͷ΅!
    @hynek
    https://hynek.me
    https://github.com/hynek

    View Slide

  4. www.variomedia.de

    View Slide

  5. ox.cx/b

    View Slide

  6. Agenda

    View Slide

  7. Agenda

    View Slide

  8. Agenda

    View Slide

  9. Agenda

    View Slide

  10. Errors

    View Slide

  11. Errors

    View Slide

  12. Requirements

    View Slide

  13. Requirements
    •fast

    View Slide

  14. Requirements
    •fast
    •once

    View Slide

  15. Requirements
    •fast
    •once

    View Slide

  16. Requirements
    •fast
    •once
    •context

    View Slide

  17. Requirements
    •fast
    •once
    •context

    View Slide

  18. View Slide

  19. View Slide

  20. View Slide

  21. View Slide

  22. View Slide

  23. View Slide

  24. View Slide

  25. View Slide

  26. View Slide

  27. View Slide

  28. Raven-Python

    View Slide

  29. Raven-Python
    Transports:

    View Slide

  30. Raven-Python
    Transports:
    • gevent

    View Slide

  31. Raven-Python
    Transports:
    • gevent
    • aiohttp

    View Slide

  32. Raven-Python
    Transports:
    • gevent
    • aiohttp
    • Twisted

    View Slide

  33. Raven-Python
    Transports:
    • gevent
    • aiohttp
    • Twisted
    • …8 others

    View Slide

  34. Raven-Python
    Integrations:
    Transports:
    • gevent
    • aiohttp
    • Twisted
    • …8 others

    View Slide

  35. Raven-Python
    Integrations:
    • logging
    Transports:
    • gevent
    • aiohttp
    • Twisted
    • …8 others

    View Slide

  36. Raven-Python
    Integrations:
    • logging
    • Django
    Transports:
    • gevent
    • aiohttp
    • Twisted
    • …8 others

    View Slide

  37. Raven-Python
    Integrations:
    • logging
    • Django
    • WSGI
    Transports:
    • gevent
    • aiohttp
    • Twisted
    • …8 others

    View Slide

  38. Raven-Python
    Integrations:
    • logging
    • Django
    • WSGI
    • …9 others
    Transports:
    • gevent
    • aiohttp
    • Twisted
    • …8 others

    View Slide

  39. Vanilla
    from raven import Client
    client = Client("https://yoursentry")
    try:
    1 / 0
    except ZeroDivisionError:
    client.captureException()

    View Slide

  40. Vanilla
    from raven import Client
    client = Client("https://yoursentry")
    try:
    1 / 0
    except ZeroDivisionError:
    client.captureException()

    View Slide

  41. Vanilla
    from raven import Client
    client = Client("https://yoursentry")
    try:
    1 / 0
    except ZeroDivisionError:
    client.captureException()

    View Slide

  42. Vanilla
    @client.capture_exceptions
    def foo():
    raise Exception()

    View Slide

  43. Django
    INSTALLED_APPS = (

    "raven.contrib.django.raven_compat",

    )

    View Slide

  44. Progress!

    View Slide

  45. Progress!

    View Slide

  46. Metrics

    View Slide

  47. Metrics?

    View Slide

  48. Metrics?
    • numbers in a DB

    View Slide

  49. Metrics?
    • numbers in a DB
    • guessing vs knowing

    View Slide

  50. –D. J. Bernstein
    “Measure. Don’t speculate.”

    View Slide

  51. System Metrics vs App Metrics
    • load
    • network traffic
    • I/O
    • …

    View Slide

  52. System Metrics vs App Metrics
    • load
    • network traffic
    • I/O
    • …
    • counters

    View Slide

  53. System Metrics vs App Metrics
    • load
    • network traffic
    • I/O
    • …
    • counters
    • timers

    View Slide

  54. System Metrics vs App Metrics
    • load
    • network traffic
    • I/O
    • …
    • counters
    • timers
    • gauges

    View Slide

  55. System Metrics vs App Metrics
    • load
    • network traffic
    • I/O
    • …
    • counters
    • timers
    • gauges
    • …

    View Slide

  56. Aggregation

    View Slide

  57. Correlation

    View Slide

  58. Math

    View Slide

  59. Math
    • # reqs / s?

    View Slide

  60. Math
    • # reqs / s?
    • worst 0.01% ⟨req time⟩?

    View Slide

  61. Math
    • # reqs / s?
    • worst 0.01% ⟨req time⟩?
    • don’t try this alone!

    View Slide

  62. Monitoring

    View Slide

  63. Monitoring
    • latency

    View Slide

  64. Monitoring
    • latency
    • error rates

    View Slide

  65. Monitoring
    • latency
    • error rates
    • anomalies

    View Slide

  66. Storage

    View Slide

  67. Librato Metrics

    View Slide

  68. Graphite
    800 pound gorilla

    View Slide

  69. Grafana

    View Slide

  70. InfluxDB
    Graphite++ in Go

    View Slide

  71. InfluxDB
    Graphite++ in Go

    View Slide

  72. Prometheus
    Metrics & Monitoring

    View Slide

  73. Prometheus
    Metrics & Monitoring

    View Slide

  74. Prometheus
    Metrics & Monitoring

    View Slide

  75. Prometheus
    Metrics & Monitoring

    View Slide

  76. Prometheus
    Metrics & Monitoring

    View Slide

  77. Prometheus
    Metrics & Monitoring

    View Slide

  78. Collecting

    View Slide

  79. Approaches

    View Slide

  80. Approaches
    1. external aggregation: StatsD, Riemann

    View Slide

  81. Approaches
    1. external aggregation: StatsD, Riemann
    + no state, simple

    View Slide

  82. Approaches
    1. external aggregation: StatsD, Riemann
    + no state, simple
    – no direct introspection

    View Slide

  83. Approaches
    1. external aggregation: StatsD, Riemann
    + no state, simple
    – no direct introspection
    2. aggregate in-app, deliver to DB

    View Slide

  84. Approaches
    1. external aggregation: StatsD, Riemann
    + no state, simple
    – no direct introspection
    2. aggregate in-app, deliver to DB
    + in-app dashboard, useful in dev

    View Slide

  85. Approaches
    1. external aggregation: StatsD, Riemann
    + no state, simple
    – no direct introspection
    2. aggregate in-app, deliver to DB
    + in-app dashboard, useful in dev
    – state w/i app

    View Slide

  86. View Slide

  87. (g|py)?statsd(py|-client)?
    import statsd
    c = statsd.StatsClient(
    "statsd.local", 8125
    )

    View Slide

  88. (g|py)?statsd(py|-client)?

    View Slide

  89. (g|py)?statsd(py|-client)?
    c.incr("foo")

    View Slide

  90. (g|py)?statsd(py|-client)?
    c.incr("foo")
    c.timing("stats.timed", 320)

    View Slide

  91. Scales
    from greplin import scales
    from greplin.scales.meter import MeterStat
    STATS = scales.collection(
    "/Resource",
    MeterStat("reqs"),
    scales.PmfStat("request_time")
    )

    View Slide

  92. Scales
    from greplin import scales
    from greplin.scales.meter import MeterStat
    STATS = scales.collection(
    "/Resource",
    MeterStat("reqs"),
    scales.PmfStat("request_time")
    )

    View Slide

  93. Scales

    View Slide

  94. Scales
    STATS.reqs.mark()

    View Slide

  95. Scales
    STATS.reqs.mark()
    with STATS.request_time.time():

    View Slide

  96. Scales
    STATS.reqs.mark()
    with STATS.request_time.time():
    do_something_expensive()

    View Slide

  97. Dashboard
    Scales

    View Slide

  98. Dashboard
    Scales

    View Slide

  99. Dashboard
    Scales

    "request_time": {
    "count": 567315293,
    "99percentile": 0.10978688716888428,
    "75percentile": 0.013181567192077637,
    "min": 0.0002448558807373047,
    "max": 30.134822130203247,
    "98percentile": 0.08934824466705339,
    "95percentile": 0.027234303951263434,
    "median": 0.009176492691040039,
    "999percentile": 0.14235656142234793,
    "stddev": 0.01676855570363413,
    "mean": 0.013247184020535955
    },

    View Slide

  100. from greplin.scales import graphite
    pusher = graphite.GraphitePeriodicPusher(
    "graphite.host",
    2003,
    period=30,
    prefix="yourapp",
    )
    pusher.allow("**")
    pusher.start()

    View Slide

  101. Prometheus

    View Slide

  102. Prometheus
    import random, time
    from prometheus_client import \
    start_http_server, Summary

    View Slide

  103. Prometheus
    import random, time
    from prometheus_client import \
    start_http_server, Summary
    FUNC_TIME = Summary(
    "func_seconds", "Time spent in func")

    View Slide

  104. Prometheus
    import random, time
    from prometheus_client import \
    start_http_server, Summary
    FUNC_TIME = Summary(
    "func_seconds", "Time spent in func")
    @FUNC_TIME.time()
    def func(t):
    time.sleep(t)

    View Slide

  105. Prometheus
    import random, time
    from prometheus_client import \
    start_http_server, Summary
    FUNC_TIME = Summary(
    "func_seconds", "Time spent in func")
    @FUNC_TIME.time()
    def func(t):
    time.sleep(t)
    if __name__ == '__main__':
    start_http_server(8000)
    while True:
    func(random.random())

    View Slide

  106. Prometheus
    # HELP func_seconds Time spent in func
    # TYPE func_seconds summary
    func_seconds_count 78.0
    func_seconds_sum 37.8028838634491

    View Slide

  107. Progress!

    View Slide

  108. Progress!
    ✓ ✓

    View Slide

  109. Logging

    View Slide

  110. Splunk

    View Slide

  111. Splunk

    View Slide

  112. Papertrail

    View Slide

  113. loggly

    View Slide

  114. ELK
    Elasticsearch + Logstash + Kibana

    View Slide

  115. View Slide

  116. View Slide

  117. Logstash

    View Slide

  118. Elasticsearch
    Logstash

    View Slide

  119. Elasticsearch
    Logstash

    View Slide

  120. Elasticsearch
    Logstash
    Kibana

    View Slide

  121. Graylog

    View Slide

  122. Graylog

    View Slide

  123. Graylog

    View Slide

  124. Graylog

    View Slide

  125. View Slide

  126. View Slide

  127. Goal
    @400000005270e0d604afce64 {
    "event": "logged_in",
    "user": "guido",
    "ip": "8.8.8.8",
    "referrer": "http://google.com"
    }

    View Slide

  128. Context

    &
    Format

    View Slide

  129. structlog

    View Slide

  130. BoundLogger
    structlog

    View Slide

  131. Original
    Logger
    BoundLogger
    structlog
    e.g.
    logging.Logger

    View Slide

  132. Original
    Logger
    BoundLogger
    bind values
    log.bind(key=value)
    Context
    structlog

    View Slide

  133. Original
    Logger
    BoundLogger
    bind values
    log.bind(key=value)
    Context
    log events
    log.info(event, another_key=another_value)
    +
    structlog

    View Slide

  134. Original
    Logger
    BoundLogger
    Processor 1 Processor n Return
    Value
    Return
    Value
    bind values
    log.bind(key=value)
    Context
    log events
    log.info(event, another_key=another_value)
    +
    structlog

    View Slide

  135. >>> import structlog
    >>> log = structlog.get_logger()
    >>> log.error("ohai", why=42, wtf=True)
    why=42 wtf=True event='ohai'

    View Slide

  136. >>> import structlog
    >>> log = structlog.get_logger()
    >>> log.error("ohai", why=42, wtf=True)
    why=42 wtf=True event='ohai'

    View Slide

  137. >>> import structlog
    >>> log = structlog.get_logger()
    >>> log.error("ohai", why=42, wtf=True)
    why=42 wtf=True event='ohai'

    View Slide

  138. >>> import structlog
    >>> log = structlog.get_logger()
    >>> log.error("ohai", why=42, wtf=True)
    why=42 wtf=True event='ohai'

    View Slide

  139. >>> from structlog import get_logger
    >>> log = get_logger()
    >>> log = log.bind(user='anonymous', some_key=23)
    >>> log = log.bind(user='hynek', another_key=42)
    >>> log.info('user.logged_in', happy=True)
    some_key=23 user='hynek' another_key=42
    happy=True event='user.logged_in'

    View Slide

  140. >>> from structlog import get_logger
    >>> log = get_logger()
    >>> log = log.bind(user='anonymous', some_key=23)
    >>> log = log.bind(user='hynek', another_key=42)
    >>> log.info('user.logged_in', happy=True)
    some_key=23 user='hynek' another_key=42
    happy=True event='user.logged_in'

    View Slide

  141. >>> from structlog import get_logger
    >>> log = get_logger()
    >>> log = log.bind(user='anonymous', some_key=23)
    >>> log = log.bind(user='hynek', another_key=42)
    >>> log.info('user.logged_in', happy=True)
    some_key=23 user='hynek' another_key=42
    happy=True event='user.logged_in'

    View Slide

  142. >>> from structlog import get_logger
    >>> log = get_logger()
    >>> log = log.bind(user='anonymous', some_key=23)
    >>> log = log.bind(user='hynek', another_key=42)
    >>> log.info('user.logged_in', happy=True)
    some_key=23 user='hynek' another_key=42
    happy=True event='user.logged_in'

    View Slide

  143. >>> from structlog import get_logger
    >>> log = get_logger()
    >>> log = log.bind(user='anonymous', some_key=23)
    >>> log = log.bind(user='hynek', another_key=42)
    >>> log.info('user.logged_in', happy=True)
    some_key=23 user='hynek' another_key=42
    happy=True event='user.logged_in'

    View Slide

  144. def view(request):
    log = logger.bind(request=request)
    log.info("view_rendered")
    return "hello, world"

    View Slide

  145. def view(request):
    log = logger.bind(request=request)
    log.info("view_rendered")
    return "hello, world"

    View Slide

  146. def view(request):
    log = logger.bind(request=request)
    log.info("view_rendered")
    return "hello, world"

    View Slide

  147. def request_extractor(_, __, event):
    req = event.pop("request", None)
    if req is not None:
    event.update({
    "client_addr":
    req.client_addr,
    "user_id":
    req.authenticated_userid,
    })
    return event

    View Slide

  148. def request_extractor(_, __, event):
    req = event.pop("request", None)
    if req is not None:
    event.update({
    "client_addr":
    req.client_addr,
    "user_id":
    req.authenticated_userid,
    })
    return event

    View Slide

  149. def request_extractor(_, __, event):
    req = event.pop("request", None)
    if req is not None:
    event.update({
    "client_addr":
    req.client_addr,
    "user_id":
    req.authenticated_userid,
    })
    return event

    View Slide

  150. def request_extractor(_, __, event):
    req = event.pop("request", None)
    if req is not None:
    event.update({
    "client_addr":
    req.client_addr,
    "user_id":
    req.authenticated_userid,
    })
    return event

    View Slide

  151. {"client_addr": "192.168.42.42",
    "user_id": null,
    "event": "view_rendered"}

    View Slide

  152. import logging, sys
    logger = logging.getLogger()
    logger.addHandler(
    logging.StreamHandler(
    sys.stdout
    )
    )

    View Slide

  153. import logging, sys
    logger = logging.getLogger()
    logger.addHandler(
    logging.StreamHandler(
    sys.stdout
    )
    )

    View Slide

  154. Capture

    View Slide

  155. Capture
    • into files

    View Slide

  156. Capture
    • into files
    • to syslog / a queue

    View Slide

  157. Capture
    • into files
    • to syslog / a queue
    • pipe into a logging agent

    View Slide

  158. View Slide

  159. log = log.bind(user="guido")
    log.info("user.login")

    View Slide

  160. {"event": "user.login", "user": "guido"}
    log = log.bind(user="guido")
    log.info("user.login")
    structlog

    View Slide

  161. {"event": "user.login", "user": "guido"}
    log = log.bind(user="guido")
    log.info("user.login")
    structlog
    stdout
    logging

    View Slide

  162. {"event": "user.login", "user": "guido"}
    log = log.bind(user="guido")
    log.info("user.login")
    structlog
    stdout
    logging
    /var/log/app/current
    runit’s svlogd
    (adds TAI64 timestamp)

    View Slide

  163. {"event": "user.login", "user": "guido"}
    log = log.bind(user="guido")
    log.info("user.login")
    structlog
    logstash-forwarder
    logstash
    stdout
    logging
    /var/log/app/current
    runit’s svlogd
    (adds TAI64 timestamp)

    View Slide

  164. {"event": "user.login", "user": "guido"}
    log = log.bind(user="guido")
    log.info("user.login")
    structlog
    logstash-forwarder
    logstash
    1010001101
    Elasticsearch
    stdout
    logging
    /var/log/app/current
    runit’s svlogd
    (adds TAI64 timestamp)

    View Slide

  165. Progress!
    ✓ ✓

    View Slide

  166. Progress!
    ✓ ✓ ✓

    View Slide

  167. Wait a Minute…

    View Slide

  168. Wait a Minute…

    View Slide

  169. Ugh
    try:
    STATS.time.timing():
    something()
    except Exception as e:
    log.error("omg", exc_info=e)
    raven_client.captureError()
    STATS.errors.mark()

    View Slide

  170. Ugh
    try:
    STATS.time.timing():
    something()
    except Exception as e:
    log.error("omg", exc_info=e)
    raven_client.captureError()
    STATS.errors.mark()

    View Slide

  171. Awww
    try:
    something()
    except Exception as e:
    log.exception("omg")

    View Slide

  172. Errors

    View Slide

  173. Errors
    • logging integration

    View Slide

  174. Errors
    • logging integration
    • structlog

    View Slide

  175. Errors
    • logging integration
    • structlog
    • web apps: error views

    View Slide

  176. Error View
    @view_config(context=Exception)
    def err(exc, request):
    return Response(
    "oops: " + raven_client.captureException()
    )

    View Slide

  177. View Slide

  178. Metrics
    measure from outside

    View Slide

  179. WSGI Servers

    View Slide

  180. WSGI Servers
    • gunicorn: --statsd-host

    View Slide

  181. WSGI Servers
    • gunicorn: --statsd-host
    • uWSGI:

    View Slide

  182. WSGI Servers
    • gunicorn: --statsd-host
    • uWSGI:
    • --stats-push statsd:

    View Slide

  183. WSGI Servers
    • gunicorn: --statsd-host
    • uWSGI:
    • --stats-push statsd:
    • --carbon

    View Slide

  184. Middleware
    def timing_tween_factory(handler, registry):
    def timing_tween(request):
    with STATS.request_time.time():
    return handler(request)
    return timing_tween

    View Slide

  185. Middleware
    def timing_tween_factory(handler, registry):
    def timing_tween(request):
    with STATS.request_time.time():
    return handler(request)
    return timing_tween

    View Slide

  186. Extract from Logs

    View Slide

  187. Extract from Logs
    logstash

    View Slide

  188. Extract from Logs
    logstash
    • StatsD

    View Slide

  189. Extract from Logs
    logstash
    • StatsD
    • Graphite

    View Slide

  190. Extract from Logs
    logstash
    • StatsD
    • Graphite
    • librato

    View Slide

  191. Extract from Logs
    logstash
    • StatsD
    • Graphite
    • librato
    • …

    View Slide

  192. Extract from Logs
    logstash
    • StatsD
    • Graphite
    • librato
    • …
    or structlog

    View Slide

  193. Leverage
    Monitoring

    View Slide

  194. Remaining

    View Slide

  195. Remaining
    1. measure code paths

    View Slide

  196. Remaining
    1. measure code paths
    2. expose gauges

    View Slide

  197. Summary

    View Slide

  198. Summary

    View Slide

  199. Summary

    View Slide

  200. Summary

    View Slide

  201. Summary

    View Slide

  202. ox.cx/b
    @hynek
    vrmd.de

    View Slide