Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Beyond grep: Practical Logging and Metrics

Beyond grep: Practical Logging and Metrics

Your Python applications are running but you’re wondering what they are doing? Your only clue about their current state is the server load after ssh-ing into the servers? Let’s change that!

Hynek Schlawack

April 12, 2015
Tweet

More Decks by Hynek Schlawack

Other Decks in Programming

Transcript

  1. Beyond grep

    Pragmatic Logging & Metrics

    https://ox.cx/b
    Hynek Schlawack

    View full-size slide

  2. Hi!
    @hynek
    https://hynek.me
    https://github.com/hynek

    View full-size slide

  3. Requirements

    View full-size slide

  4. Requirements
    •fast

    View full-size slide

  5. Requirements
    •fast
    •once

    View full-size slide

  6. Requirements
    •fast
    •once

    View full-size slide

  7. Requirements
    •fast
    •once
    •context

    View full-size slide

  8. Requirements
    •fast
    •once
    •context

    View full-size slide

  9. Raven-Python

    View full-size slide

  10. Raven-Python
    Integrations:

    View full-size slide

  11. Raven-Python
    Integrations:
    • logging

    View full-size slide

  12. Raven-Python
    Integrations:
    • logging
    • Django

    View full-size slide

  13. Raven-Python
    Integrations:
    • logging
    • Django
    • WSGI

    View full-size slide

  14. Raven-Python
    Integrations:
    • logging
    • Django
    • WSGI
    • …9 others

    View full-size slide

  15. Vanilla
    from raven import Client
    client = Client("https://yoursentry")
    try:
    1 / 0
    except ZeroDivisionError:
    client.captureException()

    View full-size slide

  16. Vanilla
    from raven import Client
    client = Client("https://yoursentry")
    try:
    1 / 0
    except ZeroDivisionError:
    client.captureException()

    View full-size slide

  17. Vanilla
    from raven import Client
    client = Client("https://yoursentry")
    try:
    1 / 0
    except ZeroDivisionError:
    client.captureException()

    View full-size slide

  18. Django
    INSTALLED_APPS = (

    "raven.contrib.django.raven_compat",

    )

    View full-size slide

  19. Progress!

    View full-size slide

  20. Metrics?
    • numbers in a DB

    View full-size slide

  21. Metrics?
    • numbers in a DB
    • guessing vs knowing

    View full-size slide

  22. System Metrics vs App Metrics
    • load
    • network traffic
    • I/O
    • …

    View full-size slide

  23. System Metrics vs App Metrics
    • load
    • network traffic
    • I/O
    • …
    • counters

    View full-size slide

  24. System Metrics vs App Metrics
    • load
    • network traffic
    • I/O
    • …
    • counters
    • timers

    View full-size slide

  25. System Metrics vs App Metrics
    • load
    • network traffic
    • I/O
    • …
    • counters
    • timers
    • gauges

    View full-size slide

  26. System Metrics vs App Metrics
    • load
    • network traffic
    • I/O
    • …
    • counters
    • timers
    • gauges
    • …

    View full-size slide

  27. Math
    • # reqs / s?

    View full-size slide

  28. Math
    • # reqs / s?
    • worst 0.01% ⟨req time⟩?

    View full-size slide

  29. Math
    • # reqs / s?
    • worst 0.01% ⟨req time⟩?
    • don’t try this alone!

    View full-size slide

  30. Monitoring
    • latency

    View full-size slide

  31. Monitoring
    • latency
    • error rates

    View full-size slide

  32. Monitoring
    • latency
    • error rates
    • anomalies

    View full-size slide

  33. Librato Metrics

    View full-size slide

  34. Graphite
    800 pound gorilla

    View full-size slide

  35. InfluxDB
    Graphite++ in Go

    View full-size slide

  36. InfluxDB
    Graphite++ in Go

    View full-size slide

  37. Approaches
    1. external aggregation: StatsD, Riemann

    View full-size slide

  38. Approaches
    1. external aggregation: StatsD, Riemann
    + no state, simple

    View full-size slide

  39. Approaches
    1. external aggregation: StatsD, Riemann
    + no state, simple
    – no direct introspection

    View full-size slide

  40. Approaches
    1. external aggregation: StatsD, Riemann
    + no state, simple
    – no direct introspection
    2. aggregate in-app, deliver to DB

    View full-size slide

  41. Approaches
    1. external aggregation: StatsD, Riemann
    + no state, simple
    – no direct introspection
    2. aggregate in-app, deliver to DB
    + in-app dashboard, useful in dev

    View full-size slide

  42. Approaches
    1. external aggregation: StatsD, Riemann
    + no state, simple
    – no direct introspection
    2. aggregate in-app, deliver to DB
    + in-app dashboard, useful in dev
    – state w/i app

    View full-size slide

  43. (g|py)?statsd(py|-client)?
    import statsd
    c = statsd.StatsClient(
    "statsd.local", 8125
    )

    View full-size slide

  44. (g|py)?statsd(py|-client)?

    View full-size slide

  45. (g|py)?statsd(py|-client)?
    c.incr("foo")

    View full-size slide

  46. (g|py)?statsd(py|-client)?
    c.incr("foo")
    c.timing("stats.timed", 320)

    View full-size slide

  47. Scales
    from greplin import scales
    from greplin.scales.meter import MeterStat
    STATS = scales.collection(
    "/Resource",
    MeterStat("reqs"),
    scales.PmfStat("request_time")
    )

    View full-size slide

  48. Scales
    from greplin import scales
    from greplin.scales.meter import MeterStat
    STATS = scales.collection(
    "/Resource",
    MeterStat("reqs"),
    scales.PmfStat("request_time")
    )

    View full-size slide

  49. Scales
    STATS.reqs.mark()

    View full-size slide

  50. Scales
    STATS.reqs.mark()
    with STATS.request_time.time():

    View full-size slide

  51. Scales
    STATS.reqs.mark()
    with STATS.request_time.time():
    do_something_expensive()

    View full-size slide

  52. Dashboard
    Scales

    View full-size slide

  53. Dashboard
    Scales

    View full-size slide

  54. Dashboard
    Scales

    "request_time": {
    "count": 567315293,
    "99percentile": 0.10978688716888428,
    "75percentile": 0.013181567192077637,
    "min": 0.0002448558807373047,
    "max": 30.134822130203247,
    "98percentile": 0.08934824466705339,
    "95percentile": 0.027234303951263434,
    "median": 0.009176492691040039,
    "999percentile": 0.14235656142234793,
    "stddev": 0.01676855570363413,
    "mean": 0.013247184020535955
    },

    View full-size slide

  55. Progress!

    View full-size slide

  56. Progress!
    ✓ ✓

    View full-size slide

  57. ELK
    Elasticsearch + Logstash + Kibana

    View full-size slide

  58. Goal
    @400000005270e0d604afce64 {
    "event": "logged_in",
    "user": "guido",
    "ip": "8.8.8.8",
    "referrer": "http://google.com"
    }

    View full-size slide

  59. Context

    &
    Format

    View full-size slide

  60. BoundLogger
    structlog

    View full-size slide

  61. Original
    Logger
    BoundLogger
    structlog
    e.g.
    logging.Logger

    View full-size slide

  62. Original
    Logger
    BoundLogger
    bind values
    log.bind(key=value)
    Context
    structlog

    View full-size slide

  63. Original
    Logger
    BoundLogger
    bind values
    log.bind(key=value)
    Context
    log events
    log.info(event, another_key=another_value)
    +
    structlog

    View full-size slide

  64. Original
    Logger
    BoundLogger
    Processor 1 Processor n Return
    Value
    Return
    Value
    bind values
    log.bind(key=value)
    Context
    log events
    log.info(event, another_key=another_value)
    +
    structlog

    View full-size slide

  65. Original
    Logger
    BoundLogger
    Processor 1 Processor n Return
    Value
    Return
    Value
    bind values
    log.bind(key=value)
    Context
    log events
    log.info(event, another_key=another_value)
    +
    structlog

    View full-size slide

  66. import logging, sys
    logger = logging.getLogger()
    logger.addHandler(
    logging.StreamHandler(
    sys.stdout
    )
    )

    View full-size slide

  67. import logging, sys
    logger = logging.getLogger()
    logger.addHandler(
    logging.StreamHandler(
    sys.stdout
    )
    )

    View full-size slide

  68. Capture
    • into files

    View full-size slide

  69. Capture
    • into files
    • to syslog / a queue

    View full-size slide

  70. Capture
    • into files
    • to syslog / a queue
    • pipe into a logging agent

    View full-size slide

  71. log = log.bind(user="guido")
    log.info("user.login")

    View full-size slide

  72. {"event": "user.login", "user": "guido"}
    log = log.bind(user="guido")
    log.info("user.login")
    structlog

    View full-size slide

  73. {"event": "user.login", "user": "guido"}
    log = log.bind(user="guido")
    log.info("user.login")
    structlog
    stdout
    logging

    View full-size slide

  74. {"event": "user.login", "user": "guido"}
    log = log.bind(user="guido")
    log.info("user.login")
    structlog
    stdout
    logging
    /var/log/app/current
    runit’s svlogd
    (adds TAI64 timestamp)

    View full-size slide

  75. {"event": "user.login", "user": "guido"}
    log = log.bind(user="guido")
    log.info("user.login")
    structlog
    logstash-forwarder
    logstash
    stdout
    logging
    /var/log/app/current
    runit’s svlogd
    (adds TAI64 timestamp)

    View full-size slide

  76. {"event": "user.login", "user": "guido"}
    log = log.bind(user="guido")
    log.info("user.login")
    structlog
    logstash-forwarder
    logstash
    1010001101
    Elasticsearch
    stdout
    logging
    /var/log/app/current
    runit’s svlogd
    (adds TAI64 timestamp)

    View full-size slide

  77. Progress!
    ✓ ✓

    View full-size slide

  78. Progress!
    ✓ ✓ ✓

    View full-size slide

  79. Wait a Minute…

    View full-size slide

  80. Ugh
    try:
    STATS.time.timing():
    something()
    except Exception as e:
    log.error("omg", exc_info=e)
    raven_client.captureError()
    STATS.errors.mark()

    View full-size slide

  81. Ugh
    try:
    STATS.time.timing():
    something()
    except Exception as e:
    log.error("omg", exc_info=e)
    raven_client.captureError()
    STATS.errors.mark()

    View full-size slide

  82. Awww
    try:
    something()
    except Exception as e:
    log.error("omg", exc_info=e)

    View full-size slide

  83. Errors
    • logging integration

    View full-size slide

  84. Errors
    • logging
    • structlog

    View full-size slide

  85. Errors
    • logging
    • structlog
    • web apps: error views

    View full-size slide

  86. Error View
    @view_config(context=Exception)
    def err(exc, request):
    return Response(
    "oops: " + raven_client.captureException()
    )

    View full-size slide

  87. Metrics
    measure from outside

    View full-size slide

  88. WSGI Servers

    View full-size slide

  89. WSGI Servers
    • gunicorn: --statsd-host

    View full-size slide

  90. WSGI Servers
    • gunicorn: --statsd-host
    • uWSGI:

    View full-size slide

  91. WSGI Servers
    • gunicorn: --statsd-host
    • uWSGI:
    • --stats-push statsd:

    View full-size slide

  92. WSGI Servers
    • gunicorn: --statsd-host
    • uWSGI:
    • --stats-push statsd:
    • --carbon

    View full-size slide

  93. Middleware
    def timing_tween_factory(handler, registry):
    def timing_tween(request):
    with STATS.request_time.time():
    return handler(request)
    return timing_tween

    View full-size slide

  94. Middleware
    def timing_tween_factory(handler, registry):
    def timing_tween(request):
    with STATS.request_time.time():
    return handler(request)
    return timing_tween

    View full-size slide

  95. Extract from Logs

    View full-size slide

  96. Leverage
    Monitoring

    View full-size slide

  97. Remaining
    1. measure code paths

    View full-size slide

  98. Remaining
    1. measure code paths
    2. expose gauges

    View full-size slide

  99. ox.cx/b
    @hynek
    vrmd.de

    View full-size slide