Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Alex Landau - Building a Culture of Observability

Alex Landau - Building a Culture of Observability

Observability is often thought of as just a new word for monitoring. While it encompasses traditional devops areas such as monitoring, metrics, and infrastructure management, it’s much deeper and empowers developers at all levels of the stack. Observability is about achieving a deep understanding of your software. This not only helps you localize and debug production issues but removes uncertainty and speculation, empowering developers to know their tools and improving engineering excellence. Observability helps developers “understand the narrative” of what’s going on in their software.

This talk is about how we’ve driven adoption of a culture of observability within our engineering culture. We'll define and motivate for our focus on observability; discuss the tangible tools we’ve built and best practices we’ve adopted to ingrain observability into our engineering culture; and provide some specific, real-world results we’ve achieved as part of this effort. We'll will focus particularly on the tooling we’ve adopted around Django and Celery and some interesting experiences we had extending their internals.


PyCon 2019

May 05, 2019

More Decks by PyCon 2019

Other Decks in Technology


  1. 2 Quick Facts • Global leader in the pet care

    space • 34,000 cities globally • 300,000 sitters and walkers • 97% 5-star reviews • One booking is made every three seconds • 500 employees in ten countries
  2. What are we talking about? - What does observability mean?

    Why is it important? - How do we achieve it for a Python webapp? - Making logs useful - Metrics for everything - Building effective dashboards - How do we do it at Rover? 3
  3. 4

  4. Observability • What’s going on in my webapp? • When

    things go wrong, what happened? • Goals: 1. Tell the narrative of your application 2. Empower developers 5
  5. A Complex Webapp • 600,000+ lines of Python code •

    100 developers • Monolithic Django app over MySQL • Thousands of views, Celery tasks, crons and one-off commands 6
  6. What does observability give us? • Wrangle a complex webapp

    • Much faster bug resolution • Significantly reduced time to detection of production issues • More thorough root cause analyses • With good observability, there are no mysteries. 7
  7. Webapp Logging • Logs come from a lot of places

    and end up in a single aggregated stream ◦ Loggly, ELK stack • NGINX, webapp, system messages, daemon processes, deployment logs… • Rover runs Django and Celery - even more logs from more contexts! 10
  8. • Request/response • Application logging (errors/warnings) • Asynchronous workflows •

    Proxy jumps • External Service Calls 12 What’s important in a log? Connecting these is like finding a needle in a haystack.
  9. Unifying Logs • Use a tracing ID that is injected

    into every log message. ◦ Unique per “execution” ◦ Searchable within the aggregated stream ◦ Present in every log message, regardless of source 13
  10. Implementation • Store a unique identifier in thread local storage

    • Inject into LogRecord with a filter 15
  11. • Too granular to see systemic impact • Hard to

    monitor • Expensive: cost is (roughly) linear with growth 18 Logs are only part of the strategy. What else do we need?
  12. • Our webapp performance is dominated by queries. • Queries

    run everywhere we execute code: views, Celery tasks, crons and commands 21 Going Deeper
  13. Metrics at Rover • We collect the number of queries

    per and the amount of time spent querying the database, per-request to each view and per-execution of each Celery task • StatsD and DataDog 22
  14. Django Query Metrics • Idea: wrap database queries with metrics

    • Django 1.11: Create a custom database engine backend • Django 2.0+: Use connection.execute_wrapper ◦ https://docs.djangoproject.com/en/dev/topics/db/instrumentation/ #database-instrumentation • Don’t emit a counter after every query; gather them until the end of request or task execution and emit a histogram (distribution) • We wrote a library to make this easy (if you use DataDog) ◦ https://github.com/roverdotcom/dogstatsd-collector 23
  15. Graphing and Aggregation Strategy • Make it easy to eyeball

    ◦ Visual diff ◦ Trends • Make dashboards self-documenting • Write down and share examples! 25
  16. Building Observability Culture • Create, document, and share tools •

    Don’t make observability opt-in; give developers useful metrics by default • Measure everything. Err on the side of overly granular • Focus on empowering developers 30