Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Alex Landau - Building a Culture of Observability

Alex Landau - Building a Culture of Observability

Observability is often thought of as just a new word for monitoring. While it encompasses traditional devops areas such as monitoring, metrics, and infrastructure management, it’s much deeper and empowers developers at all levels of the stack. Observability is about achieving a deep understanding of your software. This not only helps you localize and debug production issues but removes uncertainty and speculation, empowering developers to know their tools and improving engineering excellence. Observability helps developers “understand the narrative” of what’s going on in their software.

This talk is about how we’ve driven adoption of a culture of observability within our engineering culture. We'll define and motivate for our focus on observability; discuss the tangible tools we’ve built and best practices we’ve adopted to ingrain observability into our engineering culture; and provide some specific, real-world results we’ve achieved as part of this effort. We'll will focus particularly on the tooling we’ve adopted around Django and Celery and some interesting experiences we had extending their internals.

https://us.pycon.org/2019/schedule/presentation/211/

53b37e14a09c5a718a39fda61fe1b8e5?s=128

PyCon 2019

May 05, 2019
Tweet

Transcript

  1. Building a Culture of Observability at Rover 1 PyCon 2019

    Alex Landau, balexlandau.com
  2. 2 Quick Facts • Global leader in the pet care

    space • 34,000 cities globally • 300,000 sitters and walkers • 97% 5-star reviews • One booking is made every three seconds • 500 employees in ten countries
  3. What are we talking about? - What does observability mean?

    Why is it important? - How do we achieve it for a Python webapp? - Making logs useful - Metrics for everything - Building effective dashboards - How do we do it at Rover? 3
  4. 4

  5. Observability • What’s going on in my webapp? • When

    things go wrong, what happened? • Goals: 1. Tell the narrative of your application 2. Empower developers 5
  6. A Complex Webapp • 600,000+ lines of Python code •

    100 developers • Monolithic Django app over MySQL • Thousands of views, Celery tasks, crons and one-off commands 6
  7. What does observability give us? • Wrangle a complex webapp

    • Much faster bug resolution • Significantly reduced time to detection of production issues • More thorough root cause analyses • With good observability, there are no mysteries. 7
  8. The Pillars of Observability 8 Useful Logs Granular Metrics Narrative-

    driven Dashboards
  9. Making Logging Useful 9

  10. Webapp Logging • Logs come from a lot of places

    and end up in a single aggregated stream ◦ Loggly, ELK stack • NGINX, webapp, system messages, daemon processes, deployment logs… • Rover runs Django and Celery - even more logs from more contexts! 10
  11. 11 Rover logging events over 10 mins

  12. • Request/response • Application logging (errors/warnings) • Asynchronous workflows •

    Proxy jumps • External Service Calls 12 What’s important in a log? Connecting these is like finding a needle in a haystack.
  13. Unifying Logs • Use a tracing ID that is injected

    into every log message. ◦ Unique per “execution” ◦ Searchable within the aggregated stream ◦ Present in every log message, regardless of source 13
  14. Unifying Logs 14

  15. Implementation • Store a unique identifier in thread local storage

    • Inject into LogRecord with a filter 15
  16. Implementation 16

  17. Implementation 17 Bonus: passing down through Celery tasks... https://www.rover.com/blog/engineering/post/needle-h aystack-wrangling-celery-workflows/

  18. • Too granular to see systemic impact • Hard to

    monitor • Expensive: cost is (roughly) linear with growth 18 Logs are only part of the strategy. What else do we need?
  19. Enter Metrics 19

  20. Metrics Examples • Error rate • Response Time • Request

    Volume 20
  21. • Our webapp performance is dominated by queries. • Queries

    run everywhere we execute code: views, Celery tasks, crons and commands 21 Going Deeper
  22. Metrics at Rover • We collect the number of queries

    per and the amount of time spent querying the database, per-request to each view and per-execution of each Celery task • StatsD and DataDog 22
  23. Django Query Metrics • Idea: wrap database queries with metrics

    • Django 1.11: Create a custom database engine backend • Django 2.0+: Use connection.execute_wrapper ◦ https://docs.djangoproject.com/en/dev/topics/db/instrumentation/ #database-instrumentation • Don’t emit a counter after every query; gather them until the end of request or task execution and emit a histogram (distribution) • We wrote a library to make this easy (if you use DataDog) ◦ https://github.com/roverdotcom/dogstatsd-collector 23
  24. Effective Dashboards 24

  25. Graphing and Aggregation Strategy • Make it easy to eyeball

    ◦ Visual diff ◦ Trends • Make dashboards self-documenting • Write down and share examples! 25
  26. N+1 Query Problem 26

  27. Full Table Scan 27

  28. One Impactful Slow Query 28

  29. Putting it all together 29

  30. Building Observability Culture • Create, document, and share tools •

    Don’t make observability opt-in; give developers useful metrics by default • Measure everything. Err on the side of overly granular • Focus on empowering developers 30
  31. THANK YOU! 31