Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Alex Landau - Building a Culture of Observability

Alex Landau - Building a Culture of Observability

Observability is often thought of as just a new word for monitoring. While it encompasses traditional devops areas such as monitoring, metrics, and infrastructure management, it’s much deeper and empowers developers at all levels of the stack. Observability is about achieving a deep understanding of your software. This not only helps you localize and debug production issues but removes uncertainty and speculation, empowering developers to know their tools and improving engineering excellence. Observability helps developers “understand the narrative” of what’s going on in their software.

This talk is about how we’ve driven adoption of a culture of observability within our engineering culture. We'll define and motivate for our focus on observability; discuss the tangible tools we’ve built and best practices we’ve adopted to ingrain observability into our engineering culture; and provide some specific, real-world results we’ve achieved as part of this effort. We'll will focus particularly on the tooling we’ve adopted around Django and Celery and some interesting experiences we had extending their internals.

https://us.pycon.org/2019/schedule/presentation/211/

PyCon 2019

May 05, 2019
Tweet

More Decks by PyCon 2019

Other Decks in Technology

Transcript

  1. Building a Culture of
    Observability at Rover
    1
    PyCon 2019
    Alex Landau, balexlandau.com

    View full-size slide

  2. 2
    Quick Facts
    ● Global leader in the pet care space
    ● 34,000 cities globally
    ● 300,000 sitters and walkers
    ● 97% 5-star reviews
    ● One booking is made every three
    seconds
    ● 500 employees in ten countries

    View full-size slide

  3. What are we talking about?
    - What does observability mean? Why is it important?
    - How do we achieve it for a Python webapp?
    - Making logs useful
    - Metrics for everything
    - Building effective dashboards
    - How do we do it at Rover?
    3

    View full-size slide

  4. Observability
    ● What’s going on in my webapp?
    ● When things go wrong, what happened?
    ● Goals:
    1. Tell the narrative of your application
    2. Empower developers
    5

    View full-size slide

  5. A Complex Webapp
    ● 600,000+ lines of Python code
    ● 100 developers
    ● Monolithic Django app over MySQL
    ● Thousands of views, Celery tasks, crons and one-off
    commands
    6

    View full-size slide

  6. What does observability give us?
    ● Wrangle a complex webapp
    ● Much faster bug resolution
    ● Significantly reduced time to detection of production
    issues
    ● More thorough root cause analyses
    ● With good observability, there are no mysteries.
    7

    View full-size slide

  7. The Pillars of Observability
    8
    Useful Logs Granular
    Metrics
    Narrative-
    driven
    Dashboards

    View full-size slide

  8. Making Logging Useful
    9

    View full-size slide

  9. Webapp Logging
    ● Logs come from a lot of places and end up in a single
    aggregated stream
    ○ Loggly, ELK stack
    ● NGINX, webapp, system messages, daemon
    processes, deployment logs…
    ● Rover runs Django and Celery - even more logs from
    more contexts! 10

    View full-size slide

  10. 11
    Rover logging events over 10 mins

    View full-size slide

  11. ● Request/response
    ● Application logging
    (errors/warnings)
    ● Asynchronous workflows
    ● Proxy jumps
    ● External Service Calls
    12
    What’s
    important in a
    log?
    Connecting these is like
    finding a needle in a haystack.

    View full-size slide

  12. Unifying Logs
    ● Use a tracing ID that is injected into every log
    message.
    ○ Unique per “execution”
    ○ Searchable within the aggregated stream
    ○ Present in every log message, regardless of source
    13

    View full-size slide

  13. Unifying Logs
    14

    View full-size slide

  14. Implementation
    ● Store a unique identifier in thread local storage
    ● Inject into LogRecord with a filter
    15

    View full-size slide

  15. Implementation
    16

    View full-size slide

  16. Implementation
    17
    Bonus: passing down through Celery tasks...
    https://www.rover.com/blog/engineering/post/needle-h
    aystack-wrangling-celery-workflows/

    View full-size slide

  17. ● Too granular to see
    systemic impact
    ● Hard to monitor
    ● Expensive: cost is (roughly)
    linear with growth
    18
    Logs are only part of the
    strategy.
    What else do
    we need?

    View full-size slide

  18. Enter Metrics
    19

    View full-size slide

  19. Metrics Examples
    ● Error rate
    ● Response Time
    ● Request Volume
    20

    View full-size slide

  20. ● Our webapp performance
    is dominated by queries.
    ● Queries run everywhere we
    execute code: views,
    Celery tasks, crons and
    commands
    21
    Going Deeper

    View full-size slide

  21. Metrics at Rover
    ● We collect the number of queries per and the amount
    of time spent querying the database, per-request to
    each view and per-execution of each Celery task
    ● StatsD and DataDog
    22

    View full-size slide

  22. Django Query Metrics
    ● Idea: wrap database queries with metrics
    ● Django 1.11: Create a custom database engine backend
    ● Django 2.0+: Use connection.execute_wrapper
    ○ https://docs.djangoproject.com/en/dev/topics/db/instrumentation/
    #database-instrumentation
    ● Don’t emit a counter after every query; gather them until the end of
    request or task execution and emit a histogram (distribution)
    ● We wrote a library to make this easy (if you use DataDog)
    ○ https://github.com/roverdotcom/dogstatsd-collector
    23

    View full-size slide

  23. Effective Dashboards
    24

    View full-size slide

  24. Graphing and Aggregation Strategy
    ● Make it easy to eyeball
    ○ Visual diff
    ○ Trends
    ● Make dashboards self-documenting
    ● Write down and share examples!
    25

    View full-size slide

  25. N+1 Query Problem
    26

    View full-size slide

  26. Full Table Scan
    27

    View full-size slide

  27. One Impactful Slow Query
    28

    View full-size slide

  28. Putting it all together
    29

    View full-size slide

  29. Building Observability Culture
    ● Create, document, and share tools
    ● Don’t make observability opt-in; give developers
    useful metrics by default
    ● Measure everything. Err on the side of overly granular
    ● Focus on empowering developers
    30

    View full-size slide

  30. THANK YOU!
    31

    View full-size slide