Observability is often thought of as just a new word for monitoring. While it encompasses traditional devops areas such as monitoring, metrics, and infrastructure management, it’s much deeper and empowers developers at all levels of the stack. Observability is about achieving a deep understanding of your software. This not only helps you localize and debug production issues but removes uncertainty and speculation, empowering developers to know their tools and improving engineering excellence. Observability helps developers “understand the narrative” of what’s going on in their software.
This talk is about how we’ve driven adoption of a culture of observability within our engineering culture. We'll define and motivate for our focus on observability; discuss the tangible tools we’ve built and best practices we’ve adopted to ingrain observability into our engineering culture; and provide some specific, real-world results we’ve achieved as part of this effort. We'll will focus particularly on the tooling we’ve adopted around Django and Celery and some interesting experiences we had extending their internals.
Building a Culture of
Observability at Rover
Alex Landau, balexlandau.com
● Global leader in the pet care space
● 34,000 cities globally
● 300,000 sitters and walkers
● 97% 5-star reviews
● One booking is made every three
● 500 employees in ten countries
What are we talking about?
- What does observability mean? Why is it important?
- How do we achieve it for a Python webapp?
- Making logs useful
- Metrics for everything
- Building effective dashboards
- How do we do it at Rover?
● What’s going on in my webapp?
● When things go wrong, what happened?
1. Tell the narrative of your application
2. Empower developers
A Complex Webapp
● 600,000+ lines of Python code
● 100 developers
● Monolithic Django app over MySQL
● Thousands of views, Celery tasks, crons and one-off
What does observability give us?
● Wrangle a complex webapp
● Much faster bug resolution
● Signiﬁcantly reduced time to detection of production
● More thorough root cause analyses
● With good observability, there are no mysteries.
The Pillars of Observability
Useful Logs Granular
Making Logging Useful
● Logs come from a lot of places and end up in a single
○ Loggly, ELK stack
● NGINX, webapp, system messages, daemon
processes, deployment logs…
● Rover runs Django and Celery - even more logs from
more contexts! 10
Rover logging events over 10 mins
● Application logging
● Asynchronous workﬂows
● Proxy jumps
● External Service Calls
important in a
Connecting these is like
ﬁnding a needle in a haystack.
● Use a tracing ID that is injected into every log
○ Unique per “execution”
○ Searchable within the aggregated stream
○ Present in every log message, regardless of source
● Store a unique identiﬁer in thread local storage
● Inject into LogRecord with a ﬁlter
Bonus: passing down through Celery tasks...
● Too granular to see
● Hard to monitor
● Expensive: cost is (roughly)
linear with growth
Logs are only part of the
What else do
● Error rate
● Response Time
● Request Volume
● Our webapp performance
is dominated by queries.
● Queries run everywhere we
execute code: views,
Celery tasks, crons and
Metrics at Rover
● We collect the number of queries per and the amount
of time spent querying the database, per-request to
each view and per-execution of each Celery task
● StatsD and DataDog
Django Query Metrics
● Idea: wrap database queries with metrics
● Django 1.11: Create a custom database engine backend
● Django 2.0+: Use connection.execute_wrapper
● Don’t emit a counter after every query; gather them until the end of
request or task execution and emit a histogram (distribution)
● We wrote a library to make this easy (if you use DataDog)
Graphing and Aggregation Strategy
● Make it easy to eyeball
○ Visual diff
● Make dashboards self-documenting
● Write down and share examples!
N+1 Query Problem
Full Table Scan
One Impactful Slow Query
Putting it all together
Building Observability Culture
● Create, document, and share tools
● Don’t make observability opt-in; give developers
useful metrics by default
● Measure everything. Err on the side of overly granular
● Focus on empowering developers