Alex Landau - Building a Culture of Observability

Building a Culture of Observability at Rover 1 PyCon 2019
Alex Landau, balexlandau.com

2 Quick Facts • Global leader in the pet care
space • 34,000 cities globally • 300,000 sitters and walkers • 97% 5-star reviews • One booking is made every three seconds • 500 employees in ten countries

What are we talking about? - What does observability mean?
Why is it important? - How do we achieve it for a Python webapp? - Making logs useful - Metrics for everything - Building effective dashboards - How do we do it at Rover? 3

Observability • What’s going on in my webapp? • When
things go wrong, what happened? • Goals: 1. Tell the narrative of your application 2. Empower developers 5

A Complex Webapp • 600,000+ lines of Python code •
100 developers • Monolithic Django app over MySQL • Thousands of views, Celery tasks, crons and one-off commands 6

What does observability give us? • Wrangle a complex webapp
• Much faster bug resolution • Signiﬁcantly reduced time to detection of production issues • More thorough root cause analyses • With good observability, there are no mysteries. 7

The Pillars of Observability 8 Useful Logs Granular Metrics Narrative-
driven Dashboards

Making Logging Useful 9

Webapp Logging • Logs come from a lot of places
and end up in a single aggregated stream ◦ Loggly, ELK stack • NGINX, webapp, system messages, daemon processes, deployment logs… • Rover runs Django and Celery - even more logs from more contexts! 10

11 Rover logging events over 10 mins

• Request/response • Application logging (errors/warnings) • Asynchronous workﬂows •
Proxy jumps • External Service Calls 12 What’s important in a log? Connecting these is like ﬁnding a needle in a haystack.

Unifying Logs • Use a tracing ID that is injected
into every log message. ◦ Unique per “execution” ◦ Searchable within the aggregated stream ◦ Present in every log message, regardless of source 13

Unifying Logs 14

Implementation • Store a unique identiﬁer in thread local storage
• Inject into LogRecord with a ﬁlter 15

Implementation 16

Implementation 17 Bonus: passing down through Celery tasks... https://www.rover.com/blog/engineering/post/needle-h aystack-wrangling-celery-workflows/

• Too granular to see systemic impact • Hard to
monitor • Expensive: cost is (roughly) linear with growth 18 Logs are only part of the strategy. What else do we need?

Enter Metrics 19

Metrics Examples • Error rate • Response Time • Request
Volume 20

• Our webapp performance is dominated by queries. • Queries
run everywhere we execute code: views, Celery tasks, crons and commands 21 Going Deeper

Metrics at Rover • We collect the number of queries
per and the amount of time spent querying the database, per-request to each view and per-execution of each Celery task • StatsD and DataDog 22

Django Query Metrics • Idea: wrap database queries with metrics
• Django 1.11: Create a custom database engine backend • Django 2.0+: Use connection.execute_wrapper ◦ https://docs.djangoproject.com/en/dev/topics/db/instrumentation/ #database-instrumentation • Don’t emit a counter after every query; gather them until the end of request or task execution and emit a histogram (distribution) • We wrote a library to make this easy (if you use DataDog) ◦ https://github.com/roverdotcom/dogstatsd-collector 23

Effective Dashboards 24

Graphing and Aggregation Strategy • Make it easy to eyeball
◦ Visual diff ◦ Trends • Make dashboards self-documenting • Write down and share examples! 25

N+1 Query Problem 26

Full Table Scan 27

One Impactful Slow Query 28

Putting it all together 29

Building Observability Culture • Create, document, and share tools •
Don’t make observability opt-in; give developers useful metrics by default • Measure everything. Err on the side of overly granular • Focus on empowering developers 30

THANK YOU! 31

Alex Landau - Building a Culture of Observability

Alex Landau - Building a Culture of Observability

PyCon 2019

More Decks by PyCon 2019

Other Decks in Technology

Featured

Transcript

Building a Culture of Observability at Rover 1 PyCon 2019

2 Quick Facts • Global leader in the pet care

What are we talking about? - What does observability mean?

4

Observability • What’s going on in my webapp? • When

A Complex Webapp • 600,000+ lines of Python code •

What does observability give us? • Wrangle a complex webapp

The Pillars of Observability 8 Useful Logs Granular Metrics Narrative-

Making Logging Useful 9

Webapp Logging • Logs come from a lot of places

11 Rover logging events over 10 mins

• Request/response • Application logging (errors/warnings) • Asynchronous workﬂows •

Unifying Logs • Use a tracing ID that is injected

Unifying Logs 14

Implementation • Store a unique identiﬁer in thread local storage

Implementation 16

Implementation 17 Bonus: passing down through Celery tasks... https://www.rover.com/blog/engineering/post/needle-h aystack-wrangling-celery-workflows/

• Too granular to see systemic impact • Hard to

Enter Metrics 19

Metrics Examples • Error rate • Response Time • Request

• Our webapp performance is dominated by queries. • Queries

Metrics at Rover • We collect the number of queries

Django Query Metrics • Idea: wrap database queries with metrics

Effective Dashboards 24

Graphing and Aggregation Strategy • Make it easy to eyeball

N+1 Query Problem 26

Full Table Scan 27

One Impactful Slow Query 28

Putting it all together 29

Building Observability Culture • Create, document, and share tools •

THANK YOU! 31