Alex Landau - Building a Culture of Observability

Slide 1

Slide 1 text

Building a Culture of Observability at Rover 1 PyCon 2019 Alex Landau, balexlandau.com

Slide 2

Slide 2 text

2 Quick Facts ● Global leader in the pet care space ● 34,000 cities globally ● 300,000 sitters and walkers ● 97% 5-star reviews ● One booking is made every three seconds ● 500 employees in ten countries

Slide 3

Slide 3 text

What are we talking about? - What does observability mean? Why is it important? - How do we achieve it for a Python webapp? - Making logs useful - Metrics for everything - Building effective dashboards - How do we do it at Rover? 3

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Observability ● What’s going on in my webapp? ● When things go wrong, what happened? ● Goals: 1. Tell the narrative of your application 2. Empower developers 5

Slide 6

Slide 6 text

A Complex Webapp ● 600,000+ lines of Python code ● 100 developers ● Monolithic Django app over MySQL ● Thousands of views, Celery tasks, crons and one-off commands 6

Slide 7

Slide 7 text

What does observability give us? ● Wrangle a complex webapp ● Much faster bug resolution ● Signiﬁcantly reduced time to detection of production issues ● More thorough root cause analyses ● With good observability, there are no mysteries. 7

Slide 8

Slide 8 text

The Pillars of Observability 8 Useful Logs Granular Metrics Narrative- driven Dashboards

Slide 9

Slide 9 text

Making Logging Useful 9

Slide 10

Slide 10 text

Webapp Logging ● Logs come from a lot of places and end up in a single aggregated stream ○ Loggly, ELK stack ● NGINX, webapp, system messages, daemon processes, deployment logs… ● Rover runs Django and Celery - even more logs from more contexts! 10

Slide 11

Slide 11 text

11 Rover logging events over 10 mins

Slide 12

Slide 12 text

● Request/response ● Application logging (errors/warnings) ● Asynchronous workﬂows ● Proxy jumps ● External Service Calls 12 What’s important in a log? Connecting these is like ﬁnding a needle in a haystack.

Slide 13

Slide 13 text

Unifying Logs ● Use a tracing ID that is injected into every log message. ○ Unique per “execution” ○ Searchable within the aggregated stream ○ Present in every log message, regardless of source 13

Slide 14

Slide 14 text

Unifying Logs 14

Slide 15

Slide 15 text

Implementation ● Store a unique identiﬁer in thread local storage ● Inject into LogRecord with a ﬁlter 15

Slide 16

Slide 16 text

Implementation 16

Slide 17

Slide 17 text

Implementation 17 Bonus: passing down through Celery tasks... https://www.rover.com/blog/engineering/post/needle-h aystack-wrangling-celery-workflows/

Slide 18

Slide 18 text

● Too granular to see systemic impact ● Hard to monitor ● Expensive: cost is (roughly) linear with growth 18 Logs are only part of the strategy. What else do we need?

Slide 19

Slide 19 text

Enter Metrics 19

Slide 20

Slide 20 text

Metrics Examples ● Error rate ● Response Time ● Request Volume 20

Slide 21

Slide 21 text

● Our webapp performance is dominated by queries. ● Queries run everywhere we execute code: views, Celery tasks, crons and commands 21 Going Deeper

Slide 22

Slide 22 text

Metrics at Rover ● We collect the number of queries per and the amount of time spent querying the database, per-request to each view and per-execution of each Celery task ● StatsD and DataDog 22

Slide 23

Slide 23 text

Django Query Metrics ● Idea: wrap database queries with metrics ● Django 1.11: Create a custom database engine backend ● Django 2.0+: Use connection.execute_wrapper ○ https://docs.djangoproject.com/en/dev/topics/db/instrumentation/ #database-instrumentation ● Don’t emit a counter after every query; gather them until the end of request or task execution and emit a histogram (distribution) ● We wrote a library to make this easy (if you use DataDog) ○ https://github.com/roverdotcom/dogstatsd-collector 23

Slide 24

Slide 24 text

Effective Dashboards 24

Slide 25

Slide 25 text

Graphing and Aggregation Strategy ● Make it easy to eyeball ○ Visual diff ○ Trends ● Make dashboards self-documenting ● Write down and share examples! 25

Slide 26

Slide 26 text

N+1 Query Problem 26

Slide 27

Slide 27 text

Full Table Scan 27

Slide 28

Slide 28 text

One Impactful Slow Query 28

Slide 29

Slide 29 text

Putting it all together 29

Slide 30

Slide 30 text

Building Observability Culture ● Create, document, and share tools ● Don’t make observability opt-in; give developers useful metrics by default ● Measure everything. Err on the side of overly granular ● Focus on empowering developers 30

Slide 31

Slide 31 text

THANK YOU! 31