Observability from the Panopticon:
Measuring What Matters
Aditya Mukerjee
Observability Engineer at Stripe
Asbury Agile
October 2019
Slide 2
Slide 2 text
Observability measures how well internal states of a system
can be inferred from knowledge of its external outputs
@chimeracoder
Slide 3
Slide 3 text
1. What should I observe or monitor?
2. How do I measure and monitor those things?
3. What do we enable using this framework for observability?
@chimeracoder
Slide 4
Slide 4 text
@chimeracoder
Slide 5
Slide 5 text
Disclaimer: Surveillance of people has different ethical
properties from surveillance of software microservices!
@chimeracoder
Slide 6
Slide 6 text
Disclaimer: Surveillance of people has different ethical
properties from surveillance of software microservices!
@chimeracoder
(Don’t build tech for human rights abusers!)
Slide 7
Slide 7 text
In the panopticon, nobody knows if they’re being watched,
so everyone behaves as if they’re always being watched
@chimeracoder
Slide 8
Slide 8 text
Panopticon-style observability
@chimeracoder
We can’t observe all actions….
…but if we choose the right subset of actions to observe,
we can have high confidence that everything is doing its job
Slide 9
Slide 9 text
Let’s Create an API
•Return a list of all Twitter followers
•Record a copy to the database
•Distributed!
@chimeracoder
API
API
API
DB
Slide 10
Slide 10 text
99.99% of requests return HTTP 200 in <300ms
@chimeracoder
Is this API healthy?
Slide 11
Slide 11 text
Service-Level Agreement: What we promise our clients
@chimeracoder
Service-Level Indicators: Data used to evaluate the SLA
Slide 12
Slide 12 text
Service-Level Agreement: What we promise our clients
@chimeracoder
Service-Level Indicators: Data used to evaluate the SLA
Service-Level Objective: What we target internally
Slide 13
Slide 13 text
Service Indicators
•Rate: Number of requests received
•Errors: Number of responses written, broken down by HTTP status
•Duration: Distribution of response latency
@chimeracoder
Slide 14
Slide 14 text
Every monitor involves a service-level indicator*
@chimeracoder
*for sufficiently broad definitions of “service”
Slide 15
Slide 15 text
@chimeracoder
Define an SLA for every behavior your clients rely on…
…then apply this recursively, for behaviors you rely on
Slide 16
Slide 16 text
Define and measure your service indicator metrics, based on
the externally-observable behaviors your users will notice
@chimeracoder
Slide 17
Slide 17 text
What does the panopticon approach enable?
@chimeracoder
Slide 18
Slide 18 text
@chimeracoder
Panopticon observability helps us set development priorities
Slide 19
Slide 19 text
Error budgets are bidirectional
@chimeracoder
Panopticon observability helps us set development priorities
Slide 20
Slide 20 text
Panopticon observability helps us recover from failures
@chimeracoder
Slide 21
Slide 21 text
Panopticon observability helps us understand how our
systems actually work
@chimeracoder
Slide 22
Slide 22 text
Yes, there is such a thing as “too much reliability”
@chimeracoder
Slide 23
Slide 23 text
@chimeracoder
Slide 24
Slide 24 text
We can’t observe everything
@chimeracoder
But if we choose and observe the right indicators, that’s enough