Observability from the Panopticon: Measuring What Matters

Slide 1

Slide 1 text

Observability from the Panopticon: Measuring What Matters Aditya Mukerjee Observability Engineer at Stripe Asbury Agile October 2019

Slide 2

Slide 2 text

Observability measures how well internal states of a system can be inferred from knowledge of its external outputs @chimeracoder

Slide 3

Slide 3 text

1. What should I observe or monitor? 2. How do I measure and monitor those things? 3. What do we enable using this framework for observability? @chimeracoder

Slide 4

Slide 4 text

@chimeracoder

Slide 5

Slide 5 text

Disclaimer: Surveillance of people has different ethical properties from surveillance of software microservices! @chimeracoder

Slide 6

Slide 6 text

Disclaimer: Surveillance of people has different ethical properties from surveillance of software microservices! @chimeracoder (Don’t build tech for human rights abusers!)

Slide 7

Slide 7 text

In the panopticon, nobody knows if they’re being watched, so everyone behaves as if they’re always being watched @chimeracoder

Slide 8

Slide 8 text

Panopticon-style observability @chimeracoder We can’t observe all actions…. …but if we choose the right subset of actions to observe, we can have high confidence that everything is doing its job

Slide 9

Slide 9 text

Let’s Create an API •Return a list of all Twitter followers •Record a copy to the database •Distributed! @chimeracoder API API API DB

Slide 10

Slide 10 text

99.99% of requests return HTTP 200 in <300ms @chimeracoder Is this API healthy?

Slide 11

Slide 11 text

Service-Level Agreement: What we promise our clients @chimeracoder Service-Level Indicators: Data used to evaluate the SLA

Slide 12

Slide 12 text

Service-Level Agreement: What we promise our clients @chimeracoder Service-Level Indicators: Data used to evaluate the SLA Service-Level Objective: What we target internally

Slide 13

Slide 13 text

Service Indicators •Rate: Number of requests received •Errors: Number of responses written, broken down by HTTP status •Duration: Distribution of response latency @chimeracoder

Slide 14

Slide 14 text

Every monitor involves a service-level indicator* @chimeracoder *for sufficiently broad definitions of “service”

Slide 15

Slide 15 text

@chimeracoder Define an SLA for every behavior your clients rely on… …then apply this recursively, for behaviors you rely on

Slide 16

Slide 16 text

Define and measure your service indicator metrics, based on the externally-observable behaviors your users will notice @chimeracoder

Slide 17

Slide 17 text

What does the panopticon approach enable? @chimeracoder

Slide 18

Slide 18 text

@chimeracoder Panopticon observability helps us set development priorities

Slide 19

Slide 19 text

Error budgets are bidirectional @chimeracoder Panopticon observability helps us set development priorities

Slide 20

Slide 20 text

Panopticon observability helps us recover from failures @chimeracoder

Slide 21

Slide 21 text

Panopticon observability helps us understand how our systems actually work @chimeracoder

Slide 22

Slide 22 text

Yes, there is such a thing as “too much reliability” @chimeracoder

Slide 23

Slide 23 text

@chimeracoder

Slide 24

Slide 24 text

We can’t observe everything @chimeracoder But if we choose and observe the right indicators, that’s enough

Slide 25

Slide 25 text

Thank you! Aditya Mukerjee @chimeracoder