Observability from the Panopticon: Measuring What Matters

Observability from the Panopticon: Measuring What Matters

If microservice falls down in the middle of a server farm, does my pager make a sound? Hopefully, the answer is “yes!”. But all too often, services can become partially degraded in ways that are difficult to predict - and therefore difficult to monitor proactively. How can we develop the confidence that the services we develop are instrumented for observability in the right places - the parts which actually matter - so that we're alerted quickly to problems that arise and have enough information to resolve those problems?

We'll look at a framework for modeling interdependent systems so we can understand how to identify the areas of our code that need to be instrumented. By isolating these key components, we'll ensure that we are writing software designed for resiliency.

94dcff33cbdf74b5d785369ac54bc1a8?s=128

Aditya Mukerjee

October 04, 2019
Tweet

Transcript

  1. Observability from the Panopticon: Measuring What Matters Aditya Mukerjee Observability

    Engineer at Stripe Asbury Agile October 2019
  2. Observability measures how well internal states of a system can

    be inferred from knowledge of its external outputs @chimeracoder
  3. 1. What should I observe or monitor? 2. How do

    I measure and monitor those things? 3. What do we enable using this framework for observability? @chimeracoder
  4. @chimeracoder

  5. Disclaimer: Surveillance of people has different ethical properties from surveillance

    of software microservices! @chimeracoder
  6. Disclaimer: Surveillance of people has different ethical properties from surveillance

    of software microservices! @chimeracoder (Don’t build tech for human rights abusers!)
  7. In the panopticon, nobody knows if they’re being watched, so

    everyone behaves as if they’re always being watched @chimeracoder
  8. Panopticon-style observability @chimeracoder We can’t observe all actions…. …but if

    we choose the right subset of actions to observe, we can have high confidence that everything is doing its job
  9. Let’s Create an API •Return a list of all Twitter

    followers •Record a copy to the database •Distributed! @chimeracoder API API API DB
  10. 99.99% of requests return HTTP 200 in <300ms @chimeracoder Is

    this API healthy?
  11. Service-Level Agreement: What we promise our clients @chimeracoder Service-Level Indicators:

    Data used to evaluate the SLA
  12. Service-Level Agreement: What we promise our clients @chimeracoder Service-Level Indicators:

    Data used to evaluate the SLA Service-Level Objective: What we target internally
  13. Service Indicators •Rate: Number of requests received •Errors: Number of

    responses written, broken down by HTTP status •Duration: Distribution of response latency @chimeracoder
  14. Every monitor involves a service-level indicator* @chimeracoder *for sufficiently broad

    definitions of “service”
  15. @chimeracoder Define an SLA for every behavior your clients rely

    on… …then apply this recursively, for behaviors you rely on
  16. Define and measure your service indicator metrics, based on the

    externally-observable behaviors your users will notice @chimeracoder
  17. What does the panopticon approach enable? @chimeracoder

  18. @chimeracoder Panopticon observability helps us set development priorities

  19. Error budgets are bidirectional @chimeracoder Panopticon observability helps us set

    development priorities
  20. Panopticon observability helps us recover from failures @chimeracoder

  21. Panopticon observability helps us understand how our systems actually work

    @chimeracoder
  22. Yes, there is such a thing as “too much reliability”

    @chimeracoder
  23. @chimeracoder

  24. We can’t observe everything @chimeracoder But if we choose and

    observe the right indicators, that’s enough
  25. Thank you! Aditya Mukerjee @chimeracoder