Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Observability from the Panopticon: Measuring What Matters

Observability from the Panopticon: Measuring What Matters

If microservice falls down in the middle of a server farm, does my pager make a sound? Hopefully, the answer is “yes!”. But all too often, services can become partially degraded in ways that are difficult to predict - and therefore difficult to monitor proactively. How can we develop the confidence that the services we develop are instrumented for observability in the right places - the parts which actually matter - so that we're alerted quickly to problems that arise and have enough information to resolve those problems?

We'll look at a framework for modeling interdependent systems so we can understand how to identify the areas of our code that need to be instrumented. By isolating these key components, we'll ensure that we are writing software designed for resiliency.

Aditya Mukerjee

October 04, 2019
Tweet

More Decks by Aditya Mukerjee

Other Decks in Technology

Transcript

  1. Observability from the Panopticon:
    Measuring What Matters
    Aditya Mukerjee
    Observability Engineer at Stripe
    Asbury Agile
    October 2019

    View Slide

  2. Observability measures how well internal states of a system
    can be inferred from knowledge of its external outputs
    @chimeracoder

    View Slide

  3. 1. What should I observe or monitor?
    2. How do I measure and monitor those things?
    3. What do we enable using this framework for observability?
    @chimeracoder

    View Slide

  4. @chimeracoder

    View Slide

  5. Disclaimer: Surveillance of people has different ethical
    properties from surveillance of software microservices!
    @chimeracoder

    View Slide

  6. Disclaimer: Surveillance of people has different ethical
    properties from surveillance of software microservices!
    @chimeracoder
    (Don’t build tech for human rights abusers!)

    View Slide

  7. In the panopticon, nobody knows if they’re being watched,
    so everyone behaves as if they’re always being watched
    @chimeracoder

    View Slide

  8. Panopticon-style observability
    @chimeracoder
    We can’t observe all actions….
    …but if we choose the right subset of actions to observe,
    we can have high confidence that everything is doing its job

    View Slide

  9. Let’s Create an API
    •Return a list of all Twitter followers
    •Record a copy to the database
    •Distributed!
    @chimeracoder
    API
    API
    API
    DB

    View Slide

  10. 99.99% of requests return HTTP 200 in <300ms
    @chimeracoder
    Is this API healthy?

    View Slide

  11. Service-Level Agreement: What we promise our clients
    @chimeracoder
    Service-Level Indicators: Data used to evaluate the SLA

    View Slide

  12. Service-Level Agreement: What we promise our clients
    @chimeracoder
    Service-Level Indicators: Data used to evaluate the SLA
    Service-Level Objective: What we target internally

    View Slide

  13. Service Indicators
    •Rate: Number of requests received
    •Errors: Number of responses written, broken down by HTTP status
    •Duration: Distribution of response latency
    @chimeracoder

    View Slide

  14. Every monitor involves a service-level indicator*
    @chimeracoder
    *for sufficiently broad definitions of “service”

    View Slide

  15. @chimeracoder
    Define an SLA for every behavior your clients rely on…
    …then apply this recursively, for behaviors you rely on

    View Slide

  16. Define and measure your service indicator metrics, based on
    the externally-observable behaviors your users will notice
    @chimeracoder

    View Slide

  17. What does the panopticon approach enable?
    @chimeracoder

    View Slide

  18. @chimeracoder
    Panopticon observability helps us set development priorities

    View Slide

  19. Error budgets are bidirectional
    @chimeracoder
    Panopticon observability helps us set development priorities

    View Slide

  20. Panopticon observability helps us recover from failures
    @chimeracoder

    View Slide

  21. Panopticon observability helps us understand how our
    systems actually work
    @chimeracoder

    View Slide

  22. Yes, there is such a thing as “too much reliability”
    @chimeracoder

    View Slide

  23. @chimeracoder

    View Slide

  24. We can’t observe everything
    @chimeracoder
    But if we choose and observe the right indicators, that’s enough

    View Slide

  25. Thank you!
    Aditya Mukerjee
    @chimeracoder

    View Slide