$30 off During Our Annual Pro Sale. View Details »

Observability of Distributed Systems

Observability of Distributed Systems

Operating distributed systems is hard, not only because of their inherent complexity of the number of components and their distribution but also because the unpredictability of their failures modes: it is plenty of unknown unknowns. We are left with an imperative to build systems that can be debugged, armed with evidence instead of conjecture.

Observability is the practice of understanding the internal state of a system via knowledge of its external outputs. In this talk, we will discuss observability practices, benefits, and opportunities. We’ll also explore observability as a part of the development process.

José Carlos Chávez

November 07, 2019
Tweet

More Decks by José Carlos Chávez

Other Decks in Programming

Transcript

  1. 1
    Øbservability of Distributed
    Systems
    Øredev 2019 Photo by Daniel Cossio

    View Slide

  2. 2
    Expedia Group Proprietary and Confidential
    About me
    - Software Engineer at
    Expedia Group
    - Zipkin core team member
    and open source contributor
    for observability projects
    @jcchavezs - #oredev2019

    View Slide

  3. 3
    Expedia Group Proprietary and Confidential
    Distributed
    Systems &
    Complexity
    @jcchavezs - #oredev2019
    Photo by Claudio Testa

    View Slide

  4. 4
    Expedia Group Proprietary and Confidential
    Distributed systems
    @jcchavezs - #oredev2019
    A collection of independent
    components appears to its users
    as a single coherent system.
    Image source: https://link.medium.com/jey42ga7p1

    View Slide

  5. 5
    Expedia Group Proprietary and Confidential
    Complexity (noun)
    1. the state of having many parts and being difficult to understand
    or find an answer to.
    Cambridge Dictionary
    @jcchavezs - #oredev2019

    View Slide

  6. 6
    Expedia Group Proprietary and Confidential
    The three body problem (1687)
    Given the initial positions and
    velocities of three masses find
    their subsequent paths of
    motion, according to laws of
    motion and universal
    gravitation.
    TL;DR
    - Known initial conditions
    - Unpredictable state of the
    system at given time
    @jcchavezs - #oredev2019

    View Slide

  7. 7
    Expedia Group Proprietary and Confidential
    Distributed systems are complex
    System complexity can be described as a measure of how
    understandable a system is and how difficult it is to understand an
    operation in the system.
    Sources of complexity in systems:
    - Task-Structure Complexity
    - Unpredictability
    - Size Complexity
    - Chaotic Complexity
    - Algorithmic Complexity
    @jcchavezs - #oredev2019

    View Slide

  8. 8
    Expedia Group Proprietary and Confidential
    Why is it hard to operate a Distributed System?
    - Systems change all the time
    - Things fail in unexpected ways
    - Unknown unknowns
    - Most problems are the convergence of many different things
    failing at once
    - Everyone in the team is supposed to respond with the same level
    of confidence and tools no matter experience or expertise and
    the more components, the less individuals know about them
    @jcchavezs - #oredev2019

    View Slide

  9. 9
    Expedia Group Proprietary and Confidential
    Distributed systems are never "up";
    they exist in a constant state of
    partially degraded service.
    Source: https://opensource.com/article/17/7/state-systems-administration

    View Slide

  10. 10
    Expedia Group Proprietary and Confidential
    Observability
    @jcchavezs - #oredev2019
    Photo by Toa Heftiba

    View Slide

  11. 11
    Expedia Group Proprietary and Confidential
    What is Observability?
    [...] is a measure of how well internal states of a system can be
    inferred from knowledge of its external outputs. The observability and controllability of a system are
    mathematical duals...one can determine the behavior of the entire system from the system’s outputs. If a system is not observable, this means that the current values of
    some of its state variables cannot be determined through output sensors. This implies that their value is unknown to the controller (although they can be estimated by
    various means).
    Wikipedia
    @jcchavezs - #oredev2019

    View Slide

  12. 12
    Expedia Group Proprietary and Confidential
    What is Observability?
    Observability is the property of the system that allows to understand
    internal states from its inputs and output signals, in a way that
    actions can be distilled from that understanding.
    That means:
    - Observability is not tooling
    - It is fundamentally tied to control
    - Signals are not data but measurements connected to something
    we need to know
    @jcchavezs - #oredev2019

    View Slide

  13. 13
    Expedia Group Proprietary and Confidential
    What is Observability?
    Source: https://twitter.com/popsysdig/status/1139505998299877377
    @jcchavezs - #oredev2019

    View Slide

  14. 14
    Expedia Group Proprietary and Confidential
    Three pillars of observability
    @jcchavezs - #oredev2019
    Image source: https://twitter.com/autoletics/status/1163345131128401920

    View Slide

  15. 15
    Expedia Group Proprietary and Confidential
    Three aggregates for signals
    @jcchavezs - #oredev2019

    View Slide

  16. 16
    Expedia Group Proprietary and Confidential
    Why should we invest in observability?
    - Gives real-time feedback from signals
    - Helps to understand unknown-unknowns
    - Eases the debugging task by providing context and scope for
    signals
    - Improves resilience of systems by giving visibility to baseline
    failure modes in development cycle
    @jcchavezs - #oredev2019

    View Slide

  17. 17
    Expedia Group Proprietary and Confidential
    Building observable systems

    View Slide

  18. 18
    Expedia Group Proprietary and Confidential
    - On develop make sure your
    system can emit meaningful
    signals.
    - When testing make sure
    actionable failure modes
    can be surfaced.
    - At deploy time, use
    observability signals to
    understand the impact of
    the changes been released.
    @jcchavezs - #oredev2019
    Image source: https://link.medium.com/zvm1AfYvy0
    Observability as part of the software lifecycle

    View Slide

  19. 19
    Expedia Group Proprietary and Confidential
    - When operating a system,
    use signals to:
    - understand health
    - detect anomalies
    - triage problems
    - evolve the system
    - When in support, you can
    re-scope the issues based on
    the signal context
    @jcchavezs - #oredev2019
    Image source: https://link.medium.com/zvm1AfYvy0
    Observability as part of the software lifecycle

    View Slide

  20. 20
    Expedia Group Proprietary and Confidential
    Building an observability culture

    View Slide

  21. 21
    Expedia Group Proprietary and Confidential
    Ownership
    Landing observability in an engineering department needs
    champions who:
    - Raise awareness about the problems that can be solved by
    introducing observability
    - Understand teams’ pains when it comes to operate and triage
    the system and decide the right tools for those pains
    - Set practices, evolve them and help to replicate them among
    teams
    Building an observability culture
    @jcchavezs - #oredev2019

    View Slide

  22. 22
    Expedia Group Proprietary and Confidential
    Tooling
    Observability is not tooling but tooling is key to achieve a good
    observability, what is needed:
    - Suitable observability platforms and instrumentation in place
    - Tools and dashboards that connect the dots among stakeholders
    - Automated checks that make sure signal outputs make sense
    after a deploy
    - Right processes to make sure Personally Identifiable Information
    (PII) is safe
    Building an observability culture
    @jcchavezs - #oredev2019

    View Slide

  23. 23
    Expedia Group Proprietary and Confidential
    Business value
    Observability can also be beneficial for other stakeholders of the
    system:
    - Helping to achieve SLOs by improving the triage experience.
    - Giving support teams and engineers a common context to
    understand and fix problems in production.
    - Improving support teams awareness by foresee trends when it
    comes to failures.
    Building an observability culture
    @jcchavezs - #oredev2019

    View Slide

  24. 24
    Expedia Group Proprietary and Confidential
    Summary
    - Systems are complex and will be, observability helps us to
    understand better failure modes.
    - Observability is not a goal itself, it is only important if we close the
    cycle by the actions we take from the observations.
    - Observability will not only benefit developers and operators but
    all stakeholders of the system.
    - Like everything else in software industry, building the culture is
    more important than the code, infrastructure and tooling.
    @jcchavezs - #oredev2019

    View Slide

  25. 25
    Expedia Group Proprietary and Confidential
    Thank you
    Q&A

    View Slide

  26. 26
    Expedia Group Proprietary and Confidential
    See also
    - Does software understand complexity? - Michael Feathers
    - What is the Complexity of a Distributed System? - Anand
    Ranganathan, Roy H. Campbell
    - Observability: The significant parts - William Louth
    - Observations on observability - Colin Breck
    - Observability 3 ways: Logging, Metrics & Tracing - Adrian
    Cole
    @jcchavezs - #oredev2019

    View Slide