Slide 1

Slide 1 text

1 Øbservability of Distributed Systems Øredev 2019 Photo by Daniel Cossio

Slide 2

Slide 2 text

2 Expedia Group Proprietary and Confidential About me - Software Engineer at Expedia Group - Zipkin core team member and open source contributor for observability projects @jcchavezs - #oredev2019

Slide 3

Slide 3 text

3 Expedia Group Proprietary and Confidential Distributed Systems & Complexity @jcchavezs - #oredev2019 Photo by Claudio Testa

Slide 4

Slide 4 text

4 Expedia Group Proprietary and Confidential Distributed systems @jcchavezs - #oredev2019 A collection of independent components appears to its users as a single coherent system. Image source: https://link.medium.com/jey42ga7p1

Slide 5

Slide 5 text

5 Expedia Group Proprietary and Confidential Complexity (noun) 1. the state of having many parts and being difficult to understand or find an answer to. Cambridge Dictionary @jcchavezs - #oredev2019

Slide 6

Slide 6 text

6 Expedia Group Proprietary and Confidential The three body problem (1687) Given the initial positions and velocities of three masses find their subsequent paths of motion, according to laws of motion and universal gravitation. TL;DR - Known initial conditions - Unpredictable state of the system at given time @jcchavezs - #oredev2019

Slide 7

Slide 7 text

7 Expedia Group Proprietary and Confidential Distributed systems are complex System complexity can be described as a measure of how understandable a system is and how difficult it is to understand an operation in the system. Sources of complexity in systems: - Task-Structure Complexity - Unpredictability - Size Complexity - Chaotic Complexity - Algorithmic Complexity @jcchavezs - #oredev2019

Slide 8

Slide 8 text

8 Expedia Group Proprietary and Confidential Why is it hard to operate a Distributed System? - Systems change all the time - Things fail in unexpected ways - Unknown unknowns - Most problems are the convergence of many different things failing at once - Everyone in the team is supposed to respond with the same level of confidence and tools no matter experience or expertise and the more components, the less individuals know about them @jcchavezs - #oredev2019

Slide 9

Slide 9 text

9 Expedia Group Proprietary and Confidential Distributed systems are never "up"; they exist in a constant state of partially degraded service. Source: https://opensource.com/article/17/7/state-systems-administration

Slide 10

Slide 10 text

10 Expedia Group Proprietary and Confidential Observability @jcchavezs - #oredev2019 Photo by Toa Heftiba

Slide 11

Slide 11 text

11 Expedia Group Proprietary and Confidential What is Observability? [...] is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. The observability and controllability of a system are mathematical duals...one can determine the behavior of the entire system from the system’s outputs. If a system is not observable, this means that the current values of some of its state variables cannot be determined through output sensors. This implies that their value is unknown to the controller (although they can be estimated by various means). Wikipedia @jcchavezs - #oredev2019

Slide 12

Slide 12 text

12 Expedia Group Proprietary and Confidential What is Observability? Observability is the property of the system that allows to understand internal states from its inputs and output signals, in a way that actions can be distilled from that understanding. That means: - Observability is not tooling - It is fundamentally tied to control - Signals are not data but measurements connected to something we need to know @jcchavezs - #oredev2019

Slide 13

Slide 13 text

13 Expedia Group Proprietary and Confidential What is Observability? Source: https://twitter.com/popsysdig/status/1139505998299877377 @jcchavezs - #oredev2019

Slide 14

Slide 14 text

14 Expedia Group Proprietary and Confidential Three pillars of observability @jcchavezs - #oredev2019 Image source: https://twitter.com/autoletics/status/1163345131128401920

Slide 15

Slide 15 text

15 Expedia Group Proprietary and Confidential Three aggregates for signals @jcchavezs - #oredev2019

Slide 16

Slide 16 text

16 Expedia Group Proprietary and Confidential Why should we invest in observability? - Gives real-time feedback from signals - Helps to understand unknown-unknowns - Eases the debugging task by providing context and scope for signals - Improves resilience of systems by giving visibility to baseline failure modes in development cycle @jcchavezs - #oredev2019

Slide 17

Slide 17 text

17 Expedia Group Proprietary and Confidential Building observable systems

Slide 18

Slide 18 text

18 Expedia Group Proprietary and Confidential - On develop make sure your system can emit meaningful signals. - When testing make sure actionable failure modes can be surfaced. - At deploy time, use observability signals to understand the impact of the changes been released. @jcchavezs - #oredev2019 Image source: https://link.medium.com/zvm1AfYvy0 Observability as part of the software lifecycle

Slide 19

Slide 19 text

19 Expedia Group Proprietary and Confidential - When operating a system, use signals to: - understand health - detect anomalies - triage problems - evolve the system - When in support, you can re-scope the issues based on the signal context @jcchavezs - #oredev2019 Image source: https://link.medium.com/zvm1AfYvy0 Observability as part of the software lifecycle

Slide 20

Slide 20 text

20 Expedia Group Proprietary and Confidential Building an observability culture

Slide 21

Slide 21 text

21 Expedia Group Proprietary and Confidential Ownership Landing observability in an engineering department needs champions who: - Raise awareness about the problems that can be solved by introducing observability - Understand teams’ pains when it comes to operate and triage the system and decide the right tools for those pains - Set practices, evolve them and help to replicate them among teams Building an observability culture @jcchavezs - #oredev2019

Slide 22

Slide 22 text

22 Expedia Group Proprietary and Confidential Tooling Observability is not tooling but tooling is key to achieve a good observability, what is needed: - Suitable observability platforms and instrumentation in place - Tools and dashboards that connect the dots among stakeholders - Automated checks that make sure signal outputs make sense after a deploy - Right processes to make sure Personally Identifiable Information (PII) is safe Building an observability culture @jcchavezs - #oredev2019

Slide 23

Slide 23 text

23 Expedia Group Proprietary and Confidential Business value Observability can also be beneficial for other stakeholders of the system: - Helping to achieve SLOs by improving the triage experience. - Giving support teams and engineers a common context to understand and fix problems in production. - Improving support teams awareness by foresee trends when it comes to failures. Building an observability culture @jcchavezs - #oredev2019

Slide 24

Slide 24 text

24 Expedia Group Proprietary and Confidential Summary - Systems are complex and will be, observability helps us to understand better failure modes. - Observability is not a goal itself, it is only important if we close the cycle by the actions we take from the observations. - Observability will not only benefit developers and operators but all stakeholders of the system. - Like everything else in software industry, building the culture is more important than the code, infrastructure and tooling. @jcchavezs - #oredev2019

Slide 25

Slide 25 text

25 Expedia Group Proprietary and Confidential Thank you Q&A

Slide 26

Slide 26 text

26 Expedia Group Proprietary and Confidential See also - Does software understand complexity? - Michael Feathers - What is the Complexity of a Distributed System? - Anand Ranganathan, Roy H. Campbell - Observability: The significant parts - William Louth - Observations on observability - Colin Breck - Observability 3 ways: Logging, Metrics & Tracing - Adrian Cole @jcchavezs - #oredev2019