Increasing Observability with Distributed Tracing

Slide 1

Slide 1 text

observability ++ <= (distributed) tracing [email protected] @jeqo89 github.com/jeqo

Slide 2

Slide 2 text

## references ### papers * **dapper** https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36356.pdf * pivot tracing http://sigops.org/sosp/sosp15/current/2015-Monterey/printable/122-mace.pdf * **canopy** http://cs.brown.edu/~jcmace/papers/kaldor2017canopy.pdf ### articles * ok log https://peter.bourgon.org/ok-log/ * logs - 12 factor application https://12factor.net/logs * the problem with logging https://blog.codinghorror.com/the-problem-with-logging/ * logging v. instrumentation https://peter.bourgon.org/blog/2016/02/07/logging-v-instrumentation.html * measure anything, measure everything https://codeascraft.com/2011/02/15/measure-anything-measure-everything/ * metrics, tracing and logging https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html * monitoring and observability https://medium.com/@copyconstruct/monitoring-and-observability-8417d1952e1c * monitoring in the time of cloud native https://medium.com/@copyconstruct/monitoring-in-the-time-of-cloud-native-c87c7a5bfa3e * sre book https://landing.google.com/sre/book/index.html * distributed tracing at uber https://eng.uber.com/distributed-tracing/ ...

Slide 3

Slide 3 text

# context

Slide 4

Slide 4 text

## complexity: from monoliths to distributed systems

Slide 5

Slide 5 text

### WARNING sign

Slide 6

Slide 6 text

### data on the outside vs data on the inside “Going [from monolithic architecture] to SOA is like going from Newton’s physics to Einstein’s physics. Newton’s time marched forward uniformly with instant knowledge at a distance. Before SOA, distributed computing strove to make many systems look like one with RPC, 2PC, etc [...]

Slide 7

Slide 7 text

### data on the outside vs data on the inside [...] In Einstein’s universe, everything is relative to one’s perspective. SOA has “now” inside and the “past” arriving in messages.” - Pat Helland

Slide 8

Slide 8 text

### data on the outside vs data on the inside “perhaps we should rename the “extract microservice” refactoring operation to “change model of time and space” ;).” - Adrian Colyer

Slide 9

Slide 9 text

### `death star` architectures

Slide 10

Slide 10 text

### service collaboration #### orchestration * **explicit** data flow * coordination, coupling #### choreography * **implicit** data flow * flexible, moves faster

Slide 11

Slide 11 text

## from snowflakes servers, to containers, to serverless

Slide 12

Slide 12 text

# distributed tracing

Slide 13

Slide 13 text

> track execution, follow evidence, monitor status and location

Slide 14

Slide 14 text

### tracing definition “[...] tracing involves a specialized use of logging to record information about a program's execution. this information is typically used by programmers for debugging purposes, [...] and by software monitoring tools to diagnose common problems with software. tracing is a cross-cutting concern.” - wikipedia

Slide 15

Slide 15 text

### tracing definition “[...] the single defining characteristic of tracing, then, is that it deals with information that is request-scoped. Any bit of data or metadata that can be bound to lifecycle of a single transactional object in the system. [...]” - Peter Bourgon

Slide 16

Slide 16 text

### tracing _monoliths_

Slide 17

Slide 17 text

### tracing `distributed` systems

Slide 18

Slide 18 text

### distributed tracing * 1 story, N storytellers * **aggregated** traces, **across** boundaries * “distributed tracing _commoditizes knowledge_” - Adrian Cole

Slide 19

Slide 19 text

## distributed tracing approaches

Slide 20

Slide 20 text

### `dapper`: how google does it > [...] was build to provide developers with **more information** about the **behavior** of complex distributed systems > understanding system behavior [...] requires **observing** related activities _across many different programs and machines_. > monitoring should **always be on(!)**

Slide 21

Slide 21 text

### annotation-based approach > Two classes of solutions have been proposed to aggregate this information [...]: **black-box** and **annotation-based** monitoring schemes. annotation-based schemes rely on applications or middleware to **explicitly tag every record with a global identifier** that links these message records back to the originating request.

Slide 22

Slide 22 text

### `dapper`: impact on development * **performance**: Devs were able to track progress against request latency targets and pinpoint easy optimization opportunities. * **correctness**: Was possible to know where clients where accessing master replica when they don’t need to. * **understanding**: Now was possible to understand how long it takes to query back-ends fan-out. * **testing**

Slide 23

Slide 23 text

### distributed tracing **commons**

Slide 24

Slide 24 text

### black-box approach > Black-box schemes **assume there is no additional information other than the message record** described above, and _use statistical regression techniques_ to infer that association. _while black-box schemes are more portable than annotation-based methods_, **they need more data in order to gain sufficient accuracy** due to their reliance on statistical inference. - Dapper

Slide 25

Slide 25 text

### `black-box` approach

Slide 26

Slide 26 text

## opentracing

Slide 27

Slide 27 text

### `opentracing` OpenTracing API application logic µ-service frameworks control-flow packages RPC frameworks existing instrumentation tracing infrastructure main() T R A C E R J a e g e r service process

Slide 28

Slide 28 text

#### opentracing semantics * `trace` = tree of `spans` (i.e. DAG) * `span`: `service_name` + `operation` + `tags` + `logs` + `latency` * `span` identified by `context` * `spanContext`: `traceId` + `spanId` + `baggages`

Slide 29

Slide 29 text

### demo time > intro to **opentracing**

Slide 30

Slide 30 text

### context propagation > “the efficient implementation of the _happened before join_ requires advice in one tracepoint to send information along the execution path to advice in subsequent tracepoints. this is done through a new **baggage abstraction**, which uses _causal metadata propagation_” - pivot tracing

Slide 31

Slide 31 text

### WARNING sign: dist. tracing & context propagation

Slide 32

Slide 32 text

### sampling * < 10 events per second, don’t sample. * if you decide to sample, think about the characteristics in your traffic that you want to preserve and use those fields to guide your sample rate - honeycomb.io

Slide 33

Slide 33 text

### demo time > **distributed tracing** in practice

Slide 34

Slide 34 text

### jaeger architecture

Slide 35

Slide 35 text

### zipkin architecture

Slide 36

Slide 36 text

# a step back: **observability**

Slide 37

Slide 37 text

No content

Slide 38

Slide 38 text

### observability at twitter > “these are the _four pillars_ of the **Observability Engineering team’s** charter: * monitoring * alerting/visualization * distributed systems tracing infrastructure * log aggregation/analytics” - twitter, 2013

Slide 39

Slide 39 text

### observability at google > the **holistic approach** to be able to **observe** systems > we observe systems via **various signals**: metrics, traces, logs, events, … - @rakyll

Slide 40

Slide 40 text

### observability is a **superset** of: > **monitoring**: how you operate a system (_known unknown_) > **instrumentation**: how you develop a system to be monitorable and about _making systems more_: > **debuggable**: tracking down failures and bugs > **understandable**: answer questions, trends - Charity Majors

Slide 41

Slide 41 text

### observability for **unknowns unknowns** > “A good example of something that needs “monitoring” would be a storage server running out of disk space or a proxy server running out of file descriptors. An I/O bound service has different failure modes compared to a memory bound one. An HA system has different failure moves compared to a CP system.” “in essence “Observability” captures what “monitoring” doesn’t (and ideally, shouldn’t).” - Charity Majors

Slide 42

Slide 42 text

## observability methods

Slide 43

Slide 43 text

### observability pillars

Slide 44

Slide 44 text

### observability pillars

Slide 45

Slide 45 text

### observability pillars - Peter Bourgon

Slide 46

Slide 46 text

### logging v. instrumentati on > services should **only log actionable data** > logs should be **treated as event streams** > understand that **logging is expensive** > services should **instrument every meaningful number available for capture** > 3 metrics to get started: from USE method to RED method host-oriented: utilization, starvation, errors app-oriented: requests, errors, and duration

Slide 47

Slide 47 text

### demo time > metrics and distributed tracing with **opentracing**

Slide 48

Slide 48 text

### opencensus

Slide 49

Slide 49 text

### demo time > intro to **opencensus**

Slide 50

Slide 50 text

### WARNING SIGN - Cindy Sridharan

Slide 51

Slide 51 text

# what’s next?

Slide 52

Slide 52 text

## `canopy`: how facebook does it > `canopy` construct traces by propagating identifiers through the system to correlate information across components. **challenges** about this: * end-to-end data is heterogeneous [...] consuming instrumented data directly is cumbersome and infeasible at scale.

Slide 53

Slide 53 text

## `canopy`: how facebook does it > evaluating interactive queries over raw traces is computationally infeasible, because Facebook captures over one billion traces per day. > unless we provide further abstractions (on top of traces and events) users will have to consume trace data directly, which entails complicated queries to extract simple high-level features. > **users should be able to view traces through a lens appropriate for their particular tasks**.

Slide 54

Slide 54 text

### `canopy`: building models from traces

Slide 55

Slide 55 text

### distributed systems verification * unit-testing > testing error handling code could have prevent 58% of catastrophic failures * integration-testing > 3 nodes or less can reproduce 98% of failures * property-based testing > **caution**: passing tests does not ensure correctness - “The Verification of a Distributed System” by **Caitie McCaffrey**

Slide 56

Slide 56 text

### distributed systems verification * formal verification: TLA+ * fault-injection: chaos engineering > **without explicitly forcing a system to fail, it is unreasonable to have any confidence it will operate correctly in failure mode** * testing in production: canaries - “The Verification of a Distributed System” by **Caitie McCaffrey**

Slide 57

Slide 57 text

> “fault tolerance is a **non compositional** property” - peter alvaro

Slide 58

Slide 58 text

### lineage-driven fault injection Orchestrating Chaos Applying Database Research in the Wild - Peter Alvaro https://www.youtube.com/watch?v=YplkQu6a80Q

Slide 59

Slide 59 text

### lineage-driven fault injection Orchestrating Chaos Applying Database Research in the Wild - Peter Alvaro https://www.youtube.com/watch?v=YplkQu6a80Q

Slide 60

Slide 60 text

### lineage-driven fault injection Orchestrating Chaos Applying Database Research in the Wild - Peter Alvaro https://www.youtube.com/watch?v=YplkQu6a80Q

Slide 61

Slide 61 text

### lineage-driven fault injection Orchestrating Chaos Applying Database Research in the Wild - Peter Alvaro https://www.youtube.com/watch?v=YplkQu6a80Q

Slide 62

Slide 62 text

> “the best way to find patterns in a system is looking at it from above” - es devlin, designer _Abstract, The Art of Design_ - Netflix

Slide 63

Slide 63 text

### simulation with `simianviz` * how would my architecture looks like if it grows? * monitoring tools often “explode on impact” with real-world use cases at scale * `spigo` (aka. simianviz) is a tool to produce any format output to feed your monitoring tools from a laptop - Adrian Cockcroft

Slide 64

Slide 64 text

### vizceral - “pain suit” Intuition Engineering at Netflix - Justin Reynolds https://vimeo.com/173607639

Slide 65

Slide 65 text

### vizceral - intuition engineering Intuition Engineering at Netflix - Justin Reynolds https://vimeo.com/173607639

Slide 66

Slide 66 text

### demo time > simianviz to zipkin and vizceral

Slide 67

Slide 67 text

https://twitter.com/mipsytipsy/status/932551447555858433 https://twitter.com/jessitron/status/579109266042150912

Slide 68

Slide 68 text

## references ### papers * **dapper** https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36356.pdf * **canopy** http://cs.brown.edu/~jcmace/papers/kaldor2017canopy.pdf * **automating failure testing research at internet scale** https://people.ucsc.edu/~palvaro/socc16.pdf * data on the outside vs data on the inside http://cidrdb.org/cidr2005/papers/P12.pdf * pivot tracing http://sigops.org/sosp/sosp15/current/2015-Monterey/printable/122-mace.pdf ### articles * ok log https://peter.bourgon.org/ok-log/ * logs - 12 factor application https://12factor.net/logs * the problem with logging https://blog.codinghorror.com/the-problem-with-logging/ * logging v. instrumentation https://peter.bourgon.org/blog/2016/02/07/logging-v-instrumentation.html * logs and metrics https://medium.com/@copyconstruct/logs-and-metrics-6d34d3026e38 * measure anything, measure everything https://codeascraft.com/2011/02/15/measure-anything-measure-everything/ * metrics, tracing and logging https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html * monitoring and observability https://medium.com/@copyconstruct/monitoring-and-observability-8417d1952e1c * monitoring in the time of cloud native https://medium.com/@copyconstruct/monitoring-in-the-time-of-cloud-native-c87c7a5bfa3e * sre book https://landing.google.com/sre/book/index.html * distributed tracing at uber https://eng.uber.com/distributed-tracing/ * spigo and simianviz https://github.com/adrianco/spigo * observability: what’s in a name? https://honeycomb.io/blog/2017/08/observability-whats-in-a-name/ * wtf is operations? #serverless https://charity.wtf/2016/05/31/wtf-is-operations-serverless/ * event foo: what should i add to an event https://honeycomb.io/blog/2017/08/event-foo-what-should-i-add-to-an-event/

Slide 69

Slide 69 text

## references ### articles (continued) * Google’s approach to Observability https://medium.com/@rakyll/googles-approach-to-observability-frameworks-c89fc1f0e058 * Microservices and Observability https://medium.com/@rakyll/microservices-observability-26a8b7056bb4 * Best Practices for Observability https://honeycomb.io/blog/2017/11/best-practices-for-observability/ * https://thenewstack.io/dev-ops-doesnt-matter-need-observability/ ### talks * "Observability for Emerging Infra: What Got You Here Won't Get You There" by Charity Majors https://www.youtube.com/watch?v=1wjovFSCGhE * “The Verification of a Distributed System” by Caitie McCaffrey https://www.youtube.com/watch?v=kDh5BrqiGhI * “Mastering Chaos - A Netflix Guide to Microservices” by Josh Evans https://www.youtube.com/watch?v=CZ3wIuvmHeM * “Monitoring Microservices” by Tom Wilkie https://www.youtube.com/watch?v=emaPPg_zxb4 * “Microservice application tracing standards and simulations” by Adrian Cole and Adrian Cockcroft https://www.slideshare.net/adriancockcroft/microservices-application-tracing-standards-and-simulators-adrians-at-oscon * “Intuition Engineering at Netflix” by Justin Reynolds https://vimeo.com/173607639 * Distributed Tracing: Understanding how your all your components work together by José Carlos Chávez https://speakerdeck.com/jcchavezs/distributed-tracing-understanding-how-your-all-your-components-work-together * “Monitoring isn't just an accident” https://docs.google.com/presentation/d/1IEJIaQoCjzBsVq0h2Y7qcsWRWPS5lYt9CS2Jl25eurc/edit#slide=id.g327c9fd948_0_534 * Orchestrating Chaos Applying Database Research in the Wild - Peter Alvaro https://www.youtube.com/watch?v=YplkQu6a80Q

Slide 70

Slide 70 text

## references ### articles (continued) * “The Verification of A Distributed System” - Caitie McCaffrie https://github.com/CaitieM20/Talks/tree/master/TheVerificationOfADistributedSystem * “Testing in Production” by Charity Majors https://opensource.com/article/17/8/testing-production * “Data on the outside vs Data on the inside - Review” by Adrian Colyer https://blog.acolyer.org/2016/09/13/data-on-the-outside-versus-data-on-the-inside/

Slide 71

Slide 71 text

**thank you!** * github.com/jeqo/talk-observing-distributed-systems * jeqo.github.io * twitter.com/jeqo89