Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Increasing Observability with Distributed Tracing

Increasing Observability with Distributed Tracing

Avatar for Jorge Quilcate

Jorge Quilcate

February 28, 2018
Tweet

More Decks by Jorge Quilcate

Other Decks in Technology

Transcript

  1. ## references ### papers * **dapper** https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36356.pdf * pivot tracing

    http://sigops.org/sosp/sosp15/current/2015-Monterey/printable/122-mace.pdf * **canopy** http://cs.brown.edu/~jcmace/papers/kaldor2017canopy.pdf ### articles * ok log https://peter.bourgon.org/ok-log/ * logs - 12 factor application https://12factor.net/logs * the problem with logging https://blog.codinghorror.com/the-problem-with-logging/ * logging v. instrumentation https://peter.bourgon.org/blog/2016/02/07/logging-v-instrumentation.html * measure anything, measure everything https://codeascraft.com/2011/02/15/measure-anything-measure-everything/ * metrics, tracing and logging https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html * monitoring and observability https://medium.com/@copyconstruct/monitoring-and-observability-8417d1952e1c * monitoring in the time of cloud native https://medium.com/@copyconstruct/monitoring-in-the-time-of-cloud-native-c87c7a5bfa3e * sre book https://landing.google.com/sre/book/index.html * distributed tracing at uber https://eng.uber.com/distributed-tracing/ ...
  2. ### data on the outside vs data on the inside

    “Going [from monolithic architecture] to SOA is like going from Newton’s physics to Einstein’s physics. Newton’s time marched forward uniformly with instant knowledge at a distance. Before SOA, distributed computing strove to make many systems look like one with RPC, 2PC, etc [...]
  3. ### data on the outside vs data on the inside

    [...] In Einstein’s universe, everything is relative to one’s perspective. SOA has “now” inside and the “past” arriving in messages.” - Pat Helland
  4. ### data on the outside vs data on the inside

    “perhaps we should rename the “extract microservice” refactoring operation to “change model of time and space” ;).” - Adrian Colyer
  5. ### service collaboration #### orchestration * **explicit** data flow *

    coordination, coupling #### choreography * **implicit** data flow * flexible, moves faster
  6. ### tracing definition “[...] tracing involves a specialized use of

    logging to record information about a program's execution. this information is typically used by programmers for debugging purposes, [...] and by software monitoring tools to diagnose common problems with software. tracing is a cross-cutting concern.” - wikipedia
  7. ### tracing definition “[...] the single defining characteristic of tracing,

    then, is that it deals with information that is request-scoped. Any bit of data or metadata that can be bound to lifecycle of a single transactional object in the system. [...]” - Peter Bourgon
  8. ### distributed tracing * 1 story, N storytellers * **aggregated**

    traces, **across** boundaries * “distributed tracing _commoditizes knowledge_” - Adrian Cole
  9. ### `dapper`: how google does it > [...] was build

    to provide developers with **more information** about the **behavior** of complex distributed systems > understanding system behavior [...] requires **observing** related activities _across many different programs and machines_. > monitoring should **always be on(!)**
  10. ### annotation-based approach > Two classes of solutions have been

    proposed to aggregate this information [...]: **black-box** and **annotation-based** monitoring schemes. annotation-based schemes rely on applications or middleware to **explicitly tag every record with a global identifier** that links these message records back to the originating request.
  11. ### `dapper`: impact on development * **performance**: Devs were able

    to track progress against request latency targets and pinpoint easy optimization opportunities. * **correctness**: Was possible to know where clients where accessing master replica when they don’t need to. * **understanding**: Now was possible to understand how long it takes to query back-ends fan-out. * **testing**
  12. ### black-box approach > Black-box schemes **assume there is no

    additional information other than the message record** described above, and _use statistical regression techniques_ to infer that association. _while black-box schemes are more portable than annotation-based methods_, **they need more data in order to gain sufficient accuracy** due to their reliance on statistical inference. - Dapper
  13. ### `opentracing` OpenTracing API application logic µ-service frameworks control-flow packages

    RPC frameworks existing instrumentation tracing infrastructure main() T R A C E R J a e g e r service process
  14. #### opentracing semantics * `trace` = tree of `spans` (i.e.

    DAG) * `span`: `service_name` + `operation` + `tags` + `logs` + `latency` * `span` identified by `context` * `spanContext`: `traceId` + `spanId` + `baggages`
  15. ### context propagation > “the efficient implementation of the _happened

    before join_ requires advice in one tracepoint to send information along the execution path to advice in subsequent tracepoints. this is done through a new **baggage abstraction**, which uses _causal metadata propagation_” - pivot tracing
  16. ### sampling * < 10 events per second, don’t sample.

    * if you decide to sample, think about the characteristics in your traffic that you want to preserve and use those fields to guide your sample rate - honeycomb.io
  17. ### observability at twitter > “these are the _four pillars_

    of the **Observability Engineering team’s** charter: * monitoring * alerting/visualization * distributed systems tracing infrastructure * log aggregation/analytics” - twitter, 2013
  18. ### observability at google > the **holistic approach** to be

    able to **observe** systems > we observe systems via **various signals**: metrics, traces, logs, events, … - @rakyll
  19. ### observability is a **superset** of: > **monitoring**: how you

    operate a system (_known unknown_) > **instrumentation**: how you develop a system to be monitorable and about _making systems more_: > **debuggable**: tracking down failures and bugs > **understandable**: answer questions, trends - Charity Majors
  20. ### observability for **unknowns unknowns** > “A good example of

    something that needs “monitoring” would be a storage server running out of disk space or a proxy server running out of file descriptors. An I/O bound service has different failure modes compared to a memory bound one. An HA system has different failure moves compared to a CP system.” “in essence “Observability” captures what “monitoring” doesn’t (and ideally, shouldn’t).” - Charity Majors
  21. ### logging v. instrumentati on > services should **only log

    actionable data** > logs should be **treated as event streams** > understand that **logging is expensive** > services should **instrument every meaningful number available for capture** > 3 metrics to get started: from USE method to RED method host-oriented: utilization, starvation, errors app-oriented: requests, errors, and duration
  22. ## `canopy`: how facebook does it > `canopy` construct traces

    by propagating identifiers through the system to correlate information across components. **challenges** about this: * end-to-end data is heterogeneous [...] consuming instrumented data directly is cumbersome and infeasible at scale.
  23. ## `canopy`: how facebook does it > evaluating interactive queries

    over raw traces is computationally infeasible, because Facebook captures over one billion traces per day. > unless we provide further abstractions (on top of traces and events) users will have to consume trace data directly, which entails complicated queries to extract simple high-level features. > **users should be able to view traces through a lens appropriate for their particular tasks**.
  24. ### distributed systems verification * unit-testing > testing error handling

    code could have prevent 58% of catastrophic failures * integration-testing > 3 nodes or less can reproduce 98% of failures * property-based testing > **caution**: passing tests does not ensure correctness - “The Verification of a Distributed System” by **Caitie McCaffrey**
  25. ### distributed systems verification * formal verification: TLA+ * fault-injection:

    chaos engineering > **without explicitly forcing a system to fail, it is unreasonable to have any confidence it will operate correctly in failure mode** * testing in production: canaries - “The Verification of a Distributed System” by **Caitie McCaffrey**
  26. ### lineage-driven fault injection Orchestrating Chaos Applying Database Research in

    the Wild - Peter Alvaro https://www.youtube.com/watch?v=YplkQu6a80Q
  27. ### lineage-driven fault injection Orchestrating Chaos Applying Database Research in

    the Wild - Peter Alvaro https://www.youtube.com/watch?v=YplkQu6a80Q
  28. ### lineage-driven fault injection Orchestrating Chaos Applying Database Research in

    the Wild - Peter Alvaro https://www.youtube.com/watch?v=YplkQu6a80Q
  29. ### lineage-driven fault injection Orchestrating Chaos Applying Database Research in

    the Wild - Peter Alvaro https://www.youtube.com/watch?v=YplkQu6a80Q
  30. > “the best way to find patterns in a system

    is looking at it from above” - es devlin, designer _Abstract, The Art of Design_ - Netflix
  31. ### simulation with `simianviz` * how would my architecture looks

    like if it grows? * monitoring tools often “explode on impact” with real-world use cases at scale * `spigo` (aka. simianviz) is a tool to produce any format output to feed your monitoring tools from a laptop - Adrian Cockcroft
  32. ### vizceral - “pain suit” Intuition Engineering at Netflix -

    Justin Reynolds https://vimeo.com/173607639
  33. ## references ### papers * **dapper** https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36356.pdf * **canopy** http://cs.brown.edu/~jcmace/papers/kaldor2017canopy.pdf

    * **automating failure testing research at internet scale** https://people.ucsc.edu/~palvaro/socc16.pdf * data on the outside vs data on the inside http://cidrdb.org/cidr2005/papers/P12.pdf * pivot tracing http://sigops.org/sosp/sosp15/current/2015-Monterey/printable/122-mace.pdf ### articles * ok log https://peter.bourgon.org/ok-log/ * logs - 12 factor application https://12factor.net/logs * the problem with logging https://blog.codinghorror.com/the-problem-with-logging/ * logging v. instrumentation https://peter.bourgon.org/blog/2016/02/07/logging-v-instrumentation.html * logs and metrics https://medium.com/@copyconstruct/logs-and-metrics-6d34d3026e38 * measure anything, measure everything https://codeascraft.com/2011/02/15/measure-anything-measure-everything/ * metrics, tracing and logging https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html * monitoring and observability https://medium.com/@copyconstruct/monitoring-and-observability-8417d1952e1c * monitoring in the time of cloud native https://medium.com/@copyconstruct/monitoring-in-the-time-of-cloud-native-c87c7a5bfa3e * sre book https://landing.google.com/sre/book/index.html * distributed tracing at uber https://eng.uber.com/distributed-tracing/ * spigo and simianviz https://github.com/adrianco/spigo * observability: what’s in a name? https://honeycomb.io/blog/2017/08/observability-whats-in-a-name/ * wtf is operations? #serverless https://charity.wtf/2016/05/31/wtf-is-operations-serverless/ * event foo: what should i add to an event https://honeycomb.io/blog/2017/08/event-foo-what-should-i-add-to-an-event/
  34. ## references ### articles (continued) * Google’s approach to Observability

    https://medium.com/@rakyll/googles-approach-to-observability-frameworks-c89fc1f0e058 * Microservices and Observability https://medium.com/@rakyll/microservices-observability-26a8b7056bb4 * Best Practices for Observability https://honeycomb.io/blog/2017/11/best-practices-for-observability/ * https://thenewstack.io/dev-ops-doesnt-matter-need-observability/ ### talks * "Observability for Emerging Infra: What Got You Here Won't Get You There" by Charity Majors https://www.youtube.com/watch?v=1wjovFSCGhE * “The Verification of a Distributed System” by Caitie McCaffrey https://www.youtube.com/watch?v=kDh5BrqiGhI * “Mastering Chaos - A Netflix Guide to Microservices” by Josh Evans https://www.youtube.com/watch?v=CZ3wIuvmHeM * “Monitoring Microservices” by Tom Wilkie https://www.youtube.com/watch?v=emaPPg_zxb4 * “Microservice application tracing standards and simulations” by Adrian Cole and Adrian Cockcroft https://www.slideshare.net/adriancockcroft/microservices-application-tracing-standards-and-simulators-adrians-at-oscon * “Intuition Engineering at Netflix” by Justin Reynolds https://vimeo.com/173607639 * Distributed Tracing: Understanding how your all your components work together by José Carlos Chávez https://speakerdeck.com/jcchavezs/distributed-tracing-understanding-how-your-all-your-components-work-together * “Monitoring isn't just an accident” https://docs.google.com/presentation/d/1IEJIaQoCjzBsVq0h2Y7qcsWRWPS5lYt9CS2Jl25eurc/edit#slide=id.g327c9fd948_0_534 * Orchestrating Chaos Applying Database Research in the Wild - Peter Alvaro https://www.youtube.com/watch?v=YplkQu6a80Q
  35. ## references ### articles (continued) * “The Verification of A

    Distributed System” - Caitie McCaffrie https://github.com/CaitieM20/Talks/tree/master/TheVerificationOfADistributedSystem * “Testing in Production” by Charity Majors https://opensource.com/article/17/8/testing-production * “Data on the outside vs Data on the inside - Review” by Adrian Colyer https://blog.acolyer.org/2016/09/13/data-on-the-outside-versus-data-on-the-inside/