Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Let's talk about Tracing

Let's talk about Tracing

Tracing plays an important role in CNCF (OpenTracing accepted as a third hosted project after Kubernetes and Prometheus). Experienced developers should know the role and importance of distributed tracing when building microservices at scale. However, unlike logging and metrics monitoring, tracing instrumentation need to propagate the tracing context both within and between processes: logging tells you what happened per-process, monitoring give you insight of your system, and tracing tell stories about your service. A standard instrumentation is required, and is strongly aligned with CNCF nowadays. In this talk we're going to walk through the tracing topic: What''s distributed tracing (brief history since Google Dapper). Introductions on OpenTracing, Jaeger, OpenCensus, etc.

Cheng-Lung Sung

October 09, 2019
Tweet

More Decks by Cheng-Lung Sung

Other Decks in Technology

Transcript

  1. About.Me/clsung • Tech Lead of Data Research & Development Center,

    CTBC Bank • RD Head of Product Development, HTC Health Care (DeepQ) • Cloud Service Infrastructure (Golang, Python) • Mobile App Development (Golang, Java, Swift, Node.js) • Deep Learning AI Platform (Golang, Python, Node.js) • Open Source contributor • GitHub github.com/clsung • Golang golang.org/AUTHORS • Plurk API www.plurk.com/API • LINE API Expert www.line-community.me/contributors
  2. Outline • CNCF - Observability • History of distributed tracing

    • Google Dapper • Facebook Canopy • Introduction to distributed tracing • Open Source tracing frameworks and tools • OpenCensus, Jaeger, Appdash, Zipkin, … • OpenTelemetry = OpenCensus + OpenTracing
  3. • Containerization • CI/CD • Orchestration & Application Definition •

    Observability & Analysis • Service Proxy, Discovery, & Mesh • Networking & Policy • Distributed Database & Storage • Streaming & Messaging • Container Registry & Runtime • Software Distribution CNCF Cloud Native Trail Map
  4. –Cindy Sridharan, author of Distributed Systems Observability “An observable system

    is one that exposes enough data about itself so that generating information and easily accessing this information becomes simple.”
  5. VI

  6. Observability - Logs Tailer: Tail file and publish it to

    NATS Server Host B Tailer Anywhere nats deamon topic Host Z Tailer Nail connects to only one natsd server and subscribes to specific topic Tailer publish logs to gnatsd Host A Tailer Nail
  7. Logging • Error messages and stack traces • Helpful on

    troubleshooting • But logs at scale become very expensive • Can not be sampled • No context, no trace • CPU, RAM, error (rate), latency, throughput • Monitoring individual components of the system • Can be aggregate-able • Choose when you need alerts Metrics
  8. 2007 2009 2011 2013 2015 2017 2008 2010 2012 2014

    2016 2018 Distributed Tracing
 Timeline Magpie Stardust … X-Trace https://ucsdnews.ucsd.edu/pressrelease/computer_scientists_honored_for_tracing_research_that_stood_10_year_test_of
  9. Google Dapper • Design requirement • Low overhead • Application-level

    transparency • Scalability • Available for analysis quickly
  10. Google Dapper • Tree • Span • Tree node •

    Annotation-based • vs Black-box (statistical)
  11. Facebook Canopy • Generate events • Propagate a TraceID •

    Instrumentation APIs • Emit events • Aggregate events • Raw traces to • Model construction • Feature extraction • Query and Visualization
  12. “Distributed tracing, also called distributed request tracing, is a method

    used to profile and monitor applications, especially those built using a microservices architecture. Distributed tracing helps pinpoint where failures occur and what causes poor performance.”
  13. • Commercial • Epsagon • LightStep • Datadog • Elastic

    • Instana • Wavefront • OpenSource • Appdash • Jaeger • Apache SkyWalking • Expedia Haystack
  14. Span A Span B Span C Span D Span E

    Time Trace Spans Span F spanA := tracer.StartSpan(“Span A”)
  15. Terminology (OpenTracing) • Time • Trace • Spans • Tags

    (Attributes) • Logs (Events) • SpanContext • Baggage (DistributedContext) Span A Span B Span C Span D Span E Time Trace Spans Span F
  16. • A trace is a data/execution path through the system,

    and can be thought of as a directed acyclic graph (DAG) of spans. • A trace is a collection of linked spans. The edges indicate the causal relationships (references) between spans. A B C D E F
  17. Span A Span B Span C Span D Span E

    Time Trace Spans Span F • The span is the execution of a client request. • The span representing an individual unit of work done in a distributed system. • https://www.w3.org/TR/trace-context/ • https://opentracing.io/docs/overview/spans/
  18. • An operation name • A start timestamp • A

    finish timestamp • A set of zero or more Span Tags • A set of zero or more Span Logs • A SpanContext (not shown here) func (h *Tracer) gotConn(info httptrace.GotConnInfo) { h.sp.SetTag("net/http.reused", info.Reused) h.sp.SetTag("net/http.was_idle", info.WasIdle) h.sp.LogFields(log.String("event", "GotConn"))
  19. • Span Tag represent contextual metadata relevant to a specific

    request. • Tags are key:value pairs that enable user-defined annotation of spans in order to query, filter, and comprehend trace data. • keys are strings and values can be strings, numbers, booleans. https://github.com/opentracing/specification/blob/master/semantic_conventions.md
  20. Four Myths about Distributed Tracing • I need distributed tracing

    to measure the health / latency / throughput of my services • I need distributed tracing to see which services talk to which other services • I can “get distributed tracing” from the service mesh without having to make any changes to my application • I can “get distributed tracing” from a service mesh by instrumenting my application with a distributed tracing library https://linkerd.io/2019/08/09/service-mesh-distributed-tracing-myths/
  21. 2007 2009 2011 2013 2015 2017 2008 2010 2012 2014

    2016 2018 Twitter
 Zipkin Distributed Tracing Frameworks/Tools
 Timeline Uber Jaeger Google
 Dapper Facebook Canopy Magpie Stardust … X-Trace Sourcegraph
 Appdash OpenTracing Google
 OpenCensus Expedia Haystack
  22. Distributed Tracing Framework
 Major Components • Recorder • Agent and/or

    • Collector • Storage • Memory/Queue • NoSQL • Visualization • Interactive UI • Analytics • Index • Search • Distributed Context Propagation
  23. 2007 2009 2011 2013 2015 2017 2008 2010 2012 2014

    2016 2018 Twitter
 Zipkin Distributed Tracing Frameworks/Tools
 Timeline Uber Jaeger Google
 Dapper 2019 Facebook Canopy Magpie Stardust … X-Trace Sourcegraph
 Appdash OpenTracing Google
 OpenCensus Expedia Haystack
  24. 
 
 
 
 
 
 
 
 
 


    
 
 
 Transaction trace is lost because tools use different headers for context propagation https://medium.com/@AloisReitbauer/trace-context-and-the-road-toward-trace-tool-interoperability-d4d56932369c With standardized headers, traces don’t break (even for proprietary information)
  25. • Effective observability requires high-quality telemetry. • OpenTelemetry makes robust,

    portable telemetry a built-in feature of cloud-native software. • Provides a single set of APIs, libraries, agents, and collector services to capture distributed traces and metrics from your application. • Distributed Tracing Working Group • Data formats for on-the-wire trace context & correlation-context, and out-of-band trace data • This specification defines formats to pass trace context information across systems. • Various tracing and diagnostics products can operate together. • https://opentelemetry.io/ • https://www.w3.org/2018/distributed-tracing/