Let's talk about Tracing

Let's talk about Tracing

Tracing plays an important role in CNCF (OpenTracing accepted as a third hosted project after Kubernetes and Prometheus). Experienced developers should know the role and importance of distributed tracing when building microservices at scale. However, unlike logging and metrics monitoring, tracing instrumentation need to propagate the tracing context both within and between processes: logging tells you what happened per-process, monitoring give you insight of your system, and tracing tell stories about your service. A standard instrumentation is required, and is strongly aligned with CNCF nowadays. In this talk we're going to walk through the tracing topic: What''s distributed tracing (brief history since Google Dapper). Introductions on OpenTracing, Jaeger, OpenCensus, etc.

9dc1fb93b959c0d838bf6a900306d9b9?s=128

Cheng-Lung Sung

October 09, 2019
Tweet

Transcript

  1. Let’s talk about tracing Cheng-Lung Sung (clsung@)

  2. About.Me/clsung • Tech Lead of Data Research & Development Center,

    CTBC Bank • RD Head of Product Development, HTC Health Care (DeepQ) • Cloud Service Infrastructure (Golang, Python) • Mobile App Development (Golang, Java, Swift, Node.js) • Deep Learning AI Platform (Golang, Python, Node.js) • Open Source contributor • GitHub github.com/clsung • Golang golang.org/AUTHORS • Plurk API www.plurk.com/API • LINE API Expert www.line-community.me/contributors
  3. Outline • CNCF - Observability • History of distributed tracing

    • Google Dapper • Facebook Canopy • Introduction to distributed tracing • Open Source tracing frameworks and tools • OpenCensus, Jaeger, Appdash, Zipkin, … • OpenTelemetry = OpenCensus + OpenTracing
  4. CNCF Cloud Native Landscape

  5. • Containerization • CI/CD • Orchestration & Application Definition •

    Observability & Analysis • Service Proxy, Discovery, & Mesh • Networking & Policy • Distributed Database & Storage • Streaming & Messaging • Container Registry & Runtime • Software Distribution CNCF Cloud Native Trail Map
  6. None
  7. None
  8. Microservices Debugging Challenges What Where Why How Who

  9. • What - 發⽣什麼事︖出錯還是等太久︖ • Where - 哪個(微)服務出問題︖ • Why

    - 什麼情況下發⽣︖ • How - 情境(Context)為何︖
  10. –Cindy Sridharan, author of Distributed Systems Observability “An observable system

    is one that exposes enough data about itself so that generating information and easily accessing this information becomes simple.”
  11. Visualize the problem

  12. “DevOps 三寶: Logging、Tracing、 Metrics”

  13. Pillars of Observability Metrics Logs Traces Visualization

  14. VI

  15. VIsual Studio https://docs.microsoft.com/en-us/visualstudio/

  16. GDB https://blogs.msdn.microsoft.com

  17. Observability - Logs Stackdriver logging

  18. Observability - Logs Tailer: Tail file and publish it to

    NATS Server Host B Tailer Anywhere nats deamon topic Host Z Tailer Nail connects to only one natsd server and subscribes to specific topic Tailer publish logs to gnatsd Host A Tailer Nail
  19. Observability - Metrics Prometheus + Grafana

  20. Logging • Error messages and stack traces • Helpful on

    troubleshooting • But logs at scale become very expensive • Can not be sampled • No context, no trace • CPU, RAM, error (rate), latency, throughput • Monitoring individual components of the system • Can be aggregate-able • Choose when you need alerts Metrics
  21. Tracing https://en.wikipedia.org/wiki/Animal_migration_tracking

  22. Observability - Tracing

  23. 2007 2009 2011 2013 2015 2017 2008 2010 2012 2014

    2016 2018 Distributed Tracing
 Timeline Magpie Stardust … X-Trace https://ucsdnews.ucsd.edu/pressrelease/computer_scientists_honored_for_tracing_research_that_stood_10_year_test_of
  24. Google Dapper • Design requirement • Low overhead • Application-level

    transparency • Scalability • Available for analysis quickly
  25. Google Dapper • Tree • Span • Tree node •

    Annotation-based • vs Black-box (statistical)
  26. Facebook Canopy • Generate events • Propagate a TraceID •

    Instrumentation APIs • Emit events • Aggregate events • Raw traces to • Model construction • Feature extraction • Query and Visualization
  27. Observability - Tracing

  28. “Distributed tracing, also called distributed request tracing, is a method

    used to profile and monitor applications, especially those built using a microservices architecture. Distributed tracing helps pinpoint where failures occur and what causes poor performance.”
  29. Distributed Tracing https://opentracing.io/docs/overview/

  30. • Commercial • Epsagon • LightStep • Datadog • Elastic

    • Instana • Wavefront • OpenSource • Appdash • Jaeger • Apache SkyWalking • Expedia Haystack
  31. Span A Span B Span C Span D Span E

    Time Trace Spans Span F spanA := tracer.StartSpan(“Span A”)
  32. Terminology (OpenTracing) • Time • Trace • Spans • Tags

    (Attributes) • Logs (Events) • SpanContext • Baggage (DistributedContext) Span A Span B Span C Span D Span E Time Trace Spans Span F
  33. • A trace is a data/execution path through the system,

    and can be thought of as a directed acyclic graph (DAG) of spans. • A trace is a collection of linked spans. The edges indicate the causal relationships (references) between spans. A B C D E F
  34. Span A Span B Span C Span D Span E

    Time Trace Spans Span F • The span is the execution of a client request. • The span representing an individual unit of work done in a distributed system. • https://www.w3.org/TR/trace-context/ • https://opentracing.io/docs/overview/spans/
  35. • An operation name • A start timestamp • A

    finish timestamp • A set of zero or more Span Tags • A set of zero or more Span Logs • A SpanContext (not shown here) func (h *Tracer) gotConn(info httptrace.GotConnInfo) { h.sp.SetTag("net/http.reused", info.Reused) h.sp.SetTag("net/http.was_idle", info.WasIdle) h.sp.LogFields(log.String("event", "GotConn"))
  36. • Span Tag represent contextual metadata relevant to a specific

    request. • Tags are key:value pairs that enable user-defined annotation of spans in order to query, filter, and comprehend trace data. • keys are strings and values can be strings, numbers, booleans. https://github.com/opentracing/specification/blob/master/semantic_conventions.md
  37. None
  38. Four Myths about Distributed Tracing • I need distributed tracing

    to measure the health / latency / throughput of my services • I need distributed tracing to see which services talk to which other services • I can “get distributed tracing” from the service mesh without having to make any changes to my application • I can “get distributed tracing” from a service mesh by instrumenting my application with a distributed tracing library https://linkerd.io/2019/08/09/service-mesh-distributed-tracing-myths/
  39. 2007 2009 2011 2013 2015 2017 2008 2010 2012 2014

    2016 2018 Twitter
 Zipkin Distributed Tracing Frameworks/Tools
 Timeline Uber Jaeger Google
 Dapper Facebook Canopy Magpie Stardust … X-Trace Sourcegraph
 Appdash OpenTracing Google
 OpenCensus Expedia Haystack
  40. Distributed Tracing Framework
 Major Components • Recorder • Agent and/or

    • Collector • Storage • Memory/Queue • NoSQL • Visualization • Interactive UI • Analytics • Index • Search • Distributed Context Propagation
  41. Zipkin

  42. Jaeger

  43. Haystack

  44. OpenCensus

  45. Different Frameworks

  46. 2007 2009 2011 2013 2015 2017 2008 2010 2012 2014

    2016 2018 Twitter
 Zipkin Distributed Tracing Frameworks/Tools
 Timeline Uber Jaeger Google
 Dapper 2019 Facebook Canopy Magpie Stardust … X-Trace Sourcegraph
 Appdash OpenTracing Google
 OpenCensus Expedia Haystack
  47. 
 
 
 
 
 
 
 
 
 


    
 
 
 Transaction trace is lost because tools use different headers for context propagation https://medium.com/@AloisReitbauer/trace-context-and-the-road-toward-trace-tool-interoperability-d4d56932369c With standardized headers, traces don’t break (even for proprietary information)
  48. • Effective observability requires high-quality telemetry. • OpenTelemetry makes robust,

    portable telemetry a built-in feature of cloud-native software. • Provides a single set of APIs, libraries, agents, and collector services to capture distributed traces and metrics from your application. • Distributed Tracing Working Group • Data formats for on-the-wire trace context & correlation-context, and out-of-band trace data • This specification defines formats to pass trace context information across systems. • Various tracing and diagnostics products can operate together. • https://opentelemetry.io/ • https://www.w3.org/2018/distributed-tracing/
  49. https://opentelemetry.io OpenTracing and OpenCensus: A Roadmap to Convergence

  50. KubeCon + CloudNativeCon
 North America 2019

  51. Summary • 對 Distributed Tracing 有基本的認識 • 了了解 Tracing 運⾏行行⽅方式,為什什麼要⽤用?

    • 未來來選⽤用 Tracing ⼯工具的⽅方向
  52. Thank you!