Let's talk about Tracing

Let’s talk about tracing Cheng-Lung Sung (clsung@)

About.Me/clsung • Tech Lead of Data Research & Development Center,
CTBC Bank • RD Head of Product Development, HTC Health Care (DeepQ) • Cloud Service Infrastructure (Golang, Python) • Mobile App Development (Golang, Java, Swift, Node.js) • Deep Learning AI Platform (Golang, Python, Node.js) • Open Source contributor • GitHub github.com/clsung • Golang golang.org/AUTHORS • Plurk API www.plurk.com/API • LINE API Expert www.line-community.me/contributors

Outline • CNCF - Observability • History of distributed tracing
• Google Dapper • Facebook Canopy • Introduction to distributed tracing • Open Source tracing frameworks and tools • OpenCensus, Jaeger, Appdash, Zipkin, … • OpenTelemetry = OpenCensus + OpenTracing

CNCF Cloud Native Landscape

• Containerization • CI/CD • Orchestration & Application Deﬁnition •
Observability & Analysis • Service Proxy, Discovery, & Mesh • Networking & Policy • Distributed Database & Storage • Streaming & Messaging • Container Registry & Runtime • Software Distribution CNCF Cloud Native Trail Map

Microservices Debugging Challenges What Where Why How Who

• What - 發⽣什麼事︖出錯還是等太久︖ • Where - 哪個(微)服務出問題︖ • Why
- 什麼情況下發⽣︖ • How - 情境(Context)為何︖

–Cindy Sridharan, author of Distributed Systems Observability “An observable system
is one that exposes enough data about itself so that generating information and easily accessing this information becomes simple.”

Visualize the problem

“DevOps 三寶: Logging、Tracing、 Metrics”

Pillars of Observability Metrics Logs Traces Visualization

VIsual Studio https://docs.microsoft.com/en-us/visualstudio/

GDB https://blogs.msdn.microsoft.com

Observability - Logs Stackdriver logging

Observability - Logs Tailer: Tail ﬁle and publish it to
NATS Server Host B Tailer Anywhere nats deamon topic Host Z Tailer Nail connects to only one natsd server and subscribes to specific topic Tailer publish logs to gnatsd Host A Tailer Nail

Observability - Metrics Prometheus + Grafana

Logging • Error messages and stack traces • Helpful on
troubleshooting • But logs at scale become very expensive • Can not be sampled • No context, no trace • CPU, RAM, error (rate), latency, throughput • Monitoring individual components of the system • Can be aggregate-able • Choose when you need alerts Metrics

Tracing https://en.wikipedia.org/wiki/Animal_migration_tracking

Observability - Tracing

2007 2009 2011 2013 2015 2017 2008 2010 2012 2014
2016 2018 Distributed Tracing  Timeline Magpie Stardust … X-Trace https://ucsdnews.ucsd.edu/pressrelease/computer_scientists_honored_for_tracing_research_that_stood_10_year_test_of

Google Dapper • Design requirement • Low overhead • Application-level
transparency • Scalability • Available for analysis quickly

Google Dapper • Tree • Span • Tree node •
Annotation-based • vs Black-box (statistical)

Facebook Canopy • Generate events • Propagate a TraceID •
Instrumentation APIs • Emit events • Aggregate events • Raw traces to • Model construction • Feature extraction • Query and Visualization

Observability - Tracing

“Distributed tracing, also called distributed request tracing, is a method
used to proﬁle and monitor applications, especially those built using a microservices architecture. Distributed tracing helps pinpoint where failures occur and what causes poor performance.”

Distributed Tracing https://opentracing.io/docs/overview/

• Commercial • Epsagon • LightStep • Datadog • Elastic
• Instana • Wavefront • OpenSource • Appdash • Jaeger • Apache SkyWalking • Expedia Haystack

Span A Span B Span C Span D Span E
Time Trace Spans Span F spanA := tracer.StartSpan(“Span A”)

Terminology (OpenTracing) • Time • Trace • Spans • Tags
(Attributes) • Logs (Events) • SpanContext • Baggage (DistributedContext) Span A Span B Span C Span D Span E Time Trace Spans Span F

• A trace is a data/execution path through the system,
and can be thought of as a directed acyclic graph (DAG) of spans. • A trace is a collection of linked spans. The edges indicate the causal relationships (references) between spans. A B C D E F

Span A Span B Span C Span D Span E
Time Trace Spans Span F • The span is the execution of a client request. • The span representing an individual unit of work done in a distributed system. • https://www.w3.org/TR/trace-context/ • https://opentracing.io/docs/overview/spans/

• An operation name • A start timestamp • A
ﬁnish timestamp • A set of zero or more Span Tags • A set of zero or more Span Logs • A SpanContext (not shown here) func (h *Tracer) gotConn(info httptrace.GotConnInfo) { h.sp.SetTag("net/http.reused", info.Reused) h.sp.SetTag("net/http.was_idle", info.WasIdle) h.sp.LogFields(log.String("event", "GotConn"))

• Span Tag represent contextual metadata relevant to a specific
request. • Tags are key:value pairs that enable user-defined annotation of spans in order to query, filter, and comprehend trace data. • keys are strings and values can be strings, numbers, booleans. https://github.com/opentracing/specification/blob/master/semantic_conventions.md

Four Myths about Distributed Tracing • I need distributed tracing
to measure the health / latency / throughput of my services • I need distributed tracing to see which services talk to which other services • I can “get distributed tracing” from the service mesh without having to make any changes to my application • I can “get distributed tracing” from a service mesh by instrumenting my application with a distributed tracing library https://linkerd.io/2019/08/09/service-mesh-distributed-tracing-myths/

2007 2009 2011 2013 2015 2017 2008 2010 2012 2014
2016 2018 Twitter  Zipkin Distributed Tracing Frameworks/Tools  Timeline Uber Jaeger Google  Dapper Facebook Canopy Magpie Stardust … X-Trace Sourcegraph  Appdash OpenTracing Google  OpenCensus Expedia Haystack

Distributed Tracing Framework  Major Components • Recorder • Agent and/or
• Collector • Storage • Memory/Queue • NoSQL • Visualization • Interactive UI • Analytics • Index • Search • Distributed Context Propagation

Zipkin

Jaeger

Haystack

OpenCensus

Different Frameworks

2007 2009 2011 2013 2015 2017 2008 2010 2012 2014
2016 2018 Twitter  Zipkin Distributed Tracing Frameworks/Tools  Timeline Uber Jaeger Google  Dapper 2019 Facebook Canopy Magpie Stardust … X-Trace Sourcegraph  Appdash OpenTracing Google  OpenCensus Expedia Haystack

                   
      Transaction trace is lost because tools use different headers for context propagation https://medium.com/@AloisReitbauer/trace-context-and-the-road-toward-trace-tool-interoperability-d4d56932369c With standardized headers, traces don’t break (even for proprietary information)

• Effective observability requires high-quality telemetry. • OpenTelemetry makes robust,
portable telemetry a built-in feature of cloud-native software. • Provides a single set of APIs, libraries, agents, and collector services to capture distributed traces and metrics from your application. • Distributed Tracing Working Group • Data formats for on-the-wire trace context & correlation-context, and out-of-band trace data • This speciﬁcation deﬁnes formats to pass trace context information across systems. • Various tracing and diagnostics products can operate together. • https://opentelemetry.io/ • https://www.w3.org/2018/distributed-tracing/

https://opentelemetry.io OpenTracing and OpenCensus: A Roadmap to Convergence

KubeCon + CloudNativeCon  North America 2019

Summary • 對 Distributed Tracing 有基本的認識 • 了了解 Tracing 運⾏行行⽅方式，為什什麼要⽤用？
• 未來來選⽤用 Tracing ⼯工具的⽅方向

Thank you!

Let's talk about Tracing

Let's talk about Tracing

More Decks by Cheng-Lung Sung

Other Decks in Technology

Featured

Transcript