Let's talk about Tracing

Slide 1

Slide 1 text

Let’s talk about tracing Cheng-Lung Sung (clsung@)

Slide 2

Slide 2 text

About.Me/clsung • Tech Lead of Data Research & Development Center, CTBC Bank • RD Head of Product Development, HTC Health Care (DeepQ) • Cloud Service Infrastructure (Golang, Python) • Mobile App Development (Golang, Java, Swift, Node.js) • Deep Learning AI Platform (Golang, Python, Node.js) • Open Source contributor • GitHub github.com/clsung • Golang golang.org/AUTHORS • Plurk API www.plurk.com/API • LINE API Expert www.line-community.me/contributors

Slide 3

Slide 3 text

Outline • CNCF - Observability • History of distributed tracing • Google Dapper • Facebook Canopy • Introduction to distributed tracing • Open Source tracing frameworks and tools • OpenCensus, Jaeger, Appdash, Zipkin, … • OpenTelemetry = OpenCensus + OpenTracing

Slide 4

Slide 4 text

CNCF Cloud Native Landscape

Slide 5

Slide 5 text

• Containerization • CI/CD • Orchestration & Application Deﬁnition • Observability & Analysis • Service Proxy, Discovery, & Mesh • Networking & Policy • Distributed Database & Storage • Streaming & Messaging • Container Registry & Runtime • Software Distribution CNCF Cloud Native Trail Map

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

Microservices Debugging Challenges What Where Why How Who

Slide 9

Slide 9 text

• What - 發⽣什麼事︖出錯還是等太久︖ • Where - 哪個(微)服務出問題︖ • Why - 什麼情況下發⽣︖ • How - 情境(Context)為何︖

Slide 10

Slide 10 text

–Cindy Sridharan, author of Distributed Systems Observability “An observable system is one that exposes enough data about itself so that generating information and easily accessing this information becomes simple.”

Slide 11

Slide 11 text

Visualize the problem

Slide 12

Slide 12 text

“DevOps 三寶: Logging、Tracing、 Metrics”

Slide 13

Slide 13 text

Pillars of Observability Metrics Logs Traces Visualization

Slide 14

Slide 14 text

Slide 15

Slide 15 text

VIsual Studio https://docs.microsoft.com/en-us/visualstudio/

Slide 16

Slide 16 text

GDB https://blogs.msdn.microsoft.com

Slide 17

Slide 17 text

Observability - Logs Stackdriver logging

Slide 18

Slide 18 text

Observability - Logs Tailer: Tail ﬁle and publish it to NATS Server Host B Tailer Anywhere nats deamon topic Host Z Tailer Nail connects to only one natsd server and subscribes to specific topic Tailer publish logs to gnatsd Host A Tailer Nail

Slide 19

Slide 19 text

Observability - Metrics Prometheus + Grafana

Slide 20

Slide 20 text

Logging • Error messages and stack traces • Helpful on troubleshooting • But logs at scale become very expensive • Can not be sampled • No context, no trace • CPU, RAM, error (rate), latency, throughput • Monitoring individual components of the system • Can be aggregate-able • Choose when you need alerts Metrics

Slide 21

Slide 21 text

Tracing https://en.wikipedia.org/wiki/Animal_migration_tracking

Slide 22

Slide 22 text

Observability - Tracing

Slide 23

Slide 23 text

2007 2009 2011 2013 2015 2017 2008 2010 2012 2014 2016 2018 Distributed Tracing  Timeline Magpie Stardust … X-Trace https://ucsdnews.ucsd.edu/pressrelease/computer_scientists_honored_for_tracing_research_that_stood_10_year_test_of

Slide 24

Slide 24 text

Google Dapper • Design requirement • Low overhead • Application-level transparency • Scalability • Available for analysis quickly

Slide 25

Slide 25 text

Google Dapper • Tree • Span • Tree node • Annotation-based • vs Black-box (statistical)

Slide 26

Slide 26 text

Facebook Canopy • Generate events • Propagate a TraceID • Instrumentation APIs • Emit events • Aggregate events • Raw traces to • Model construction • Feature extraction • Query and Visualization

Slide 27

Slide 27 text

Observability - Tracing

Slide 28

Slide 28 text

“Distributed tracing, also called distributed request tracing, is a method used to proﬁle and monitor applications, especially those built using a microservices architecture. Distributed tracing helps pinpoint where failures occur and what causes poor performance.”

Slide 29

Slide 29 text

Distributed Tracing https://opentracing.io/docs/overview/

Slide 30

Slide 30 text

• Commercial • Epsagon • LightStep • Datadog • Elastic • Instana • Wavefront • OpenSource • Appdash • Jaeger • Apache SkyWalking • Expedia Haystack

Slide 31

Slide 31 text

Span A Span B Span C Span D Span E Time Trace Spans Span F spanA := tracer.StartSpan(“Span A”)

Slide 32

Slide 32 text

Terminology (OpenTracing) • Time • Trace • Spans • Tags (Attributes) • Logs (Events) • SpanContext • Baggage (DistributedContext) Span A Span B Span C Span D Span E Time Trace Spans Span F

Slide 33

Slide 33 text

• A trace is a data/execution path through the system, and can be thought of as a directed acyclic graph (DAG) of spans. • A trace is a collection of linked spans. The edges indicate the causal relationships (references) between spans. A B C D E F

Slide 34

Slide 34 text

Span A Span B Span C Span D Span E Time Trace Spans Span F • The span is the execution of a client request. • The span representing an individual unit of work done in a distributed system. • https://www.w3.org/TR/trace-context/ • https://opentracing.io/docs/overview/spans/

Slide 35

Slide 35 text

• An operation name • A start timestamp • A ﬁnish timestamp • A set of zero or more Span Tags • A set of zero or more Span Logs • A SpanContext (not shown here) func (h *Tracer) gotConn(info httptrace.GotConnInfo) { h.sp.SetTag("net/http.reused", info.Reused) h.sp.SetTag("net/http.was_idle", info.WasIdle) h.sp.LogFields(log.String("event", "GotConn"))

Slide 36

Slide 36 text

• Span Tag represent contextual metadata relevant to a specific request. • Tags are key:value pairs that enable user-defined annotation of spans in order to query, filter, and comprehend trace data. • keys are strings and values can be strings, numbers, booleans. https://github.com/opentracing/specification/blob/master/semantic_conventions.md

Slide 37

Slide 37 text

No content

Slide 38

Slide 38 text

Four Myths about Distributed Tracing • I need distributed tracing to measure the health / latency / throughput of my services • I need distributed tracing to see which services talk to which other services • I can “get distributed tracing” from the service mesh without having to make any changes to my application • I can “get distributed tracing” from a service mesh by instrumenting my application with a distributed tracing library https://linkerd.io/2019/08/09/service-mesh-distributed-tracing-myths/

Slide 39

Slide 39 text

2007 2009 2011 2013 2015 2017 2008 2010 2012 2014 2016 2018 Twitter  Zipkin Distributed Tracing Frameworks/Tools  Timeline Uber Jaeger Google  Dapper Facebook Canopy Magpie Stardust … X-Trace Sourcegraph  Appdash OpenTracing Google  OpenCensus Expedia Haystack

Slide 40

Slide 40 text

Distributed Tracing Framework  Major Components • Recorder • Agent and/or • Collector • Storage • Memory/Queue • NoSQL • Visualization • Interactive UI • Analytics • Index • Search • Distributed Context Propagation

Slide 41

Slide 41 text

Zipkin

Slide 42

Slide 42 text

Jaeger

Slide 43

Slide 43 text

Haystack

Slide 44

Slide 44 text

OpenCensus

Slide 45

Slide 45 text

Different Frameworks

Slide 46

Slide 46 text

2007 2009 2011 2013 2015 2017 2008 2010 2012 2014 2016 2018 Twitter  Zipkin Distributed Tracing Frameworks/Tools  Timeline Uber Jaeger Google  Dapper 2019 Facebook Canopy Magpie Stardust … X-Trace Sourcegraph  Appdash OpenTracing Google  OpenCensus Expedia Haystack

Slide 47

Slide 47 text

                          Transaction trace is lost because tools use different headers for context propagation https://medium.com/@AloisReitbauer/trace-context-and-the-road-toward-trace-tool-interoperability-d4d56932369c With standardized headers, traces don’t break (even for proprietary information)

Slide 48

Slide 48 text

• Effective observability requires high-quality telemetry. • OpenTelemetry makes robust, portable telemetry a built-in feature of cloud-native software. • Provides a single set of APIs, libraries, agents, and collector services to capture distributed traces and metrics from your application. • Distributed Tracing Working Group • Data formats for on-the-wire trace context & correlation-context, and out-of-band trace data • This speciﬁcation deﬁnes formats to pass trace context information across systems. • Various tracing and diagnostics products can operate together. • https://opentelemetry.io/ • https://www.w3.org/2018/distributed-tracing/

Slide 49

Slide 49 text

https://opentelemetry.io OpenTracing and OpenCensus: A Roadmap to Convergence

Slide 50

Slide 50 text

KubeCon + CloudNativeCon  North America 2019

Slide 51

Slide 51 text

Summary • 對 Distributed Tracing 有基本的認識 • 了了解 Tracing 運⾏行行⽅方式，為什什麼要⽤用？ • 未來來選⽤用 Tracing ⼯工具的⽅方向

Slide 52

Slide 52 text

Thank you!