From zero to distributed traces: an OpenTracing tutorial

Slide 1

Slide 1 text

From zero to distributed traces An OpenTracing Tutorial Bryan Liles (Capital One), Yuri Shkuro (Uber), Won Jun Jang (Uber), Prithvi Raj (Uber) Velocity NYC, Oct 2 2017 1

Slide 2

Slide 2 text

Agenda ● Why care about tracing ● Tracing demo ● Why care about OpenTracing ● CNCF Jaeger ● OpenTracing deep dive ● Showcase & open discussion 2

Slide 3

Slide 3 text

Getting the most of this workshop 3 ● Learn the ropes. If you already know them, help teach the ropes :) ● Meet some people Everyone can walk away with practical tracing experience and a better sense of the space.

Slide 4

Slide 4 text

Why care about Tracing Tracing is fun 4

Slide 5

Slide 5 text

5 Today’s applications are complex

Slide 6

Slide 6 text

6 BILLIONS times a day!

Slide 7

Slide 7 text

7 How do we know what’s going on?

Slide 8

Slide 8 text

Metrics / Stats ● Counters, timers, gauges, histograms ● Four golden signals ○ utilization ○ saturation ○ throughput ○ errors ● Statsd, Prometheus, Grafana We use MONITORING tools 8 Logging ● Application events ● Errors, stack traces ● ELK, Splunk, Fluentd Monitoring tools must “tell stories” about your system

Slide 9

Slide 9 text

Metrics and logs are per-instance We need to monitor distributed transactions Metrics and logs don’t cut it anymore! 9

Slide 10

Slide 10 text

Systems and Distributed and Concurrent 10 Distributed Concurrency “The Simple [Inefficient] Thing” Basic Concurrency Async Concurrency Distributed Concurrency

Slide 11

Slide 11 text

11 How do we “tell stories” about distributed concurrency?

Slide 12

Slide 12 text

Slide 13

Slide 13 text

13 performance and latency optimization distributed transaction monitoring service dependency analysis root cause analysis distributed context propagation Distributed Tracing Systems

Slide 14

Slide 14 text

Context Propagation and Distributed Tracing 14 A B C D E {context} {context} {context} {context} Unique ID → {context} Edge service A B E C D time TRACE SPANS

Slide 15

Slide 15 text

Understanding Sampling Tracing data can exceed business traffic. Most tracing systems sample transactions: ● Head-based sampling: the sampling decision is made just before the trace is started, and it is respected by all nodes in the graph ● Tail-based sampling: the sampling decision is made after the trace is completed / collected 15

Slide 16

Slide 16 text

Let’s look at some traces demo time: http://bit.do/jaeger-hotrod 16

Slide 17

Slide 17 text

Tracing instrumentation has been too hard. ● Lock-in is unacceptable: instrumentation must be decoupled from vendors ● Monkey patching doesn’t scale: instrumentation must be explicit ● Inconsistent APIs: tracing semantics must not be language-dependent ● Handoff woes: tracing libs in Project X don’t hand-off to tracing libs in Project Y Great… Why isn’t everyone tracing? 17

Slide 18

Slide 18 text

Enter OpenTracing http://opentracing.io 18

Slide 19

Slide 19 text

OpenTracing in a nutshell OpenTracing addresses the instrumentation problem. Who cares? Developers building: ● Cloud-native / microservice applications ● OSS packages, especially near process edges (web frameworks, managed service clients, etc) ● Tracing and/or monitoring systems 19

Slide 20

Slide 20 text

Where does tracing code live? 20 OSS and commercial / in-house instrumentation Tracer SDKs / clients Tracing backends and UIs

Slide 21

Slide 21 text

OpenTracing Architecture 21 OpenTracing API application logic µ-service frameworks Lambda functions RPC & control-flow frameworks existing instrumentation tracing infrastructure main() I N S T A N A CNCF Jaeger microservice process

Slide 22

Slide 22 text

~2 years old Tracer implementations: Jaeger, Zipkin, LightStep, SkyWalking, others All sorts of companies use OpenTracing: A young, growing project 22

Slide 23

Slide 23 text

Rapidly growing OSS and vendor support 23 JDBI Java Webservlet Jaxr

Slide 24

Slide 24 text

Jaeger A distributed tracing system 24

Slide 25

Slide 25 text

New CNCF Project: Jaeger 25 https://github.com/uber/jaeger

Slide 26

Slide 26 text

• Inspired by Google’s Dapper and OpenZipkin • Started at Uber in August 2015 • Open sourced in April 2017 • Official CNCF project, Sep 2017 • Built-in OpenTracing support • https://github.com/uber/jaeger Jaeger - /ˈyāɡər/, noun: hunter 26

Slide 27

Slide 27 text

Jaeger: Technology Stack ● Go backend ● Pluggable storage ○ Cassandra, Elasticsearch, memory, ... ● React/Javascript frontend ● OpenTracing Instrumentation libraries 27

Slide 28

Slide 28 text

Jaeger: Community ● 10 full time engineers at Uber and Red Hat ● 30+ contributors on GitHub ● Already used by many organizations ○ including Symantec, Red Hat, Base CRM, Massachusetts Open Cloud, Nets, FarmersEdge, GrafanaLabs, Northwestern Mutual, Zenly 28

Slide 29

Slide 29 text

Doc http://bit.do/velocity17 OpenTracing deep dive 29

Slide 30

Slide 30 text

Materials ● Setup instructions: http://bit.do/velocity17 ● Tutorial: http://bit.do/opentracing-tutorial ● Q&A: https://gitter.im/opentracing/workshop 30

Slide 31

Slide 31 text

31 Lesson 1 Hello, World

Slide 32

Slide 32 text

Lesson 1 Objectives 32 ● Basic concepts ● Instantiate a Tracer ● Create a simple trace ● Annotate the trace

Slide 33

Slide 33 text

Basic concepts: SPAN Span: a basic unit of work, timing, and causality. A span contains: ● operation name ● start / finish timestamps ● tags and logs ● references to other spans 33

Slide 34

Slide 34 text

Basic concepts: TRACE Trace: a directed acyclic graph (DAG) of spans 34 Span A Span B Span C Span D Span E Span F Span G Span H

Slide 35

Slide 35 text

Trace as a time sequence diagram A B E C D time F G H

Slide 36

Slide 36 text

Basic concepts: OPERATION NAME 36 A human-readable string which concisely represents the work of the span. ● E.g. an RPC method name, a function name, or the name of a subtask or stage within a larger computation ● Can be set at span creation or later ● Should be low cardinality, aggregatable, identifying class of spans get too general get_account/12345 too specific get_account good, “12345” could be a tag

Slide 37

Slide 37 text

Basic concepts: TAG A key-value pair that describes the span overall. Examples: ● http.url = “http://google.com” ● http.status_code = 200 ● peer.service = “mysql” ● db.statement = “select * from users” https://github.com/opentracing/specification/blob/master/semantic_conventions.md 37

Slide 38

Slide 38 text

Basic concepts: LOG 38 Describes an event at a point in time during the span lifetime. ● OpenTracing supports structured logging ● Contains a timestamp and a set of fields span.log_kv( {'event': 'open_conn', 'port': 433} )

Slide 39

Slide 39 text

Basic concepts: TRACER A tracer is a concrete implementation of the OpenTracing API. tracer := jaeger.New("hello-world") span := tracer.StartSpan("say-hello") // do the work span.Finish() 39

Slide 40

Slide 40 text

Understanding Sampling ● Tracing data > than business traffic ● Most tracing systems sample transactions ● Head-based sampling: the sampling decision is made just before the trace is started, and it is respected by all nodes in the graph ● Tail-based sampling: the sampling decision is made after the trace is completed / collected 40

Slide 41

Slide 41 text

How to create Jaeger Tracer 41 cfg := &config.Configuration{ Sampler: &config.SamplerConfig{ Type: "const", Param: 1, }, Reporter: &config.ReporterConfig{LogSpans: true}, } tracer, closer, err := cfg.New(serviceName)

Slide 42

Slide 42 text

42 Lesson 2 Context and Tracing Functions

Slide 43

Slide 43 text

Lesson 2 Objectives 43 ● Trace individual functions ● Combine multiple spans into a single trace ● Propagate the in-process context

Slide 44

Slide 44 text

44 How do we build a DAG? span1 := tracer.StartSpan("say-hello") // do the work span1.Finish() span2 := tracer.StartSpan("format-string") // do the work span2.Finish() This just creates two independent traces!

Slide 45

Slide 45 text

45 Build a DAG with Span References span1 := tracer.StartSpan("say-hello") // do the work span1.Finish() span2 := tracer.StartSpan( "format-string", opentracing.ChildOf(span1.Context()), ) // do the work span2.Finish()

Slide 46

Slide 46 text

Basic concepts: SPAN CONTEXT 46 Serializable format for linking spans across network boundaries. Carries trace/span identity and baggage. type SpanContext struct { traceID TraceID spanID SpanID parentID SpanID flags byte baggage map[string]string }

Slide 47

Slide 47 text

Basic concepts: SPAN REFERENCE Describes causal relationship to another span. type Reference struct { Type opentracing.SpanReferenceType Context SpanContext } 47

Slide 48

Slide 48 text

Types of Span References ChildOf: referenced span is an ancestor that depends on the results of the current span. E.g. RPC call, database call, local function FollowsFrom: referenced span is an ancestor that does not depend on the results of the current span. E.g. async fire-n-forget cache write. 48

Slide 49

Slide 49 text

In-process Context Propagation We don’t want to keep passing Spans around. Need a more general request context. ● Go: context.Context (from std lib) ● Java, Python: thread-locals (WIP) ● Node.js: TBD (internally: @uber/node-context) 49

Slide 50

Slide 50 text

50 Lesson 3 Tracing RPC Requests

Slide 51

Slide 51 text

● Trace a transaction across more than one microservice ● Pass the context between processes using Inject and Extract ● Apply OpenTracing-recommended tags Lesson 3 Objectives 51

Slide 52

Slide 52 text

Three Steps for Instrumentation 52 MY SERVICE inbound request outbound request Jaeger client library Send trace data to Jaeger (background thread) 1 instrumentation Handler Headers TraceID Context Span Context Span Headers TraceID instrumentation Client 2 3

Slide 53

Slide 53 text

Basic concepts: Inject and Extract Tracer methods used to serialize Span Context to or from RPC requests (or other network comms) void Inject(SpanContext, Format, Carrier) SpanContext Extract(Format, Carrier) 53

Slide 54

Slide 54 text

Basic concepts: Propagation Format OpenTracing does not define the wire format. It assumes that the frameworks for network comms allow passing the context (request metadata) as one of these (the Format enum): 1. TextMap: Arbitrary string key/value headers 2. Binary: A binary blob 3. HTTPHeaders: as a special case of #1 54

Slide 55

Slide 55 text

Basic concepts: Carrier Each Format defines a corresponding Carrier interface that the Tracer uses to read/write the span context. The instrumentation implements the Carrier interface as an adapter around their custom types 55

Slide 56

Slide 56 text

Inject Example 56 Tracer TextMap Carrier Binary Carrier AddHeader(key, value) Write(byte[]) RPC Adapter RPC Request Set(key, value) Write(byte[]) Adapter RPC Request

Slide 57

Slide 57 text

57 Lesson 4 Baggage

Slide 58

Slide 58 text

● Understand distributed context propagation ● Use baggage to pass data through the call graph Lesson 4 Objectives 58

Slide 59

Slide 59 text

Distributed Context Propagation 59 Client Span button=buy Frontend Span button=buy, exp_id=57 Ad Span button=buy, exp_id=57 Content Span button=buy, exp_id=57 Shard A Span button=buy, exp_id=57 Shard B Span button=buy, exp_id=57 Cassandra Spans button=buy, exp_id=57 Cassandra Spans button=buy, exp_id=57 Cassandra Spans button=buy, exp_id=57 Cassandra Spans button=buy, exp_id=57 Cassandra Spans button=buy, exp_id=57 Problem: how to aggregate disk writes in Cassandra by “button” type (or experiment id, etc, etc)? See the Pivot Tracing paper http://pivottracing.io/

Slide 60

Slide 60 text

Baggage is a general purpose in-band key-value store. span.SetBaggageItem("Bender", "Rodriguez") Transparent to most services. Powerful but dangerous ● Bloats the request size Basic concepts: Baggage 60 A C D E B

Slide 61

Slide 61 text

Extra Credit 61

Slide 62

Slide 62 text

62 Logging v. Tracing

Slide 63

Slide 63 text

Monitoring == Observing Events 63 Metrics - Record events as aggregates (e.g. counters) Tracing - Record transaction-scoped events Logging - Record unique events Low volume High volume

Slide 64

Slide 64 text

Logging v. Tracing 64 Tracing ● Contextual ● High granularity (debug and ↓) ● Per-transaction sampling ● Lower volume, higher fidelity Logging ● No context ● Low granularity (warn and ↑) ● Per-process sampling (at best) ● High volume, low fidelity Industry advice: don’t log on success (https://vimeo.com/221066726)

Slide 65

Slide 65 text

Q & A Open Discussion 65

Slide 66

Slide 66 text

Thank You and See You in Austin! • See you in Austin and Copenhagen! • KubeCon + CloudNativeCon North America 2017 – Austin, Texas (December 6 - 8, 2017) – Registration & Sponsorships now open: kubecon.io • KubeCon + CloudNativeCon Europe 2018 – Copenhagen, Denmark (May 2 - 4, 2018) – http://events.linuxfoundation.org/events/kubecon-an d-cloudnativecon-europe 66

Slide 67

Slide 67 text

Appendix 67

Slide 68

Slide 68 text

Jaeger at Uber ● Root cause and dependency analysis ● Distributed context propagation ○ Tenancy ○ Security ○ Chaos Engineering ● Data mining ○ Capacity Planning ○ Latency and SLA analysis 68

Slide 69

Slide 69 text

Jaeger: Roadmap ● Adaptive sampling ● Data mining pipeline ● Instrumentation in more languages ● Drop-in replacement for Zipkin ● Path-based dependency diagrams ● Latency histograms 69