Slide 1

Slide 1 text

Observable Applications With OpenTelemetry Virtual Rejekts | 01.04.20

Slide 2

Slide 2 text

Do we need distributed tracing?

Slide 3

Slide 3 text

That depends on the questions we want to ask...

Slide 4

Slide 4 text

Hi, I'm Johannes Johannes Liebermann Software Developer, Kinvolk Github: johananl Twitter: @j_lieb Email: [email protected]

Slide 5

Slide 5 text

The Kubernetes Linux Experts Engineering services and products for Kubernetes, containers, process management and Linux user-space + kernel Blog: kinvolk.io/blog Github: kinvolk Twitter: kinvolkio Email: [email protected] Kinvolk

Slide 6

Slide 6 text

https://xkcd.com/927/

Slide 7

Slide 7 text

Agenda ● Logs, metrics and traces ● Distributed tracing: why it’s great but also hard ● Introduction to OpenTelemetry ● A look into the tracing libraries (Go, Python) ● Demo — instrumenting a distributed application ● Context propagation

Slide 8

Slide 8 text

This talk assumes familiarity with distributed tracing.

Slide 9

Slide 9 text

Distributed Tracing in 30 Seconds helloHandler := func(w http.ResponseWriter, req *http.Request) { ... ctx, span := tr.Start( req.Context(), "handle-hello-request", ... ) defer span.End() _, _ = io.WriteString(w, "Hello, world!\n") }

Slide 10

Slide 10 text

Distributed Tracing in 30 Seconds

Slide 11

Slide 11 text

Distributed Tracing in 30 Seconds ● A span measures a unit of work in a service ● A trace combines multiple spans together handle-http-request query-database render-response 5ms 3ms 2ms trace span

Slide 12

Slide 12 text

Logs, Metrics, Traces

Slide 13

Slide 13 text

Logs, Metrics, Traces Generating Processing & Storing Querying Scope Logs Easy Hard Hard Node Metrics So-so Easy Easy Node / Service Traces Hard So-so Easy Request

Slide 14

Slide 14 text

Logs, Metrics, Traces Generating Processing & Storing Querying Scope Logs Easy Hard Hard Node Metrics So-so Easy Easy Node / Service Traces Hard So-so Easy Request

Slide 15

Slide 15 text

What Is the Question? ● Why did this node crash? ● Was function X on the node called? ● Is my service healthy? ● How much traffic do we have? ● Why was this request slow? ● Where should I optimize performance? ● Which services are involved?

Slide 16

Slide 16 text

What Is the Question? ● Why did this node crash? ● Was function X on the node called? ● Is my service healthy? ● How much traffic do we have? ● Why was this request slow? ● Where should I optimize performance? ● Which services are involved? Logs Metrics Traces

Slide 17

Slide 17 text

What Is the Question? ● Why did this node crash? ● Was function X on the node called? ● Is my service healthy? ● How much traffic do we have? ● Why was this request slow? ● Where should I optimize performance? ● Which services are involved?

Slide 18

Slide 18 text

Distributed tracing allows us to get low-level, end-to-end information about individual requests.

Slide 19

Slide 19 text

...so why not trace everything all the time?

Slide 20

Slide 20 text

Logs, Metrics, Traces Generating Processing & Storing Querying Scope Logs Easy Hard Hard Service Metrics So-so Easy Easy Service / system Traces Hard So-so Easy Request

Slide 21

Slide 21 text

1. It’s a Lot of Work ● Instrumentation == code changes ● Hard to justify reducing team velocity for tracing ● You can’t have “instrumentation holes” ○ At the very least you must propagate context

Slide 22

Slide 22 text

2. We Can’t Vendor-Lock ● Vendor-locking for tracing is especially problematic ● Importing a vendor-specific library is scary ○ What if my monitoring vendor raises prices? ● Open-source libraries must remain neutral ○ You can’t require users to use a specific vendor ○ Maintaining support for multiple vendors is a lot of work

Slide 23

Slide 23 text

3. Distributed Tracing vs. Microservices ● Does distributed tracing conflict with microservices?

Slide 24

Slide 24 text

● Multiple microservices ● Multiple programming languages and frameworks ● Multiple protocols (HTTP, gRPC, messaging, ...) ● Multiple tracing backends (Jaeger, Zipkin, Datadog, LightStep, NewRelic, Dynatrace, …) Multiple Everything

Slide 25

Slide 25 text

Is there a solution?

Slide 26

Slide 26 text

Standards!

Slide 27

Slide 27 text

Lack of standards is especially costly for distributed tracing.

Slide 28

Slide 28 text

https://xkcd.com/927/

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

OpenCensus

Slide 31

Slide 31 text

OpenCensus ⊻

Slide 32

Slide 32 text

+ = May 2019

Slide 33

Slide 33 text

Introducing OpenTelemetry ● opentelemetry.io ● Announced May 2019 ● The next major version of both OpenTracing and OpenCensus ● A real community effort ● A spec and a set of libraries ● API and implementation ● Tracing and metrics

Slide 34

Slide 34 text

OpenTelemetry — Architecture ● API ○ Follows the OpenTelemetry specification ○ Can be used without an implementation ● SDK ○ A ready-to-use implementation ○ Alternative implementations are supported ● Exporters ● Bridges https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/library-guidelines.md

Slide 35

Slide 35 text

Separation of Concerns ● Library developers depend only on the API ● Application developers depend on the API and on an implementation ● Monitoring vendors maintain their own exporters

Slide 36

Slide 36 text

Protecting User Applications ● I may want to use an instrumented 3rd-party library without using OpenTelemetry ○ If no implementation is plugged in, telemetry data is not produced ● My code should not be broken by instrumentation ○ The API package is self-sufficient thanks to a built-in noop implementation ● Performance impact should be minimal ○ No blocking of end-user application by default ○ Noop implementation produces negligible overhead ○ Asynchronous exporting of telemetry data

Slide 37

Slide 37 text

● Current status: beta ● Production readiness: 2nd half of 2020 ● Libraries for: Go, Python, Java, C++, Rust, PHP, Ruby, .NET, JavaScript, Erlang Project Status

Slide 38

Slide 38 text

Go Library Status Latest release: v0.4.2 (beta) ✓ API (tracing + metrics) ✓ SDK (tracing + metrics) ✓ Context propagation ✓ Exporters: Jaeger, Zipkin, Prometheus (metrics) ✓ OpenTracing bridge

Slide 39

Slide 39 text

Python Library Status Latest release: v0.6.0 (beta) ✓ API (tracing + metrics) ✓ SDK (tracing + metrics) ✓ Context propagation ✓ Exporters: Jaeger, Zipkin, Prometheus (metrics) ✓ OpenTracing bridge

Slide 40

Slide 40 text

How do I instrument my services?

Slide 41

Slide 41 text

Instrumenting Go Code // Explicit span creation. handler := func(w http.ResponseWriter, r *http.Request) { ctx, span := tr.Start(r.Context(), "handle-request") defer span.End() // Handle HTTP request. } // Implicit span creation. err := tr.WithSpan(ctx, "do-stuff", func(context.Context) error { return do_stuff() } )

Slide 42

Slide 42 text

Instrumenting Go Code // Log an event on the span. span.AddEvent(ctx, "Generating response", key.New("response").String("stuff") ) // Set key-value pairs on the span. span.SetAttributes( key.New("cool").Bool(true), key.New("usefulness").String("very"), )

Slide 43

Slide 43 text

// Inject tracing metadata on outgoing requests. grpctrace.Inject(ctx, &metadata) // Extract tracing metadata on incoming requests. metadata, spanCtx := grpctrace.Extract(ctx, &metadataCopy) Propagating Context Between Processes Protocol dependent!

Slide 44

Slide 44 text

Instrumenting Python Code # Implicit span creation. with tracer.start_as_current_span("do-stuff") as span: do_stuff() # Log an event on the span. span.add_event(“something happened”, {‘foo’: ‘bar’}) # Set key-value pairs on the span. span.set_attribute(‘cool’: True)

Slide 45

Slide 45 text

Frontend Seniority Field Role Demo: Fake Job Title Generator HTTP gRPC

Slide 46

Slide 46 text

Demo

Slide 47

Slide 47 text

The tricky part: context propagation.

Slide 48

Slide 48 text

Context — Request-Scoped Data ● “Context” refers to request-scoped data ○ Example: request/transaction ID ● Context is propagated across a request’s path ● Needed for span correlation ○ Trace ID and span ID must be propagated ● Two types of context propagation: in-process and distributed

Slide 49

Slide 49 text

In-Process Context Propagation handleIncomingRequest() { span := tr.Start() defer span.End() library1.doSomething() } library1.doSomething() { // Current span? library2.doSomething() } library2.doSomething() { // Current span? } User Application 3rd-party library 3rd-party library

Slide 50

Slide 50 text

In-Process Context Propagation ● Used among functions or goroutines within a service ● Must be thread-safe ● Two main approaches: ○ Implicit — thread-local storage, global variables, … ○ Explicit — as an argument in function calls ● Go uses the context standard library package ● Python uses context vars

Slide 51

Slide 51 text

In-Process Context Propagation—Go // Set current span. func ContextWithSpan(ctx context.Context, span Span) context.Context { return context.WithValue(ctx, currentSpanKey, span) } // Get current span. func SpanFromContext(ctx context.Context) Span { if span, has := ctx.Value(currentSpanKey).(Span); has { return span } return NoopSpan{} } api/trace/context.go

Slide 52

Slide 52 text

Distributed Context Propagation httptrace.Inject() httptrace.Extract() grpctrace.Inject() grpctrace.Extract() Service A Service B Service C HTTP gRPC

Slide 53

Slide 53 text

Conclusion

Slide 54

Slide 54 text

Conclusion ● Tracing is tricky—but may well be worth it ● It’s much easier than before ○ Hopefully you’ll never have to re-instrument ○ Auto-instrumentation is in the works ● No vendor locking ○ Architecture encourages separation of concerns ● A good balance between freedom and uniformity ○ Simple APIs ○ Support for arbitrary implementations ○ A real community effort

Slide 55

Slide 55 text

Johannes Liebermann Github: johananl Twitter: @j_lieb Email: [email protected] Kinvolk Blog: kinvolk.io/blog Github: kinvolk Twitter: kinvolkio Email: [email protected] Thank you!