Observable Applications With OpenTelemetry

Observable Applications With OpenTelemetry Virtual Rejekts | 01.04.20

Do we need distributed tracing?

That depends on the questions we want to ask...

Hi, I'm Johannes Johannes Liebermann Software Developer, Kinvolk Github: johananl
Twitter: @j_lieb Email: [email protected]

The Kubernetes Linux Experts Engineering services and products for Kubernetes,
containers, process management and Linux user-space + kernel Blog: kinvolk.io/blog Github: kinvolk Twitter: kinvolkio Email: [email protected] Kinvolk

https://xkcd.com/927/

Agenda • Logs, metrics and traces • Distributed tracing: why
it’s great but also hard • Introduction to OpenTelemetry • A look into the tracing libraries (Go, Python) • Demo — instrumenting a distributed application • Context propagation

This talk assumes familiarity with distributed tracing.

Distributed Tracing in 30 Seconds helloHandler := func(w http.ResponseWriter, req
*http.Request) { ... ctx, span := tr.Start( req.Context(), "handle-hello-request", ... ) defer span.End() _, _ = io.WriteString(w, "Hello, world!\n") }

Distributed Tracing in 30 Seconds

Distributed Tracing in 30 Seconds • A span measures a
unit of work in a service • A trace combines multiple spans together handle-http-request query-database render-response 5ms 3ms 2ms trace span

Logs, Metrics, Traces

Logs, Metrics, Traces Generating Processing & Storing Querying Scope Logs
Easy Hard Hard Node Metrics So-so Easy Easy Node / Service Traces Hard So-so Easy Request

What Is the Question? • Why did this node crash?
• Was function X on the node called? • Is my service healthy? • How much trafﬁc do we have? • Why was this request slow? • Where should I optimize performance? • Which services are involved?

• Was function X on the node called? • Is my service healthy? • How much trafﬁc do we have? • Why was this request slow? • Where should I optimize performance? • Which services are involved? Logs Metrics Traces

• Was function X on the node called? • Is my service healthy? • How much trafﬁc do we have? • Why was this request slow? • Where should I optimize performance? • Which services are involved?

Distributed tracing allows us to get low-level, end-to-end information about
individual requests.

...so why not trace everything all the time?

Logs, Metrics, Traces Generating Processing & Storing Querying Scope Logs
Easy Hard Hard Service Metrics So-so Easy Easy Service / system Traces Hard So-so Easy Request

1. It’s a Lot of Work • Instrumentation == code
changes • Hard to justify reducing team velocity for tracing • You can’t have “instrumentation holes” ◦ At the very least you must propagate context

2. We Can’t Vendor-Lock • Vendor-locking for tracing is especially
problematic • Importing a vendor-speciﬁc library is scary ◦ What if my monitoring vendor raises prices? • Open-source libraries must remain neutral ◦ You can’t require users to use a speciﬁc vendor ◦ Maintaining support for multiple vendors is a lot of work

3. Distributed Tracing vs. Microservices • Does distributed tracing conﬂict
with microservices?

• Multiple microservices • Multiple programming languages and frameworks •
Multiple protocols (HTTP, gRPC, messaging, ...) • Multiple tracing backends (Jaeger, Zipkin, Datadog, LightStep, NewRelic, Dynatrace, …) Multiple Everything

Is there a solution?

Standards!

Lack of standards is especially costly for distributed tracing.

https://xkcd.com/927/

OpenCensus

OpenCensus ⊻

+ = May 2019

Introducing OpenTelemetry • opentelemetry.io • Announced May 2019 • The
next major version of both OpenTracing and OpenCensus • A real community effort • A spec and a set of libraries • API and implementation • Tracing and metrics

OpenTelemetry — Architecture • API ◦ Follows the OpenTelemetry speciﬁcation
◦ Can be used without an implementation • SDK ◦ A ready-to-use implementation ◦ Alternative implementations are supported • Exporters • Bridges https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/library-guidelines.md

Separation of Concerns • Library developers depend only on the
API • Application developers depend on the API and on an implementation • Monitoring vendors maintain their own exporters

Protecting User Applications • I may want to use an
instrumented 3rd-party library without using OpenTelemetry ◦ If no implementation is plugged in, telemetry data is not produced • My code should not be broken by instrumentation ◦ The API package is self-sufﬁcient thanks to a built-in noop implementation • Performance impact should be minimal ◦ No blocking of end-user application by default ◦ Noop implementation produces negligible overhead ◦ Asynchronous exporting of telemetry data

• Current status: beta • Production readiness: 2nd half of
2020 • Libraries for: Go, Python, Java, C++, Rust, PHP, Ruby, .NET, JavaScript, Erlang Project Status

Go Library Status Latest release: v0.4.2 (beta) ✓ API (tracing
+ metrics) ✓ SDK (tracing + metrics) ✓ Context propagation ✓ Exporters: Jaeger, Zipkin, Prometheus (metrics) ✓ OpenTracing bridge

Python Library Status Latest release: v0.6.0 (beta) ✓ API (tracing
+ metrics) ✓ SDK (tracing + metrics) ✓ Context propagation ✓ Exporters: Jaeger, Zipkin, Prometheus (metrics) ✓ OpenTracing bridge

How do I instrument my services?

Instrumenting Go Code // Explicit span creation. handler := func(w
http.ResponseWriter, r *http.Request) { ctx, span := tr.Start(r.Context(), "handle-request") defer span.End() // Handle HTTP request. } // Implicit span creation. err := tr.WithSpan(ctx, "do-stuff", func(context.Context) error { return do_stuff() } )

Instrumenting Go Code // Log an event on the span.
span.AddEvent(ctx, "Generating response", key.New("response").String("stuff") ) // Set key-value pairs on the span. span.SetAttributes( key.New("cool").Bool(true), key.New("usefulness").String("very"), )

// Inject tracing metadata on outgoing requests. grpctrace.Inject(ctx, &metadata) //
Extract tracing metadata on incoming requests. metadata, spanCtx := grpctrace.Extract(ctx, &metadataCopy) Propagating Context Between Processes Protocol dependent!

Instrumenting Python Code # Implicit span creation. with tracer.start_as_current_span("do-stuff") as
span: do_stuff() # Log an event on the span. span.add_event(“something happened”, {‘foo’: ‘bar’}) # Set key-value pairs on the span. span.set_attribute(‘cool’: True)

Frontend Seniority Field Role Demo: Fake Job Title Generator HTTP
gRPC

The tricky part: context propagation.

Context — Request-Scoped Data • “Context” refers to request-scoped data
◦ Example: request/transaction ID • Context is propagated across a request’s path • Needed for span correlation ◦ Trace ID and span ID must be propagated • Two types of context propagation: in-process and distributed

In-Process Context Propagation handleIncomingRequest() { span := tr.Start() defer span.End()
library1.doSomething() } library1.doSomething() { // Current span? library2.doSomething() } library2.doSomething() { // Current span? } User Application 3rd-party library 3rd-party library

In-Process Context Propagation • Used among functions or goroutines within
a service • Must be thread-safe • Two main approaches: ◦ Implicit — thread-local storage, global variables, … ◦ Explicit — as an argument in function calls • Go uses the context standard library package • Python uses context vars

In-Process Context Propagation—Go // Set current span. func ContextWithSpan(ctx context.Context,
span Span) context.Context { return context.WithValue(ctx, currentSpanKey, span) } // Get current span. func SpanFromContext(ctx context.Context) Span { if span, has := ctx.Value(currentSpanKey).(Span); has { return span } return NoopSpan{} } api/trace/context.go

Distributed Context Propagation httptrace.Inject() httptrace.Extract() grpctrace.Inject() grpctrace.Extract() Service A Service
B Service C HTTP gRPC

Conclusion

Conclusion • Tracing is tricky—but may well be worth it
• It’s much easier than before ◦ Hopefully you’ll never have to re-instrument ◦ Auto-instrumentation is in the works • No vendor locking ◦ Architecture encourages separation of concerns • A good balance between freedom and uniformity ◦ Simple APIs ◦ Support for arbitrary implementations ◦ A real community effort

Johannes Liebermann Github: johananl Twitter: @j_lieb Email: [email protected] Kinvolk Blog:
kinvolk.io/blog Github: kinvolk Twitter: kinvolkio Email: [email protected] Thank you!

Observable Applications With OpenTelemetry

Observable Applications With OpenTelemetry

More Decks by Johannes Liebermann

Featured

Transcript