Distributed Tracing: Understanding how your components work together - MicroCPH 2018

Distributed Tracing Understand how your components work together

About me José Carlos Chávez • Software Engineer at Typeform
focused on the responses services aggregate. • Zipkin core team, DataDog consultant and open source contributor for Distributed Tracing projects.

Distributed Systems

Distributed systems A collection of independent components appears to its
users as a single coherent system. Characteristics: • Concurrency • No global clock • Independent failures #microcph / @jcchavezs

Water heater Gas supplier Cold water storage tank Shutoff valve
First floor branch Tank valve 爆$❄#☭ Distributed systems

Auth service Images service Videos service DB2 DB3 DB4 TCP
error (2003) 500 Internal Error 500 Internal Error GET /media/u1k API Proxy Distributed systems: Understanding failures DB1 Media API

Water heater Gas supplier Cold water storage tank Shutoff valve
First floor branch Tank valve 爆$❄#☭ I AM HERE! First floor distributor is clogged! Distributed systems: Understanding failures

We do have that, it is called logs!

API Proxy Auth service Media API Images service Videos service
DB2 DB3 DB4 TCP error (2003) 500 Internal Error 500 Internal Error GET /media/u1k Logs: Concurrency DB1

[24/Oct/2017 13:50:07 +0000] “GET /media HTTP/1.1” 200 … **0/13548” [24/Oct/2017
13:50:07 +0000] “GET /media HTTP/1.1” 200 … **0/23948” [24/Oct/2017 13:50:08 +0000] “GET /media HTTP/1.1” 200 … **0/12396” [24/Oct/2017 13:50:07 +0000] “GET /videos HTTP/1.1” 200 … **0/23748” [24/Oct/2017 13:50:07 +0000] “GET /images HTTP/1.1” 200 … **0/23248” [24/Oct/2017 13:50:07 +0000] “GET /auth HTTP/1.1” 200 … **0/26548” [24/Oct/2017 13:50:07 +0000] “POST /media HTTP/1.1” 200 … **0/13148” [24/Oct/2017 13:50:07 +0000] “GET /media HTTP/1.1” 200 … **0/2588” [24/Oct/2017 13:50:07 +0000] “GET /auth HTTP/1.1” 500 … **0/3248” [24/Oct/2017 13:50:07 +0000] “POST /media HTTP/1.1” 200 … **0/23548” [24/Oct/2017 13:50:07 +0000] “GET /images HTTP/1.1” 200 … **0/22598” [24/Oct/2017 13:50:07 +0000] “GET /videos HTTP/1.1” 200 … **0/13948” ... ? ? Logs

Distributed Tracing to unclog your pipes

Distributed Tracing • Understanding Latency issues across services • Identifying
dependencies and critical paths • Visualizing concurrency • Request scoped • Complementary to other observability means #microcph / @jcchavezs

Distributed tracing as observability mean Credits: Peter Bourgon

API Proxy Auth service Media API Images service Videos service
DB2 DB3 DB4 TCP error (2003) Can’t connect to server 500 Internal Error 500 Internal Error GET /media/u1k Distributed microservices architecture DB1

[24/Oct/2017 13:50:07 +0000] “GET /media HTTP/1.1” 200 … **0/13548” [24/Oct/2017
13:50:07 +0000] “GET /media HTTP/1.1” 200 … **0/23948” [24/Oct/2017 13:50:08 +0000] “GET /media HTTP/1.1” 200 … **0/12396” [24/Oct/2017 13:50:07 +0000] “GET /videos HTTP/1.1” 200 … **0/23748” [24/Oct/2017 13:50:07 +0000] “GET /images HTTP/1.1” 200 … **0/23248” [24/Oct/2017 13:50:07 +0000] “GET /auth HTTP/1.1” 200 … **0/26548” [24/Oct/2017 13:50:07 +0000] “POST /media HTTP/1.1” 200 … **0/13148” [24/Oct/2017 13:50:07 +0000] “GET /media HTTP/1.1” 200 … **0/2588” [24/Oct/2017 13:50:07 +0000] “GET /auth HTTP/1.1” 500 … **0/3248” [24/Oct/2017 13:50:07 +0000] “POST /media HTTP/1.1” 200 … **0/23548” [24/Oct/2017 13:50:07 +0000] “GET /images HTTP/1.1” 200 … **0/22598” [24/Oct/2017 13:50:07 +0000] “GET /videos HTTP/1.1” 200 … **0/13948” ... Logs

API Proxy Media API Auth Videos Images | Time [1508410470]
error TCP (2003) 500 [1508410442] no cache for resource, retrieving from DB ip_1027301 error error Distributed tracing

Elements of Distributed Tracing • A trace shows an execution
path through a distributed system • A context includes information that should be propagated across services • A span in the trace represents a logical unit of work (with a beginning and end) • Tags and logs (optional) add complementary information to spans. #microcph / @jcchavezs

Client / Producer Extract() TraceID: fAf3oXLoD SpanID: dZ0qHIBa1 Sampled: true
… TraceID: fAf3oXLoD ParentSpanID: dZ0qHIBa1 SpanID: oYa7m31wq Sampled: true ... Elements of Distributed Tracing #microcph / @jcchavezs Extract() Format DB Call Inject()

Benefits of Distributed Tracing • Effective monitoring • System insight,
clarifies non trivial interactions • Visibility to critical paths and dependencies • Observe E2E latency in near real time • Request scoped, not request’s lifecycle scoped. #microcph / @jcchavezs

What about overhead? • Observability tools are unintrusive • Sampling
reduces overhead • Instrumentation can be delegated to common frameworks • (Don’t) trace every single operation #microcph / @jcchavezs

Introducing Zipkin

(open)Zipkin • Distributed tracing tool • Based on Google Dapper
(2010) • Created by twitter (2012) • Open source (2015) • Mature tracing model • Strong community: ◦ @zipkinproject ◦ github.com/openzipkin #microcph / @jcchavezs

Service (instrumented) Transport Collect spans Collector API UI Storage DB
Visualize Retrieve data Store spans http/kafka/grpc Receive spans Deserialize and schedule for storage Cassandra/MySQL/ElasticSearch Zipkin: architecture

Zipkin: traces

Zipkin: trace overview

Zipkin: tags and logs

Zipkin: traces with errors

Zipkin: traces for async operations

Zipkin: dependency graph

Q&As @jcchavezs Look around resources: https://goo.gl/rMXLmp

Distributed Tracing: Understanding how your com...

Distributed Tracing: Understanding how your components work together - MicroCPH 2018

José Carlos Chávez

More Decks by José Carlos Chávez

Other Decks in Programming

Featured

Transcript

Distributed Tracing Understand how your components work together

About me José Carlos Chávez • Software Engineer at Typeform

Distributed Systems

Distributed systems A collection of independent components appears to its

Water heater Gas supplier Cold water storage tank Shutoff valve

Auth service Images service Videos service DB2 DB3 DB4 TCP

Water heater Gas supplier Cold water storage tank Shutoff valve

We do have that, it is called logs!

API Proxy Auth service Media API Images service Videos service

[24/Oct/2017 13:50:07 +0000] “GET /media HTTP/1.1” 200 … **0/13548” [24/Oct/2017

Distributed Tracing to unclog your pipes

Distributed Tracing • Understanding Latency issues across services • Identifying

Distributed tracing as observability mean Credits: Peter Bourgon

API Proxy Auth service Media API Images service Videos service

[24/Oct/2017 13:50:07 +0000] “GET /media HTTP/1.1” 200 … **0/13548” [24/Oct/2017

API Proxy Media API Auth Videos Images | Time [1508410470]

Elements of Distributed Tracing • A trace shows an execution

Client / Producer Extract() TraceID: fAf3oXLoD SpanID: dZ0qHIBa1 Sampled: true

Benefits of Distributed Tracing • Effective monitoring • System insight,

What about overhead? • Observability tools are unintrusive • Sampling

Introducing Zipkin

(open)Zipkin • Distributed tracing tool • Based on Google Dapper

Service (instrumented) Transport Collect spans Collector API UI Storage DB

Zipkin: traces

Zipkin: traces

Zipkin: traces

Zipkin: trace overview

Zipkin: tags and logs

Zipkin: traces with errors

Zipkin: traces for async operations

Zipkin: dependency graph

Zipkin: dependency graph

Q&As @jcchavezs Look around resources: https://goo.gl/rMXLmp