Distributed Tracing: Understanding how your all your components work together

Distributed Tracing How your systems work together

About me José Carlos Chávez Software Engineer at Typeform focused
on the responses services aggregate Open source contributor for Distributed Tracing projects @jcchavezs [email protected]

Distributed systems

Distributed systems A collection of independent components appears to its
users as a single coherent system. Characteristics: → Concurrency → No global clock → Independent failures

Distributed systems Water heater Gas supplier Cold water storage tank
Shutoff valve First floor branch Tank valve 爆$❄#☭

Distributed microservices architecture Frontend Service Ads Content Search Images Search
DB2 DB3 DB1 TCP error (2003) 500 Internal Error 500 Internal Error Search Service GET /?q=cats

Distributed microservices architecture Water heater Gas supplier Cold water storage
tank Shutoff valve First floor branch Tank valve 爆$❄#☭ I AM HERE! First floor distributor is clogged!

We do have that, it is called logs!

Logs Frontend Service Ads Search Service Content Search Images Search
DB2 DB3 DB1 TCP error (2003) 500 Internal Error 500 Internal Error GET /?q=cats

[24/Oct/2017 13:50:07 +0000] “GET / HTTP/1.1” 200 … **0/13548” [24/Oct/2017
13:50:07 +0000] “GET /omnisearch HTTP/1.1” 200 … **0/23948” [24/Oct/2017 13:50:08 +0000] “GET / HTTP/1.1” 200 … **0/12396” [24/Oct/2017 13:50:07 +0000] “GET /images HTTP/1.1” 200 … **0/23748” [24/Oct/2017 13:50:07 +0000] “GET /content HTTP/1.1” 200 … **0/23248” [24/Oct/2017 13:50:07 +0000] “GET /ads HTTP/1.1” 200 … **0/26548” [24/Oct/2017 13:50:07 +0000] “GET / HTTP/1.1” 200 … **0/13148” [24/Oct/2017 13:50:07 +0000] “GET / HTTP/1.1” 200 … **0/2588” [24/Oct/2017 13:50:07 +0000] “GET /ads HTTP/1.1” 500 … **0/3248” [24/Oct/2017 13:50:07 +0000] “GET /omnisearch HTTP/1.1” 200 … **0/23548” [24/Oct/2017 13:50:07 +0000] “GET /content HTTP/1.1” 200 … **0/22598” [24/Oct/2017 13:50:07 +0000] “GET /images HTTP/1.1” 200 … **0/13948” ... Logs ? ?

Distributed Tracing to unclog your pipes

Distributed Tracing → Understanding Latency issues across services → Identifying
dependencies and critical paths → Visualizing concurrency → Request scoped → Complementary to other monitoring tools

Distributed Tracing Frontend Service Ads Search Service Content Search Images
Search DB2 DB3 DB1 TCP error (2003) Can’t connect to server 500 Internal Error 500 Internal Error GET /?q=cats

[24/Oct/2017 13:50:07 +0000] “GET / HTTP/1.1” 200 … **0/13548” [24/Oct/2017
13:50:07 +0000] “GET /omnisearch HTTP/1.1” 200 … **0/23948” [24/Oct/2017 13:50:08 +0000] “GET / HTTP/1.1” 200 … **0/12396” [24/Oct/2017 13:50:07 +0000] “GET /images HTTP/1.1” 200 … **0/23748” [24/Oct/2017 13:50:07 +0000] “GET /content HTTP/1.1” 200 … **0/23248” [24/Oct/2017 13:50:07 +0000] “GET /ads HTTP/1.1” 200 … **0/26548” [24/Oct/2017 13:50:07 +0000] “GET / HTTP/1.1” 200 … **0/13148” [24/Oct/2017 13:50:07 +0000] “GET / HTTP/1.1” 200 … **0/2588” [24/Oct/2017 13:50:07 +0000] “GET /ads HTTP/1.1” 500 … **0/3248” [24/Oct/2017 13:50:07 +0000] “GET /omnisearch HTTP/1.1” 200 … **0/23548” [24/Oct/2017 13:50:07 +0000] “GET /content HTTP/1.1” 200 … **0/22598” [24/Oct/2017 13:50:07 +0000] “GET /images HTTP/1.1” 200 … **0/13948” ... Logs

Distributed Tracing FRONTEND SEARCH ADS CONTENT IMAGES Time [1508410470] error
TCP (2003) 500 [1508410442] no cache for resource, retrieving from DB db_2_inst1 error error

Elements of Distributed Tracing → A trace shows an execution
path through a distributed system → A context includes information that should be propagated across services → A span in the trace represents a logical unit of work (with a beginning and end) → Tags and logs (optional) add complementary information to spans.

How propagation it works? Client / Producer Extract() Inject() Span
2 Span 1 Extract() TraceID: fAf3oXLoDS ParentID: ... SpanID: dZ0qHIBa1A Sampled: true ... TraceID: fAf3oXLoDS ParentID: dZ0qHIBa1A SpanID: oYa7m31wq5 Sampled: true ...

Benefits of Distributed Tracing → Effective monitoring → System insight,
clarifies non trivial interactions → Visibility to critical paths and dependencies → Observe E2E latency in near real time → Request scoped, not request’s lifecycle scoped.

What about overhead? → Observability tools are unintrusive → Sampling
reduces overhead → Instrumentation can be delegated to common frameworks → (Don’t) trace every single operation

Zipkin

Zipkin → Distributed tracing tool → Based on Google Dapper
(2010) → Created by twitter (2012) → Open source (2015) → Strong community

Zipkin’s architecture Service (instrumented) Transport Collect spans Collector API UI
Storage DB Visualize Retrieve data Store spans http/kafka/grpc Receive spans Deserialize and schedule for storage Cassandra/MySQL/ElasticSearch

Zipkin

Zipkin: Trace overview

Zipkin: Tags & Logs

Zipkin: Errors

Zipkin: Async requests

Zipkin: Dependency graph

Thank you

How is it different from logs? Credits: Peter Bourgon

Distributed Tracing: Understanding how your all...

Distributed Tracing: Understanding how your all your components work together

José Carlos Chávez

More Decks by José Carlos Chávez

Other Decks in Technology

Featured

Transcript

Distributed Tracing How your systems work together

About me José Carlos Chávez Software Engineer at Typeform focused

Distributed systems

Distributed systems A collection of independent components appears to its

Distributed systems Water heater Gas supplier Cold water storage tank

Distributed microservices architecture Frontend Service Ads Content Search Images Search

Distributed microservices architecture Water heater Gas supplier Cold water storage

We do have that, it is called logs!

Logs Frontend Service Ads Search Service Content Search Images Search

[24/Oct/2017 13:50:07 +0000] “GET / HTTP/1.1” 200 … **0/13548” [24/Oct/2017

Distributed Tracing to unclog your pipes

Distributed Tracing → Understanding Latency issues across services → Identifying

Distributed Tracing Frontend Service Ads Search Service Content Search Images

[24/Oct/2017 13:50:07 +0000] “GET / HTTP/1.1” 200 … **0/13548” [24/Oct/2017

Distributed Tracing FRONTEND SEARCH ADS CONTENT IMAGES Time [1508410470] error

Elements of Distributed Tracing → A trace shows an execution

How propagation it works? Client / Producer Extract() Inject() Span

Benefits of Distributed Tracing → Effective monitoring → System insight,

What about overhead? → Observability tools are unintrusive → Sampling

Zipkin

Zipkin → Distributed tracing tool → Based on Google Dapper

Zipkin’s architecture Service (instrumented) Transport Collect spans Collector API UI

Zipkin

Zipkin: Trace overview

Zipkin: Tags & Logs

Zipkin: Errors

Zipkin: Async requests

Zipkin: Dependency graph

Zipkin: Dependency graph

Thank you

Q&A

How is it different from logs? Credits: Peter Bourgon