Slide 1

Slide 1 text

Distributed Tracing Understand how your components work together

Slide 2

Slide 2 text

About me José Carlos Chávez ● Software Engineer at Typeform focused on the aggregate of responses services. ● Zipkin core team and open source contributor for Observability projects.

Slide 3

Slide 3 text

Distributed Systems

Slide 4

Slide 4 text

Distributed systems A collection of independent components appears to its users as a single coherent system. Characteristics: ● Concurrency ● No global clock ● Independent failures

Slide 5

Slide 5 text

Water heater Gas supplier Cold water storage tank Shutoff valve First floor branch Tank valve 爆$❄#☭ Distributed systems

Slide 6

Slide 6 text

Auth service Images service Videos service DB2 DB3 DB4 TCP error (2003) 500 Internal Error 500 Internal Error GET /media/e5k2 API Proxy Distributed systems: Understanding failures DB1 Media API

Slide 7

Slide 7 text

Water heater Gas supplier Cold water storage tank Shutoff valve First floor branch Tank valve 爆$❄#☭ I AM HERE! First floor distributor is clogged! Distributed systems: Understanding failures

Slide 8

Slide 8 text

We do have that, it is called logs!

Slide 9

Slide 9 text

API Proxy Auth service Media API Images service Videos service DB2 DB3 DB4 TCP error (2003) 500 Internal Error 500 Internal Error GET /media/e5k2 Logs & Concurrency DB1

Slide 10

Slide 10 text

[24/Oct/2017 13:50:07 +0000] “GET /media HTTP/1.1” 200 … **0/13548” [24/Oct/2017 13:50:07 +0000] “GET /media HTTP/1.1” 200 … **0/23948” [24/Oct/2017 13:50:08 +0000] “GET /media HTTP/1.1” 200 … **0/12396” [24/Oct/2017 13:50:07 +0000] “GET /videos HTTP/1.1” 200 … **0/23748” [24/Oct/2017 13:50:07 +0000] “GET /images HTTP/1.1” 200 … **0/23248” [24/Oct/2017 13:50:07 +0000] “GET /auth HTTP/1.1” 200 … **0/26548” [24/Oct/2017 13:50:07 +0000] “POST /media HTTP/1.1” 200 … **0/13148” [24/Oct/2017 13:50:07 +0000] “GET /media HTTP/1.1” 200 … **0/2588” [24/Oct/2017 13:50:07 +0000] “GET /auth HTTP/1.1” 500 … **0/3248” [24/Oct/2017 13:50:07 +0000] “POST /media HTTP/1.1” 200 … **0/23548” [24/Oct/2017 13:50:07 +0000] “GET /images HTTP/1.1” 200 … **0/22598” [24/Oct/2017 13:50:07 +0000] “GET /videos HTTP/1.1” 200 … **0/13948” ... ? ? Logs & Concurrency

Slide 11

Slide 11 text

Water heater Gas supplier Cold water storage tank Shutoff valve First floor branch Tank valve 爆$❄#☭ I AM HERE! First floor distributor is clogged! Distributed systems: Understanding failures

Slide 12

Slide 12 text

Distributed Tracing to unclog your pipes

Slide 13

Slide 13 text

API Proxy Media API Auth Videos Images Time 500 error Distributed tracing [1508410442] no cache for resource, retrieving from DBc TraceID d52d38b69b0fb15efa

Slide 14

Slide 14 text

Distributed Tracing: What answers I get? ● What services did a request pass through? ● What occurred in each service for a given request? ● Where did the error happen? ● Where are the bottlenecks? ● What is the critical path for a request? ● Who should I page?

Slide 15

Slide 15 text

Distributed Tracing & friends Credits: Peter Bourgon

Slide 16

Slide 16 text

Benefits of Distributed Tracing ● (almost) Immediate feedback ● System insight, clarifies non trivial interactions ● Visibility to critical paths and dependencies ● Understand latencies ● Request scoped, not request’s lifecycle scoped.

Slide 17

Slide 17 text

Trace’s Anatomy ● A trace shows an execution path through a distributed system ● A span in the trace represents a logical unit of work (with a start and end) ● A context includes information that should be propagated across services ● Tags and logs (optional) add complementary information to spans. /things auth.Auth Time GET /videos mysql.Get T R A C E

Slide 18

Slide 18 text

Elements of distributed tracing Credits: Nic Munroe Leg 1: inbound propagation Leg 2: outbound propagation Leg 3: in-process propagation Distributed Tracing

Slide 19

Slide 19 text

Leg 1: Inbound propagation When your service process a request or consume a message. API Proxy Media API GET /media TraceID: fAf3oXL6DS SpanID: dZ0xHIBa1A ...

Slide 20

Slide 20 text

Leg 2: Outbound propagation When your service makes an outbound call to another service Media API Video service GET /videos TraceID: fAf3oXL6DS ParentID: dZ0xHIBa1A SpanID: y74fr5udj http/get

Slide 21

Slide 21 text

mysql.Query redis.Get Leg 3: In process propagation When performing an operation inside the service Media API Cache service Images service GET /images

Slide 22

Slide 22 text

API Proxy Media API Auth Videos Images Time 500 error Distributed tracing [1508410442] no cache for resource, retrieving from DBc TraceID d52d38b69b0fb15efa

Slide 23

Slide 23 text

Any overhead? For users: ● Observability tools are meant to be unintrusive ● Sampling reduces overhead ● (Don’t) trace every single operation For developers: ● Not all libraries are ready to plug instruments ● Instrumentation can be delegated to common frameworks

Slide 24

Slide 24 text

Introducing Apache Zipkin

Slide 25

Slide 25 text

Apache Zipkin Based on BigBrotherBird (B3) and inspired on Google Dapper (2010). It was open sourced by Twitter (2012) and joined Apache Incubator on September 2018. ● Mature tracing model emerged from users’ needs. ● Used by large companies like Netflix, SoundCloud and Yelp but also not too big ones. ● Strong community: ○ @zipkinproject ○ gitter.im/openzipkin

Slide 26

Slide 26 text

Service (instrumented) Transport Collect spans Collector API UI Storage DB Visualize Retrieve data Store spans http/kafka/grpc Receive spans Deserialize and schedule for storage Cassandra/MySQL/ElasticSearch Zipkin: architecture

Slide 27

Slide 27 text

Zipkin: traces

Slide 28

Slide 28 text

Zipkin: traces

Slide 29

Slide 29 text

Zipkin: traces

Slide 30

Slide 30 text

Zipkin: trace overview

Slide 31

Slide 31 text

Zipkin: tags and logs

Slide 32

Slide 32 text

Zipkin: traces with errors

Slide 33

Slide 33 text

Zipkin: traces for async operations

Slide 34

Slide 34 text

Zipkin: dependency graph

Slide 35

Slide 35 text

Zipkin: dependency graph

Slide 36

Slide 36 text

Q&As twitter.com/jcchavezs Find more: http://bit.ly/dist-trac