Slide 1

Slide 1 text

1 Distributed Tracing: Understand how your components work together BuildStuff Ukraine 2019 Photo by Samuel Sianipar

Slide 2

Slide 2 text

2 Expedia Group Proprietary and Confidential About me - Software Engineer at Expedia Group - Zipkin core team member and open source contributor for observability projects @jcchavezs - #BuildStuffUA

Slide 3

Slide 3 text

3 Expedia Group Proprietary and Confidential Distributed Systems

Slide 4

Slide 4 text

4 Expedia Group Proprietary and Confidential Distributed systems @jcchavezs - #BuildStuffUA A collection of independent components appears to its users as a single coherent system. Characteristics: - Concurrency - No global clock - Independent failures Image source: https://link.medium.com/jey42ga7p1

Slide 5

Slide 5 text

5 Expedia Group Proprietary and Confidential Water heater Gas supplier Cold water storage tank Shutoff valve First floor branch Tank valve 爆 $❄#☭ @jcchavezs - #BuildStuffUA Distributed Systems

Slide 6

Slide 6 text

6 Expedia Group Proprietary and Confidential Auth service Images service Videos service DB2 DB3 DB4 Error 1152 ER_ABORTING_CONNECTION 500 Internal Error 500 Internal Error GET /media/e5k2 API Proxy DB1 Media API Distributed Systems @jcchavezs - #BuildStuffUA

Slide 7

Slide 7 text

7 Expedia Group Proprietary and Confidential Water heater Gas supplier Cold water storage tank Shutoff valve First floor branch Tank valve 爆 $❄#☭ I AM HERE! First floor distributor is clogged! Distributed Systems @jcchavezs - #BuildStuffUA

Slide 8

Slide 8 text

8 Expedia Group Proprietary and Confidential We do have that, it is called logs!

Slide 9

Slide 9 text

9 Expedia Group Proprietary and Confidential API Proxy Auth service Media API Images service Videos service DB2 DB3 DB4 500 Internal Error 500 Internal Error GET /media/e5k2 DB1 Error 1152 ER_ABORTING_CONNECTION Logs & Concurrency @jcchavezs - #BuildStuffUA

Slide 10

Slide 10 text

10 Expedia Group Proprietary and Confidential [24/Oct/2017 13:50:07 +0000] “GET /media HTTP/1.1” 200 … **0/13548” [24/Oct/2017 13:50:07 +0000] “GET /media HTTP/1.1” 200 … **0/23948” [24/Oct/2017 13:50:08 +0000] “GET /media HTTP/1.1” 200 … **0/12396” [24/Oct/2017 13:50:07 +0000] “GET /videos HTTP/1.1” 200 … **0/23748” [24/Oct/2017 13:50:07 +0000] “GET /images HTTP/1.1” 200 … **0/23248” [24/Oct/2017 13:50:07 +0000] “GET /auth HTTP/1.1” 200 … **0/26548” [24/Oct/2017 13:50:07 +0000] “POST /media HTTP/1.1” 200 … **0/13148” [24/Oct/2017 13:50:07 +0000] “GET /media HTTP/1.1” 200 … **0/2588” [24/Oct/2017 13:50:07 +0000] “GET /auth HTTP/1.1” 500 … **0/3248” [24/Oct/2017 13:50:07 +0000] “POST /media HTTP/1.1” 200 … **0/23548” [24/Oct/2017 13:50:07 +0000] “GET /images HTTP/1.1” 200 … **0/22598” ... ? ? Logs & Concurrency @jcchavezs - #BuildStuffUA

Slide 11

Slide 11 text

11 Expedia Group Proprietary and Confidential Why is it hard to operate a Distributed System? ● Systems change all the time ● Things fail in unexpected ways ● Unknown unknowns ● Most problems are the convergence of many different things failing at once ● Everyone in the team is supposed to respond with the same level of confidence and tools no matter experience or expertise and the more components, the less individuals know about them @jcchavezs - #BuildStuffUA Distributed Systems

Slide 12

Slide 12 text

12 Expedia Group Proprietary and Confidential Water heater Gas supplier Cold water storage tank Shutoff valve First floor branch Tank valve 爆 $❄#☭ I AM HERE! First floor distributor is clogged! Distributed Systems @jcchavezs - #BuildStuffUA

Slide 13

Slide 13 text

13 Expedia Group Proprietary and Confidential Distributed Tracing to unclog your pipes

Slide 14

Slide 14 text

14 Expedia Group Proprietary and Confidential API Proxy Media API Auth Videos Images Time error [1508410442] no cache for resource, retrieving from DB TraceID d52d38b69b0fb15efa I AM HERE! Aborted connection Distributed Tracing @jcchavezs - #BuildStuffUA

Slide 15

Slide 15 text

15 Expedia Group Proprietary and Confidential ● What services did a request/message pass through? ● What occurred in each service for a given request/message? ● Where did the error happen? ● Where are the bottlenecks? ● What is the critical path for a request? ● Who should I page? Distributed Tracing The Answers @jcchavezs - #BuildStuffUA

Slide 16

Slide 16 text

16 Expedia Group Proprietary and Confidential Distributed Tracing Source: https://twitter.com/rakyll/status/971231712049971200 @jcchavezs - #BuildStuffUA

Slide 17

Slide 17 text

17 Expedia Group Proprietary and Confidential Distributed Tracing & friends Distributed Tracing @jcchavezs - #BuildStuffUA Logs tell you that an event happened. Metrics tell you how many events of this type are happening in the system. Tracing tells you what happened (who did what) and the impact of that propagated across your system. Image source: https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html

Slide 18

Slide 18 text

18 Expedia Group Proprietary and Confidential op 1 op 2 The Benefits Distributed Tracing ● Immediate feedback ● System insight, clarifies non trivial interactions ● Visibility to critical paths and dependencies ● Understand latencies ● Request scoped, not request’s lifecycle scoped. @jcchavezs - #BuildStuffUA

Slide 19

Slide 19 text

19 Expedia Group Proprietary and Confidential /things auth.Auth Time GET /videos mysql.Get T R A C E Trace’s Anatomy Distributed Tracing ● A trace shows an execution path through a distributed system ● A span in the trace represents a logical unit of work (with a start and end) ● A context includes information that should be propagated across services ● Tags and logs (optional) add complementary information to spans. @jcchavezs - #BuildStuffUA

Slide 20

Slide 20 text

20 Expedia Group Proprietary and Confidential Leg 1: inbound propagation Leg 2: outbound propagation Leg 3: in-process propagation Distributed Tracing Distributed Tracing Source: https://link.medium.com/BXTM1u5oH1 @jcchavezs - #BuildStuffUA

Slide 21

Slide 21 text

21 Expedia Group Proprietary and Confidential When a service process a request or consume a message it will extract (if possible) the context from upstream to continue the trace, otherwise it will start a new trace. API Proxy Media API GET /media/{id} TraceID: fAf3oXL6DS SpanID: dZ0xHIBa1A ... Leg 1: Inbound propagation Distributed Tracing @jcchavezs - #BuildStuffUA

Slide 22

Slide 22 text

22 Expedia Group Proprietary and Confidential When a service makes an outbound call to another service it will inject the context in the request (headers) or message (metadata). Media API Video service GET /videos TraceID: fAf3oXL6DS ParentID: y74fr5udj SpanID: dZ0xHIBa1A http/get Leg 2: Outbound propagation Distributed Tracing @jcchavezs - #BuildStuffUA TraceID: fAf3oXL6DS SpanID: y74fr5udj

Slide 23

Slide 23 text

23 Expedia Group Proprietary and Confidential When performing an operation inside the service it will use the server context as a parent to create local spans. mysql.Query redis.Get Media API Cache service Images service GET /images Leg 3: In process propagation Distributed Tracing @jcchavezs - #BuildStuffUA

Slide 24

Slide 24 text

24 Expedia Group Proprietary and Confidential API Proxy Media API Auth Videos Images Time error [1508410442] no cache for resource, retrieving from DB TraceID d52d38b69b0fb15efa I AM HERE! Aborted connection Distributed Tracing @jcchavezs - #BuildStuffUA

Slide 25

Slide 25 text

25 Expedia Group Proprietary and Confidential Are they all benefits? Overhead for users: • Observability tools are meant to be unintrusive • Sampling reduces overhead • (Don’t) trace every single operation Overhead for developers: • Not all libraries are ready to plug instruments • Instrumentation can be delegated to common frameworks • Right sampling is hard Distributed Tracing @jcchavezs - #BuildStuffUA

Slide 26

Slide 26 text

26 Expedia Group Proprietary and Confidential Introducing Zipkin

Slide 27

Slide 27 text

27 Expedia Group Proprietary and Confidential Based on B3 and inspired by Google Dapper (2010). It was open sourced by Twitter (2012). ● Mature tracing model emerged from users’ needs. ● Used by large companies like LINE, Netflix, SoundCloud and Yelp but also small ones. ● Strong community: ○ @zipkinproject ○ gitter.im/openzipkin Zipkin @jcchavezs - #BuildStuffUA

Slide 28

Slide 28 text

28 Expedia Group Proprietary and Confidential Service (instrumented) Transport Collect spans Collector API UI Storage DB Visualize Retrieve data Store spans http/kafka/grpc Receive spans Deserialize and schedule for storage Cassandra/MySQL/ElasticSearch Zipkin @jcchavezs - #BuildStuffUA

Slide 29

Slide 29 text

29 Expedia Group Proprietary and Confidential @jcchavezs - #BuildStuffUA

Slide 30

Slide 30 text

30 Expedia Group Proprietary and Confidential @jcchavezs - #BuildStuffUA

Slide 31

Slide 31 text

31 Expedia Group Proprietary and Confidential @jcchavezs - #BuildStuffUA

Slide 32

Slide 32 text

32 Expedia Group Proprietary and Confidential @jcchavezs - #BuildStuffUA

Slide 33

Slide 33 text

33 Expedia Group Proprietary and Confidential @jcchavezs - #BuildStuffUA

Slide 34

Slide 34 text

34 Expedia Group Proprietary and Confidential @jcchavezs - #BuildStuffUA

Slide 35

Slide 35 text

35 Expedia Group Proprietary and Confidential @jcchavezs - #BuildStuffUA

Slide 36

Slide 36 text

36 Expedia Group Proprietary and Confidential @jcchavezs - #BuildStuffUA

Slide 37

Slide 37 text

37 Expedia Group Proprietary and Confidential Summary ● Distributed Systems are complex and will be. ● Distributed tracing helps you to understand latencies, critical paths and errors in within a request or message flow. ● Distributed Tracing provides contextual insights within a request, it is complementary to other observability tools. @jcchavezs - #BuildStuffUA

Slide 38

Slide 38 text

38 Expedia Group Proprietary and Confidential Thank you Q&A