Slide 1

Slide 1 text

Messaging: the second hardest thing to trace AKA why distributed tracing tools do this last @adrianfcole works at Pivotal works on Zipkin

Slide 2

Slide 2 text

Introduction introduction http: logs and dumps http: logs and tracing messaging: logs and dumps messaging: logs and tracing wrapping up @adrianfcole #zipkin

Slide 3

Slide 3 text

@adrianfcole • spring cloud at pivotal • focused on distributed tracing • helped open zipkin

Slide 4

Slide 4 text

Understanding code execution The Stack(trace) shows which functions blocked on the one of interest Logs can show events which happened before the one of interest Distributed Trace can explain the request context (no offense metrics, profilers, flame graphers etc)

Slide 5

Slide 5 text

Http with logs and thread dumps @adrianfcole #zipkin introduction http: logs and dumps http: logs and tracing messaging: logs and dumps messaging: logs and tracing wrapping up

Slide 6

Slide 6 text

Let’s figure out why this request took a while GET /

Slide 7

Slide 7 text

Ooh a thread dump from the frontend! "qtp38625764-23" #23 prio=5 os_prio=31 tid=0x00007fce80769000 nid=0x7203 runnable [0x000070000b4c6000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0([email protected]/Native Method) --snip-- at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) at brave.webmvc.Frontend.callBackend(Frontend.java:16)

Slide 8

Slide 8 text

Interpreting the frontend thread dump "qtp38625764-23" #23 prio=5 os_prio=31 tid=0x00007fce80769000 nid=0x7203 runnable [0x000070000b4c6000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0([email protected]/Native Method) --snip-- at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) at brave.webmvc.Frontend.callBackend(Frontend.java:16) Thread dumps tell you what a thread was running, but not how long, to whom, what preceded it, etc..

Slide 9

Slide 9 text

Logs on the frontend 14:56:43,443 DEBUG [qtp38625764-23] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:229) - Looking up handler method for path / 14:56:43,444 DEBUG [qtp38625764-23] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:234) - Returning handler method [public org.springframework.http.ResponseEntity brave.webmvc.Frontend.callBackend()] --snip-- 14:56:43,448 DEBUG [qtp38625764-23] conn.LoggingManagedHttpClientConnection (LoggingManagedHttpClientConnection.java:133) - http-outgoing-0 >> GET /api HTTP/1.1 14:56:43,448 DEBUG [qtp38625764-23] conn.LoggingManagedHttpClientConnection (LoggingManagedHttpClientConnection.java:136) - http-outgoing-0 >> Host: localhost:9000 --snip-- 14:56:48,461 DEBUG [qtp38625764-23] conn.LoggingManagedHttpClientConnection (LoggingManagedHttpClientConnection.java:122) - http-outgoing-0 << HTTP/1.1 200 OK

Slide 10

Slide 10 text

Interpreting the frontend logs 14:56:43,443 DEBUG [qtp38625764-23] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:229) - Looking up handler method for path / 14:56:43,444 DEBUG [qtp38625764-23] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:234) - Returning handler method [public org.springframework.http.ResponseEntity brave.webmvc.Frontend.callBackend()] --snip-- 14:56:43,448 DEBUG [qtp38625764-23] conn.LoggingManagedHttpClientConnection (LoggingManagedHttpClientConnection.java:133) - http-outgoing-0 >> GET /api HTTP/1.1 14:56:43,448 DEBUG [qtp38625764-23] conn.LoggingManagedHttpClientConnection (LoggingManagedHttpClientConnection.java:136) - http-outgoing-0 >> Host: localhost:9000 --snip-- 14:56:48,461 DEBUG [qtp38625764-23] client.RestTemplate (RestTemplate.java:547) - GET request for “http://backend/api" resulted in 200 (OK) If you are lucky, logs preceding and following a thread dump explain context further.

Slide 11

Slide 11 text

Let’s figure out why this request took a while GET /api GET /

Slide 12

Slide 12 text

Ooh a thread dump from the backend! "qtp78377968-22" #22 prio=5 os_prio=31 tid=0x00007fc0ff3c9000 nid=0xa103 waiting on condition [0x0000700007fb5000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at brave.webmvc.Backend.printDate(Backend.java:18) This looks suspicious..

Slide 13

Slide 13 text

Interpreting the backend thread dump "qtp78377968-22" #22 prio=5 os_prio=31 tid=0x00007fc0ff3c9000 nid=0xa103 waiting on condition [0x0000700007fb5000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at brave.webmvc.Backend.printDate(Backend.java:18) Just like us, threads can sleep!

Slide 14

Slide 14 text

Logs on the backend 14:56:43,451 DEBUG [qtp78377968-22] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:229) - Looking up handler method for path /api 14:56:43,452 DEBUG [qtp78377968-22] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:234) - Returning handler method [public org.springframework.http.ResponseEntity brave.webmvc.Backend.printDate(java.lang.String)] 14:56:48,459 DEBUG [qtp78377968-22] annotation.AbstractMessageConverterMethodProcessor (AbstractMessageConverterMethodProcessor.java:170) - Written [Mon Oct 22 14:56:48 CST 2018] as "text/plain;charset=ISO-8859-1" using [org.springframework.http.converter.StringHttpMessageConverter@74cbbb52]

Slide 15

Slide 15 text

14:56:43,451 DEBUG [qtp78377968-22] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:229) - Looking up handler method for path /api 14:56:43,452 DEBUG [qtp78377968-22] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:234) - Returning handler method [public org.springframework.http.ResponseEntity brave.webmvc.Backend.printDate(java.lang.String)] 14:56:48,459 DEBUG [qtp78377968-22] annotation.AbstractMessageConverterMethodProcessor (AbstractMessageConverterMethodProcessor.java:170) - Written [Mon Oct 22 14:56:48 CST 2018] as "text/plain;charset=ISO-8859-1" using [org.springframework.http.converter.StringHttpMessageConverter@74cbbb52] Interpreting the backend logs Logs tell us details like the time, but can’t tell us we are asleep!

Slide 16

Slide 16 text

Logs and Thread dumps can get us far We didn’t have to change our code except maybe to raise log level Through analysis, we can narrow down to the code if we have a bundle of good info, some experience and a dash of patience

Slide 17

Slide 17 text

Http with logs and tracing @adrianfcole #zipkin introduction http: logs and dumps http: logs and tracing messaging: logs and dumps messaging: logs and tracing wrapping up

Slide 18

Slide 18 text

Interpreting the frontend logs with tracing 14:56:43,443 [43b8b8a1849142c4/43b8b8a1849142c4] DEBUG [qtp38625764-23] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:229) - Looking up handler method for path / 14:56:43,444 [43b8b8a1849142c4/43b8b8a1849142c4] DEBUG [qtp38625764-23] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:234) - Returning handler method [public org.springframework.http.ResponseEntity brave.webmvc.Frontend.callBackend()] --snip-- 14:56:43,448 [43b8b8a1849142c4/b260d4ccbdf9c0be] DEBUG [qtp38625764-23] conn.LoggingManagedHttpClientConnection (LoggingManagedHttpClientConnection.java:133) - http-outgoing-0 >> GET /api HTTP/1.1 14:56:43,448 [43b8b8a1849142c4/b260d4ccbdf9c0be] DEBUG [qtp38625764-23] conn.LoggingManagedHttpClientConnection (LoggingManagedHttpClientConnection.java:136) - http-outgoing-0 >> Host: localhost:9000 --snip-- 14:56:48,461 [43b8b8a1849142c4/b260d4ccbdf9c0be] DEBUG [qtp38625764-23] conn.LoggingManagedHttpClientConnection (LoggingManagedHttpClientConnection.java:122) - http-outgoing-0 << HTTP/1.1 200 OK Initial step is the root span Next step is a child span

Slide 19

Slide 19 text

Interpreting the backend logs with tracing 14:56:43,451 [43b8b8a1849142c4/b260d4ccbdf9c0be] DEBUG [qtp78377968-22] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:229) - Looking up handler method for path /api 14:56:43,452 [43b8b8a1849142c4/b260d4ccbdf9c0be] DEBUG [qtp78377968-22] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:234) - Returning handler method [public org.springframework.http.ResponseEntity brave.webmvc.Backend.printDate(java.lang.String)] 14:56:48,459 [43b8b8a1849142c4/b260d4ccbdf9c0be] DEBUG [qtp78377968-22] annotation.AbstractMessageConverterMethodProcessor (AbstractMessageConverterMethodProcessor.java: 170) - Written [Mon Oct 22 14:56:48 CST 2018] as "text/plain;charset=ISO-8859-1" using [org.springframework.http.converter.StringHttpMessageConverter@74cbbb52] By default, the client and server share the same ID, so you can correlate.

Slide 20

Slide 20 text

Interpreting the initial server call { "traceId": "43b8b8a1849142c4", "id": "43b8b8a1849142c4", "kind": "SERVER", "name": "get /", "timestamp": 1540191403443082, "duration": 5020712, "localEndpoint": { "serviceName": "frontend", "ipv4": "192.168.99.1" }, "remoteEndpoint": { "ipv6": "::1", "port": 53225 }, "tags": { "http.method": "GET", "http.path": "/", "mvc.controller.class": "Frontend", "mvc.controller.method": "callBackend" } }

Slide 21

Slide 21 text

{ "traceId": "43b8b8a1849142c4", "parentId": "43b8b8a1849142c4", "id": "b260d4ccbdf9c0be", "kind": "CLIENT", "name": "get", "timestamp": 1540191403444849, "duration": 5016903, "localEndpoint": { "serviceName": "frontend", "ipv4": "192.168.99.1" }, "tags": { "http.method": "GET", "http.path": "/api" } } Interpreting the outbound client call

Slide 22

Slide 22 text

{ "traceId": "43b8b8a1849142c4", "parentId": "43b8b8a1849142c4", "id": "b260d4ccbdf9c0be", "kind": "SERVER", "name": "get /api", "timestamp": 1540191403451067, "duration": 5007961, "localEndpoint": { "serviceName": "backend", "ipv4": "192.168.99.1" }, "remoteEndpoint": { "ipv4": "127.0.0.1", "port": 53217 }, "tags": { "http.method": "GET", "http.path": "/api", "mvc.controller.class": "Backend", "mvc.controller.method": "printDate" }, "shared": true Interpreting the next server call

Slide 23

Slide 23 text

Visualizing the trace

Slide 24

Slide 24 text

Interpreting the initial server call

Slide 25

Slide 25 text

Interpreting the next server call

Slide 26

Slide 26 text

Distributed tracing is a higher level abstraction The data and UI don’t look like code so easier on non-coders Visualizations focus on service abstraction primary facts are services, endpoints and timing information

Slide 27

Slide 27 text

Messaging with logs and dumps @adrianfcole #zipkin introduction http: logs and dumps http: logs and tracing messaging: logs and dumps messaging: logs and tracing wrapping up

Slide 28

Slide 28 text

Let’s figure out why this processing took a while on-message

Slide 29

Slide 29 text

Ooh a thread dump from the backend! "ActiveMQ Session Task-1" #28 prio=7 os_prio=31 tid=0x00007ff0fb3e6000 nid=0x4907 waiting on condition [0x00007000098f8000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at brave.webmvc.Backend.onMessage(Backend.java:10) --snip-- at org.apache.activemq.ActiveMQSessionExecutor.dispatch(ActiveMQSessionExecutor.java:131) at org.apache.activemq.ActiveMQSessionExecutor.iterate(ActiveMQSessionExecutor.java:202) --snip--

Slide 30

Slide 30 text

Interpreting the backend thread dump "ActiveMQ Session Task-1" #28 prio=7 os_prio=31 tid=0x00007ff0fb3e6000 nid=0x4907 waiting on condition [0x00007000098f8000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at brave.webmvc.Backend.onMessage(Backend.java:10) --snip-- at org.apache.activemq.ActiveMQSessionExecutor.dispatch(ActiveMQSessionExecutor.java:131) at org.apache.activemq.ActiveMQSessionExecutor.iterate(ActiveMQSessionExecutor.java:202) --snip-- we’ll get back to this

Slide 31

Slide 31 text

Logs on the backend 17:35:26,825 TRACE [ActiveMQ Session Task-1] support.TransactionSynchronizationManager (TransactionSynchronizationManager.java:193) - Bound value [org.springframework.jms.listener.LocallyExposedJmsResourceHolder@a544fe8] for key [org.apache.activemq.ActiveMQConnectionFactory@e9b6969] to thread [ActiveMQ Session Task-1] 17:35:31,831 TRACE [ActiveMQ Session Task-1] support.TransactionSynchronizationManager (TransactionSynchronizationManager.java:243) - Removed value [org.springframework.jms.listener.LocallyExposedJmsResourceHolder@a544fe8] for key [org.apache.activemq.ActiveMQConnectionFactory@e9b6969] from thread [ActiveMQ Session Task-1]

Slide 32

Slide 32 text

Interpreting the backend logs 17:35:26,825 TRACE [ActiveMQ Session Task-1] support.TransactionSynchronizationManager (TransactionSynchronizationManager.java:193) - Bound value [org.springframework.jms.listener.LocallyExposedJmsResourceHolder@a544fe8] for key [org.apache.activemq.ActiveMQConnectionFactory@e9b6969] to thread [ActiveMQ Session Task-1] 17:35:31,831 TRACE [ActiveMQ Session Task-1] support.TransactionSynchronizationManager (TransactionSynchronizationManager.java:243) - Removed value [org.springframework.jms.listener.LocallyExposedJmsResourceHolder@a544fe8] for key [org.apache.activemq.ActiveMQConnectionFactory@e9b6969] from thread [ActiveMQ Session Task-1] We are very unlucky: no lines in the logs are from the stack trace!

Slide 33

Slide 33 text

No thread dump from the frontend! EEEK are we missing something?!

Slide 34

Slide 34 text

Logs on the frontend 17:26:40,784 DEBUG [qtp38625764-23] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:229) - Looking up handler method for path / 17:26:40,785 DEBUG [qtp38625764-23] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:234) - Returning handler method [public org.springframework.http.ResponseEntity brave.webmvc.Frontend.callBackend()] --snip-- 17:26:40,794 DEBUG [qtp839259983-25] core.JmsTemplate (JmsTemplate.java:567) - Sending created message: ActiveMQObjectMessage … 17:26:40,801 DEBUG [qtp839259983-25] annotation.AbstractMessageConverterMethodProcessor (AbstractMessageConverterMethodProcessor.java:170) - Written [sent] as "text/ plain;charset=ISO-8859-1" …

Slide 35

Slide 35 text

Interpreting the frontend logs 17:26:40,784 DEBUG [qtp38625764-23] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:229) - Looking up handler method for path / 17:26:40,785 DEBUG [qtp38625764-23] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:234) - Returning handler method [public org.springframework.http.ResponseEntity brave.webmvc.Frontend.callBackend()] --snip-- 17:26:40,794 DEBUG [qtp839259983-25] core.JmsTemplate (JmsTemplate.java:567) - Sending created message: ActiveMQObjectMessage … 17:26:40,801 DEBUG [qtp839259983-25] annotation.AbstractMessageConverterMethodProcessor (AbstractMessageConverterMethodProcessor.java:170) - Written [sent] as "text/ plain;charset=ISO-8859-1" … NOT MUCH HELP!

Slide 36

Slide 36 text

No thread dump from the frontend! It is unlikely you’ll see a related thread dump from the frontend as message production is usually quick and does not block on a response.

Slide 37

Slide 37 text

Let’s put this visually poll send- message on-message Lack of explicit connection has to do with how message processing works

Slide 38

Slide 38 text

Let’s put this literally MessageDispatch md = unconsumedMessages.dequeueNoWait(); // there’s often no stack relationship between message // consumption and processing! if (md != null) { dispatch(md); return true; }

Slide 39

Slide 39 text

Messaging with logs and tracing @adrianfcole #zipkin introduction http: logs and dumps http: logs and tracing messaging: logs and dumps messaging: logs and tracing wrapping up

Slide 40

Slide 40 text

Interpreting the frontend logs with tracing 17:26:40,784 DEBUG [47eaae8393b4405c/47eaae8393b4405c] [qtp38625764-23] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:229) - Looking up handler method for path / 17:26:40,785 DEBUG [47eaae8393b4405c/47eaae8393b4405c] [qtp38625764-23] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:234) - Returning handler method [public org.springframework.http.ResponseEntity brave.webmvc.Frontend.callBackend()] --snip-- 17:26:40,794 DEBUG [47eaae8393b4405c/47eaae8393b4405c] [qtp839259983-25] core.JmsTemplate (JmsTemplate.java:567) - Sending created message: ActiveMQObjectMessage … 17:26:40,801 DEBUG [47eaae8393b4405c/47eaae8393b4405c] [qtp839259983-25] annotation.AbstractMessageConverterMethodProcessor (AbstractMessageConverterMethodProcessor.java:170) - Written [sent] as "text/ plain;charset=ISO-8859-1" … no span for the JMS producer!

Slide 41

Slide 41 text

Interpreting the backend logs with tracing 17:35:26,825 TRACE [47eaae8393b4405c/b9b1184f3f3c2dcf] [ActiveMQ Session Task-1] support.TransactionSynchronizationManager (TransactionSynchronizationManager.java:193) - Bound value [org.springframework.jms.listener.LocallyExposedJmsResourceHolder@a544fe8] for key [org.apache.activemq.ActiveMQConnectionFactory@e9b6969] to thread [ActiveMQ Session Task-1] 17:35:31,831 TRACE [47eaae8393b4405c/b9b1184f3f3c2dcf] [ActiveMQ Session Task-1] support.TransactionSynchronizationManager (TransactionSynchronizationManager.java:243) - Removed value [org.springframework.jms.listener.LocallyExposedJmsResourceHolder@a544fe8] for key [org.apache.activemq.ActiveMQConnectionFactory@e9b6969] from thread [ActiveMQ Session Task-1] Even if you can’t see the producer ID, the trace ID is there!

Slide 42

Slide 42 text

Interpreting the initial server call { "traceId": "47eaae8393b4405c", "id": "47eaae8393b4405c", "kind": "SERVER", "name": "get /", "timestamp": 1540200484420851, "duration": 69995, "localEndpoint": { "serviceName": “frontend", "ipv4": "192.168.99.1" }, "remoteEndpoint": { "ipv6": "::1", "port": 55041 }, "tags": { "http.method": "GET", "http.path": "/", "mvc.controller.class": "Frontend", "mvc.controller.method": "callBackend" } } Looks exactly like the other one?

Slide 43

Slide 43 text

{ "traceId": "47eaae8393b4405c", "parentId": "47eaae8393b4405c", "id": "221447f43d8df065", "kind": "PRODUCER", "name": "send", "timestamp": 1540200484456255, "duration": 6904, "localEndpoint": { "serviceName": "frontend", "ipv4": "192.168.99.1" }, "remoteEndpoint": { "serviceName": "jms" }, "tags": { "jms.queue": "backend" } } Interpreting the message producer

Slide 44

Slide 44 text

{ "traceId": "47eaae8393b4405c", "parentId": "221447f43d8df065", "id": "3df175b9151faea7", "kind": "CONSUMER", "name": "receive", "timestamp": 1540200484478931, "duration": 1, "localEndpoint": { "serviceName": "backend", "ipv4": "192.168.99.1" }, "remoteEndpoint": { "serviceName": "jms" }, "tags": { "jms.queue": "backend" } } Interpreting the message consumer

Slide 45

Slide 45 text

{ "traceId": "47eaae8393b4405c", "parentId": "3df175b9151faea7", "id": "b9b1184f3f3c2dcf", "name": "on-message", "timestamp": 1540200484478932, "duration": 5013861, "localEndpoint": { "serviceName": "backend", "ipv4": "192.168.99.1" } } Interpreting the message processor

Slide 46

Slide 46 text

Visualizing the trace

Slide 47

Slide 47 text

Visualizing the trace (zoomed!)

Slide 48

Slide 48 text

Interpreting the initial server call

Slide 49

Slide 49 text

Interpreting the message producer

Slide 50

Slide 50 text

Interpreting the message consumer

Slide 51

Slide 51 text

Interpreting the message processor

Slide 52

Slide 52 text

Summary Async is HARD difficult if possible to track manually Tracing ubiquitizes by affording the same tools in messaging as normal RPC P.S. the hardest is reactive…

Slide 53

Slide 53 text

Wrapping Up @adrianfcole #zipkin introduction http: logs and dumps http: logs and tracing messaging: logs and dumps messaging: logs and tracing wrapping up

Slide 54

Slide 54 text

Wrapping up Recognize async is much harder than sync comms Don’t roll your own correlation system You are not alone! @adrianfcole #zipkin gitter.im/openzipkin/zipkin github.com/openzipkin/zipkin

Slide 55

Slide 55 text

Wrapping up Recognize async is much harder than sync comms Don’t roll your own correlation system You are not alone! @adrianfcole #zipkin gitter.im/openzipkin/zipkin github.com/openzipkin/zipkin