Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Messaging: the second hardest thing to trace

Adrian Cole
October 11, 2018

Messaging: the second hardest thing to trace

This is the second draft. Sharing for feedback

Adrian Cole

October 11, 2018
Tweet

More Decks by Adrian Cole

Other Decks in Technology

Transcript

  1. Messaging: the second hardest thing to trace AKA why distributed

    tracing tools do this last @adrianfcole works at Pivotal works on Zipkin
  2. Introduction introduction http: logs and dumps http: logs and tracing

    messaging: logs and dumps messaging: logs and tracing wrapping up @adrianfcole #zipkin
  3. Understanding code execution The Stack(trace) shows which functions blocked on

    the one of interest Logs can show events which happened before the one of interest Distributed Trace can explain the request context (no offense metrics, profilers, flame graphers etc)
  4. Http with logs and thread dumps @adrianfcole #zipkin introduction http:

    logs and dumps http: logs and tracing messaging: logs and dumps messaging: logs and tracing wrapping up
  5. Ooh a thread dump from the frontend! "qtp38625764-23" #23 prio=5

    os_prio=31 tid=0x00007fce80769000 nid=0x7203 runnable [0x000070000b4c6000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0([email protected]/Native Method) --snip-- at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) at brave.webmvc.Frontend.callBackend(Frontend.java:16)
  6. Interpreting the frontend thread dump "qtp38625764-23" #23 prio=5 os_prio=31 tid=0x00007fce80769000

    nid=0x7203 runnable [0x000070000b4c6000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0([email protected]/Native Method) --snip-- at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) at brave.webmvc.Frontend.callBackend(Frontend.java:16) Thread dumps tell you what a thread was running, but not how long, to whom, what preceded it, etc..
  7. Logs on the frontend 14:56:43,443 DEBUG [qtp38625764-23] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:229) -

    Looking up handler method for path / 14:56:43,444 DEBUG [qtp38625764-23] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:234) - Returning handler method [public org.springframework.http.ResponseEntity<java.lang.String> brave.webmvc.Frontend.callBackend()] --snip-- 14:56:43,448 DEBUG [qtp38625764-23] conn.LoggingManagedHttpClientConnection (LoggingManagedHttpClientConnection.java:133) - http-outgoing-0 >> GET /api HTTP/1.1 14:56:43,448 DEBUG [qtp38625764-23] conn.LoggingManagedHttpClientConnection (LoggingManagedHttpClientConnection.java:136) - http-outgoing-0 >> Host: localhost:9000 --snip-- 14:56:48,461 DEBUG [qtp38625764-23] conn.LoggingManagedHttpClientConnection (LoggingManagedHttpClientConnection.java:122) - http-outgoing-0 << HTTP/1.1 200 OK
  8. Interpreting the frontend logs 14:56:43,443 DEBUG [qtp38625764-23] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:229) -

    Looking up handler method for path / 14:56:43,444 DEBUG [qtp38625764-23] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:234) - Returning handler method [public org.springframework.http.ResponseEntity<java.lang.String> brave.webmvc.Frontend.callBackend()] --snip-- 14:56:43,448 DEBUG [qtp38625764-23] conn.LoggingManagedHttpClientConnection (LoggingManagedHttpClientConnection.java:133) - http-outgoing-0 >> GET /api HTTP/1.1 14:56:43,448 DEBUG [qtp38625764-23] conn.LoggingManagedHttpClientConnection (LoggingManagedHttpClientConnection.java:136) - http-outgoing-0 >> Host: localhost:9000 --snip-- 14:56:48,461 DEBUG [qtp38625764-23] client.RestTemplate (RestTemplate.java:547) - GET request for “http://backend/api" resulted in 200 (OK) If you are lucky, logs preceding and following a thread dump explain context further.
  9. Ooh a thread dump from the backend! "qtp78377968-22" #22 prio=5

    os_prio=31 tid=0x00007fc0ff3c9000 nid=0xa103 waiting on condition [0x0000700007fb5000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at brave.webmvc.Backend.printDate(Backend.java:18) This looks suspicious..
  10. Interpreting the backend thread dump "qtp78377968-22" #22 prio=5 os_prio=31 tid=0x00007fc0ff3c9000

    nid=0xa103 waiting on condition [0x0000700007fb5000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at brave.webmvc.Backend.printDate(Backend.java:18) Just like us, threads can sleep!
  11. Logs on the backend 14:56:43,451 DEBUG [qtp78377968-22] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:229) -

    Looking up handler method for path /api 14:56:43,452 DEBUG [qtp78377968-22] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:234) - Returning handler method [public org.springframework.http.ResponseEntity<java.lang.String> brave.webmvc.Backend.printDate(java.lang.String)] 14:56:48,459 DEBUG [qtp78377968-22] annotation.AbstractMessageConverterMethodProcessor (AbstractMessageConverterMethodProcessor.java:170) - Written [Mon Oct 22 14:56:48 CST 2018] as "text/plain;charset=ISO-8859-1" using [org.springframework.http.converter.StringHttpMessageConverter@74cbbb52]
  12. 14:56:43,451 DEBUG [qtp78377968-22] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:229) - Looking up handler method

    for path /api 14:56:43,452 DEBUG [qtp78377968-22] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:234) - Returning handler method [public org.springframework.http.ResponseEntity<java.lang.String> brave.webmvc.Backend.printDate(java.lang.String)] 14:56:48,459 DEBUG [qtp78377968-22] annotation.AbstractMessageConverterMethodProcessor (AbstractMessageConverterMethodProcessor.java:170) - Written [Mon Oct 22 14:56:48 CST 2018] as "text/plain;charset=ISO-8859-1" using [org.springframework.http.converter.StringHttpMessageConverter@74cbbb52] Interpreting the backend logs Logs tell us details like the time, but can’t tell us we are asleep!
  13. Logs and Thread dumps can get us far We didn’t

    have to change our code except maybe to raise log level Through analysis, we can narrow down to the code if we have a bundle of good info, some experience and a dash of patience
  14. Http with logs and tracing @adrianfcole #zipkin introduction http: logs

    and dumps http: logs and tracing messaging: logs and dumps messaging: logs and tracing wrapping up
  15. Interpreting the frontend logs with tracing 14:56:43,443 [43b8b8a1849142c4/43b8b8a1849142c4] DEBUG [qtp38625764-23]

    handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:229) - Looking up handler method for path / 14:56:43,444 [43b8b8a1849142c4/43b8b8a1849142c4] DEBUG [qtp38625764-23] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:234) - Returning handler method [public org.springframework.http.ResponseEntity<java.lang.String> brave.webmvc.Frontend.callBackend()] --snip-- 14:56:43,448 [43b8b8a1849142c4/b260d4ccbdf9c0be] DEBUG [qtp38625764-23] conn.LoggingManagedHttpClientConnection (LoggingManagedHttpClientConnection.java:133) - http-outgoing-0 >> GET /api HTTP/1.1 14:56:43,448 [43b8b8a1849142c4/b260d4ccbdf9c0be] DEBUG [qtp38625764-23] conn.LoggingManagedHttpClientConnection (LoggingManagedHttpClientConnection.java:136) - http-outgoing-0 >> Host: localhost:9000 --snip-- 14:56:48,461 [43b8b8a1849142c4/b260d4ccbdf9c0be] DEBUG [qtp38625764-23] conn.LoggingManagedHttpClientConnection (LoggingManagedHttpClientConnection.java:122) - http-outgoing-0 << HTTP/1.1 200 OK Initial step is the root span Next step is a child span
  16. Interpreting the backend logs with tracing 14:56:43,451 [43b8b8a1849142c4/b260d4ccbdf9c0be] DEBUG [qtp78377968-22]

    handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:229) - Looking up handler method for path /api 14:56:43,452 [43b8b8a1849142c4/b260d4ccbdf9c0be] DEBUG [qtp78377968-22] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:234) - Returning handler method [public org.springframework.http.ResponseEntity<java.lang.String> brave.webmvc.Backend.printDate(java.lang.String)] 14:56:48,459 [43b8b8a1849142c4/b260d4ccbdf9c0be] DEBUG [qtp78377968-22] annotation.AbstractMessageConverterMethodProcessor (AbstractMessageConverterMethodProcessor.java: 170) - Written [Mon Oct 22 14:56:48 CST 2018] as "text/plain;charset=ISO-8859-1" using [org.springframework.http.converter.StringHttpMessageConverter@74cbbb52] By default, the client and server share the same ID, so you can correlate.
  17. Interpreting the initial server call { "traceId": "43b8b8a1849142c4", "id": "43b8b8a1849142c4",

    "kind": "SERVER", "name": "get /", "timestamp": 1540191403443082, "duration": 5020712, "localEndpoint": { "serviceName": "frontend", "ipv4": "192.168.99.1" }, "remoteEndpoint": { "ipv6": "::1", "port": 53225 }, "tags": { "http.method": "GET", "http.path": "/", "mvc.controller.class": "Frontend", "mvc.controller.method": "callBackend" } }
  18. { "traceId": "43b8b8a1849142c4", "parentId": "43b8b8a1849142c4", "id": "b260d4ccbdf9c0be", "kind": "CLIENT", "name":

    "get", "timestamp": 1540191403444849, "duration": 5016903, "localEndpoint": { "serviceName": "frontend", "ipv4": "192.168.99.1" }, "tags": { "http.method": "GET", "http.path": "/api" } } Interpreting the outbound client call
  19. { "traceId": "43b8b8a1849142c4", "parentId": "43b8b8a1849142c4", "id": "b260d4ccbdf9c0be", "kind": "SERVER", "name":

    "get /api", "timestamp": 1540191403451067, "duration": 5007961, "localEndpoint": { "serviceName": "backend", "ipv4": "192.168.99.1" }, "remoteEndpoint": { "ipv4": "127.0.0.1", "port": 53217 }, "tags": { "http.method": "GET", "http.path": "/api", "mvc.controller.class": "Backend", "mvc.controller.method": "printDate" }, "shared": true Interpreting the next server call
  20. Distributed tracing is a higher level abstraction The data and

    UI don’t look like code so easier on non-coders Visualizations focus on service abstraction primary facts are services, endpoints and timing information
  21. Messaging with logs and dumps @adrianfcole #zipkin introduction http: logs

    and dumps http: logs and tracing messaging: logs and dumps messaging: logs and tracing wrapping up
  22. Ooh a thread dump from the backend! "ActiveMQ Session Task-1"

    #28 prio=7 os_prio=31 tid=0x00007ff0fb3e6000 nid=0x4907 waiting on condition [0x00007000098f8000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at brave.webmvc.Backend.onMessage(Backend.java:10) --snip-- at org.apache.activemq.ActiveMQSessionExecutor.dispatch(ActiveMQSessionExecutor.java:131) at org.apache.activemq.ActiveMQSessionExecutor.iterate(ActiveMQSessionExecutor.java:202) --snip--
  23. Interpreting the backend thread dump "ActiveMQ Session Task-1" #28 prio=7

    os_prio=31 tid=0x00007ff0fb3e6000 nid=0x4907 waiting on condition [0x00007000098f8000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at brave.webmvc.Backend.onMessage(Backend.java:10) --snip-- at org.apache.activemq.ActiveMQSessionExecutor.dispatch(ActiveMQSessionExecutor.java:131) at org.apache.activemq.ActiveMQSessionExecutor.iterate(ActiveMQSessionExecutor.java:202) --snip-- we’ll get back to this
  24. Logs on the backend 17:35:26,825 TRACE [ActiveMQ Session Task-1] support.TransactionSynchronizationManager

    (TransactionSynchronizationManager.java:193) - Bound value [org.springframework.jms.listener.LocallyExposedJmsResourceHolder@a544fe8] for key [org.apache.activemq.ActiveMQConnectionFactory@e9b6969] to thread [ActiveMQ Session Task-1] 17:35:31,831 TRACE [ActiveMQ Session Task-1] support.TransactionSynchronizationManager (TransactionSynchronizationManager.java:243) - Removed value [org.springframework.jms.listener.LocallyExposedJmsResourceHolder@a544fe8] for key [org.apache.activemq.ActiveMQConnectionFactory@e9b6969] from thread [ActiveMQ Session Task-1]
  25. Interpreting the backend logs 17:35:26,825 TRACE [ActiveMQ Session Task-1] support.TransactionSynchronizationManager

    (TransactionSynchronizationManager.java:193) - Bound value [org.springframework.jms.listener.LocallyExposedJmsResourceHolder@a544fe8] for key [org.apache.activemq.ActiveMQConnectionFactory@e9b6969] to thread [ActiveMQ Session Task-1] 17:35:31,831 TRACE [ActiveMQ Session Task-1] support.TransactionSynchronizationManager (TransactionSynchronizationManager.java:243) - Removed value [org.springframework.jms.listener.LocallyExposedJmsResourceHolder@a544fe8] for key [org.apache.activemq.ActiveMQConnectionFactory@e9b6969] from thread [ActiveMQ Session Task-1] We are very unlucky: no lines in the logs are from the stack trace!
  26. Logs on the frontend 17:26:40,784 DEBUG [qtp38625764-23] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:229) -

    Looking up handler method for path / 17:26:40,785 DEBUG [qtp38625764-23] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:234) - Returning handler method [public org.springframework.http.ResponseEntity<java.lang.String> brave.webmvc.Frontend.callBackend()] --snip-- 17:26:40,794 DEBUG [qtp839259983-25] core.JmsTemplate (JmsTemplate.java:567) - Sending created message: ActiveMQObjectMessage … 17:26:40,801 DEBUG [qtp839259983-25] annotation.AbstractMessageConverterMethodProcessor (AbstractMessageConverterMethodProcessor.java:170) - Written [sent] as "text/ plain;charset=ISO-8859-1" …
  27. Interpreting the frontend logs 17:26:40,784 DEBUG [qtp38625764-23] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:229) -

    Looking up handler method for path / 17:26:40,785 DEBUG [qtp38625764-23] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:234) - Returning handler method [public org.springframework.http.ResponseEntity<java.lang.String> brave.webmvc.Frontend.callBackend()] --snip-- 17:26:40,794 DEBUG [qtp839259983-25] core.JmsTemplate (JmsTemplate.java:567) - Sending created message: ActiveMQObjectMessage … 17:26:40,801 DEBUG [qtp839259983-25] annotation.AbstractMessageConverterMethodProcessor (AbstractMessageConverterMethodProcessor.java:170) - Written [sent] as "text/ plain;charset=ISO-8859-1" … NOT MUCH HELP!
  28. No thread dump from the frontend! It is unlikely you’ll

    see a related thread dump from the frontend as message production is usually quick and does not block on a response.
  29. Let’s put this visually poll send- message on-message Lack of

    explicit connection has to do with how message processing works
  30. Let’s put this literally MessageDispatch md = unconsumedMessages.dequeueNoWait(); // there’s

    often no stack relationship between message // consumption and processing! if (md != null) { dispatch(md); return true; }
  31. Messaging with logs and tracing @adrianfcole #zipkin introduction http: logs

    and dumps http: logs and tracing messaging: logs and dumps messaging: logs and tracing wrapping up
  32. Interpreting the frontend logs with tracing 17:26:40,784 DEBUG [47eaae8393b4405c/47eaae8393b4405c] [qtp38625764-23]

    handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:229) - Looking up handler method for path / 17:26:40,785 DEBUG [47eaae8393b4405c/47eaae8393b4405c] [qtp38625764-23] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:234) - Returning handler method [public org.springframework.http.ResponseEntity<java.lang.String> brave.webmvc.Frontend.callBackend()] --snip-- 17:26:40,794 DEBUG [47eaae8393b4405c/47eaae8393b4405c] [qtp839259983-25] core.JmsTemplate (JmsTemplate.java:567) - Sending created message: ActiveMQObjectMessage … 17:26:40,801 DEBUG [47eaae8393b4405c/47eaae8393b4405c] [qtp839259983-25] annotation.AbstractMessageConverterMethodProcessor (AbstractMessageConverterMethodProcessor.java:170) - Written [sent] as "text/ plain;charset=ISO-8859-1" … no span for the JMS producer!
  33. Interpreting the backend logs with tracing 17:35:26,825 TRACE [47eaae8393b4405c/b9b1184f3f3c2dcf] [ActiveMQ

    Session Task-1] support.TransactionSynchronizationManager (TransactionSynchronizationManager.java:193) - Bound value [org.springframework.jms.listener.LocallyExposedJmsResourceHolder@a544fe8] for key [org.apache.activemq.ActiveMQConnectionFactory@e9b6969] to thread [ActiveMQ Session Task-1] 17:35:31,831 TRACE [47eaae8393b4405c/b9b1184f3f3c2dcf] [ActiveMQ Session Task-1] support.TransactionSynchronizationManager (TransactionSynchronizationManager.java:243) - Removed value [org.springframework.jms.listener.LocallyExposedJmsResourceHolder@a544fe8] for key [org.apache.activemq.ActiveMQConnectionFactory@e9b6969] from thread [ActiveMQ Session Task-1] Even if you can’t see the producer ID, the trace ID is there!
  34. Interpreting the initial server call { "traceId": "47eaae8393b4405c", "id": "47eaae8393b4405c",

    "kind": "SERVER", "name": "get /", "timestamp": 1540200484420851, "duration": 69995, "localEndpoint": { "serviceName": “frontend", "ipv4": "192.168.99.1" }, "remoteEndpoint": { "ipv6": "::1", "port": 55041 }, "tags": { "http.method": "GET", "http.path": "/", "mvc.controller.class": "Frontend", "mvc.controller.method": "callBackend" } } Looks exactly like the other one?
  35. { "traceId": "47eaae8393b4405c", "parentId": "47eaae8393b4405c", "id": "221447f43d8df065", "kind": "PRODUCER", "name":

    "send", "timestamp": 1540200484456255, "duration": 6904, "localEndpoint": { "serviceName": "frontend", "ipv4": "192.168.99.1" }, "remoteEndpoint": { "serviceName": "jms" }, "tags": { "jms.queue": "backend" } } Interpreting the message producer
  36. { "traceId": "47eaae8393b4405c", "parentId": "221447f43d8df065", "id": "3df175b9151faea7", "kind": "CONSUMER", "name":

    "receive", "timestamp": 1540200484478931, "duration": 1, "localEndpoint": { "serviceName": "backend", "ipv4": "192.168.99.1" }, "remoteEndpoint": { "serviceName": "jms" }, "tags": { "jms.queue": "backend" } } Interpreting the message consumer
  37. { "traceId": "47eaae8393b4405c", "parentId": "3df175b9151faea7", "id": "b9b1184f3f3c2dcf", "name": "on-message", "timestamp":

    1540200484478932, "duration": 5013861, "localEndpoint": { "serviceName": "backend", "ipv4": "192.168.99.1" } } Interpreting the message processor
  38. Summary Async is HARD difficult if possible to track manually

    Tracing ubiquitizes by affording the same tools in messaging as normal RPC P.S. the hardest is reactive…
  39. Wrapping Up @adrianfcole #zipkin introduction http: logs and dumps http:

    logs and tracing messaging: logs and dumps messaging: logs and tracing wrapping up
  40. Wrapping up Recognize async is much harder than sync comms

    Don’t roll your own correlation system You are not alone! @adrianfcole #zipkin gitter.im/openzipkin/zipkin github.com/openzipkin/zipkin
  41. Wrapping up Recognize async is much harder than sync comms

    Don’t roll your own correlation system You are not alone! @adrianfcole #zipkin gitter.im/openzipkin/zipkin github.com/openzipkin/zipkin