Messaging: the second hardest thing to trace

5b0bda421e4500b706c5c42d20c3a632?s=47 Adrian Cole
October 11, 2018

Messaging: the second hardest thing to trace

This is the second draft. Sharing for feedback

5b0bda421e4500b706c5c42d20c3a632?s=128

Adrian Cole

October 11, 2018
Tweet

Transcript

  1. Messaging: the second hardest thing to trace AKA why distributed

    tracing tools do this last @adrianfcole works at Pivotal works on Zipkin
  2. Introduction introduction http: logs and dumps http: logs and tracing

    messaging: logs and dumps messaging: logs and tracing wrapping up @adrianfcole #zipkin
  3. @adrianfcole • spring cloud at pivotal • focused on distributed

    tracing • helped open zipkin
  4. Understanding code execution The Stack(trace) shows which functions blocked on

    the one of interest Logs can show events which happened before the one of interest Distributed Trace can explain the request context (no offense metrics, profilers, flame graphers etc)
  5. Http with logs and thread dumps @adrianfcole #zipkin introduction http:

    logs and dumps http: logs and tracing messaging: logs and dumps messaging: logs and tracing wrapping up
  6. Let’s figure out why this request took a while GET

    /
  7. Ooh a thread dump from the frontend! "qtp38625764-23" #23 prio=5

    os_prio=31 tid=0x00007fce80769000 nid=0x7203 runnable [0x000070000b4c6000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(java.base@9.0.4/Native Method) --snip-- at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) at brave.webmvc.Frontend.callBackend(Frontend.java:16)
  8. Interpreting the frontend thread dump "qtp38625764-23" #23 prio=5 os_prio=31 tid=0x00007fce80769000

    nid=0x7203 runnable [0x000070000b4c6000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(java.base@9.0.4/Native Method) --snip-- at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) at brave.webmvc.Frontend.callBackend(Frontend.java:16) Thread dumps tell you what a thread was running, but not how long, to whom, what preceded it, etc..
  9. Logs on the frontend 14:56:43,443 DEBUG [qtp38625764-23] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:229) -

    Looking up handler method for path / 14:56:43,444 DEBUG [qtp38625764-23] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:234) - Returning handler method [public org.springframework.http.ResponseEntity<java.lang.String> brave.webmvc.Frontend.callBackend()] --snip-- 14:56:43,448 DEBUG [qtp38625764-23] conn.LoggingManagedHttpClientConnection (LoggingManagedHttpClientConnection.java:133) - http-outgoing-0 >> GET /api HTTP/1.1 14:56:43,448 DEBUG [qtp38625764-23] conn.LoggingManagedHttpClientConnection (LoggingManagedHttpClientConnection.java:136) - http-outgoing-0 >> Host: localhost:9000 --snip-- 14:56:48,461 DEBUG [qtp38625764-23] conn.LoggingManagedHttpClientConnection (LoggingManagedHttpClientConnection.java:122) - http-outgoing-0 << HTTP/1.1 200 OK
  10. Interpreting the frontend logs 14:56:43,443 DEBUG [qtp38625764-23] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:229) -

    Looking up handler method for path / 14:56:43,444 DEBUG [qtp38625764-23] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:234) - Returning handler method [public org.springframework.http.ResponseEntity<java.lang.String> brave.webmvc.Frontend.callBackend()] --snip-- 14:56:43,448 DEBUG [qtp38625764-23] conn.LoggingManagedHttpClientConnection (LoggingManagedHttpClientConnection.java:133) - http-outgoing-0 >> GET /api HTTP/1.1 14:56:43,448 DEBUG [qtp38625764-23] conn.LoggingManagedHttpClientConnection (LoggingManagedHttpClientConnection.java:136) - http-outgoing-0 >> Host: localhost:9000 --snip-- 14:56:48,461 DEBUG [qtp38625764-23] client.RestTemplate (RestTemplate.java:547) - GET request for “http://backend/api" resulted in 200 (OK) If you are lucky, logs preceding and following a thread dump explain context further.
  11. Let’s figure out why this request took a while GET

    /api GET /
  12. Ooh a thread dump from the backend! "qtp78377968-22" #22 prio=5

    os_prio=31 tid=0x00007fc0ff3c9000 nid=0xa103 waiting on condition [0x0000700007fb5000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at brave.webmvc.Backend.printDate(Backend.java:18) This looks suspicious..
  13. Interpreting the backend thread dump "qtp78377968-22" #22 prio=5 os_prio=31 tid=0x00007fc0ff3c9000

    nid=0xa103 waiting on condition [0x0000700007fb5000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at brave.webmvc.Backend.printDate(Backend.java:18) Just like us, threads can sleep!
  14. Logs on the backend 14:56:43,451 DEBUG [qtp78377968-22] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:229) -

    Looking up handler method for path /api 14:56:43,452 DEBUG [qtp78377968-22] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:234) - Returning handler method [public org.springframework.http.ResponseEntity<java.lang.String> brave.webmvc.Backend.printDate(java.lang.String)] 14:56:48,459 DEBUG [qtp78377968-22] annotation.AbstractMessageConverterMethodProcessor (AbstractMessageConverterMethodProcessor.java:170) - Written [Mon Oct 22 14:56:48 CST 2018] as "text/plain;charset=ISO-8859-1" using [org.springframework.http.converter.StringHttpMessageConverter@74cbbb52]
  15. 14:56:43,451 DEBUG [qtp78377968-22] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:229) - Looking up handler method

    for path /api 14:56:43,452 DEBUG [qtp78377968-22] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:234) - Returning handler method [public org.springframework.http.ResponseEntity<java.lang.String> brave.webmvc.Backend.printDate(java.lang.String)] 14:56:48,459 DEBUG [qtp78377968-22] annotation.AbstractMessageConverterMethodProcessor (AbstractMessageConverterMethodProcessor.java:170) - Written [Mon Oct 22 14:56:48 CST 2018] as "text/plain;charset=ISO-8859-1" using [org.springframework.http.converter.StringHttpMessageConverter@74cbbb52] Interpreting the backend logs Logs tell us details like the time, but can’t tell us we are asleep!
  16. Logs and Thread dumps can get us far We didn’t

    have to change our code except maybe to raise log level Through analysis, we can narrow down to the code if we have a bundle of good info, some experience and a dash of patience
  17. Http with logs and tracing @adrianfcole #zipkin introduction http: logs

    and dumps http: logs and tracing messaging: logs and dumps messaging: logs and tracing wrapping up
  18. Interpreting the frontend logs with tracing 14:56:43,443 [43b8b8a1849142c4/43b8b8a1849142c4] DEBUG [qtp38625764-23]

    handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:229) - Looking up handler method for path / 14:56:43,444 [43b8b8a1849142c4/43b8b8a1849142c4] DEBUG [qtp38625764-23] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:234) - Returning handler method [public org.springframework.http.ResponseEntity<java.lang.String> brave.webmvc.Frontend.callBackend()] --snip-- 14:56:43,448 [43b8b8a1849142c4/b260d4ccbdf9c0be] DEBUG [qtp38625764-23] conn.LoggingManagedHttpClientConnection (LoggingManagedHttpClientConnection.java:133) - http-outgoing-0 >> GET /api HTTP/1.1 14:56:43,448 [43b8b8a1849142c4/b260d4ccbdf9c0be] DEBUG [qtp38625764-23] conn.LoggingManagedHttpClientConnection (LoggingManagedHttpClientConnection.java:136) - http-outgoing-0 >> Host: localhost:9000 --snip-- 14:56:48,461 [43b8b8a1849142c4/b260d4ccbdf9c0be] DEBUG [qtp38625764-23] conn.LoggingManagedHttpClientConnection (LoggingManagedHttpClientConnection.java:122) - http-outgoing-0 << HTTP/1.1 200 OK Initial step is the root span Next step is a child span
  19. Interpreting the backend logs with tracing 14:56:43,451 [43b8b8a1849142c4/b260d4ccbdf9c0be] DEBUG [qtp78377968-22]

    handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:229) - Looking up handler method for path /api 14:56:43,452 [43b8b8a1849142c4/b260d4ccbdf9c0be] DEBUG [qtp78377968-22] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:234) - Returning handler method [public org.springframework.http.ResponseEntity<java.lang.String> brave.webmvc.Backend.printDate(java.lang.String)] 14:56:48,459 [43b8b8a1849142c4/b260d4ccbdf9c0be] DEBUG [qtp78377968-22] annotation.AbstractMessageConverterMethodProcessor (AbstractMessageConverterMethodProcessor.java: 170) - Written [Mon Oct 22 14:56:48 CST 2018] as "text/plain;charset=ISO-8859-1" using [org.springframework.http.converter.StringHttpMessageConverter@74cbbb52] By default, the client and server share the same ID, so you can correlate.
  20. Interpreting the initial server call { "traceId": "43b8b8a1849142c4", "id": "43b8b8a1849142c4",

    "kind": "SERVER", "name": "get /", "timestamp": 1540191403443082, "duration": 5020712, "localEndpoint": { "serviceName": "frontend", "ipv4": "192.168.99.1" }, "remoteEndpoint": { "ipv6": "::1", "port": 53225 }, "tags": { "http.method": "GET", "http.path": "/", "mvc.controller.class": "Frontend", "mvc.controller.method": "callBackend" } }
  21. { "traceId": "43b8b8a1849142c4", "parentId": "43b8b8a1849142c4", "id": "b260d4ccbdf9c0be", "kind": "CLIENT", "name":

    "get", "timestamp": 1540191403444849, "duration": 5016903, "localEndpoint": { "serviceName": "frontend", "ipv4": "192.168.99.1" }, "tags": { "http.method": "GET", "http.path": "/api" } } Interpreting the outbound client call
  22. { "traceId": "43b8b8a1849142c4", "parentId": "43b8b8a1849142c4", "id": "b260d4ccbdf9c0be", "kind": "SERVER", "name":

    "get /api", "timestamp": 1540191403451067, "duration": 5007961, "localEndpoint": { "serviceName": "backend", "ipv4": "192.168.99.1" }, "remoteEndpoint": { "ipv4": "127.0.0.1", "port": 53217 }, "tags": { "http.method": "GET", "http.path": "/api", "mvc.controller.class": "Backend", "mvc.controller.method": "printDate" }, "shared": true Interpreting the next server call
  23. Visualizing the trace

  24. Interpreting the initial server call

  25. Interpreting the next server call

  26. Distributed tracing is a higher level abstraction The data and

    UI don’t look like code so easier on non-coders Visualizations focus on service abstraction primary facts are services, endpoints and timing information
  27. Messaging with logs and dumps @adrianfcole #zipkin introduction http: logs

    and dumps http: logs and tracing messaging: logs and dumps messaging: logs and tracing wrapping up
  28. Let’s figure out why this processing took a while on-message

  29. Ooh a thread dump from the backend! "ActiveMQ Session Task-1"

    #28 prio=7 os_prio=31 tid=0x00007ff0fb3e6000 nid=0x4907 waiting on condition [0x00007000098f8000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at brave.webmvc.Backend.onMessage(Backend.java:10) --snip-- at org.apache.activemq.ActiveMQSessionExecutor.dispatch(ActiveMQSessionExecutor.java:131) at org.apache.activemq.ActiveMQSessionExecutor.iterate(ActiveMQSessionExecutor.java:202) --snip--
  30. Interpreting the backend thread dump "ActiveMQ Session Task-1" #28 prio=7

    os_prio=31 tid=0x00007ff0fb3e6000 nid=0x4907 waiting on condition [0x00007000098f8000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at brave.webmvc.Backend.onMessage(Backend.java:10) --snip-- at org.apache.activemq.ActiveMQSessionExecutor.dispatch(ActiveMQSessionExecutor.java:131) at org.apache.activemq.ActiveMQSessionExecutor.iterate(ActiveMQSessionExecutor.java:202) --snip-- we’ll get back to this
  31. Logs on the backend 17:35:26,825 TRACE [ActiveMQ Session Task-1] support.TransactionSynchronizationManager

    (TransactionSynchronizationManager.java:193) - Bound value [org.springframework.jms.listener.LocallyExposedJmsResourceHolder@a544fe8] for key [org.apache.activemq.ActiveMQConnectionFactory@e9b6969] to thread [ActiveMQ Session Task-1] 17:35:31,831 TRACE [ActiveMQ Session Task-1] support.TransactionSynchronizationManager (TransactionSynchronizationManager.java:243) - Removed value [org.springframework.jms.listener.LocallyExposedJmsResourceHolder@a544fe8] for key [org.apache.activemq.ActiveMQConnectionFactory@e9b6969] from thread [ActiveMQ Session Task-1]
  32. Interpreting the backend logs 17:35:26,825 TRACE [ActiveMQ Session Task-1] support.TransactionSynchronizationManager

    (TransactionSynchronizationManager.java:193) - Bound value [org.springframework.jms.listener.LocallyExposedJmsResourceHolder@a544fe8] for key [org.apache.activemq.ActiveMQConnectionFactory@e9b6969] to thread [ActiveMQ Session Task-1] 17:35:31,831 TRACE [ActiveMQ Session Task-1] support.TransactionSynchronizationManager (TransactionSynchronizationManager.java:243) - Removed value [org.springframework.jms.listener.LocallyExposedJmsResourceHolder@a544fe8] for key [org.apache.activemq.ActiveMQConnectionFactory@e9b6969] from thread [ActiveMQ Session Task-1] We are very unlucky: no lines in the logs are from the stack trace!
  33. No thread dump from the frontend! EEEK are we missing

    something?!
  34. Logs on the frontend 17:26:40,784 DEBUG [qtp38625764-23] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:229) -

    Looking up handler method for path / 17:26:40,785 DEBUG [qtp38625764-23] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:234) - Returning handler method [public org.springframework.http.ResponseEntity<java.lang.String> brave.webmvc.Frontend.callBackend()] --snip-- 17:26:40,794 DEBUG [qtp839259983-25] core.JmsTemplate (JmsTemplate.java:567) - Sending created message: ActiveMQObjectMessage … 17:26:40,801 DEBUG [qtp839259983-25] annotation.AbstractMessageConverterMethodProcessor (AbstractMessageConverterMethodProcessor.java:170) - Written [sent] as "text/ plain;charset=ISO-8859-1" …
  35. Interpreting the frontend logs 17:26:40,784 DEBUG [qtp38625764-23] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:229) -

    Looking up handler method for path / 17:26:40,785 DEBUG [qtp38625764-23] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:234) - Returning handler method [public org.springframework.http.ResponseEntity<java.lang.String> brave.webmvc.Frontend.callBackend()] --snip-- 17:26:40,794 DEBUG [qtp839259983-25] core.JmsTemplate (JmsTemplate.java:567) - Sending created message: ActiveMQObjectMessage … 17:26:40,801 DEBUG [qtp839259983-25] annotation.AbstractMessageConverterMethodProcessor (AbstractMessageConverterMethodProcessor.java:170) - Written [sent] as "text/ plain;charset=ISO-8859-1" … NOT MUCH HELP!
  36. No thread dump from the frontend! It is unlikely you’ll

    see a related thread dump from the frontend as message production is usually quick and does not block on a response.
  37. Let’s put this visually poll send- message on-message Lack of

    explicit connection has to do with how message processing works
  38. Let’s put this literally MessageDispatch md = unconsumedMessages.dequeueNoWait(); // there’s

    often no stack relationship between message // consumption and processing! if (md != null) { dispatch(md); return true; }
  39. Messaging with logs and tracing @adrianfcole #zipkin introduction http: logs

    and dumps http: logs and tracing messaging: logs and dumps messaging: logs and tracing wrapping up
  40. Interpreting the frontend logs with tracing 17:26:40,784 DEBUG [47eaae8393b4405c/47eaae8393b4405c] [qtp38625764-23]

    handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:229) - Looking up handler method for path / 17:26:40,785 DEBUG [47eaae8393b4405c/47eaae8393b4405c] [qtp38625764-23] handler.AbstractHandlerMethodMapping (AbstractHandlerMethodMapping.java:234) - Returning handler method [public org.springframework.http.ResponseEntity<java.lang.String> brave.webmvc.Frontend.callBackend()] --snip-- 17:26:40,794 DEBUG [47eaae8393b4405c/47eaae8393b4405c] [qtp839259983-25] core.JmsTemplate (JmsTemplate.java:567) - Sending created message: ActiveMQObjectMessage … 17:26:40,801 DEBUG [47eaae8393b4405c/47eaae8393b4405c] [qtp839259983-25] annotation.AbstractMessageConverterMethodProcessor (AbstractMessageConverterMethodProcessor.java:170) - Written [sent] as "text/ plain;charset=ISO-8859-1" … no span for the JMS producer!
  41. Interpreting the backend logs with tracing 17:35:26,825 TRACE [47eaae8393b4405c/b9b1184f3f3c2dcf] [ActiveMQ

    Session Task-1] support.TransactionSynchronizationManager (TransactionSynchronizationManager.java:193) - Bound value [org.springframework.jms.listener.LocallyExposedJmsResourceHolder@a544fe8] for key [org.apache.activemq.ActiveMQConnectionFactory@e9b6969] to thread [ActiveMQ Session Task-1] 17:35:31,831 TRACE [47eaae8393b4405c/b9b1184f3f3c2dcf] [ActiveMQ Session Task-1] support.TransactionSynchronizationManager (TransactionSynchronizationManager.java:243) - Removed value [org.springframework.jms.listener.LocallyExposedJmsResourceHolder@a544fe8] for key [org.apache.activemq.ActiveMQConnectionFactory@e9b6969] from thread [ActiveMQ Session Task-1] Even if you can’t see the producer ID, the trace ID is there!
  42. Interpreting the initial server call { "traceId": "47eaae8393b4405c", "id": "47eaae8393b4405c",

    "kind": "SERVER", "name": "get /", "timestamp": 1540200484420851, "duration": 69995, "localEndpoint": { "serviceName": “frontend", "ipv4": "192.168.99.1" }, "remoteEndpoint": { "ipv6": "::1", "port": 55041 }, "tags": { "http.method": "GET", "http.path": "/", "mvc.controller.class": "Frontend", "mvc.controller.method": "callBackend" } } Looks exactly like the other one?
  43. { "traceId": "47eaae8393b4405c", "parentId": "47eaae8393b4405c", "id": "221447f43d8df065", "kind": "PRODUCER", "name":

    "send", "timestamp": 1540200484456255, "duration": 6904, "localEndpoint": { "serviceName": "frontend", "ipv4": "192.168.99.1" }, "remoteEndpoint": { "serviceName": "jms" }, "tags": { "jms.queue": "backend" } } Interpreting the message producer
  44. { "traceId": "47eaae8393b4405c", "parentId": "221447f43d8df065", "id": "3df175b9151faea7", "kind": "CONSUMER", "name":

    "receive", "timestamp": 1540200484478931, "duration": 1, "localEndpoint": { "serviceName": "backend", "ipv4": "192.168.99.1" }, "remoteEndpoint": { "serviceName": "jms" }, "tags": { "jms.queue": "backend" } } Interpreting the message consumer
  45. { "traceId": "47eaae8393b4405c", "parentId": "3df175b9151faea7", "id": "b9b1184f3f3c2dcf", "name": "on-message", "timestamp":

    1540200484478932, "duration": 5013861, "localEndpoint": { "serviceName": "backend", "ipv4": "192.168.99.1" } } Interpreting the message processor
  46. Visualizing the trace

  47. Visualizing the trace (zoomed!)

  48. Interpreting the initial server call

  49. Interpreting the message producer

  50. Interpreting the message consumer

  51. Interpreting the message processor

  52. Summary Async is HARD difficult if possible to track manually

    Tracing ubiquitizes by affording the same tools in messaging as normal RPC P.S. the hardest is reactive…
  53. Wrapping Up @adrianfcole #zipkin introduction http: logs and dumps http:

    logs and tracing messaging: logs and dumps messaging: logs and tracing wrapping up
  54. Wrapping up Recognize async is much harder than sync comms

    Don’t roll your own correlation system You are not alone! @adrianfcole #zipkin gitter.im/openzipkin/zipkin github.com/openzipkin/zipkin
  55. Wrapping up Recognize async is much harder than sync comms

    Don’t roll your own correlation system You are not alone! @adrianfcole #zipkin gitter.im/openzipkin/zipkin github.com/openzipkin/zipkin