Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Distributed Tracing at Netflix

Distributed Tracing at Netflix

This was presented in a private workshop on distributed tracing and it provides details about the current state of distributed tracing at netflix (July 2015)

A3668e66eb7b8980ac91daaa4e9fe691?s=128

Nitesh Kant

July 10, 2015
Tweet

Transcript

  1. Nitesh Kant - @NiteshKant - Software Engineer, Cloud Platform, Netflix

    Distributed Tracing @Netflix a.k.a Salp July 2015
  2. Nitesh Kant Who Am I? ❖ Engineer, Cloud Platform, Netflix.

    ❖ Leading Reactive IPC @Netflix ❖ Core contributor, RxNetty* ❖ Introduced distributed tracing inside Netflix (!Yet OSS) * https://github.com/ReactiveX/RxNetty @NiteshKant
  3. 62 Million Subscribers world-wide 10 Billion hours / month http://ir.netflix.com/

  4. AWS Availability Zone AWS Availability Zone AWS Availability Zone

  5. Availability Zone Zuul (Routing layer) Edge Services Various Microservices

  6. The problem

  7. The food chain A simplistic hypothetical call chain. Zuul API

    Recommendation service Subscriber service Cache C*
  8. The food chain A simplistic hypothetical call chain. Zuul API

    Recommendation service Subscriber service Cache C* What services does a request touch?
  9. The food chain A simplistic hypothetical call chain. Zuul API

    Recommendation service Subscriber service Cache C* Who Am I dependent on? What services does a request touch?
  10. The food chain A simplistic hypothetical call chain. Zuul API

    Recommendation service Subscriber service Cache C* Who Am I dependent on? Who Am I dependent on? Who calls me? What services does a request touch?
  11. The food chain A simplistic hypothetical call chain. Zuul API

    Recommendation service Subscriber service Cache C* Who Am I dependent on? Who Am I dependent on? Who calls me? Who overwhelmed me now? What services does a request touch?
  12. Back in the days

  13. Request Trace In-memory storage of data till request completion. Zuul

    API Recommendation service Subscriber service Cache C* A simplistic hypothetical call chain.
  14. Request Trace In-memory storage of data till request completion. Zuul

    API Recommendation service Subscriber service Cache C* A simplistic hypothetical call chain.
  15. Request Trace In-memory storage of data till request completion. Zuul

    API Recommendation service Subscriber service Cache C* A simplistic hypothetical call chain.
  16. Request Trace In-memory storage of data till request completion. Zuul

    API Recommendation service Subscriber service Cache C* A simplistic hypothetical call chain.
  17. Request Trace In-memory storage of data till request completion. Inflexible

    data model (averse to custom data) Zuul API Recommendation service Subscriber service Cache C* A simplistic hypothetical call chain.
  18. Today* * July 2015

  19. Salp (Transparent Microservices Interactions)

  20. Zuul API Recommendation service Subscriber service Cache C* A simplistic

    hypothetical call chain.
  21. User Request Tracing Filter Zuul

  22. User Request Tracing Filter Local tracing decision Zuul

  23. User Request Tracing Filter Dynamic Config Service Get sampling rate

    Zuul
  24. User Request Tracing Filter Initialize local tracing context, when traced

    Zuul
  25. User Request Tracing Filter IPC instrumentation filter (Server) Zuul

  26. User Request Tracing Filter IPC instrumentation filter (Server) Emit server

    receive annotation Zuul
  27. User Request Tracing Filter IPC instrumentation filter (Server) Async publish

    to Suro* Logging Filter * https://github.com/Netflix/suro Zuul
  28. User Request Tracing Filter IPC instrumentation filter (Server) Zuul IPC

    instrumentation filter (Client) Emit client send annotation Dependency request
  29. Tracing Filter IPC instrumentation filter (Server) Zuul IPC instrumentation filter

    (Client) Emit client receive annotation Dependency response
  30. IPC instrumentation filter (Server) Zuul IPC instrumentation filter (Client) Emit

    server send annotation Dependency response Response
  31. IPC instrumentation filter (Server) Zuul IPC instrumentation filter (Client) Emit

    server send annotation Dependency response Response Any other service
  32. Tracing Context Propagation.

  33. In Process Within thread boundaries Thread local

  34. In Process Within thread boundaries Thread local Across thread boundaries

    Thread “Variable”. Copied on thread boundaries.
  35. Across processes A simplistic hypothetical call chain. Zuul API Recommendation

    Subscriber service Cache C*
  36. Across processes A simplistic hypothetical call chain. Zuul API Recommendation

    Subscriber service Cache C* ❖ Proprietary generic over the network “context propagation” library. ❖ Only HTTP traffic, context propagated as headers. ❖ On receipt, realize back to thread local.
  37. Sampling

  38. User Request Tracing Filter Dynamic Config Service Get sampling rate

    Zuul
  39. User Request Tracing Filter Dynamic Config Service Get sampling rate

    Zuul ❖ Simple sample rate (% of requests) stored in the configuration service. ❖ Granularity: Per application. ❖ Dynamic: Checked per request.
  40. Targeted tracing

  41. User Request Tracing Filter Dynamic Config Service Get sampling rate

    Zuul
  42. User Request Tracing Filter Dynamic Config Service Get sampling rate

    Zuul OR Dynamic criterion on request headers, URI, etc.
  43. None
  44. None
  45. None
  46. None
  47. None
  48. Data Publishing (via Suro) Tracing Filter IPC instrumentation filter (Server)

    Async publish to Suro* Logging Filter * https://github.com/Netflix/suro Zuul
  49. None
  50. None
  51. Data Format Custom K-V pair over Thrift

  52. Data Format Custom K-V pair over Thrift JSON JSON

  53. Publishing Rate Annotations per second. Daily (for a week)

  54. Publishing Rate Bytes per second. Daily (for a week)

  55. Visualizations

  56. Request traces Source: ElasticSearch. Granularity: Per request.

  57. Flow graphs Source: Druid. Granularity: Aggregated over per cluster.

  58. Flow graphs Source: Druid. Granularity: Aggregated over per cluster. Call

    volume & HTTP response code distribution per edge.
  59. Another visualization Source: API. Granularity: Aggregated. http://techblog.netflix.com/2015/02/a-microscope-on-microservices.html

  60. Tech landscape (Today*) Homogeneous architecture. Apps in critical path (customer

    request) are all JVM based.* HTTP/1.1 based services. * July 2015 * Node.js front-end app but IPC calls via JVM sidecar.
  61. Tech landscape (Future) Heterogenous architecture. Polyglot environment. Full duplex communication.

  62. Thank You! @NiteshKant July 2015