Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Distributed Tracing: Understanding how your all your components work together

Distributed Tracing: Understanding how your all your components work together

Understanding system failures traditionally starts with looking at a single component in isolation. However, this approach does not provide sufficient information with distributed services architectures. In these systems, end-user requests traverse dozens of components, and therefore a new approach is needed.

In this talk we’ll look at distributed tracing, which summarizes and contextualizes all sides of the story into a well-scoped and shared timeline. We’ll also look at distributed tracing tools, like Zipkin, which highlight the relationship between components, from the very top of the stack to the deepest aspects of the system.

José Carlos Chávez

November 15, 2017
Tweet

More Decks by José Carlos Chávez

Other Decks in Technology

Transcript

  1. About me José Carlos Chávez Software Engineer at Typeform focused

    on the responses services aggregate Open source contributor for Distributed Tracing projects @jcchavezs [email protected]
  2. Distributed systems A collection of independent components appears to its

    users as a single coherent system. Characteristics: → Concurrency → No global clock → Independent failures
  3. Distributed systems Water heater Gas supplier Cold water storage tank

    Shutoff valve First floor branch Tank valve 爆$❄#☭
  4. Distributed microservices architecture Frontend Service Ads Content Search Images Search

    DB2 DB3 DB1 TCP error (2003) 500 Internal Error 500 Internal Error Search Service GET /?q=cats
  5. Distributed microservices architecture Water heater Gas supplier Cold water storage

    tank Shutoff valve First floor branch Tank valve 爆$❄#☭ I AM HERE! First floor distributor is clogged!
  6. Logs Frontend Service Ads Search Service Content Search Images Search

    DB2 DB3 DB1 TCP error (2003) 500 Internal Error 500 Internal Error GET /?q=cats
  7. [24/Oct/2017 13:50:07 +0000] “GET / HTTP/1.1” 200 … **0/13548” [24/Oct/2017

    13:50:07 +0000] “GET /omnisearch HTTP/1.1” 200 … **0/23948” [24/Oct/2017 13:50:08 +0000] “GET / HTTP/1.1” 200 … **0/12396” [24/Oct/2017 13:50:07 +0000] “GET /images HTTP/1.1” 200 … **0/23748” [24/Oct/2017 13:50:07 +0000] “GET /content HTTP/1.1” 200 … **0/23248” [24/Oct/2017 13:50:07 +0000] “GET /ads HTTP/1.1” 200 … **0/26548” [24/Oct/2017 13:50:07 +0000] “GET / HTTP/1.1” 200 … **0/13148” [24/Oct/2017 13:50:07 +0000] “GET / HTTP/1.1” 200 … **0/2588” [24/Oct/2017 13:50:07 +0000] “GET /ads HTTP/1.1” 500 … **0/3248” [24/Oct/2017 13:50:07 +0000] “GET /omnisearch HTTP/1.1” 200 … **0/23548” [24/Oct/2017 13:50:07 +0000] “GET /content HTTP/1.1” 200 … **0/22598” [24/Oct/2017 13:50:07 +0000] “GET /images HTTP/1.1” 200 … **0/13948” ... Logs ? ?
  8. Distributed Tracing → Understanding Latency issues across services → Identifying

    dependencies and critical paths → Visualizing concurrency → Request scoped → Complementary to other monitoring tools
  9. Distributed Tracing Frontend Service Ads Search Service Content Search Images

    Search DB2 DB3 DB1 TCP error (2003) Can’t connect to server 500 Internal Error 500 Internal Error GET /?q=cats
  10. [24/Oct/2017 13:50:07 +0000] “GET / HTTP/1.1” 200 … **0/13548” [24/Oct/2017

    13:50:07 +0000] “GET /omnisearch HTTP/1.1” 200 … **0/23948” [24/Oct/2017 13:50:08 +0000] “GET / HTTP/1.1” 200 … **0/12396” [24/Oct/2017 13:50:07 +0000] “GET /images HTTP/1.1” 200 … **0/23748” [24/Oct/2017 13:50:07 +0000] “GET /content HTTP/1.1” 200 … **0/23248” [24/Oct/2017 13:50:07 +0000] “GET /ads HTTP/1.1” 200 … **0/26548” [24/Oct/2017 13:50:07 +0000] “GET / HTTP/1.1” 200 … **0/13148” [24/Oct/2017 13:50:07 +0000] “GET / HTTP/1.1” 200 … **0/2588” [24/Oct/2017 13:50:07 +0000] “GET /ads HTTP/1.1” 500 … **0/3248” [24/Oct/2017 13:50:07 +0000] “GET /omnisearch HTTP/1.1” 200 … **0/23548” [24/Oct/2017 13:50:07 +0000] “GET /content HTTP/1.1” 200 … **0/22598” [24/Oct/2017 13:50:07 +0000] “GET /images HTTP/1.1” 200 … **0/13948” ... Logs
  11. Distributed Tracing FRONTEND SEARCH ADS CONTENT IMAGES Time [1508410470] error

    TCP (2003) 500 [1508410442] no cache for resource, retrieving from DB db_2_inst1 error error
  12. Elements of Distributed Tracing → A trace shows an execution

    path through a distributed system → A context includes information that should be propagated across services → A span in the trace represents a logical unit of work (with a beginning and end) → Tags and logs (optional) add complementary information to spans.
  13. How propagation it works? Client / Producer Extract() Inject() Span

    2 Span 1 Extract() TraceID: fAf3oXLoDS ParentID: ... SpanID: dZ0qHIBa1A Sampled: true ... TraceID: fAf3oXLoDS ParentID: dZ0qHIBa1A SpanID: oYa7m31wq5 Sampled: true ...
  14. Benefits of Distributed Tracing → Effective monitoring → System insight,

    clarifies non trivial interactions → Visibility to critical paths and dependencies → Observe E2E latency in near real time → Request scoped, not request’s lifecycle scoped.
  15. What about overhead? → Observability tools are unintrusive → Sampling

    reduces overhead → Instrumentation can be delegated to common frameworks → (Don’t) trace every single operation
  16. Zipkin → Distributed tracing tool → Based on Google Dapper

    (2010) → Created by twitter (2012) → Open source (2015) → Strong community
  17. Zipkin’s architecture Service (instrumented) Transport Collect spans Collector API UI

    Storage DB Visualize Retrieve data Store spans http/kafka/grpc Receive spans Deserialize and schedule for storage Cassandra/MySQL/ElasticSearch
  18. Q&A