Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Distributed tracing with OpenTracing and Jaeger

Max Klyga
January 05, 2018

Distributed tracing with OpenTracing and Jaeger

Distributed tracing solves problems of identifying performance bottlenecks in distributed system.
This presentation describes how Stream uses OpenTracing and Jaeger for distributed tracing in our system.

Build Scalable Newsfeeds & Activity Streams - https://getstream.io

Max Klyga

January 05, 2018
Tweet

More Decks by Max Klyga

Other Decks in Programming

Transcript

  1. Which requests are slow? Why? Lookup in Kibana: Lacks context,

    why was it slow? Grafana/Graphite: We know what operations were slow but can’t correlate them to any particular request • Unclear how to aggregate data • Lack of transparency and extensibility
  2. • Understanding individual request behaviour - Look at a specific

    request, not just an aggregated view - Search by request ID • Identification of performance bottlenecks - Given a particular request data identify which operation was causing performance issues • Structured data access - View N slowest request - Search for a request on a particular shard/host or from a particular client
  3. • Pass around unique ID (Trace ID) from service entry

    point to the end of execution • Build a DAG of related operations • Contextualise metadata
  4. Vendor-neutral open API standard for distributed tracing • Decoupling tracing

    backend from client • Library/framework integrations - Integration via middleware for cross-service tracing for a variety of transports
  5. Integration with StreamGoKit: Timer metrics can create spans via *WithTracing

    method versions Integration with API and Proxy servers: Incoming request starts span, extracts relevant metadata (shard, api key, etc.)
  6. Tracing span information passed in code via context.Context object Cross-service

    trace linking (Trace ID passed in HTTP headers/gRPC metadata) func myOperation(...) {
 timer, subcontext := metrics.StartWithTracing(context, "<operation>")
 defer timer.Stop()
 
 // timed operation
 
 subTimer, _ := metrics.StartWithTracing(subcontext, "<sub-operation>")
 // timed sub-operation
 subTimer.Stop()
 
 // timed operation continues
 }
  7. • Distributed tracing system released as open source by Uber

    Technologies - v1.0 released Dec 2017 • Written in Go
  8. Flexible pipeline • Agent - abstracts routing and discovery of

    collectors away from the client - allows for adaptive sampling rate to be implemented • Collector • UI for querying
  9. Our deployment now (WIP) • Collectors collocated with agents on

    service machines • Proxy, API, DB servers partially instrumented • ElasticSearch as storage