Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Distributed Tracing at UBER Scale

Distributed Tracing at UBER Scale

Presented at Monitorama-PDF 2017.
Video: https://vimeo.com/221070602

Yuri Shkuro

May 24, 2017
Tweet

More Decks by Yuri Shkuro

Other Decks in Programming

Transcript

  1. Distributed Tracing at UBER Scale
    Crea7ng a treasure map
    for your monitoring data
    Yuri Shkuro, UBER Technologies

    View Slide

  2. ABOUT ME
    •  SoAware Engineer on the
    Observability team in NYC
    •  Working on the open source
    distributed tracing system Jaeger
    •  Co-founded the OpenTracing
    project
    •  Banking industry survivor
    •  Github: yurishkuro
    •  TwiLer: @yurishkuro

    View Slide

  3. Would You Like Some Tracing with
    Your Monitoring?
    What does it take to roll it out?

    View Slide

  4. Why Distributed Tracing
    •  Distributed transac7on monitoring
    •  Performance / latency op7miza7on
    •  Root cause analysis
    •  Service dependency analysis
    •  Distributed context propaga7on (“baggage”)

    View Slide

  5. JAEGER, Distributed Tracing
    •  Open Source
    •  OpenTracing inside
    •  In ac7ve development
    •  PRs are welcome
    •  Zipkin compa7ble
    •  github.com/uber/jaeger

    View Slide

  6. Who Thinks Tracing is Awesome?

    View Slide

  7. View Slide

  8. View Slide

  9. Why Doesn’t Everyone Do Tracing?

    View Slide

  10. Tracing Instrumenta7on is
    HARD
    EXPENSIVE
    BORING

    View Slide

  11. Instrumenta7on
    •  Metrics and logging are not new
    •  Tracing is both new and harder

    View Slide

  12. Context Propaga7on
    A
    B
    C
    D
    E
    {context}
    {context}
    {context}
    {context}
    Unique ID → {context}
    Edge service

    View Slide

  13. Headers:
    . . .
    Trace ID
    . . .
    Instrumentation
    APPLICATION / MICROSERVICE
    Handler
    Context
    [Span]
    Client
    Context
    [Span]
    Inbound
    HTTP
    Request
    Instrumentation
    Headers:
    . . .
    Trace ID
    . . .
    Outbound
    HTTP
    Request
    Context Propaga7on

    View Slide

  14. In-Process Context Propaga7on
    Implicit, via Thread-Locals
    but: thread pools, futures
    Explicit

    View Slide

  15. It’s Also the Frameworks
    •  Go: stdlib, gorilla, …
    •  Java: jaxrs2, okhLp, ApacheHLpClient, …
    •  Python: Flask, Django, Tornado, urllib2, …
    •  Node.js – who knows…

    View Slide

  16. OpenTracing to the Rescue

    View Slide

  17. No Help With In-Process Propaga7on
    •  Must be done manually
    •  UBER has 2000-3000 microservices
    •  Resources of the tracing team are limited
    •  Developers must instrument their code!

    View Slide

  18. BITE MAKE ME!
    How do we mobilize the org?

    View Slide

  19. Traveling Salesman Problem
    2017 edi7on

    View Slide

  20. They Must Want Your Product
    or S7cks and Carrots

    View Slide

  21. Recap: Why Distributed Tracing
    •  Distributed transac7on monitoring
    •  Performance / latency op7miza7on
    •  Root cause analysis
    •  Service dependency analysis
    •  Distributed context propaga7on (“baggage”)

    View Slide

  22. Service Dependency Analysis
    •  Explain to us what we just built
    •  Who are my dependencies
    •  Workflow analysis
    •  Where is all this traffic coming from?
    •  Service 7ers

    View Slide

  23. Baggage
    •  Tenancy, test or produc7on
    – Set at the top
    – Used at the storage layer, prod or test DB
    •  Authen7ca7on tokens
    – Signed user or service iden7ty
    – Checked at mul7ple levels

    View Slide

  24. S7cks and Carrots
    •  Get other teams build features on top
    – Performance team
    – Capacity & cost accoun7ng
    – Baggage
    •  More carrots
    •  Eventually they become s7cks (peer pressure)

    View Slide

  25. Each Organiza7on is Different
    Find what works best

    View Slide

  26. How to Measure Adop7on?
    Measure everything

    View Slide

  27. Does Service X Report Traces?
    •  Daily aggrega7on job
    •  Auto-book 7ckets
    •  Build a dashboard
    •  Pass/Fail: too easy to pass

    View Slide

  28. Trace Quality Score
    •  Inspect traces
    – See a caller, but no spans
    •  Join with other data
    – Rou7ng logs
    •  Auto-book 7ckets (carefully, not for everyone)
    – With detailed report

    View Slide

  29. Trace Quality Metrics by Service

    View Slide

  30. Thank You
    •  Jaeger
    –  hLps://github.com/uber/jaeger
    –  Blog: Evolving Distributed Tracing at UBER
    –  Blog: Take OpenTracing for a HotROD Ride
    •  OpenTracing: hLp://opentracing.io/
    •  We are hiring
    •  @yurishkuro

    View Slide