Distributed Tracing at UBER Scale

Distributed Tracing at UBER Scale

Presented at Monitorama-PDF 2017.
Video: https://vimeo.com/221070602

5432b69e7e90874d9468594b22cb3665?s=128

Yuri Shkuro

May 24, 2017
Tweet

Transcript

  1. Distributed Tracing at UBER Scale Crea7ng a treasure map for

    your monitoring data Yuri Shkuro, UBER Technologies
  2. ABOUT ME •  SoAware Engineer on the Observability team in

    NYC •  Working on the open source distributed tracing system Jaeger •  Co-founded the OpenTracing project •  Banking industry survivor •  Github: yurishkuro •  TwiLer: @yurishkuro
  3. Would You Like Some Tracing with Your Monitoring? What does

    it take to roll it out?
  4. Why Distributed Tracing •  Distributed transac7on monitoring •  Performance /

    latency op7miza7on •  Root cause analysis •  Service dependency analysis •  Distributed context propaga7on (“baggage”)
  5. JAEGER, Distributed Tracing •  Open Source •  OpenTracing inside • 

    In ac7ve development •  PRs are welcome •  Zipkin compa7ble •  github.com/uber/jaeger
  6. Who Thinks Tracing is Awesome?

  7. None
  8. None
  9. Why Doesn’t Everyone Do Tracing?

  10. Tracing Instrumenta7on is HARD EXPENSIVE BORING

  11. Instrumenta7on •  Metrics and logging are not new •  Tracing

    is both new and harder
  12. Context Propaga7on A B C D E {context} {context} {context}

    {context} Unique ID → {context} Edge service
  13. Headers: . . . Trace ID . . . Instrumentation

    APPLICATION / MICROSERVICE Handler Context [Span] Client Context [Span] Inbound HTTP Request Instrumentation Headers: . . . Trace ID . . . Outbound HTTP Request Context Propaga7on
  14. In-Process Context Propaga7on Implicit, via Thread-Locals but: thread pools, futures

    Explicit
  15. It’s Also the Frameworks •  Go: stdlib, gorilla, … • 

    Java: jaxrs2, okhLp, ApacheHLpClient, … •  Python: Flask, Django, Tornado, urllib2, … •  Node.js – who knows…
  16. OpenTracing to the Rescue

  17. No Help With In-Process Propaga7on •  Must be done manually

    •  UBER has 2000-3000 microservices •  Resources of the tracing team are limited •  Developers must instrument their code!
  18. BITE MAKE ME! How do we mobilize the org?

  19. Traveling Salesman Problem 2017 edi7on

  20. They Must Want Your Product or S7cks and Carrots

  21. Recap: Why Distributed Tracing •  Distributed transac7on monitoring •  Performance

    / latency op7miza7on •  Root cause analysis •  Service dependency analysis •  Distributed context propaga7on (“baggage”)
  22. Service Dependency Analysis •  Explain to us what we just

    built •  Who are my dependencies •  Workflow analysis •  Where is all this traffic coming from? •  Service 7ers
  23. Baggage •  Tenancy, test or produc7on – Set at the top

    – Used at the storage layer, prod or test DB •  Authen7ca7on tokens – Signed user or service iden7ty – Checked at mul7ple levels
  24. S7cks and Carrots •  Get other teams build features on

    top – Performance team – Capacity & cost accoun7ng – Baggage •  More carrots •  Eventually they become s7cks (peer pressure)
  25. Each Organiza7on is Different Find what works best

  26. How to Measure Adop7on? Measure everything

  27. Does Service X Report Traces? •  Daily aggrega7on job • 

    Auto-book 7ckets •  Build a dashboard •  Pass/Fail: too easy to pass
  28. Trace Quality Score •  Inspect traces – See a caller, but

    no spans •  Join with other data – Rou7ng logs •  Auto-book 7ckets (carefully, not for everyone) – With detailed report
  29. Trace Quality Metrics by Service

  30. Thank You •  Jaeger –  hLps://github.com/uber/jaeger –  Blog: Evolving Distributed

    Tracing at UBER –  Blog: Take OpenTracing for a HotROD Ride •  OpenTracing: hLp://opentracing.io/ •  We are hiring •  @yurishkuro