Distributed Tracing at UBER Scale

Distributed Tracing at UBER Scale Crea7ng a treasure map for
your monitoring data Yuri Shkuro, UBER Technologies

ABOUT ME •  SoAware Engineer on the Observability team in
NYC •  Working on the open source distributed tracing system Jaeger •  Co-founded the OpenTracing project •  Banking industry survivor •  Github: yurishkuro •  TwiLer: @yurishkuro

Would You Like Some Tracing with Your Monitoring? What does
it take to roll it out?

Why Distributed Tracing •  Distributed transac7on monitoring •  Performance /
latency op7miza7on •  Root cause analysis •  Service dependency analysis •  Distributed context propaga7on (“baggage”)

JAEGER, Distributed Tracing •  Open Source •  OpenTracing inside • 
In ac7ve development •  PRs are welcome •  Zipkin compa7ble •  github.com/uber/jaeger

Who Thinks Tracing is Awesome?

Why Doesn’t Everyone Do Tracing?

Tracing Instrumenta7on is HARD EXPENSIVE BORING

Instrumenta7on •  Metrics and logging are not new •  Tracing
is both new and harder

Context Propaga7on A B C D E {context} {context} {context}
{context} Unique ID → {context} Edge service

Headers: . . . Trace ID . . . Instrumentation
APPLICATION / MICROSERVICE Handler Context [Span] Client Context [Span] Inbound HTTP Request Instrumentation Headers: . . . Trace ID . . . Outbound HTTP Request Context Propaga7on

In-Process Context Propaga7on Implicit, via Thread-Locals but: thread pools, futures
Explicit

It’s Also the Frameworks •  Go: stdlib, gorilla, … • 
Java: jaxrs2, okhLp, ApacheHLpClient, … •  Python: Flask, Django, Tornado, urllib2, … •  Node.js – who knows…

OpenTracing to the Rescue

No Help With In-Process Propaga7on •  Must be done manually
•  UBER has 2000-3000 microservices •  Resources of the tracing team are limited •  Developers must instrument their code!

BITE MAKE ME! How do we mobilize the org?

Traveling Salesman Problem 2017 edi7on

They Must Want Your Product or S7cks and Carrots

Recap: Why Distributed Tracing •  Distributed transac7on monitoring •  Performance
/ latency op7miza7on •  Root cause analysis •  Service dependency analysis •  Distributed context propaga7on (“baggage”)

Service Dependency Analysis •  Explain to us what we just
built •  Who are my dependencies •  Workﬂow analysis •  Where is all this traﬃc coming from? •  Service 7ers

Baggage •  Tenancy, test or produc7on – Set at the top
– Used at the storage layer, prod or test DB •  Authen7ca7on tokens – Signed user or service iden7ty – Checked at mul7ple levels

S7cks and Carrots •  Get other teams build features on
top – Performance team – Capacity & cost accoun7ng – Baggage •  More carrots •  Eventually they become s7cks (peer pressure)

Each Organiza7on is Diﬀerent Find what works best

How to Measure Adop7on? Measure everything

Does Service X Report Traces? •  Daily aggrega7on job • 
Auto-book 7ckets •  Build a dashboard •  Pass/Fail: too easy to pass

Trace Quality Score •  Inspect traces – See a caller, but
no spans •  Join with other data – Rou7ng logs •  Auto-book 7ckets (carefully, not for everyone) – With detailed report

Trace Quality Metrics by Service

Thank You •  Jaeger –  hLps://github.com/uber/jaeger –  Blog: Evolving Distributed
Tracing at UBER –  Blog: Take OpenTracing for a HotROD Ride •  OpenTracing: hLp://opentracing.io/ •  We are hiring •  @yurishkuro

Distributed Tracing at UBER Scale

Distributed Tracing at UBER Scale

Yuri Shkuro

More Decks by Yuri Shkuro

Other Decks in Programming

Featured

Transcript