Slide 1

Slide 1 text

Distributed Tracing at UBER Scale Crea7ng a treasure map for your monitoring data Yuri Shkuro, UBER Technologies

Slide 2

Slide 2 text

ABOUT ME •  SoAware Engineer on the Observability team in NYC •  Working on the open source distributed tracing system Jaeger •  Co-founded the OpenTracing project •  Banking industry survivor •  Github: yurishkuro •  TwiLer: @yurishkuro

Slide 3

Slide 3 text

Would You Like Some Tracing with Your Monitoring? What does it take to roll it out?

Slide 4

Slide 4 text

Why Distributed Tracing •  Distributed transac7on monitoring •  Performance / latency op7miza7on •  Root cause analysis •  Service dependency analysis •  Distributed context propaga7on (“baggage”)

Slide 5

Slide 5 text

JAEGER, Distributed Tracing •  Open Source •  OpenTracing inside •  In ac7ve development •  PRs are welcome •  Zipkin compa7ble •  github.com/uber/jaeger

Slide 6

Slide 6 text

Who Thinks Tracing is Awesome?

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

Why Doesn’t Everyone Do Tracing?

Slide 10

Slide 10 text

Tracing Instrumenta7on is HARD EXPENSIVE BORING

Slide 11

Slide 11 text

Instrumenta7on •  Metrics and logging are not new •  Tracing is both new and harder

Slide 12

Slide 12 text

Context Propaga7on A B C D E {context} {context} {context} {context} Unique ID → {context} Edge service

Slide 13

Slide 13 text

Headers: . . . Trace ID . . . Instrumentation APPLICATION / MICROSERVICE Handler Context [Span] Client Context [Span] Inbound HTTP Request Instrumentation Headers: . . . Trace ID . . . Outbound HTTP Request Context Propaga7on

Slide 14

Slide 14 text

In-Process Context Propaga7on Implicit, via Thread-Locals but: thread pools, futures Explicit

Slide 15

Slide 15 text

It’s Also the Frameworks •  Go: stdlib, gorilla, … •  Java: jaxrs2, okhLp, ApacheHLpClient, … •  Python: Flask, Django, Tornado, urllib2, … •  Node.js – who knows…

Slide 16

Slide 16 text

OpenTracing to the Rescue

Slide 17

Slide 17 text

No Help With In-Process Propaga7on •  Must be done manually •  UBER has 2000-3000 microservices •  Resources of the tracing team are limited •  Developers must instrument their code!

Slide 18

Slide 18 text

BITE MAKE ME! How do we mobilize the org?

Slide 19

Slide 19 text

Traveling Salesman Problem 2017 edi7on

Slide 20

Slide 20 text

They Must Want Your Product or S7cks and Carrots

Slide 21

Slide 21 text

Recap: Why Distributed Tracing •  Distributed transac7on monitoring •  Performance / latency op7miza7on •  Root cause analysis •  Service dependency analysis •  Distributed context propaga7on (“baggage”)

Slide 22

Slide 22 text

Service Dependency Analysis •  Explain to us what we just built •  Who are my dependencies •  Workflow analysis •  Where is all this traffic coming from? •  Service 7ers

Slide 23

Slide 23 text

Baggage •  Tenancy, test or produc7on – Set at the top – Used at the storage layer, prod or test DB •  Authen7ca7on tokens – Signed user or service iden7ty – Checked at mul7ple levels

Slide 24

Slide 24 text

S7cks and Carrots •  Get other teams build features on top – Performance team – Capacity & cost accoun7ng – Baggage •  More carrots •  Eventually they become s7cks (peer pressure)

Slide 25

Slide 25 text

Each Organiza7on is Different Find what works best

Slide 26

Slide 26 text

How to Measure Adop7on? Measure everything

Slide 27

Slide 27 text

Does Service X Report Traces? •  Daily aggrega7on job •  Auto-book 7ckets •  Build a dashboard •  Pass/Fail: too easy to pass

Slide 28

Slide 28 text

Trace Quality Score •  Inspect traces – See a caller, but no spans •  Join with other data – Rou7ng logs •  Auto-book 7ckets (carefully, not for everyone) – With detailed report

Slide 29

Slide 29 text

Trace Quality Metrics by Service

Slide 30

Slide 30 text

Thank You •  Jaeger –  hLps://github.com/uber/jaeger –  Blog: Evolving Distributed Tracing at UBER –  Blog: Take OpenTracing for a HotROD Ride •  OpenTracing: hLp://opentracing.io/ •  We are hiring •  @yurishkuro