PWL-SF#1 => Ryan Kennedy on Dapper, a Distributed Systems Tracing Infrastructure

Slide 1

Slide 1 text

Google Dapper A Large-Scale Distributed Systems Tracing Infrastructure

Slide 2

Slide 2 text

About Us • Ryan Kennedy @rckenned - runs infrastructure for @yammer. • Anjali Shenoy @anjshenoy - infrastructure engineer @yammer.

Slide 3

Slide 3 text

Who has read this paper?

Slide 4

Slide 4 text

This is an experience paper. Not a research paper.

Slide 5

Slide 5 text

What is Dapper?

Slide 6

Slide 6 text

“Distributed Systems”

Slide 7

Slide 7 text

“Tracing Infrastructure”

Slide 8

Slide 8 text

“Large-Scale”

Slide 9

Slide 9 text

“A Large-Scale Distributed Systems Tracing Infrastructure"

Slide 10

Slide 10 text

“We built Dapper to provide Google’s developers with more information about the behavior of complex distributed systems.”

Slide 11

Slide 11 text

Distributed systems explained

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

Beneﬁts of a distributed system… • A collection of software services • Developed by different teams • Across different platforms • Using different programming languages

Slide 14

Slide 14 text

Downsides of a distributed system… • A collection of software services • Developed by different teams • Across different platforms • Using different programming languages

Slide 15

Slide 15 text

Engineering Context • Multiple services • Each service manned by a separate team • Continuous deployment of services

Slide 16

Slide 16 text

On-call troubles • Investigate overall health of system • Guess which service is at fault • Why which service is at fault*

Slide 17

Slide 17 text

Dapper A large scale distributed systems tracing infrastructure

Slide 18

Slide 18 text

Dapper’s Problem Space • End user - on-call engineer • Bird’s-eye view into overall system health • Ability to drill down into a service and see why its holding up the train • Long term pattern recognition

Slide 19

Slide 19 text

Design goals ! • Ubiquity • Low overhead • Application level transparency • Scalability

Slide 20

Slide 20 text

Low overhead. But… how? • Sampling - 0.01% of requests for high throughput systems. • Adaptive sampling was being deployed at publish time. • Out of band trace data collection

Slide 21

Slide 21 text

What is a trace? • A dapper trace is a tree of spans - 1 for every RPC • Each span has its own set of annotations, including the parent span • Annotation - application speciﬁc data you want to send along with the trace

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

What about annotations? • Annotation is some application speciﬁc information you pass along with your span. • A span can have 0-many annotations. • Each annotation has a timestamp and either a textual value or key-value pairs.

Slide 24

Slide 24 text

Let’s talk about instrumentation

Slide 25

Slide 25 text

{! "trace_id": "7021185255097625687",! "spans": [ {! "span_id": "2186499883",! "parent_span_id": "",! "name": "groups.create",! "start_time": 1395364621946662144,! "duration": 1359471104,! "annotations": [ {! "path_to_sql": {! "sql": "INSERT into messages (…)",! "start_node": “app/controllers/ a_controller.rb:create",! "path": "app/models/b.rb:create_message, app/ models/c.rb:create_group_message!"! },! "logged_at": 1395364623306275840! }! ]}]}

Slide 26

Slide 26 text

Ok. Now what? • How to effectively coalesce data in downstream systems? • Data for immediate perusal • Data for long term pattern recognition

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

Bigtable

Slide 29

Slide 29 text

Moar sampling… • Google’s prod clusters generate >1TB of data/ day. • Dapper end users want to query trace data ~ 2 weeks old

Slide 30

Slide 30 text

Dapper API • By Trace Id: load on demand • Bulk Access: leverage MapReduce jobs to provide access to billions of traces in parallel. • Indexed access: composite index => lookup by service name, host name, timestamp.

Slide 31

Slide 31 text

Dapper API Usage • Online web applications • Command line - on demand • One-off analytical tools

Slide 32

Slide 32 text

No content

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

Dapper in Development • Performance • Correctness • Understanding • Testing

Slide 35

Slide 35 text

Exception monitoring • Exception monitoring service uses trace/span ids as meta data • Link to same in UI reports

Slide 36

Slide 36 text

Other uses… • Long tail latency • Inferring service dependencies • Layered and shared storage systems

Slide 37

Slide 37 text

Blind spots • Coalescing effects • Finding a root cause within a service • Tying kernel events to a trace

Slide 38

Slide 38 text

In conclusion • Best use case for dev/ops teams. • Practical - negligible performance impact • Keep the trace repo API open

Slide 39

Slide 39 text

Thank you, authors • Benjamin H. Sigelman • Luiz André Barroso • Mike Burrows • Pat Stephenson • Manoj Plakal • Donald Beaver • Saul Jaspan • Chandan Shanbhag

Slide 40

Slide 40 text

Thanks! Questions?