PWL-SF#1 => Ryan Kennedy on Dapper, a Distributed Systems Tracing Infrastructure

Google Dapper A Large-Scale Distributed Systems Tracing Infrastructure

About Us • Ryan Kennedy @rckenned - runs infrastructure for
@yammer. • Anjali Shenoy @anjshenoy - infrastructure engineer @yammer.

Who has read this paper?

This is an experience paper. Not a research paper.

What is Dapper?

“Distributed Systems”

“Tracing Infrastructure”

“Large-Scale”

“A Large-Scale Distributed Systems Tracing Infrastructure"

“We built Dapper to provide Google’s developers with more information
about the behavior of complex distributed systems.”

Distributed systems explained

Beneﬁts of a distributed system… • A collection of software
services • Developed by different teams • Across different platforms • Using different programming languages

Downsides of a distributed system… • A collection of software
services • Developed by different teams • Across different platforms • Using different programming languages

Engineering Context • Multiple services • Each service manned by
a separate team • Continuous deployment of services

On-call troubles • Investigate overall health of system • Guess
which service is at fault • Why which service is at fault*

Dapper A large scale distributed systems tracing infrastructure

Dapper’s Problem Space • End user - on-call engineer •
Bird’s-eye view into overall system health • Ability to drill down into a service and see why its holding up the train • Long term pattern recognition

Design goals ! • Ubiquity • Low overhead • Application
level transparency • Scalability

Low overhead. But… how? • Sampling - 0.01% of requests
for high throughput systems. • Adaptive sampling was being deployed at publish time. • Out of band trace data collection

What is a trace? • A dapper trace is a
tree of spans - 1 for every RPC • Each span has its own set of annotations, including the parent span • Annotation - application speciﬁc data you want to send along with the trace

What about annotations? • Annotation is some application speciﬁc information
you pass along with your span. • A span can have 0-many annotations. • Each annotation has a timestamp and either a textual value or key-value pairs.

Let’s talk about instrumentation

{! "trace_id": "7021185255097625687",! "spans": [ {! "span_id": "2186499883",! "parent_span_id": "",!
"name": "groups.create",! "start_time": 1395364621946662144,! "duration": 1359471104,! "annotations": [ {! "path_to_sql": {! "sql": "INSERT into messages (…)",! "start_node": “app/controllers/ a_controller.rb:create",! "path": "app/models/b.rb:create_message, app/ models/c.rb:create_group_message!"! },! "logged_at": 1395364623306275840! }! ]}]}

Ok. Now what? • How to effectively coalesce data in
downstream systems? • Data for immediate perusal • Data for long term pattern recognition

Bigtable

Moar sampling… • Google’s prod clusters generate >1TB of data/
day. • Dapper end users want to query trace data ~ 2 weeks old

Dapper API • By Trace Id: load on demand •
Bulk Access: leverage MapReduce jobs to provide access to billions of traces in parallel. • Indexed access: composite index => lookup by service name, host name, timestamp.

Dapper API Usage • Online web applications • Command line
- on demand • One-off analytical tools

Dapper in Development • Performance • Correctness • Understanding •
Testing

Exception monitoring • Exception monitoring service uses trace/span ids as
meta data • Link to same in UI reports

Other uses… • Long tail latency • Inferring service dependencies
• Layered and shared storage systems

Blind spots • Coalescing effects • Finding a root cause
within a service • Tying kernel events to a trace

In conclusion • Best use case for dev/ops teams. •
Practical - negligible performance impact • Keep the trace repo API open

Thank you, authors • Benjamin H. Sigelman • Luiz André
Barroso • Mike Burrows • Pat Stephenson • Manoj Plakal • Donald Beaver • Saul Jaspan • Chandan Shanbhag

Thanks! Questions?

PWL-SF#1 => Ryan Kennedy on Dapper, a Distribut...

PWL-SF#1 => Ryan Kennedy on Dapper, a Distributed Systems Tracing Infrastructure

More Decks by Papers_We_Love

Other Decks in Technology

Featured

Transcript