Set your sites on tracing

Set your sites on tracing An overview of distributed tracing
practice @adrianfcole works at Pivotal works on Zipkin

Introduction introduction a typical zipkin site wrapping up @adrianfcole #zipkin

@adrianfcole • spring cloud at pivotal • focused on distributed
tracing • helped open zipkin

What is Distributed Tracing? Distributed tracing tracks production requests as
they touch different parts of your architecture. Requests have a unique trace ID, which you can use to lookup a trace diagram, or log entries related to it. Causal diagrams are easier to understand than scrolling through logs.

Example Trace Diagram Wire Send Store Async Store Wire Send
POST /things POST /things

Why do I care? - Reduce time in triage by
contextualizing errors and delays - Visualize latency like time in my service vs waiting for other services - Understand complex applications like async code or microservices - See your architecture with live dependency diagrams built from traces

Distributed Tracing Vocabulary A Span is an individual operation that
took place. A span contains timestamped events and tags. A Trace is an end-to-end latency graph, composed of spans. Tracers records spans and passes context required to connect them into a trace Instrumentation uses a tracer to record a task such as an http request as a span

wombats:10.2.3.47:8080 A Span is an individual operation Server Received a
Request POST /things Server Sent a Response Events Tags Operation request-i d abcd-ffe http.status_code 202 http.url …&features=HD-uploads

Tracing is capturing important events Wire Send Store Async Store
Wire Send POST /things POST /things

Tracers record time, duration and host Wire Send Store Async
Store Wire Send POST /things POST /things Tracers don’t decide what to record, instrumentation does.. we’ll get to that

Tracers send trace data out of process Tracers propagate IDs
in-band, to tell the receiver there’s a trace in progress Completed spans are reported out-of-band, to reduce overhead and allow for batching

How do I turn on tracing? A tracer is a
utility library, similar to metrics or logging libraries. It is a mechanism uses to trace an operation. Instrumentation is framework-specific code that uses a tracer to collect details such as the http url and request timing. Instrumentation must be configured and pointed to a tracing system for tracing to work. This is often done automatically with agents or frameworks like Spring Boot.

Zipkin is a distributed tracing system

Zipkin can be as simple as one file listening on
one port $ curl -sSL https://zipkin.io/quickstart.sh | bash -s $ SELF_TRACING_ENABLED=true java -jar zipkin.jar ******** ** ** * * ** ** ** ** ** ** ** ** ******** **** **** **** **** ****** **** *** **************************************************************************** ******* **** *** **** **** ** ** ***** ** ***** ** ** ** ** ** ** ** ** * *** ** **** ** ** ** ***** **** ** ** *** ****** ** ** ** ** ** ** ** :: Powered by Spring Boot :: (v2.1.3.RELEASE) 2019-03-03 10:42:02.455 INFO 25496 --- [ main] z.s.ZipkinServer : Starting ZipkinServer on MacBook-Pro-7.local with PID 25496 (/Users/acole/oss/sleuth-webmvc- example/zipkin.jar started by acole in /Users/acole/oss/sleuth-webmvc-example) —snip— $ curl -s localhost:9411/api/v2/services|jq . [ "gateway" ]

Zipkin Architecture Amazon Docker Google Kubernetes Cloud Foundry Tracers report
spans over HTTP, Kafka or RabbitMQ. Servers collect spans, storing them in MySQL, Cassandra, or Elasticsearch. Users query for traces via Zipkin’s Web UI or Api.

A typical Zipkin site introduction a typical zipkin site wrapping
up @adrianfcole #zipkin

What is a Zipkin site Site owner: End user who
champions Zipkin as a part of additional roles in their company. Many site owners are part time, yet contribute back to open source. Zipkin site: Production deployment of distributed tracing, which considers Zipkin format, instrumentation or backends strategic to their observability function.

What information do we collect on Zipkin sites * Introduction
of the company context and team on tracing * System overview from application until visualization/analysis * Site-specific data conventions such as services are named * Why tracing is important, goals and service level agreements * Status like costs adoption, ingestion and costs incurred

Why bother with tracing? Ascend Money says: Measure latency improvements
before and after refactoring the services. Identify non-conformant service communications that deviates from the design. Hotels.com says: helps in pointing out the worst offenders and by making it easier to identify performance improvements such as network calls that could be done in parallel. Netﬂix says: The business value is in providing operational visibility into the systems and enhance developer productivity.

What kind of infrastructure is involved? Effective tracing matches the
architecture and skillset of the site owners. Rarely do sites choose the same application and tracing infrastructure.

So, a site doesn’t only run Zipkin server? Zipkin Server
is the canonical backend which receives Zipkin format, and presents a UI. Some don’t run Zipkin server, or also run other products for various reasons. * SaaS preference * APM integration * Hybrid setup

And.. applications don’t always use Zipkin libraries?! Zipkin curates propagation
and trace formats which decouple sites from a mandate of using our code. By producing the same data, applications have more flexibility and choice. * 3rd party libraries * Proxies (service mesh) integration * In-house custom tools

Let’s look at a site that once used Zipkin server
Hotels.com started with a Zipkin backend, but are transitioning to Expedia Haystack, which provides more features like adaptive alerting. https://github.com/ExpediaDotCom/haystack Applications still emit data in Zipkin v2 format, which is forwarded to Haystack with a tool they created called Pitchfork. Developers still use Zipkin on their laptops for local troubleshooting, as it is easy to run.

Let’s look at a site that didn’t initially use Zipkin
server Netflix created a Dapper-based tracing system to trace RPC calls involved in video streaming. This included framework libraries to produce trace headers and data. As Spring Boot became prevalent, Zipkin became more useful as it is built-into the tracing library Spring Cloud Sleuth. Netflix convert legacy spans into Zipkin v2 format in their Kafka/Flink pipeline. This allows traces to stitch together for query and analysis.

Let’s look at a site that never used Zipkin server
Infostellar architecture runs in Google Cloud, except ground station software that runs locally at an antennae site. Many components trace with Zipkin libraries, some with OpenTracing, some homegrown. All use Zipkin’s B3 format for propagation. Even when using Zipkin libraries, data sends directly to Google Stackdriver for query and analysis. There’s no Zipkin server footprint at Infostellar.

Let’s look at a site that uses stock Zipkin server
Medidata is an entirely AWS architecture, using the zipkin-aws image will allows http and SQS span collection. They collect 100% data into AWS-managed Elasticsearch storage. While the zipkin service is standard, Medidata has a service that reads trace data from Elasticsearch, comparing it with performance objectives in APIs and issuing alerts when performance degrades.

Besides architecture, what’s different across sites? Data collection policy: Typeform
always provision request IDs. Infostellar use antenna, satellite and plan tags for business context. LINE add company- speciﬁc tags like phase and instance ID. Expedia Haystack scrubs secrets. Data retention policy: Medidata retain 100% for 100 days. Netﬂix sample 100% of FIT experiments, 0.1% otherwise. SoundCloud retain a very low sample rate for 7 days. Tracing adoption rate: LINE is only one team’s services, Ascend <50%, Tyro is over 90%

How do sites get started with tracing Proxy: starting traces
at a proxy can raise visibility of upstream and downstream. Typeform initialise a trace and request ID in their custom proxy. Single service: hotels.com recognised even though tracing is a team sport, starting with a single service can still add value. New Framework: Sites like Ascend rolled out tracing in new applications as it was out-of-box supported with Spring Boot (via Spring Cloud Sleuth). Green Field: Infostellar engineers had previous experience with tracing, and built their platform with tracing in mind.

Wrapping Up @adrianfcole #zipkin introduction a typical zipkin site wrapping
up

Wrapping up Contribute https://cwiki.apache.org/confluence/display/ZIPKIN/Sites Check our “Last Month In Zipkin”
Chat any time on Gitter @adrianfcole #zipkin gitter.im/openzipkin/zipkin github.com/openzipkin/zipkin

Set your sites on tracing

Set your sites on tracing

Adrian Cole

More Decks by Adrian Cole

Other Decks in Technology

Featured

Transcript

Set your sites on tracing An overview of distributed tracing

Introduction introduction a typical zipkin site wrapping up @adrianfcole #zipkin

@adrianfcole • spring cloud at pivotal • focused on distributed

What is Distributed Tracing? Distributed tracing tracks production requests as

Example Trace Diagram Wire Send Store Async Store Wire Send

Why do I care? - Reduce time in triage by

Distributed Tracing Vocabulary A Span is an individual operation that

wombats:10.2.3.47:8080 A Span is an individual operation Server Received a

Tracing is capturing important events Wire Send Store Async Store

Tracers record time, duration and host Wire Send Store Async

Tracers send trace data out of process Tracers propagate IDs

How do I turn on tracing? A tracer is a

Zipkin is a distributed tracing system

Zipkin can be as simple as one file listening on

Zipkin Architecture Amazon Docker Google Kubernetes Cloud Foundry Tracers report

A typical Zipkin site introduction a typical zipkin site wrapping

What is a Zipkin site Site owner: End user who

What information do we collect on Zipkin sites * Introduction

Why bother with tracing? Ascend Money says: Measure latency improvements

What kind of infrastructure is involved? Effective tracing matches the

So, a site doesn’t only run Zipkin server? Zipkin Server

And.. applications don’t always use Zipkin libraries?! Zipkin curates propagation

Let’s look at a site that once used Zipkin server

Let’s look at a site that didn’t initially use Zipkin

Let’s look at a site that never used Zipkin server

Let’s look at a site that uses stock Zipkin server

Besides architecture, what’s different across sites? Data collection policy: Typeform

How do sites get started with tracing Proxy: starting traces

Wrapping Up @adrianfcole #zipkin introduction a typical zipkin site wrapping

Wrapping up Contribute https://cwiki.apache.org/confluence/display/ZIPKIN/Sites Check our “Last Month In Zipkin”