How to Properly Blame Things for Causing Latency - JFokus 2018

Slide 1

Slide 1 text

How to Properly Blame Things for Causing Latency An introduction to Distributed Tracing and Zipkin @adrianfcole works at Pivotal works on Zipkin

Slide 2

Slide 2 text

Introduction introduction understanding latency distributed tracing zipkin demo propagation wrapping up @adrianfcole #zipkin

Slide 3

Slide 3 text

@adrianfcole • spring cloud at pivotal • focused on distributed tracing • helped open zipkin

Slide 4

Slide 4 text

Understanding Latency introduction understanding latency distributed tracing zipkin demo propagation wrapping up @adrianfcole #zipkin

Slide 5

Slide 5 text

Understanding Latency Logging - recording events Metrics - data combined from measuring events Tracing - recording events with causal ordering Unifying theory: Everything is based on events credit: coda hale note: metrics take in events and emit events! ex a reading of requests per second is itself an event data combined from measuring events put more speciﬁcally: metrics are “statistical aggregates of properties of events which generate periodic events recording the instantaneous values of those aggregates"

Slide 6

Slide 6 text

Different tools Tracing Request scoped Logging Events Metrics Aggregatable* credit: peter bourgon Different focus often confused because they have things in common, like a timeline. start with logging: crappy error happened tracing: impact of that error metrics: how many errors of this type are happening in the system logs: discrete events: debug, error, audit, request details crappy error happened; tracing can tell you the impact of that error. for example did it cause a caller to fail or did it delay it? tracing: request-scope causal info: latency, queries, IDs metrics: gauge counter histogram; success failure or customer how many errors of this type are happening in this cluster? not all metrics are meaningfully aggregatable, ex percentiles or averages https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html

Slide 7

Slide 7 text

Let’s use latency to compare a few tools • Log - event (response time) • Metric - value (response time) • Trace - tree (response time) event value and tree are outputs of each corresponding system

Slide 8

Slide 8 text

Logs show response time [20/Apr/2017:14:19:07 +0000] "GET / HTTP/1.1" 200 7918 "" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv: 1.8.1.11) Gecko/20061201 Firefox/2.0.0.11 (Ubuntu- feisty)" **0/95491** Look! this request took 95 milliseconds! often a ﬁeld or other to derive duration from logs. note there’s some complexity in this format, and often latency is timestamp math between events.

Slide 9

Slide 9 text

Metrics show response time Is 95 milliseconds slow? How fast were most requests at 14:19? context of a fact within the system. 95ms is indeed slow, but not critical. most requests were good at that time, even if the system had trouble 10 minutes prior can be work resource event customer metrics

Slide 10

Slide 10 text

What caused the request to take 95 milliseconds? Traces show response time Wire Send Store Async Store Wire Send POST /things POST /things åȇȇȇȇȇȇȇȇȇȇȇȇ95491 microseconds───────────────────────────å åȇȇȇȇȇȇȇȇȇȇȇȇ 557231 microseconds───────────å an error delayed the request, which would have otherwise been performant.

Slide 11

Slide 11 text

Log - easy to “grep”, manually read Metric - can identify trends Trace - identify cause across services First thoughts…. You can link together: For example add trace ID to logs

Slide 12

Slide 12 text

Distributed Tracing introduction understanding latency distributed tracing zipkin demo propagation wrapping up @adrianfcole #zipkin

Slide 13

Slide 13 text

Distributed Tracing commoditizes knowledge Distributed tracing systems collect end-to-end latency graphs (traces) in near real-time. You can compare traces to understand why certain requests take longer than others.

Slide 14

Slide 14 text

Distributed Tracing Vocabulary A Span is an individual operation that took place. A span contains timestamped events and tags. A Trace is an end-to-end latency graph, composed of spans. Tracers records spans and passes context required to connect them into a trace Instrumentation uses a tracer to record a task such as an http request as a span

Slide 15

Slide 15 text

wombats:10.2.3.47:8080 A Span is an individual operation Server Received a Request POST /things Server Sent a Response Events Tags Operation remote.ipv4 1.2.3.4 http.request-id abcd-ffe http.request.size 15 MiB http.url …&features=HD-uploads

Slide 16

Slide 16 text

Tracing is logging important events Wire Send Store Async Store Wire Send POST /things POST /things

Slide 17

Slide 17 text

Tracers record time, duration and host Wire Send Store Async Store Wire Send POST /things POST /things Tracers don’t decide what to record, instrumentation does.. we’ll get to that

Slide 18

Slide 18 text

Tracers send trace data out of process Tracers propagate IDs in-band, to tell the receiver there’s a trace in progress Completed spans are reported out-of-band, to reduce overhead and allow for batching

Slide 19

Slide 19 text

Tracer == Instrumentation? A tracer is a utility library, similar to metrics or logging libraries. It is a mechanism uses to trace an operation. Instrumentation is the what and how. For example, instrumentation for ApacheHC and OkHttp record similar data with a tracer. How they do that is library specific.

Slide 20

Slide 20 text

Instrumentation decides what to record Instrumentation decides how to propagate state Instrumentation is usually invisible to users

Slide 21

Slide 21 text

Tracing affects your production requests Tracing affects your production requests, causing size and latency overhead. Tracers are carefully written to not cause applications to crash. Instrumentation is carefully written to not slow or overload your requests. - Tracers propagate structural data in-band, and the rest out-of-band - Instrumentation has data and sampling policy to manage volume - Often, layers such as HTTP have common instrumentation and/or models

Slide 22

Slide 22 text

Tracing Systems are Observability Tools Tracing systems collect, process and present data reported by tracers. - aggregate spans into trace trees - provide query and visualization focused on latency - have retention policy (usually days)

Slide 23

Slide 23 text

Protip: Tracing is not just for latency Some wins unrelated to latency - Understand your architecture - Find who’s calling deprecated services - Reduce time spent on triage

Slide 24

Slide 24 text

Zipkin introduction understanding latency distributed tracing zipkin demo propagation wrapping up @adrianfcole #zipkin

Slide 25

Slide 25 text

Zipkin is a distributed tracing system

Slide 26

Slide 26 text

Zipkin lives in GitHub Zipkin was created by Twitter in 2012 based on the Google Dapper paper. In 2015, OpenZipkin became the primary fork. OpenZipkin is an org on GitHub. It contains tracers, OpenApi spec, service components and docker images. https://github.com/openzipkin

Slide 27

Slide 27 text

Zipkin Architecture Amazon Azure Docker Google Kubernetes Mesos Spark Tracers report spans HTTP or Kafka. Servers collect spans, storing them in MySQL, Cassandra, or Elasticsearch. Users query for traces via Zipkin’s Web UI or Api. google: https://cloudplatform.googleblog.com/2016/12/Stackdriver-Trace-Zipkin-distributed-tracing-and-performance-analysis-for-everyone.html spark: https://engineering.pinterest.com/blog/distributed-tracing-pinterest-new-open-source-tools amazon X-Ray: https://github.com/openzipkin/brave/releases/tag/4.9.1 dealer.com have some interesting tools, too https://github.com/DealerDotCom/zipkin-elasticbeanstalk

Slide 28

Slide 28 text

Zipkin has starter architecture Tracing is new for a lot of folks. For many, the MySQL option is a good start, as it is familiar. services: storage: image: openzipkin/zipkin-mysql container_name: mysql ports: - 3306:3306 server: image: openzipkin/zipkin environment: - STORAGE_TYPE=mysql - MYSQL_HOST=mysql ports: - 9411:9411 depends_on: - storage

Slide 29

Slide 29 text

Zipkin can be as simple as a single file $ curl -SL 'https://search.maven.org/remote_content?g=io.zipkin.java&a=zipkin-server&v=LATEST&c=exec' > zipkin.jar $ SELF_TRACING_ENABLED=true java -jar zipkin.jar ******** ** ** * * ** ** ** ** ** ** ** ** ******** **** **** **** **** ****** **** *** **************************************************************************** ******* **** *** **** **** ** ** ***** ** ***** ** ** ** ** ** ** ** ** * *** ** **** ** ** ** ***** **** ** ** *** ****** ** ** ** ** ** ** ** :: Powered by Spring Boot :: (v1.5.4.RELEASE) 2016-08-01 18:50:07.098 INFO 8526 --- [ main] zipkin.server.ZipkinServer : Starting ZipkinServer on acole with PID 8526 (/Users/acole/oss/sleuth-webmvc- example/zipkin.jar started by acole in /Users/acole/oss/sleuth-webmvc-example) —snip— $ curl -s localhost:9411/api/v2/services|jq . [ "gateway" ]

Slide 30

Slide 30 text

How data gets to Zipkin —> Looks easy right?

Slide 31

Slide 31 text

Brave: the most popular Zipkin Java tracer • Brave - OpenZipkin’s java library and instrumentation • Layers under projects like Ratpack, Dropwizard, Play • Spring Cloud Sleuth - automatic tracing for Spring Boot • Includes many common spring integrations • Starting in version 2, Sleuth is a layer over Brave! c, c#, erlang, javascript, go, php, python, ruby, too https://github.com/openzipkin/brave-webmvc-example https://github.com/openzipkin/sleuth-webmvc-example

Slide 32

Slide 32 text

Some notable open source tracing libraries • OpenCensus - Observability SDK (metrics, tracing, tags) • Most notably, gRPC’s tracing library • Includes exporters in Zipkin format and B3 propagation format • OpenTracing - trace instrumentation library api definitions • Bridge to Zipkin tracers available in Java, Go and PHP • SkyWalking - APM with a java agent developed in China • Work in progress to send trace data to zipkin Census stems from the metrics and tracing instrumentation and tooling that exist inside of Google (Dapper, for which it's used as a sidecar), and it will be replacing internal instrumentation at Google and the Stackdriver Trace SDKs once it matures.

Slide 33

Slide 33 text

Demo introduction understanding latency distributed tracing zipkin demo propagation wrapping up @adrianfcole #zipkin

Slide 34

Slide 34 text

A web browser calls a service that calls another. Zipkin will show how long the whole operation took, as well how much time was spent in each service. Distributed Tracing across multiple apps openzipkin/zipkin-js spring-cloud-sleuth

Slide 35

Slide 35 text

JavaScript referenced in index.html fetches an api request. The fetch function is traced via a Zipkin wrapper. zipkin-js JavaScript openzipkin/zipkin-js-example

Slide 36

Slide 36 text

Api requests are served by Spring Boot applications. Tracing of these are automatically performed by Spring Cloud Sleuth. Spring Cloud Sleuth Java openzipkin/sleuth-webmvc-example

Slide 37

Slide 37 text

Propagation introduction understanding latency distributed tracing zipkin demo propagation wrapping up @adrianfcole #zipkin

Slide 38

Slide 38 text

Under the covers, tracing code can be tricky // This is real code, but only one callback of Apache HC Span span = handler.nextSpan(req); CloseableHttpResponse resp = null; Throwable error = null; try (SpanInScope ws = tracer.withSpanInScope(span)) { return resp = protocolExec.execute(route, req, ctx, exec); } catch (IOException | HttpException | RuntimeException | Error e) { error = e; throw e; } finally { handler.handleReceive(resp, error, span); } Timing correctly Trace state Error callbacks Version woes managing spans ensures parent/child links are maintained. this allows the system to collate spans into a trace automatically.

Slide 39

Slide 39 text

Instrumentation Instrumentation record behavior of a request or a message. Instrumentation is applied use of Tracer libraries. They extract trace context from incoming messages, pass it through the process, allocating child spans for intermediate operations. Finally, they inject trace context onto outgoing messages so the process can repeat on the other side.

Slide 40

Slide 40 text

Propagation Instrumentation encode request-scoped state required for tracing to work. Services that use a compatible context format can understand their position in a trace. Regardless of libraries used, tracing can interop via propagation. Look at B3 and trace-context for example.

Slide 41

Slide 41 text

Propagation is the hardest part • In process - place state in scope and always remove • Across processes - inject state into message and out on the other side • Among other contexts - you may not be the only one

Slide 42

Slide 42 text

In process propagation • Scoping api - ensures state is visible to downstream code and always cleaned up. ex try/finally • Instrumentation - carries state to where it can be scoped • Async - you may have to stash it between callbacks • Queuing - if backlog is possible, you may have to attach it to the message even in-process Thread locals are basically key/value stores based on the thread id sometimes you can correlate by message or request id instead

Slide 43

Slide 43 text

Across process propagation • Headers - usually you can encode state into a header • some proxies will drop it • some services/clones may manipulate it • Envelopes - sometimes you have a custom message envelope • this implies coordination as it can make the message unreadable

Slide 44

Slide 44 text

Among other tracing implementations • In-process - you may be able to join their context • you may be able to read their data (ex thread local storage) • you may be able to correlate with it • Across process - you may be able to share a header • only works if your ID format can fit into theirs • otherwise you may have to push multiple headers

Slide 45

Slide 45 text

Wrapping Up introduction understanding latency distributed tracing zipkin demo wrapping up @adrianfcole #zipkin

Slide 46

Slide 46 text

Wrapping up Start by sending traces directly to a zipkin server. Grow into fanciness as you need it: sampling, streaming, etc Remember you are not alone! @adrianfcole #zipkin @zipkinproject gitter.im/openzipkin/zipkin

Slide 47

Slide 47 text

Example Tracing Flow log correlation metrics scope http request Reporter http request Recorder Trace Context Parser adrian’s not done with this image