Slide 1

Slide 1 text

An introduction to Distributed Tracing and Zipkin by Adrian Cole #ZipkinMeetup Microservices using Netflixoss

Slide 2

Slide 2 text

Introduction introduction latency analysis distributed tracing zipkin and friends see also wrapping up

Slide 3

Slide 3 text

@adrianfcole • spring cloud at pivotal • focus on distributed tracing • helped open zipkin

Slide 4

Slide 4 text

Latency Analysis introduction latency analysis distributed tracing zipkin and friends see also wrapping up

Slide 5

Slide 5 text

Latency Analysis Microservice and data pipeline architectures are a often a graph of components, distributed across a network. A call graph or data flow can become delayed or fail due to the nature of the operation, components, or edges between them. We want to understand our current architecture and troubleshoot latency problems, in production.

Slide 6

Slide 6 text

Why is POST /things slow? POST /things

Slide 7

Slide 7 text

Troubleshooting latency problems When was the event and how long did it take? Where did this happen? Which event was it? Is it abnormal?

Slide 8

Slide 8 text

When was the event and how long did it take? First log statement was at 15:31:29.103 GMT… last… 15:31:30.530 Server Received:15:31:29:103 POST /things Server Sent:15:31:30:530 Duration: 1427 milliseconds

Slide 9

Slide 9 text

wombats:10.2.3.47:8080 Server log says Client IP was 1.2.3.4 This is a shard in the wombats cluster, listening on 10.2.3.47:8080 Server Received:15:31:29:103 POST /things Server Sent:15:31:30:530 Duration: 1427 milliseconds Where did this happen? peer.ipv4 1.2.3.4

Slide 10

Slide 10 text

wombats:10.2.3.47:8080 Which event was it? The http response header had “request-id: abcd- ffe”? Is that what you mean? Server Received:15:31:29:103 POST /things Server Sent:15:31:30:530 Duration: 1427 milliseconds peer.ipv4 1.2.3.4 http.request-id abcd-ffe

Slide 11

Slide 11 text

wombats:10.2.3.47:8080 Is it abnormal? I’ll check other logs for this request id and see what I can find out. Server Received:15:31:29:103 POST /things Server Sent:15:31:30:530 Duration: 1427 milliseconds Well, average response time for POST /things in the last 2 days is 100ms peer.ipv4 1.2.3.4 http.request-id abcd-ffe

Slide 12

Slide 12 text

wombats:10.2.3.47:8080 Achieving understanding I searched the logs for others in that group.. took about the same time. Server Received:15:31:29:103 POST /things Server Sent:15:31:30:530 Duration: 1427 milliseconds Ok, looks like this client is in the experimental group for HD uploads peer.ipv4 1.2.3.4 http.request-id abcd-ffe http.request.size 15 MiB http.url …&features=HD-uploads

Slide 13

Slide 13 text

POST /things We find operations are often connected Client Sent:15:31:28:500 Client Received:15:31:31:000 Duration: 2500 milliseconds Server Received:15:31:29:103 POST /things Server Sent:15:31:30:530 Duration: 1427 milliseconds

Slide 14

Slide 14 text

but not all operations are on the critical path Wire Send Store Async Store Wire Send POST /things POST /things

Slide 15

Slide 15 text

and not all operations are relevant Wire Send Store Async Async Store Failed Wire Send POST /things POST /things KQueueArrayWrapper.kev UnboundedFuturePool-2 SelectorUtil.select LockSupport.parkNan ReferenceQueue.remove

Slide 16

Slide 16 text

Call graphs are increasingly complex Polyglot microservice and data flow architectures are increasingly easy to write and deploy.

Slide 17

Slide 17 text

Can we make troubleshooting wizard-free? We no longer need wizards to deploy complex architectures. We shouldn’t need wizards to troubleshoot them, either!

Slide 18

Slide 18 text

Distributed Tracing Distributed tracing systems collect end-to-end latency graphs (traces) in near real-time. You can compare traces to understand why certain requests take longer than others.

Slide 19

Slide 19 text

Distributed Tracing introduction latency analysis distributed tracing zipkin and friends see also wrapping up

Slide 20

Slide 20 text

Distributed Tracing A Span is an individual operation that took place. A span contains timestamped events and tags. A Trace is an end-to-end latency graph, composed of spans.

Slide 21

Slide 21 text

wombats:10.2.3.47:8080 A Span is an individual operation Server Received POST /things Server Sent Events Tags Operation peer.ipv4 1.2.3.4 http.request-id abcd-ffe http.request.size 15 MiB http.url …&features=HD-uploads

Slide 22

Slide 22 text

A Trace is a graph of spans in context Wire Send Store Async Store Wire Send POST /things POST /things

Slide 23

Slide 23 text

Tracers create Spans Tracers execute in your production apps! They are written to not log too much, and to not cause applications to crash. - propagate structural data in-band, and the rest out-of-band - have instrumentation or sampling policy to manage volume - often include opinionated instrumentation of layers such as HTTP

Slide 24

Slide 24 text

Tracing Systems Tracing systems collect, process and present data reported by tracers. - aggregate spans into trace trees - provide query and visualization for latency analysis - have retention policy (usually days)

Slide 25

Slide 25 text

Zipkin and Friends introduction latency analysis distributed tracing zipkin and friends see also wrapping up

Slide 26

Slide 26 text

Zipkin is a distributed tracing system

Slide 27

Slide 27 text

Zipkin has pluggable architecture Tracers collect timing data and transport it over HTTP or Kafka. Collectors store spans in MySQL or Cassandra. Users query for traces via Zipkin’s Web UI or Api. mysql:
 image: openzipkin/zipkin-cassandra:1.30.2
 ports:
 - 9042:9042
 query:
 image: openzipkin/zipkin-query:1.30.2
 environment: - TRANSPORT_TYPE=http
 - STORAGE_TYPE=cassandra
 ports:
 - 9411:9411
 links:
 - cassandra:storage
 web:
 image: openzipkin/zipkin-web:1.30.2
 ports:
 - 8080:8080
 environment:
 - TRANSPORT_TYPE=http
 links:
 - query

Slide 28

Slide 28 text

Zipkin has starter architecture Tracing is new for a lot of folks. For many, the MySQL option is a good start, as it is familiar. mysql:
 image: openzipkin/zipkin-mysql:1.30.2
 ports:
 - 3306:3306
 query:
 image: openzipkin/zipkin-java:0.4.4
 environment: - TRANSPORT_TYPE=http
 - STORAGE_TYPE=mysql
 ports:
 - 9411:9411
 links:
 - mysql:storage
 web:
 image: openzipkin/zipkin-web:1.30.2
 ports:
 - 8080:8080
 environment:
 - TRANSPORT_TYPE=http
 links:
 - query

Slide 29

Slide 29 text

Zipkin can be as simple as a single file $ curl -SL https://jcenter.bintray.com/io/zipkin/java/zipkin-server/0.4.4/zipkin-server-0.4.4-exec.jar > zipkin-server.jar $ java -jar zipkin-server.jar . ____ _ __ _ _ /\\ / ___'_ __ _ _(_)_ __ __ _ \ \ \ \ ( ( )\___ | '_ | '_| | '_ \/ _` | \ \ \ \ \\/ ___)| |_)| | | | | || (_| | ) ) ) ) ' |____| .__|_| |_|_| |_\__, | / / / / =========|_|==============|___/=/_/_/_/ :: Spring Boot :: (v1.3.1.RELEASE) 2016-01-25 15:15:16.456 INFO 94716 --- [ main] zipkin.server.ZipkinServer : Starting ZipkinServer on acole with PID 94716 (/tmp/zipkin-server.jar started by acole in /tmp) —snip— $ curl -s localhost:9411/api/v1/services|jq . [ "zipkin-query" ]

Slide 30

Slide 30 text

Zipkin lives in GitHub Zipkin was created by Twitter in 2012. In 2015, OpenZipkin became the primary fork. OpenZipkin is an org on GitHub. It contains tracers, OpenApi spec, service components and docker images. https://github.com/openzipkin https://gitter.im/openzipkin/zipkin

Slide 31

Slide 31 text

Zipkin-tracing-Zipkin demo because we can

Slide 32

Slide 32 text

Zipkin-Compatible Tracers Incomplete list, and in order of appearance!

Slide 33

Slide 33 text

Finagle includes a Trace api and common integrations Finagle automatically traces stacks like http, thrift, mysql, redis and memcached. Report to zipkin via Scribe by depending on finagle-zipkin. https://github.com/twitter/finagle Finagle Scala

Slide 34

Slide 34 text

Brave includes tracing apis and common integrations You can automatically trace with integrations such as Jersey filters or Apache HTTP client interceptors. Brave also includes tracing apis for custom instrumentation. https://github.com/openzipkin/brave Brave Java

Slide 35

Slide 35 text

HTrace is a tracing framework for use with distributed systems. Hadoop includes commands to enable automatic tracing via HTrace. ZipkinSpanReceiver can report via Kafka or Scribe. HTrace apis can also be used directly. https://github.com/apache/incubator-htrace HTrace C, Java

Slide 36

Slide 36 text

ZipkinTracer includes middleware for Rack and Faraday If a incoming request to Rack results in an outgoing request via Faraday, they will be two spans in the same trace. https://github.com/openzipkin/zipkin-tracer ZipkinTracer Ruby

Slide 37

Slide 37 text

ZipkinTracerModule includes a tracing client and IHttpModule integration By configuring ZipkinRequestContextModule, incoming requests will be automatically traced. You can also use ITracerClient for custom instrumentation. https://github.com/mdsol/Medidata.ZipkinTracerModule ZipkinTracerModule .Net

Slide 38

Slide 38 text

Spring Cloud Sleuth includes instrumentation for Spring Boot, and a streaming collector. Report to zipkin via HTTP by depending on spring-cloud- sleuth-zipkin. Spring Cloud Stream is also available, providing a more flexible and scalable pipeline. https://github.com/spring-cloud/spring-cloud-sleuth Spring Cloud Sleuth Java

Slide 39

Slide 39 text

pyramid_zipkin is a Pyramid tween to add Zipkin service spans. By including ‘pyramid_zipkin’, incoming requests will be automatically traced. You can also use create_headers_for_new_span for outgoing requests. https://github.com/Yelp/pyramid_zipkin pyramid_zipkin Python

Slide 40

Slide 40 text

see also introduction latency analysis distributed tracing zipkin and friends see also wrapping up

Slide 41

Slide 41 text

OpenTracing is an effort to clean-up and de-risk distributed tracing instrumentation OpenTracing Interfaces decouple instrumentation from vendor-specific dependencies and terminology. This allows applications to switch products with less effort. http://opentracing.io/ OpenTracing Go, Python, Java, JavaScript

Slide 42

Slide 42 text

A single configuration change to bind a Tracer implementation in main() or similar import "github.com/opentracing/opentracing-go" import "github.com/tracer_x/tracerimpl" func main() { // Bind tracerimpl to the opentracing system opentracing.InitGlobalTracer( tracerimpl.New(kTracerImplAccessToken)) ... normal main() stuff ... } How does it work? Clean, vendor-neutral instrumentation code that naturally tells the story of a distributed operation import "github.com/opentracing/opentracing-go" func AddContact(c *Contact) { sp := opentracing.StartTrace("AddContact") defer sp.Finish() sp.Info("Added contact: ", *c) subRoutine(sp, ...) ... } func subRoutine(parentSpan opentracing.Span, ...) { ... sp := opentracing.JoinTrace("subRoutine", parentSpan) defer sp.Finish() sp.Info("deferred work to subroutine") ... } Thanks, @el_bhs for the slide!

Slide 43

Slide 43 text

Spigo Go Simulate interesting architectures Generate large scale configurations Eventually stress test real tools https://github.com/adrianco/spigo

Slide 44

Slide 44 text

You can POST Spigo flows to zipkin $ run.sh —snip— $ curl -s 192.168.99.100:9411/api/v1/spans -X POST --data @json_metrics/lamp_flow.json -H "Content-Type: application/json"

Slide 45

Slide 45 text

Wrapping Up introduction latency analysis distributed tracing zipkin and friends see also wrapping up

Slide 46

Slide 46 text

Where to go from here Start by sending traces directly to a zipkin server. Grow into fanciness as you need it: sampling, kafka, spark etc Remember you are not alone! gitter.im/openzipkin/zipkin

Slide 47

Slide 47 text

Thank you! @adrianfcole Questions?

Slide 48

Slide 48 text

Microservices using Netflixoss March 2nd Building pipelines with Spinnaker by Gard Rimestad