Slide 1

Slide 1 text

© 2016 Pivotal !1 An introduction to Distributed Tracing and Zipkin Adrian Cole, Pivotal @adrianfcole How to Properly Blame Things for Causing Latency

Slide 2

Slide 2 text

Introduction introduction understanding latency distributed tracing zipkin demo wrapping up @adrianfcole #zipkin

Slide 3

Slide 3 text

@adrianfcole • spring cloud at pivotal • focused on distributed tracing • helped open zipkin

Slide 4

Slide 4 text

Understanding Latency introduction understanding latency distributed tracing zipkin demo wrapping up @adrianfcole #zipkin

Slide 5

Slide 5 text

Understanding our architecture Microservice and data pipeline architectures are a often a graph of components, distributed across a network. A call graph or data flow can become delayed or fail due to the nature of the operation, components, or edges between them. We want to understand our current architecture and troubleshoot latency problems, in production.

Slide 6

Slide 6 text

Why is POST /things slow? POST /things

Slide 7

Slide 7 text

POST /things There’s often two sides to the story Client Sent:15:31:28:500 Client Received:15:31:31:000 Duration: 2500 milliseconds Server Received:15:31:29:103 POST /things Server Sent:15:31:30:530 Duration: 1427 milliseconds

Slide 8

Slide 8 text

and not all operations are on the critical path Wire Send Store Async Store Wire Send POST /things POST /things

Slide 9

Slide 9 text

and not all operations are relevant Wire Send Store Async Async Store Failed Wire Send POST /things POST /things KQueueArrayWrapper.kev UnboundedFuturePool-2 SelectorUtil.select LockSupport.parkNan ReferenceQueue.remove

Slide 10

Slide 10 text

Service architecture isn’t this simple anymore Single-server scenarios aren’t realistic or don’t fully explain latency. David Vignoni Gnome-fs-server.svg

Slide 11

Slide 11 text

Can we make troubleshooting wizard-free? We no longer need wizards to deploy complex architectures. We shouldn’t need wizards to troubleshoot them, either!

Slide 12

Slide 12 text

Distributed Tracing introduction understanding latency distributed tracing zipkin demo wrapping up @adrianfcole #zipkin

Slide 13

Slide 13 text

Distributed Tracing commoditizes knowledge Distributed tracing systems collect end-to-end latency graphs (traces) in near real-time. You can compare traces to understand why certain requests take longer than others.

Slide 14

Slide 14 text

Distributed Tracing Vocabulary A Span is an individual operation that took place. A span contains timestamped events and tags. A Trace is an end-to-end latency graph, composed of spans.

Slide 15

Slide 15 text

wombats:10.2.3.47:8080 A Span is an individual operation Server Received POST /things Server Sent Events Tags Operation peer.ipv4 1.2.3.4 http.request-id abcd-ffe http.request.size 15 MiB http.url …&features=HD-uploads

Slide 16

Slide 16 text

Tracing Systems are Observability Tools Tracing systems collect, process and present data reported by tracers. - aggregate spans into trace trees - provide query and visualization focused on latency - have retention policy (usually days)

Slide 17

Slide 17 text

ProTip: Tracing is not just for latency Some wins unrelated to latency - Understand your architecture - Find services that aren’t used - Reduce time spent on triage

Slide 18

Slide 18 text

Zipkin introduction understanding latency distributed tracing zipkin demo wrapping up @adrianfcole #zipkin

Slide 19

Slide 19 text

Zipkin is a distributed tracing system

Slide 20

Slide 20 text

Zipkin lives in GitHub Zipkin was created by Twitter in 2012. In 2015, OpenZipkin became the primary fork. OpenZipkin is an org on GitHub. It contains tracers, OpenApi spec, service components and docker images. https://github.com/openzipkin

Slide 21

Slide 21 text

Zipkin Architecture Platform frameworks for Zipkin: Bosh (Cloud Foundry) Docker (in Zipkin’s org) Kubernetes Mesos Tracers report spans HTTP or Kafka. Servers collect spans, storing them in MySQL, Cassandra, or Elasticsearch. Users query for traces via Zipkin’s Web UI or Api.

Slide 22

Slide 22 text

Zipkin has starter architecture Tracing is new for a lot of folks. For many, the MySQL option is a good start, as it is familiar. services: storage: image: openzipkin/zipkin-mysql container_name: mysql ports: - 3306:3306 server: image: openzipkin/zipkin environment: - STORAGE_TYPE=mysql - MYSQL_HOST=mysql ports: - 9411:9411 depends_on: - storage

Slide 23

Slide 23 text

Zipkin can be as simple as a single file $ curl -SL 'https://search.maven.org/remote_content?g=io.zipkin.java&a=zipkin-server&v=LATEST&c=exec' > zipkin.jar $ SELF_TRACING_ENABLED=true java -jar zipkin.jar . ____ _ __ _ _ /\\ / ___'_ __ _ _(_)_ __ __ _ \ \ \ \ ( ( )\___ | '_ | '_| | '_ \/ _` | \ \ \ \ \\/ ___)| |_)| | | | | || (_| | ) ) ) ) ' |____| .__|_| |_|_| |_\__, | / / / / =========|_|==============|___/=/_/_/_/ :: Spring Boot :: (v1.4.0.RELEASE) 2016-08-01 18:50:07.098 INFO 8526 --- [ main] zipkin.server.ZipkinServer : Starting ZipkinServer on acole with PID 8526 (/Users/acole/oss/sleuth-webmvc-example/zipkin.jar started by acole in /Users/acole/oss/sleuth-webmvc-example) —snip— $ curl -s localhost:9411/api/v1/services|jq . [ "zipkin-server" ]

Slide 24

Slide 24 text

Demo introduction understanding latency distributed tracing zipkin demo wrapping up @adrianfcole #zipkin

Slide 25

Slide 25 text

Two Spring Boot (Java) services collaborate over http. Zipkin will show how long the whole operation took, as well how much time was spent in each service. https://github.com/openzipkin/sleuth-webmvc-example Distributed Tracing across Spring Boot apps https://github.com/openzipkin/zipkin-js-example

Slide 26

Slide 26 text

Web requests in the demo are served by Spring MVC controllers. Tracing of these are automatically performed by Spring Cloud Sleuth. Spring Cloud Sleuth reports to Zipkin via HTTP by depending on spring-cloud-sleuth-zipkin. https://cloud.spring.io/spring-cloud-sleuth/ Spring Cloud Sleuth Java

Slide 27

Slide 27 text

Wrapping Up introduction understanding latency distributed tracing zipkin demo wrapping up @adrianfcole #zipkin

Slide 28

Slide 28 text

Wrapping up Start by sending traces directly to a zipkin server. Grow into fanciness as you need it: sampling, streaming, etc Remember you are not alone! @adrianfcole #zipkin gitter.im/spring-cloud/spring-cloud-sleuth gitter.im/openzipkin/zipkin