Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Distributed Tracing and Zipkin at NetflixOSS Ba...

Adrian Cole
January 25, 2016

Distributed Tracing and Zipkin at NetflixOSS Barcelona

My first presentation on zipkin took place at Betahaus in Barcelona, hosted by Schibsted at the Netflix OSS Meetup.

This deck overviews latency anaysis, and how tracers substitute for people gathering and organizing things from server logs. The resulting tracing systems give you an upper hand on determining root cause even as calls cascade across a network.

Being a first run, I learned some things about the deck, which I'll change in later versions.

Firstly.. less slides. This is too many slides for 50m, so some hard choices need to be made.

It might be better to move talk about tracers until after demo'ing zipkin. Not everyone are interested in tracers: multiple people asked if there was a prana side-car or similar that they could apply to an existing app to make it instrumented.

I think that this could be better organized by demoing zipkin, showing an implicit tool like sleuth.. *then* discussing how that works (ex all the cool bits about tracers and how they in-band vs out-of-band). I was missing a diagram on some of this, which I literally drew on paper there :)

Another thing the group noticed was we could touch on automated analysis. Even if a single slide, mention what insight people are thinking about here, ex. annotating spans that are notable.

Unsurprisingly, folks were unimpressed with zipkin's UI, even if feedback was kind :P This is known, but we don't have anyone who's offered to help. At least we know it is still an issue!

Finally, and more important than the notes above, was that the crowd were quite patient and giving with feedback and ideas. A first deck is like telling jokes at amateur night (I think.. as I've not tried it). The Netflix OSS crowd in BCN rolled with my mistakes, gaps in slides and mental glitches. They were quite astute throughout and engaged with feedback. I'd love to come back and present here again!

Adrian Cole

January 25, 2016
Tweet

More Decks by Adrian Cole

Other Decks in Technology

Transcript

  1. An introduction to Distributed Tracing and Zipkin by Adrian Cole

    #ZipkinMeetup Microservices using Netflixoss
  2. Latency Analysis Microservice and data pipeline architectures are a often

    a graph of components, distributed across a network. A call graph or data flow can become delayed or fail due to the nature of the operation, components, or edges between them. We want to understand our current architecture and troubleshoot latency problems, in production.
  3. Troubleshooting latency problems When was the event and how long

    did it take? Where did this happen? Which event was it? Is it abnormal?
  4. When was the event and how long did it take?

    First log statement was at 15:31:29.103 GMT… last… 15:31:30.530 Server Received:15:31:29:103 POST /things Server Sent:15:31:30:530 Duration: 1427 milliseconds
  5. wombats:10.2.3.47:8080 Server log says Client IP was 1.2.3.4 This is

    a shard in the wombats cluster, listening on 10.2.3.47:8080 Server Received:15:31:29:103 POST /things Server Sent:15:31:30:530 Duration: 1427 milliseconds Where did this happen? peer.ipv4 1.2.3.4
  6. wombats:10.2.3.47:8080 Which event was it? The http response header had

    “request-id: abcd- ffe”? Is that what you mean? Server Received:15:31:29:103 POST /things Server Sent:15:31:30:530 Duration: 1427 milliseconds peer.ipv4 1.2.3.4 http.request-id abcd-ffe
  7. wombats:10.2.3.47:8080 Is it abnormal? I’ll check other logs for this

    request id and see what I can find out. Server Received:15:31:29:103 POST /things Server Sent:15:31:30:530 Duration: 1427 milliseconds Well, average response time for POST /things in the last 2 days is 100ms peer.ipv4 1.2.3.4 http.request-id abcd-ffe
  8. wombats:10.2.3.47:8080 Achieving understanding I searched the logs for others in

    that group.. took about the same time. Server Received:15:31:29:103 POST /things Server Sent:15:31:30:530 Duration: 1427 milliseconds Ok, looks like this client is in the experimental group for HD uploads peer.ipv4 1.2.3.4 http.request-id abcd-ffe http.request.size 15 MiB http.url …&features=HD-uploads
  9. POST /things We find operations are often connected Client Sent:15:31:28:500

    Client Received:15:31:31:000 Duration: 2500 milliseconds Server Received:15:31:29:103 POST /things Server Sent:15:31:30:530 Duration: 1427 milliseconds
  10. but not all operations are on the critical path Wire

    Send Store Async Store Wire Send POST /things POST /things
  11. and not all operations are relevant Wire Send Store Async

    Async Store Failed Wire Send POST /things POST /things KQueueArrayWrapper.kev UnboundedFuturePool-2 SelectorUtil.select LockSupport.parkNan ReferenceQueue.remove
  12. Call graphs are increasingly complex Polyglot microservice and data flow

    architectures are increasingly easy to write and deploy.
  13. Can we make troubleshooting wizard-free? We no longer need wizards

    to deploy complex architectures. We shouldn’t need wizards to troubleshoot them, either!
  14. Distributed Tracing Distributed tracing systems collect end-to-end latency graphs (traces)

    in near real-time. You can compare traces to understand why certain requests take longer than others.
  15. Distributed Tracing A Span is an individual operation that took

    place. A span contains timestamped events and tags. A Trace is an end-to-end latency graph, composed of spans.
  16. wombats:10.2.3.47:8080 A Span is an individual operation Server Received POST

    /things Server Sent Events Tags Operation peer.ipv4 1.2.3.4 http.request-id abcd-ffe http.request.size 15 MiB http.url …&features=HD-uploads
  17. A Trace is a graph of spans in context Wire

    Send Store Async Store Wire Send POST /things POST /things
  18. Tracers create Spans Tracers execute in your production apps! They

    are written to not log too much, and to not cause applications to crash. - propagate structural data in-band, and the rest out-of-band - have instrumentation or sampling policy to manage volume - often include opinionated instrumentation of layers such as HTTP
  19. Tracing Systems Tracing systems collect, process and present data reported

    by tracers. - aggregate spans into trace trees - provide query and visualization for latency analysis - have retention policy (usually days)
  20. Zipkin has pluggable architecture Tracers collect timing data and transport

    it over HTTP or Kafka. Collectors store spans in MySQL or Cassandra. Users query for traces via Zipkin’s Web UI or Api. mysql:
 image: openzipkin/zipkin-cassandra:1.30.2
 ports:
 - 9042:9042
 query:
 image: openzipkin/zipkin-query:1.30.2
 environment: - TRANSPORT_TYPE=http
 - STORAGE_TYPE=cassandra
 ports:
 - 9411:9411
 links:
 - cassandra:storage
 web:
 image: openzipkin/zipkin-web:1.30.2
 ports:
 - 8080:8080
 environment:
 - TRANSPORT_TYPE=http
 links:
 - query
  21. Zipkin has starter architecture Tracing is new for a lot

    of folks. For many, the MySQL option is a good start, as it is familiar. mysql:
 image: openzipkin/zipkin-mysql:1.30.2
 ports:
 - 3306:3306
 query:
 image: openzipkin/zipkin-java:0.4.4
 environment: - TRANSPORT_TYPE=http
 - STORAGE_TYPE=mysql
 ports:
 - 9411:9411
 links:
 - mysql:storage
 web:
 image: openzipkin/zipkin-web:1.30.2
 ports:
 - 8080:8080
 environment:
 - TRANSPORT_TYPE=http
 links:
 - query
  22. Zipkin can be as simple as a single file $

    curl -SL https://jcenter.bintray.com/io/zipkin/java/zipkin-server/0.4.4/zipkin-server-0.4.4-exec.jar > zipkin-server.jar $ java -jar zipkin-server.jar . ____ _ __ _ _ /\\ / ___'_ __ _ _(_)_ __ __ _ \ \ \ \ ( ( )\___ | '_ | '_| | '_ \/ _` | \ \ \ \ \\/ ___)| |_)| | | | | || (_| | ) ) ) ) ' |____| .__|_| |_|_| |_\__, | / / / / =========|_|==============|___/=/_/_/_/ :: Spring Boot :: (v1.3.1.RELEASE) 2016-01-25 15:15:16.456 INFO 94716 --- [ main] zipkin.server.ZipkinServer : Starting ZipkinServer on acole with PID 94716 (/tmp/zipkin-server.jar started by acole in /tmp) —snip— $ curl -s localhost:9411/api/v1/services|jq . [ "zipkin-query" ]
  23. Zipkin lives in GitHub Zipkin was created by Twitter in

    2012. In 2015, OpenZipkin became the primary fork. OpenZipkin is an org on GitHub. It contains tracers, OpenApi spec, service components and docker images. https://github.com/openzipkin https://gitter.im/openzipkin/zipkin
  24. Finagle includes a Trace api and common integrations Finagle automatically

    traces stacks like http, thrift, mysql, redis and memcached. Report to zipkin via Scribe by depending on finagle-zipkin. https://github.com/twitter/finagle Finagle Scala
  25. Brave includes tracing apis and common integrations You can automatically

    trace with integrations such as Jersey filters or Apache HTTP client interceptors. Brave also includes tracing apis for custom instrumentation. https://github.com/openzipkin/brave Brave Java
  26. HTrace is a tracing framework for use with distributed systems.

    Hadoop includes commands to enable automatic tracing via HTrace. ZipkinSpanReceiver can report via Kafka or Scribe. HTrace apis can also be used directly. https://github.com/apache/incubator-htrace HTrace C, Java
  27. ZipkinTracer includes middleware for Rack and Faraday If a incoming

    request to Rack results in an outgoing request via Faraday, they will be two spans in the same trace. https://github.com/openzipkin/zipkin-tracer ZipkinTracer Ruby
  28. ZipkinTracerModule includes a tracing client and IHttpModule integration By configuring

    ZipkinRequestContextModule, incoming requests will be automatically traced. You can also use ITracerClient for custom instrumentation. https://github.com/mdsol/Medidata.ZipkinTracerModule ZipkinTracerModule .Net
  29. Spring Cloud Sleuth includes instrumentation for Spring Boot, and a

    streaming collector. Report to zipkin via HTTP by depending on spring-cloud- sleuth-zipkin. Spring Cloud Stream is also available, providing a more flexible and scalable pipeline. https://github.com/spring-cloud/spring-cloud-sleuth Spring Cloud Sleuth Java
  30. pyramid_zipkin is a Pyramid tween to add Zipkin service spans.

    By including ‘pyramid_zipkin’, incoming requests will be automatically traced. You can also use create_headers_for_new_span for outgoing requests. https://github.com/Yelp/pyramid_zipkin pyramid_zipkin Python
  31. OpenTracing is an effort to clean-up and de-risk distributed tracing

    instrumentation OpenTracing Interfaces decouple instrumentation from vendor-specific dependencies and terminology. This allows applications to switch products with less effort. http://opentracing.io/ OpenTracing Go, Python, Java, JavaScript
  32. A single configuration change to bind a Tracer implementation in

    main() or similar import "github.com/opentracing/opentracing-go" import "github.com/tracer_x/tracerimpl" func main() { // Bind tracerimpl to the opentracing system opentracing.InitGlobalTracer( tracerimpl.New(kTracerImplAccessToken)) ... normal main() stuff ... } How does it work? Clean, vendor-neutral instrumentation code that naturally tells the story of a distributed operation import "github.com/opentracing/opentracing-go" func AddContact(c *Contact) { sp := opentracing.StartTrace("AddContact") defer sp.Finish() sp.Info("Added contact: ", *c) subRoutine(sp, ...) ... } func subRoutine(parentSpan opentracing.Span, ...) { ... sp := opentracing.JoinTrace("subRoutine", parentSpan) defer sp.Finish() sp.Info("deferred work to subroutine") ... } Thanks, @el_bhs for the slide!
  33. You can POST Spigo flows to zipkin $ run.sh —snip—

    $ curl -s 192.168.99.100:9411/api/v1/spans -X POST --data @json_metrics/lamp_flow.json -H "Content-Type: application/json"
  34. Where to go from here Start by sending traces directly

    to a zipkin server. Grow into fanciness as you need it: sampling, kafka, spark etc Remember you are not alone! gitter.im/openzipkin/zipkin