Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Distributed Tracing and Zipkin at DevOpsDays Singapore

Adrian Cole
October 08, 2016

Introduction to Distributed Tracing and Zipkin at DevOpsDays Singapore

30m deck. the twist here is the demo using javascript, too

http://www.devopsdays.org/events/2016-singapore/welcome/

Adrian Cole

October 08, 2016
Tweet

More Decks by Adrian Cole

Other Decks in Technology

Transcript

  1. © 2016 Pivotal
    !1
    An introduction to Distributed Tracing and Zipkin
    Adrian Cole, Pivotal
    @adrianfcole
    How to Properly Blame Things for Causing Latency

    View Slide

  2. Introduction
    introduction
    understanding latency
    distributed tracing
    zipkin
    demo
    wrapping up
    @adrianfcole
    #zipkin

    View Slide

  3. @adrianfcole
    • spring cloud at pivotal
    • focused on distributed tracing
    • helped open zipkin

    View Slide

  4. Understanding Latency
    introduction
    understanding latency
    distributed tracing
    zipkin
    demo
    wrapping up
    @adrianfcole
    #zipkin

    View Slide

  5. Understanding our architecture
    Microservice and data pipeline architectures are a often a graph of
    components, distributed across a network.
    A call graph or data flow can become delayed or fail due to the nature of the
    operation, components, or edges between them.
    We want to understand our current architecture and troubleshoot latency
    problems, in production.

    View Slide

  6. Why is POST /things slow?
    POST /things

    View Slide

  7. POST /things
    There’s often two sides to the story
    Client Sent:15:31:28:500 Client Received:15:31:31:000
    Duration: 2500 milliseconds
    Server Received:15:31:29:103
    POST /things
    Server Sent:15:31:30:530
    Duration: 1427 milliseconds

    View Slide

  8. and not all operations are on the critical path
    Wire Send Store
    Async Store
    Wire Send
    POST /things
    POST /things

    View Slide

  9. and not all operations are relevant
    Wire Send Store Async
    Async Store Failed
    Wire Send
    POST /things
    POST /things
    KQueueArrayWrapper.kev
    UnboundedFuturePool-2
    SelectorUtil.select
    LockSupport.parkNan ReferenceQueue.remove

    View Slide

  10. Service architecture isn’t this simple anymore
    Single-server scenarios aren’t
    realistic or don’t fully explain
    latency.
    David Vignoni Gnome-fs-server.svg

    View Slide

  11. Can we make troubleshooting wizard-free?
    We no longer need wizards to
    deploy complex architectures.
    We shouldn’t need wizards to
    troubleshoot them, either!

    View Slide

  12. Distributed Tracing
    introduction
    understanding latency
    distributed tracing
    zipkin
    demo
    wrapping up
    @adrianfcole
    #zipkin

    View Slide

  13. Distributed Tracing commoditizes knowledge
    Distributed tracing systems collect end-to-end latency graphs
    (traces) in near real-time.
    You can compare traces to understand why certain requests
    take longer than others.

    View Slide

  14. Distributed Tracing Vocabulary
    A Span is an individual operation that took place. A span
    contains timestamped events and tags.
    A Trace is an end-to-end latency graph, composed of spans.

    View Slide

  15. wombats:10.2.3.47:8080
    A Span is an individual operation
    Server Received
    POST /things
    Server Sent
    Events
    Tags
    Operation
    peer.ipv4 1.2.3.4
    http.request-id abcd-ffe
    http.request.size 15 MiB
    http.url …&features=HD-uploads

    View Slide

  16. Tracing Systems are Observability Tools
    Tracing systems collect, process and present data reported by tracers.
    - aggregate spans into trace trees
    - provide query and visualization focused on latency
    - have retention policy (usually days)

    View Slide

  17. ProTip: Tracing is not just for latency
    Some wins unrelated to latency
    - Understand your architecture
    - Find services that aren’t used
    - Reduce time spent on triage

    View Slide

  18. Zipkin
    introduction
    understanding latency
    distributed tracing
    zipkin
    demo
    wrapping up
    @adrianfcole
    #zipkin

    View Slide

  19. Zipkin is a distributed tracing system

    View Slide

  20. Zipkin lives in GitHub
    Zipkin was created by Twitter in 2012. In 2015, OpenZipkin
    became the primary fork.
    OpenZipkin is an org on GitHub. It contains tracers, OpenApi spec,
    service components and docker images.
    https://github.com/openzipkin

    View Slide

  21. Zipkin Architecture
    Platform frameworks for Zipkin:
    Bosh (Cloud Foundry)
    Docker (in Zipkin’s org)
    Kubernetes
    Mesos
    Tracers report spans HTTP or Kafka.
    Servers collect spans, storing them
    in MySQL, Cassandra, or
    Elasticsearch.
    Users query for traces via Zipkin’s
    Web UI or Api.

    View Slide

  22. Zipkin has starter architecture
    Tracing is new for a lot of
    folks.
    For many, the MySQL option
    is a good start, as it is familiar.
    services:
    storage:
    image: openzipkin/zipkin-mysql
    container_name: mysql
    ports:
    - 3306:3306
    server:
    image: openzipkin/zipkin
    environment:
    - STORAGE_TYPE=mysql
    - MYSQL_HOST=mysql
    ports:
    - 9411:9411
    depends_on:
    - storage

    View Slide

  23. Zipkin can be as simple as a single file
    $ curl -SL 'https://search.maven.org/remote_content?g=io.zipkin.java&a=zipkin-server&v=LATEST&c=exec' > zipkin.jar
    $ SELF_TRACING_ENABLED=true java -jar zipkin.jar
    . ____ _ __ _ _
    /\\ / ___'_ __ _ _(_)_ __ __ _ \ \ \ \
    ( ( )\___ | '_ | '_| | '_ \/ _` | \ \ \ \
    \\/ ___)| |_)| | | | | || (_| | ) ) ) )
    ' |____| .__|_| |_|_| |_\__, | / / / /
    =========|_|==============|___/=/_/_/_/
    :: Spring Boot :: (v1.4.0.RELEASE)
    2016-08-01 18:50:07.098 INFO 8526 --- [ main] zipkin.server.ZipkinServer : Starting ZipkinServer on acole
    with PID 8526 (/Users/acole/oss/sleuth-webmvc-example/zipkin.jar started by acole in /Users/acole/oss/sleuth-webmvc-example)
    —snip—
    $ curl -s localhost:9411/api/v1/services|jq .
    [
    "zipkin-server"
    ]

    View Slide

  24. Demo
    introduction
    understanding latency
    distributed tracing
    zipkin
    demo
    wrapping up
    @adrianfcole
    #zipkin

    View Slide

  25. Two Spring Boot (Java) services collaborate over http.
    Zipkin will show how long the whole operation took, as
    well how much time was spent in each service.
    https://github.com/openzipkin/sleuth-webmvc-example
    Distributed Tracing across Spring Boot apps
    https://github.com/openzipkin/zipkin-js-example

    View Slide

  26. Web requests in the demo are served by Spring MVC controllers.
    Tracing of these are automatically performed by Spring Cloud Sleuth.
    Spring Cloud Sleuth reports to Zipkin via HTTP by depending on
    spring-cloud-sleuth-zipkin.
    https://cloud.spring.io/spring-cloud-sleuth/
    Spring Cloud Sleuth Java

    View Slide

  27. Wrapping Up
    introduction
    understanding latency
    distributed tracing
    zipkin
    demo
    wrapping up
    @adrianfcole
    #zipkin

    View Slide

  28. Wrapping up
    Start by sending traces directly to a zipkin server.
    Grow into fanciness as you need it: sampling, streaming, etc
    Remember you are not alone!
    @adrianfcole
    #zipkin
    gitter.im/spring-cloud/spring-cloud-sleuth
    gitter.im/openzipkin/zipkin

    View Slide