Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Set your sites on tracing

Set your sites on tracing

Deck overviewing the zipkin sites project, including what's important from different angles.
https://cwiki.apache.org/confluence/display/ZIPKIN/Sites

Presented first at comcast at an observability summit run by Prabha (follow https://twitter.com/kmprabha)

Adrian Cole

March 08, 2019
Tweet

More Decks by Adrian Cole

Other Decks in Technology

Transcript

  1. Set your sites on tracing
    An overview of distributed tracing practice
    @adrianfcole
    works at Pivotal
    works on Zipkin

    View full-size slide

  2. Introduction
    introduction
    a typical zipkin site
    wrapping up
    @adrianfcole
    #zipkin

    View full-size slide

  3. @adrianfcole
    • spring cloud at pivotal
    • focused on distributed tracing
    • helped open zipkin

    View full-size slide

  4. What is Distributed Tracing?
    Distributed tracing tracks production requests as they touch
    different parts of your architecture.
    Requests have a unique trace ID, which you can use to lookup a
    trace diagram, or log entries related to it.
    Causal diagrams are easier to understand than scrolling through logs.

    View full-size slide

  5. Example Trace Diagram
    Wire Send Store
    Async Store
    Wire Send
    POST /things
    POST /things

    View full-size slide

  6. Why do I care?
    - Reduce time in triage by contextualizing errors and delays
    - Visualize latency like time in my service vs waiting for other services
    - Understand complex applications like async code or microservices
    - See your architecture with live dependency diagrams built from traces

    View full-size slide

  7. Distributed Tracing Vocabulary
    A Span is an individual operation that took place. A span contains timestamped
    events and tags.
    A Trace is an end-to-end latency graph, composed of spans.
    Tracers records spans and passes context required to connect them into a trace
    Instrumentation uses a tracer to record a task such as an http request as a span

    View full-size slide

  8. wombats:10.2.3.47:8080
    A Span is an individual operation
    Server Received a Request
    POST /things
    Server Sent a Response
    Events
    Tags
    Operation
    request-i
    d
    abcd-ffe
    http.status_code 202
    http.url …&features=HD-uploads

    View full-size slide

  9. Tracing is capturing important events
    Wire Send Store
    Async Store
    Wire Send
    POST /things
    POST /things

    View full-size slide

  10. Tracers record time, duration and host
    Wire Send Store
    Async Store
    Wire Send
    POST /things
    POST /things
    Tracers don’t decide what to record, instrumentation does.. we’ll get to that

    View full-size slide

  11. Tracers send trace data out of process
    Tracers propagate IDs in-band,
    to tell the receiver there’s a trace in progress
    Completed spans are reported out-of-band,
    to reduce overhead and allow for batching

    View full-size slide

  12. How do I turn on tracing?
    A tracer is a utility library, similar to metrics or logging libraries. It is a mechanism
    uses to trace an operation.
    Instrumentation is framework-specific code that uses a tracer to collect details
    such as the http url and request timing.
    Instrumentation must be configured and pointed to a tracing system for tracing to
    work. This is often done automatically with agents or frameworks like Spring Boot.

    View full-size slide

  13. Zipkin is a distributed tracing system

    View full-size slide

  14. Zipkin can be as simple as one file listening on one port
    $ curl -sSL https://zipkin.io/quickstart.sh | bash -s
    $ SELF_TRACING_ENABLED=true java -jar zipkin.jar
    ********
    ** **
    * *
    ** **
    ** **
    ** **
    ** **
    ********
    ****
    ****
    **** ****
    ****** **** ***
    ****************************************************************************
    ******* **** ***
    **** ****
    **
    **
    ***** ** ***** ** ** ** ** **
    ** ** ** * *** ** **** **
    ** ** ***** **** ** ** ***
    ****** ** ** ** ** ** ** **
    :: Powered by Spring Boot :: (v2.1.3.RELEASE)
    2019-03-03 10:42:02.455 INFO 25496 --- [ main] z.s.ZipkinServer : Starting ZipkinServer on MacBook-Pro-7.local with PID 25496 (/Users/acole/oss/sleuth-webmvc-
    example/zipkin.jar started by acole in /Users/acole/oss/sleuth-webmvc-example)
    —snip—
    $ curl -s localhost:9411/api/v2/services|jq .
    [
    "gateway"
    ]

    View full-size slide

  15. Zipkin Architecture
    Amazon
    Docker
    Google
    Kubernetes
    Cloud Foundry
    Tracers report spans over HTTP,
    Kafka or RabbitMQ.
    Servers collect spans, storing them in
    MySQL, Cassandra, or Elasticsearch.
    Users query for traces via Zipkin’s
    Web UI or Api.

    View full-size slide

  16. A typical Zipkin site
    introduction
    a typical zipkin site
    wrapping up
    @adrianfcole
    #zipkin

    View full-size slide

  17. What is a Zipkin site
    Site owner: End user who champions Zipkin as a part of additional roles in their
    company. Many site owners are part time, yet contribute back to open source.
    Zipkin site: Production deployment of distributed tracing, which considers Zipkin
    format, instrumentation or backends strategic to their observability function.

    View full-size slide

  18. What information do we collect on Zipkin sites
    * Introduction of the company context and team on tracing
    * System overview from application until visualization/analysis
    * Site-specific data conventions such as services are named
    * Why tracing is important, goals and service level agreements
    * Status like costs adoption, ingestion and costs incurred

    View full-size slide

  19. Why bother with tracing?
    Ascend Money says: Measure latency improvements before and after
    refactoring the services. Identify non-conformant service communications that
    deviates from the design.
    Hotels.com says: helps in pointing out the worst offenders and by making it
    easier to identify performance improvements such as network calls that could
    be done in parallel.
    Netflix says: The business value is in providing operational visibility into the
    systems and enhance developer productivity.

    View full-size slide

  20. What kind of infrastructure is involved?
    Effective tracing matches
    the architecture and skillset
    of the site owners.
    Rarely do sites choose the
    same application and tracing
    infrastructure.

    View full-size slide

  21. So, a site doesn’t only run Zipkin server?
    Zipkin Server is the canonical backend which receives Zipkin
    format, and presents a UI. Some don’t run Zipkin server, or also run
    other products for various reasons.
    * SaaS preference
    * APM integration
    * Hybrid setup

    View full-size slide

  22. And.. applications don’t always use Zipkin libraries?!
    Zipkin curates propagation and trace formats which decouple
    sites from a mandate of using our code. By producing the same
    data, applications have more flexibility and choice.
    * 3rd party libraries
    * Proxies (service mesh) integration
    * In-house custom tools

    View full-size slide

  23. Let’s look at a site that once used Zipkin server
    Hotels.com started with a Zipkin backend, but are transitioning to Expedia
    Haystack, which provides more features like adaptive alerting.
    https://github.com/ExpediaDotCom/haystack
    Applications still emit data in Zipkin v2 format, which is forwarded to Haystack
    with a tool they created called Pitchfork. Developers still use Zipkin on their
    laptops for local troubleshooting, as it is easy to run.

    View full-size slide

  24. Let’s look at a site that didn’t initially use Zipkin server
    Netflix created a Dapper-based tracing system to trace RPC calls involved in video
    streaming. This included framework libraries to produce trace headers and data.
    As Spring Boot became prevalent, Zipkin became more useful as it is built-into the
    tracing library Spring Cloud Sleuth.
    Netflix convert legacy spans into Zipkin v2 format in their Kafka/Flink pipeline. This
    allows traces to stitch together for query and analysis.

    View full-size slide

  25. Let’s look at a site that never used Zipkin server
    Infostellar architecture runs in Google Cloud, except ground station software
    that runs locally at an antennae site.
    Many components trace with Zipkin libraries, some with OpenTracing, some
    homegrown. All use Zipkin’s B3 format for propagation.
    Even when using Zipkin libraries, data sends directly to Google Stackdriver for
    query and analysis. There’s no Zipkin server footprint at Infostellar.

    View full-size slide

  26. Let’s look at a site that uses stock Zipkin server
    Medidata is an entirely AWS architecture, using the zipkin-aws
    image will allows http and SQS span collection. They collect 100%
    data into AWS-managed Elasticsearch storage.
    While the zipkin service is standard, Medidata has a service that
    reads trace data from Elasticsearch, comparing it with performance
    objectives in APIs and issuing alerts when performance degrades.

    View full-size slide

  27. Besides architecture, what’s different across sites?
    Data collection policy: Typeform always provision request IDs. Infostellar use
    antenna, satellite and plan tags for business context. LINE add company-
    specific tags like phase and instance ID. Expedia Haystack scrubs secrets.
    Data retention policy: Medidata retain 100% for 100 days. Netflix sample
    100% of FIT experiments, 0.1% otherwise. SoundCloud retain a very low sample
    rate for 7 days.
    Tracing adoption rate: LINE is only one team’s services, Ascend <50%, Tyro is
    over 90%

    View full-size slide

  28. How do sites get started with tracing
    Proxy: starting traces at a proxy can raise visibility of upstream and
    downstream. Typeform initialise a trace and request ID in their custom proxy.
    Single service: hotels.com recognised even though tracing is a team sport, starting
    with a single service can still add value.
    New Framework: Sites like Ascend rolled out tracing in new applications as it was
    out-of-box supported with Spring Boot (via Spring Cloud Sleuth).
    Green Field: Infostellar engineers had previous experience with tracing, and built
    their platform with tracing in mind.

    View full-size slide

  29. Wrapping Up
    @adrianfcole
    #zipkin
    introduction
    a typical zipkin site
    wrapping up

    View full-size slide

  30. Wrapping up
    Contribute https://cwiki.apache.org/confluence/display/ZIPKIN/Sites
    Check our “Last Month In Zipkin”
    Chat any time on Gitter
    @adrianfcole
    #zipkin
    gitter.im/openzipkin/zipkin
    github.com/openzipkin/zipkin

    View full-size slide