Set your sites on tracing - Cloud Native Summit Wellington

Slide 1

Slide 1 text

Set your sites on tracing An overview of distributed tracing practice @adrianfcole works at Pivotal works on Zipkin

Slide 2

Slide 2 text

Introduction introduction a typical zipkin site wrapping up @adrianfcole #zipkin

Slide 3

Slide 3 text

@adrianfcole • spring cloud at pivotal • focused on distributed tracing • helped open zipkin

Slide 4

Slide 4 text

What is Distributed Tracing? Distributed tracing tracks production requests as they touch different parts of your architecture. Requests have a unique trace ID, which you can use to lookup a trace diagram, or log entries related to it. Causal diagrams are easier to understand than scrolling through logs.

Slide 5

Slide 5 text

Why do I care? - Reduce time in triage by contextualizing errors and delays - Visualize latency like time in my service vs waiting for other services - Understand complex applications like async code or microservices - See your architecture with live dependency diagrams built from traces

Slide 6

Slide 6 text

Example long request in Zipkin (a tracing system) https://twitter.com/zipkinproject https://github.com/openzipkin/zipkin This top line shows the service time, the part of response time a service owner can directly affect

Slide 7

Slide 7 text

We use it to solve our own latency problems This dotted line is how much latency we took of worst request performance!

Slide 8

Slide 8 text

Example service diagram (of a dishwasher) https://github.com/SmartThingsOSS

Slide 9

Slide 9 text

How do I turn on tracing? A tracer is a utility library, similar to metrics or logging libraries. It is a mechanism uses to trace an operation. Instrumentation is framework-specific code that uses a tracer to collect details such as the http url and request timing. Instrumentation must be configured and pointed to a tracing system for tracing to work. This is often done automatically with agents or frameworks like Spring Boot.

Slide 10

Slide 10 text

Zipkin can be as simple as one file listening on one port $ curl -sSL https://zipkin.apache.org/quickstart.sh | bash -s $ SELF_TRACING_ENABLED=true java -jar zipkin.jar ******** ** ** * * ** ** ** ** ** ** ** ** ******** **** **** **** **** ****** **** *** **************************************************************************** ******* **** *** **** **** ** ** ***** ** ***** ** ** ** ** ** ** ** ** * *** ** **** ** ** ** ***** **** ** ** *** ****** ** ** ** ** ** ** ** :: Powered by Spring Boot :: (v2.1.4.RELEASE) 2019-05-16 14:52:44.695 INFO 30176 --- [ main] z.s.ZipkinServer : Starting ZipkinServer v2.13.0 on MacBook-Pro-7.local with PID 30176 (/private/tmp/ zipkin.jar started by acole in /private/tmp) —snip— $ curl -s localhost:9411/api/v2/services|jq . [ "api-proxy", "auth-api", "phoenix" ]

Slide 11

Slide 11 text

A typical Zipkin site introduction a typical zipkin site wrapping up @adrianfcole #zipkin

Slide 12

Slide 12 text

What is a Zipkin site Site owner: End user who champions Zipkin as a part of additional roles in their company. Many site owners are part time, yet contribute back to open source. Zipkin site: Production deployment of distributed tracing, which considers Zipkin format, instrumentation or backends strategic to their observability function. For example, some sites use tools like zipkin-php or zipkin-go to collect data, but export it to Google Stackdriver for analysis and visualization. Others use our data format in their tracing pipeline which gathers data from Zipkin tools, alternative or legacy internal ones. Some use Zipkin backends, but other agent technology, such as SkyWalking to gather it. By sharing real life setups, you can ideally understand the different approaches that coexist to solve the problem of tracing.

Slide 13

Slide 13 text

What information do we collect on Zipkin sites * Introduction of the company context and team on tracing * System overview from application until visualization/analysis * Site-specific data conventions such as services are named * Why tracing is important, goals and service level agreements * Status like costs adoption, ingestion and costs incurred System Overview instrumentation - approach, platforms supported data ingestion - formats, data pipeline, sampling data store and aggregation - data at rest, retention, indexing, cleansing realtime and batch analysis - techniques, visualizations, UI, tooling Site-specific data conventions service name - what is the source of your Zipkin service name? does it come from discovery? Is it used in other tools like metrics? site-specific tags - which tags do you rely on for search or aggregation? For example, do you add correlation or environment IDs to spans? Which are fixed cardinality? Goals What near, middle and long term milestones exist? What value are the business looking to receive? What improvements are you looking to further? What other projects relate to your tracing goals?

Slide 14

Slide 14 text

Why bother with tracing? Ascend Money says: Measure latency improvements before and after refactoring the services. Identify non-conformant service communications that deviates from the design. Hotels.com says: helps in pointing out the worst offenders and by making it easier to identify performance improvements such as network calls that could be done in parallel. Netflix says: The business value is in providing operational visibility into the systems and enhance developer productivity. https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=95651348 https://cwiki.apache.org/confluence/display/ZIPKIN/Hotels.com https://cwiki.apache.org/confluence/display/ZIPKIN/Netflix

Slide 15

Slide 15 text

What kind of infrastructure is involved? Effective tracing matches the architecture and skillset of the site owners. Sites have different application and tracing infrastructures.

Slide 16

Slide 16 text

So, a site doesn’t only run Zipkin server? Zipkin Server is the canonical backend which receives Zipkin format, and presents a UI. Some don’t run Zipkin server, or also run other products for various reasons. * SaaS preference * APM integration * Hybrid setup * SaaS preference: Ex. Infostellar use Google Stackdriver * APM integration: Ex. hotels.com use a centralised Expedia Haystack which also adds anomaly detection * Pipeline setup: Ex. yelp have “firehose” collectors which forward to a zipkin server

Slide 17

Slide 17 text

And.. applications don’t always use Zipkin libraries?! Zipkin curates propagation and trace formats which decouple sites from a mandate of using our code. By producing the same data, applications have more flexibility and choice. * 3rd party libraries * Proxies (service mesh) integration * In-house custom tools * 3rd party libraries - ex OpenCensus and some OpenTracing libraries * Proxies (service mesh) integration - ex linkerd and envoy do not use zipkin libraries for tracing, but they emit our formats * In-house custom tools - ex utilities not yet open sourced or are too bespoke.

Slide 18

Slide 18 text

Let’s look at a site that once used Zipkin server Hotels.com started with a Zipkin backend, but are transitioning to Expedia Haystack, which provides more features like adaptive alerting. https://github.com/ExpediaDotCom/haystack Applications still emit data in Zipkin v2 format, which is forwarded to Haystack with a tool they created called Pitchfork. Developers still use Zipkin on their laptops for local troubleshooting, as it is easy to run. Hotels.com is part of the Expedia Group and is a website for booking hotel rooms online. It's tracing team operates from London. https://cwiki.apache.org/conﬂuence/display/ZIPKIN/Hotels.com

Slide 19

Slide 19 text

Let’s look at a site that didn’t initially use Zipkin server Netflix created a Dapper-based tracing system to trace RPC calls involved in video streaming. This included framework libraries to produce trace headers and data. As Spring Boot became prevalent, Zipkin became more useful as it is built-into the tracing library Spring Cloud Sleuth. Netflix convert legacy spans into Zipkin v2 format in their Kafka/Flink pipeline. This allows traces to stitch together for query and analysis. Netflix is a video streaming service. Its tracing team operates from the Silicon Valley. https://cwiki.apache.org/confluence/display/ZIPKIN/Netflix

Slide 20

Slide 20 text

Let’s look at a site that never used Zipkin server Infostellar architecture runs in Google Cloud, except ground station software that runs locally at an antennae site. Many components trace with Zipkin libraries, some with OpenTracing, some homegrown. All use Zipkin’s B3 format for propagation. Even when using Zipkin libraries, data sends directly to Google Stackdriver for query and analysis. There’s no Zipkin server footprint at Infostellar. Infostellar (StellarStation) - is a space communications infrastructure ﬁrm. Its tracing team operates from Tokyo. https://cwiki.apache.org/conﬂuence/pages/viewpage.action?pageId=95655004

Slide 21

Slide 21 text

Let’s look at a site that uses stock Zipkin server Medidata is an entirely AWS architecture, using the zipkin-aws image will allows http and SQS span collection. They collect 100% data into AWS-managed Elasticsearch storage. While the zipkin service is standard, Medidata has a service that reads trace data from Elasticsearch, comparing it with performance objectives in APIs and issuing alerts when performance degrades. Medidata is the largest provider of software for clinical trials. https://cwiki.apache.org/conﬂuence/display/ZIPKIN/Medidata

Slide 22

Slide 22 text

ok it is stock++ Medidata wrote SLAP

Slide 23

Slide 23 text

Besides architecture, what’s different across sites? Data collection policy: Typeform always provision request IDs. Infostellar use antenna, satellite and plan tags for business context. LINE add company- speciﬁc tags like phase and instance ID. Expedia Haystack scrubs secrets. Data retention policy: Medidata retain 100% for 100 days. Netﬂix sample 100% of FIT experiments, 0.1% otherwise. SoundCloud retain a very low sample rate for 7 days. Tracing adoption rate: LINE is only one team’s services, Ascend <50%, Tyro is over 90%

Slide 24

Slide 24 text

How do sites get started with tracing Proxy: starting traces at a proxy can raise visibility of upstream and downstream. Typeform initialise a trace and request ID in their custom proxy. Single service: hotels.com recognised even though tracing is a team sport, starting with a single service can still add value. New Framework: Sites like Ascend rolled out tracing in new applications as it was out-of-box supported with Spring Boot (via Spring Cloud Sleuth). Green Field: Infostellar engineers had previous experience with tracing, and built their platform with tracing in mind. https://cwiki.apache.org/conﬂuence/display/ZIPKIN/Hotels.com https://cwiki.apache.org/conﬂuence/display/ZIPKIN/Typeform

Slide 25

Slide 25 text

Wrapping Up @adrianfcole #zipkin introduction a typical zipkin site wrapping up

Slide 26

Slide 26 text

Wrapping up Contribute our site documents Chat any time on Gitter @adrianfcole #zipkin gitter.im/openzipkin/zipkin github.com/openzipkin/zipkin