Finding a Needle in a Call Stack - Intro to Distributed Tracing

Slide 1

Slide 1 text

Finding a Needle in a Call Stack Introduction to Distributed Tracing By Josh Michielsen Senior Software Engineer, Platform Engineering Condé Nast International (@condenasteng)

Slide 2

Slide 2 text

About Me Snr Software Engineer, Platform Engineering @ Condé Nast Intl. Live in Cambridge, UK Cyclist Photographer Dog Lover @jmickey_ jmichielsen jmickey mickey.dev [email protected] 2

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Global Cloud Platform Clusters in 4 Regions 11 Markets 120+ Million Monthly Pageviews 14/34 Publications Migrated 4

Slide 5

Slide 5 text

Microservices 5

Slide 6

Slide 6 text

Beneﬁts of Microservices 6 1 Independently Deployable & Scalable Small services can be deployed and scaled quickly and independently. 2 Cross-functional Teams Teams can specialise and focus on bringing business value in a speciﬁc area. 3 Polyglot Services Teams can choose their technology stack, such as programming language, datastore, frameworks, etc. 4 Improved Fault Tolerance & Isolation The loss of a single service can be less detrimental the overall health of the system.

Slide 7

Slide 7 text

Slide 8

Slide 8 text

304ms 472ms 295ms 543ms 1393ms 8 389ms

Slide 9

Slide 9 text

304ms 472ms 295ms 543ms 1840ms 9

Slide 10

Slide 10 text

Challenges of Debugging Microservices 10 1 What went wrong? What lead to the behaviour the user is experiencing? Timeout? Error? Latency? 2 Where did it go wrong? Which service is causing the problem? Is it a database, queue, or third-party service? 3 Why did it happen? What set of circumstances led to the issue occurring? 4 What was the context? Contextualising logs and metrics with other metadata associated with the request and the state of the service.

Slide 11

Slide 11 text

Existing Methods 11 Metrics ● Error rates, latency, throughput, CPU, RAM, etc. ● Identify anomalies/irregularities. Logging ● Record error messages and stack traces. ● Track unexpected code paths. ● Find the “why” of the issue. Used together to debug issues. Metrics help you identify that an issue exists, and logging helps you understand what the issue is. Problem: These tools only tell the story from the perspective of a single application.

Slide 12

Slide 12 text

Distributed Tracing 12

Slide 13

Slide 13 text

“ “ Deﬁnition Distributed tracing, also called distributed request tracing, is a method used to proﬁle and monitor applications, especially those built using a microservices architecture. Distributed tracing helps pinpoint where failures occur and what causes poor performance. 13 https:/ /opentracing.io/docs/overview/what-is-tracing

Slide 14

Slide 14 text

Anatomy of a Trace 14 Terminology: ● Trace ● Span ● Tags ● Logs ● SpanContext ● Baggage A B C D E F Time Trace Spans

Slide 15

Slide 15 text

15 A B C D E F Time Trace Trace Representation of a request as it traverses a distributed system. Directed Acyclic Graph, with Spans (Nodes) and the edges between them as References. Spans

Slide 16

Slide 16 text

16 A B C D E F Time Trace Spans Span Individual unit of work done within a distributed system. Spans generally have a name, and a start and end timestamp. Trace Representation of a request as it traverses a distributed system. Directed Acyclic Graph, with Spans (Nodes) and the edges between them as References.

Slide 17

Slide 17 text

17 A B C D E F Time Trace Spans A trace is a collection of one or more spans. Span Individual unit of work done within a distributed system. Spans generally have a name, and a start and end timestamp. Trace Representation of a request as it traverses a distributed system. Directed Acyclic Graph, with Spans (Nodes) and the edges between them as References.

Slide 18

Slide 18 text

18 A B Time Tags Span specific, user-defined key:value annotations. Can be used to query and filter. Tags http.method: “GET”, http.status: 200, app: “my-service”

Slide 19

Slide 19 text

19 A B Time Tags http.method: “GET”, http.status: 200, app: “my-service” Tags Span specific, user-defined key:value annotations. Can be used to query and filter. Logs “level”: “info”, “method”: “GET”, “message”: “animals are better than humans” Logs Contextualised logging messages. Used to capture debugging information and events that occurred within the span.

Slide 20

Slide 20 text

20 A B Time Tags http.method: “GET”, http.status: 200, app: “my-service” Tags Span specific, user-defined key:value annotations. Can be used to query and filter. Logs “level”: “info”, “method”: “GET”, “message”: “animals are better than humans” Logs Contextualised logging messages. Used to capture debugging information and events that occurred within the span. SpanContext trace_id: “20092013”, span_id: “01071992” Baggage: user_id: “03092019” SpanContext Transporting data across the network boundary, both in the immediate context (e.g. parent span_id), and baggage - which is carried throughout the entire trace.

Slide 21

Slide 21 text

21 Using HTTP request context to pass metadata between services to be utilised by the tracing instrumentation. Context Propagation A C F D E B Edge Service UUID { context } { context } { context } { context } { context } { context } { context }

Slide 22

Slide 22 text

OpenTracing API 22

Slide 23

Slide 23 text

Overview Forms a common language around tracing. Allows interchangeability of distributed tracing platforms. Provides a common set of client libraries for various languages. Supported by the CNCF 23 Standard, vendor neutral API speciﬁcation for instrumentation. 1 2 3 4

Slide 24

Slide 24 text

Slide 25

Slide 25 text

25 OpenTracing & OpenCensus are merging to form the OpenTelemetry standard. Early September - client libraries at feature parity. Early November - both previous projects ofﬁcially sunset.

Slide 26

Slide 26 text

Jaeger Tracing 26

Slide 27

Slide 27 text

Overview Inspired by Google Dapper and OpenZipkin Created by Uber in 2015 and donated to the CNCF in 2017. Compliant with both OpenTracing and OpenCensus. 27 Open-source distributed tracing platform. 1 2 3

Slide 28

Slide 28 text

● Go backend ● React frontend ● Multiple storage options: ● Cassandra ● ElasticSearch ● In-Memory ● Compatible with Apache Kafka for backpressure management. 28 Overview (cont.)

Slide 29

Slide 29 text

29 Application Jaeger Client Jaeger Agent Spans (UDP) Control Flow Jaeger Collector Push Adaptive Sampling Control Flow Poll (sampling, etc.) DB Spark Jobs Jaeger Query UI Jaeger Architecture

Slide 30

Slide 30 text

A Note on Sampling 30 In statistics, quality assurance, and survey methodology, sampling is the selection of a subset of individuals from within a statistical population to estimate characteristics of the whole population. Wikipedia https://en.wikipedia.org/wiki/Sampling_(statistics) Probabilistic Sampling: Random sampling decision. E.g. 0.01% or 1/1000th chance of being sampled. (Default) Rate Limiting: “Leaky Bucket” rate limiting. E.g. 20 per second. Constant: Same decision for all traces. Either all or nothing. “ “

Slide 31

Slide 31 text

Jaeger vs. Zipkin 31 Jaeger ● Written in Go ● 8.6k stars ● Compatible with Zipkin client libraries ● Kubernetes Operator & ofﬁcial Helm chart ● CNCF project Zipkin ● Written in Java ● Released in 2012 by Twitter ● 11k stars ● Single binary deployment

Slide 32

Slide 32 text

Demo 32

Slide 33

Slide 33 text

Resources Jaeger: https://jaegertracing.io Hot R.O.D Example: https://github.com/jaegertracing/jaeger/tree/master/examples/hotrod Hot R.O.D Tutorial: http://bit.ly/hotrod-tutorial OpenTracing Tutorial: https://github.com/yurishkuro/opentracing-tutorial Framework Libraries: https://github.com/opentracing-contrib/ Supported Languages: https://opentracing.io/docs/supported-languages/ 33

Slide 34

Slide 34 text

34 Thanks for Listening!

Slide 35

Slide 35 text

35 @jmickey_ jmichielsen jmickey mickey.dev [email protected] Thanks for Listening!