Observability 3 ways at SCaLE 17x

Observability 3 ways Logging, Metrics and Tracing @adrianfcole works at
Pivotal works on Zipkin let’s talk about observability.. how we understand production.. through three common tools

Unifying theory • Logging - recording events • Metrics -
data combined from measuring events • Tracing - recording events with causal ordering Everything is based on events credit: coda hale note: metrics take in events and emit events! ex a reading of requests per second is itself an event data combined from measuring events put more speciﬁcally: metrics are “statistical aggregates of properties of events which generate periodic events recording the instantaneous values of those aggregates"

Tracing Request scoped Logging Events Metrics Aggregatable* credit: peter bourgon
Focal areas often confused because they have things in common, like a timeline. start with logging: crappy error happened tracing: impact of that error metrics: how many errors of this type are happening in the system logs: discrete events: debug, error, audit, request details crappy error happened; tracing can tell you the impact of that error. for example did it cause a caller to fail or did it delay it? tracing: request-scope causal info: latency, queries, IDs metrics: gauge counter histogram; success failure or customer how many errors of this type are happening in this cluster? not all metrics are meaningfully aggregatable, ex percentiles or averages https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html

Let’s use latency to compare a few tools • Log
- event (response time) • Metric - value (response time) • Trace - tree (response time) event value and tree are outputs of each corresponding system

Logs show response time [20/Apr/2017:11:07:07 +0000] "GET / HTTP/1.1" 200
7918 "" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv: 1.8.1.11) Gecko/20061201 Firefox/2.0.0.11 (Ubuntu- feisty)" **0/95491** Look! this request took 95 milliseconds! often a ﬁeld or other to derive duration from logs. note there’s some complexity in this format, and often latency is timestamp math between events.

Metrics show response time Is 95 milliseconds slow? How fast
were most requests at 11:07? 95ms wouldn't have set oﬀ an alert if we are monitoring max. response time can also be complicated as it not a normal distribution (long tail) and can be multimodal (multiple simultaneous peak latencies)

Alert on max, performance tune to high percentiles. @jon_k_schneider I
can't control the processes which cause max (GC, VM pauses etc which happens less than 1%), but I can control what might happen at 99% and can use that statistic for performance analysis.

What caused the request to take 95 milliseconds? åȇȇȇȇȇȇȇȇȇȇȇȇ95491 microseconds───────────────────────────å
│䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢│ │䡢䡢䡢䡢䡢䡢䡢䡢䡢│ åȇȇȇȇȇȇȇȇȇȇȇȇ 55231 microseconds───────────å │䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢│ │䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢│ │䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢│ Traces show response time an error delayed the request, which would have otherwise been performant.

First thoughts… • Log - easy to “grep”, manually read
• Metric - can identify trends • Trace - identify cause across services

How do you write timing code? • Log - time
and write formatted or structured logs • Metric - time and store the number • Trace - start, propagate and ﬁnish a “span” Jargon alert! span == one operation in the call graph

Logging response time Find the thing you want and time
it, format the result into a log statement. long tookMs = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - startNs); logger.log("<-- " + response.code() + ' ' + response.message() + ' ' + response.request().url() + " (" + tookMs + "ms" + (!logHeaders ? ", " + bodySize + " body" : "") + ')'); often global reference to a logger, as events share no state between them Code is simpliﬁed and uncompilable variant of okhttp request logger

Metric’ing response time Initialize something to record duration and add
to it def apply(request: Req, service: Service[Req, Rep]): Future[Rep] = { val sample = Timer.start(Clock.SYSTEM) service(request).respond { response => sample.stop( Metrics.timer(“request.latency”, “code”, response.status()) ) } } often endpoint-scoped reference to a stat, api often limited to the value type Code is simpliﬁed and uncompilable variant of ﬁnagle tracing using micrometer

Tracing response time Create and manage a span. Pass it
on via headers Span span = handler.handleReceive(extractor, httpRequest); try { chain.doFilter(httpRequest, httpResponse); } finally { servlet.handleAsync(handler, httpRequest, httpResponse, span); } managing spans ensures parent/child links are maintained. this allows the system to collate spans into a trace automatically. Code is simpliﬁed and uncompilable variant of servlet tracing

Impact of timing code • Log - ubiquitous apis, but
requires coordination • Metric - easy, but least context • Trace - hardest, as identiﬁers must be passed within and between services logging and metrics require less state, so are easier to program

Should you write timing code? • Frameworks usually have metrics
built-in • Many frameworks have tracing built-in • Lots of edge cases in this sort of code! edge cases like clock skew, sparse traces, overhead management

How to not see tracing code? • Buddy - another
process intercepts yours • Agent - code patches code • Framework - code intercepts or conﬁgures code

Buddy tracing web service processing Use a service mesh to
trace around your services propagate context headers Image is from linkerd blog https://blog.buoyant.io/2017/03/14/a-service-mesh-for-kubernetes-part-vii-distributed-tracing-made-easy/ without propagation of context, stitching traces together is heuristic, if possible. propagation supports other things like retry budgets or routing overrides

Agent tracing We have ways of making code traced.. if
("spark/webserver/JettyHandler".equals(className)) { ClassPool cp = new ClassPool(); cp.appendClassPath(new LoaderClassPath(loader)); CtClass ct = cp.makeClass(new ByteArrayInputStream(classfileBuffer)); CtMethod ctMethod = ct.getDeclaredMethod("doHandle"); ctMethod.insertBefore("{ $4.setHeader(\”TraceId\", MagicTraceId.get()); }"); return ct.toBytecode(); } Code is from https://github.com/bsideup/javaagent-boilerplate Most APMs use agents. Allows tracing of libraries that never planned to be traced, and more advanced propagation as system or otherwise diﬃcult classes can be modiﬁed.

Framework Tracing Framework code conﬁgures libraries @Configuration @AutoConfigureAfter(TraceAutoConfiguration.class) @ConditionalOnClass(HystrixCommand.class) @ConditionalOnBean(Tracer.class)
public class SleuthHystrixAutoConfiguration { @Bean SleuthHystrixConcurrencyStrategy sleuthHystrixConcurrencyStrategy( Tracer tracer, TraceKeys traceKeys) { return new SleuthHystrixConcurrencyStrategy(tracer, traceKeys); } } Code is from https://github.com/spring-cloud/spring-cloud-sleuth Users trust frameworks to setup tracing on their behalf. Usually there are ways to customize data. Frameworks are limited to version compatibility and hooks that are available.

How is timing data shipped? • Log - pull raw
events into a parsing pipeline • Metric - report duration buckets near-real time • Trace - report spans near-real time logs usually imply disk writes and/or store forward pipelines which can be available minutes later. There are diﬀerent architectures behind services, so these are just tendencies. Ex all three are sometimes placed over the same pipeline. Also, sometimes there’s buﬀering in log pipelines for recent data,.

Parsing latency from events Identify the pattern and parse into
indexable fields input { file { path => "/var/log/http.log" } } filter { grok { match => { "message" => "%{IP:client} %{WORD:method} % {URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}" } } } an example of reading a file and parsing duration from it. also includes other fields so you can rollup by ip config is from elasticsearch grok plugin docs

Bucketing duration deﬁne boundaries up front… boundaries[0] = 1; //
0 to < 1ms boundaries[1] = 1000; // 1ms to < 1s boundaries[2] = 50000; // 1s to < 50s add values by incrementing count in a bucket for (int i = 0; i < boundaries.length; i++) { if (duration < boundaries[i]) { bucket[i]++; return; } } bucket[boundaries.length]++; // overflow! Code is simpliﬁed from google-instrumentation, which is an opinionated way not shared by others. Some prefer recording into an ordinary histogram, later using that to make a cumulative one. static bucketing isn’t always used, HDRHistogram is dynamic for example.

Shipping spans Spans represent operations and are structured !""""""""""""""""""""""""""""""""# $
""% structure and report span &""""▶ $ $ ${ $ $ "traceId": "aa", $ $ "id": "6b", $ $ "name": "get", $ $ "timestamp": 1483945573944000,$ $ "duration": 95491, $ $ "annotations": [ $ $--snip-- $ '""""""""""""""""""""""""""""""""( This structure can change, but instrumentation code can live forever! json is in zipkin format

How timing data grows • Log - grows with traffic
and verbosity • Metric - fixed wrt traffic • Trace - grows with traffic ex trace data doesn't necessarily grow based on garbage collection or audit events

Means to reduce volume • Log - don’t log irrelevant
data, ﬁltering • Metric - read-your-writes, coarser grain • Trace - sampling, but needs to be consistent Each have different retention, too! logging and tracing generally increase volume with traﬃc log systems can lose 10% and there’s often no way to say which 10% will be dropped you typically can’t get per-request details from a metric people can trust tracing too much and not expect data to ever be absent sampling is also possible in metrics, but it can interfere with aggregations. accidental sampling a lot of folks ship an unholy amount of data at statsd over UDP w/o recording packet loss - coda

Stitching “all” 3 together Trace ID Pod Service Name Metrics
Logging Tracing with all three you can identify knock-on eﬀects of things like pauses or deployment events

Correlating Metrics and Tracing Data https://medium.com/observability/want-to-debug- latency-7aa48ecbe8f7

Leverage strengths while understanding weaknesses • Log - monoliths, black
boxes, exceptional cases • Metric - identify patterns and/or alert • Trace - distributed services “why is this slow” all tools are sometimes needed. sometimes you have software you can’t aﬀect “black boxes” scraping logs can help.

Logging, metrics & distributed tracing — These are problems, not solutions!
@autoletics Instrumenting simulated episodes derived from real systems is not new, but rarely discussed. We need to consider all options even if some feel easier to reach https://medium.com/@autoletics/logging-metrics-distributed-tracing-these-are-problems-not-solutions-d528c09a4dbe https://medium.com/@autoletics/data-in-motion-reconstructing-the-past-4859749983bf other simulation focused attempts https://github.com/adrianco/spigo canopy decoupled recording from model creation https://blog.acolyer.org/2017/11/22/canopy-an-end-to-end-performance-tracing-and-analysis-system/

Was this helpful? If so, thank folks who helped with
this! @adrianfcole @munroenic @basvanbeek @bogdandrutu @jeanneretph If not, blame me, @peterbourgon @felix_b @abhik5ingh https://peter.bourgon.org/blog/ @coda @jon_k_schneider @autoletics Peter’s blog led to this talk. read it! https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html The others oﬀered time reviewing, sometimes multiple passes. Each of which improved this content.

Observability 3 ways at SCaLE 17x

Observability 3 ways at SCaLE 17x

Adrian Cole

More Decks by Adrian Cole

Other Decks in Technology

Featured

Transcript

Observability 3 ways Logging, Metrics and Tracing @adrianfcole works at

Unifying theory • Logging - recording events • Metrics -

Tracing Request scoped Logging Events Metrics Aggregatable* credit: peter bourgon

Let’s use latency to compare a few tools • Log

Logs show response time [20/Apr/2017:11:07:07 +0000] "GET / HTTP/1.1" 200

Metrics show response time Is 95 milliseconds slow? How fast

Alert on max, performance tune to high percentiles. @jon_k_schneider I

What caused the request to take 95 milliseconds? åȇȇȇȇȇȇȇȇȇȇȇȇ95491 microseconds───────────────────────────å

First thoughts… • Log - easy to “grep”, manually read

How do you write timing code? • Log - time

Logging response time Find the thing you want and time

Metric’ing response time Initialize something to record duration and add

Tracing response time Create and manage a span. Pass it

Impact of timing code • Log - ubiquitous apis, but

Should you write timing code? • Frameworks usually have metrics

How to not see tracing code? • Buddy - another

Buddy tracing web service processing Use a service mesh to

Agent tracing We have ways of making code traced.. if

Framework Tracing Framework code conﬁgures libraries @Configuration @AutoConfigureAfter(TraceAutoConfiguration.class) @ConditionalOnClass(HystrixCommand.class) @ConditionalOnBean(Tracer.class)

How is timing data shipped? • Log - pull raw

Parsing latency from events Identify the pattern and parse into

Bucketing duration deﬁne boundaries up front… boundaries[0] = 1; //

Shipping spans Spans represent operations and are structured !""""""""""""""""""""""""""""""""# $

How timing data grows • Log - grows with trafﬁc

Means to reduce volume • Log - don’t log irrelevant

Stitching “all” 3 together Trace ID Pod Service Name Metrics

Correlating Metrics and Tracing Data https://medium.com/observability/want-to-debug- latency-7aa48ecbe8f7

Leverage strengths while understanding weaknesses • Log - monoliths, black

Logging, metrics & distributed tracing — These are problems, not solutions!

Was this helpful? If so, thank folks who helped with