Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Observability 3 ways at SCaLE 17x

Observability 3 ways at SCaLE 17x

This was an updated variant of my https://speakerdeck.com/adriancole/observability-3-ways-logging-metrics-and-tracing deck given at the Southern California Linux Expo

https://www.socallinuxexpo.org/scale/17x/presentations/observability-3-ways-logging-metrics-and-distributed-tracing

What's different are a couple more advice slides, the last a challenge to positioning logging, tracing and metrics as solutions. I also tried to emphasize that "3 ways" is not the same as "the only 3 ways"

Adrian Cole

March 09, 2019
Tweet

More Decks by Adrian Cole

Other Decks in Technology

Transcript

  1. Observability 3 ways
    Logging, Metrics and Tracing
    @adrianfcole
    works at Pivotal
    works on Zipkin
    let’s talk about observability.. how we understand production.. through three common tools

    View Slide

  2. Unifying theory
    • Logging - recording events
    • Metrics - data combined from measuring events
    • Tracing - recording events with causal ordering
    Everything is based on events
    credit: coda hale
    note: metrics take in events and emit events! ex a reading of requests per second is itself an event

    data combined from measuring events put more specifically: metrics are “statistical aggregates of properties of events which generate periodic events recording the
    instantaneous values of those aggregates"

    View Slide

  3. Tracing
    Request
    scoped
    Logging
    Events
    Metrics
    Aggregatable*
    credit: peter bourgon
    Focal areas
    often confused because they have things in common, like a timeline.

    start with logging: crappy error happened

    tracing: impact of that error

    metrics: how many errors of this type are happening in the system

    logs: discrete events: debug, error, audit, request details

    crappy error happened;

    tracing can tell you the impact of that error. for example did it cause a caller to fail or did it delay it?

    tracing: request-scope causal info: latency, queries, IDs

    metrics: gauge counter histogram; success failure or customer

    how many errors of this type are happening in this cluster?

    not all metrics are meaningfully aggregatable, ex percentiles or averages

    https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html

    View Slide

  4. Let’s use latency to
    compare a few tools
    • Log - event (response time)
    • Metric - value (response time)
    • Trace - tree (response time)
    event value and tree are outputs of each corresponding system

    View Slide

  5. Logs show response time
    [20/Apr/2017:11:07:07 +0000] "GET / HTTP/1.1" 200
    7918 "" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:
    1.8.1.11) Gecko/20061201 Firefox/2.0.0.11 (Ubuntu-
    feisty)" **0/95491**
    Look! this request took 95 milliseconds!
    often a field or other to derive duration from logs.

    note there’s some complexity in this format, and often latency is timestamp math between events.

    View Slide

  6. Metrics show response time
    Is 95 milliseconds slow?
    How fast were most
    requests at 11:07?
    95ms wouldn't have set off an alert if we are monitoring max. response time can also be complicated as it not a normal distribution (long tail) and can be multimodal
    (multiple simultaneous peak latencies)

    View Slide

  7. Alert on max, performance
    tune to high percentiles.
    @jon_k_schneider
    I can't control the processes which cause max (GC, VM pauses etc which happens less than 1%), but I can control what might happen at 99% and can use that statistic
    for performance analysis.

    View Slide

  8. What caused the request to take 95
    milliseconds?
    åȇȇȇȇȇȇȇȇȇȇȇȇ95491 microseconds───────────────────────────å
    │䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢│
    │䡢䡢䡢䡢䡢䡢䡢䡢䡢│
    åȇȇȇȇȇȇȇȇȇȇȇȇ 55231 microseconds───────────å
    │䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢│
    │䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢│
    │䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢│
    Traces show response time
    an error delayed the request, which would have otherwise been performant.

    View Slide

  9. First thoughts…
    • Log - easy to “grep”, manually read
    • Metric - can identify trends
    • Trace - identify cause across services

    View Slide

  10. How do you write timing
    code?
    • Log - time and write formatted or structured logs
    • Metric - time and store the number
    • Trace - start, propagate and finish a “span”
    Jargon alert! span == one operation in the call graph

    View Slide

  11. Logging response time
    Find the thing you want and time it, format the result
    into a log statement.
    long tookMs = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - startNs);
    logger.log("+ response.request().url() + " (" + tookMs + "ms" + (!logHeaders ? ", "
    + bodySize + " body" : "") + ')');
    often global reference to a logger, as events share no state between them

    Code is simplified and uncompilable variant of okhttp request logger

    View Slide

  12. Metric’ing response time
    Initialize something to record duration and add to it
    def apply(request: Req, service: Service[Req, Rep]): Future[Rep] = {
    val sample = Timer.start(Clock.SYSTEM)
    service(request).respond { response =>
    sample.stop(
    Metrics.timer(“request.latency”, “code”, response.status())
    )
    }
    }
    often endpoint-scoped reference to a stat, api often limited to the value type

    Code is simplified and uncompilable variant of finagle tracing using micrometer

    View Slide

  13. Tracing response time
    Create and manage a span. Pass it on via headers
    Span span = handler.handleReceive(extractor, httpRequest);
    try {
    chain.doFilter(httpRequest, httpResponse);
    } finally {
    servlet.handleAsync(handler, httpRequest, httpResponse, span);
    }
    managing spans ensures parent/child links are maintained. this allows the system to collate spans into a trace automatically.

    Code is simplified and uncompilable variant of servlet tracing

    View Slide

  14. Impact of timing code
    • Log - ubiquitous apis, but requires coordination
    • Metric - easy, but least context
    • Trace - hardest, as identifiers must be passed
    within and between services
    logging and metrics require less state, so are easier to program

    View Slide

  15. Should you write timing
    code?
    • Frameworks usually have metrics built-in
    • Many frameworks have tracing built-in
    • Lots of edge cases in this sort of code!
    edge cases like clock skew, sparse traces, overhead management

    View Slide

  16. How to not see tracing
    code?
    • Buddy - another process intercepts yours
    • Agent - code patches code
    • Framework - code intercepts or configures code

    View Slide

  17. Buddy tracing
    web service processing
    Use a service mesh to trace around your services
    propagate
    context headers
    Image is from linkerd blog https://blog.buoyant.io/2017/03/14/a-service-mesh-for-kubernetes-part-vii-distributed-tracing-made-easy/

    without propagation of context, stitching traces together is heuristic, if possible. propagation supports other things like retry budgets or routing overrides

    View Slide

  18. Agent tracing
    We have ways of making code traced..
    if ("spark/webserver/JettyHandler".equals(className)) {
    ClassPool cp = new ClassPool();
    cp.appendClassPath(new LoaderClassPath(loader));
    CtClass ct = cp.makeClass(new ByteArrayInputStream(classfileBuffer));
    CtMethod ctMethod = ct.getDeclaredMethod("doHandle");
    ctMethod.insertBefore("{ $4.setHeader(\”TraceId\", MagicTraceId.get()); }");
    return ct.toBytecode();
    }
    Code is from https://github.com/bsideup/javaagent-boilerplate

    Most APMs use agents. Allows tracing of libraries that never planned to be traced, and more advanced propagation as system or otherwise difficult classes can be
    modified.

    View Slide

  19. Framework Tracing
    Framework code configures libraries
    @Configuration
    @AutoConfigureAfter(TraceAutoConfiguration.class)
    @ConditionalOnClass(HystrixCommand.class)
    @ConditionalOnBean(Tracer.class)
    public class SleuthHystrixAutoConfiguration {
    @Bean
    SleuthHystrixConcurrencyStrategy sleuthHystrixConcurrencyStrategy(
    Tracer tracer, TraceKeys traceKeys) {
    return new SleuthHystrixConcurrencyStrategy(tracer, traceKeys);
    }
    }
    Code is from https://github.com/spring-cloud/spring-cloud-sleuth

    Users trust frameworks to setup tracing on their behalf. Usually there are ways to customize data. Frameworks are limited to version compatibility and hooks that are
    available.

    View Slide

  20. How is timing data
    shipped?
    • Log - pull raw events into a parsing pipeline
    • Metric - report duration buckets near-real time
    • Trace - report spans near-real time
    logs usually imply disk writes and/or store forward pipelines which can be available minutes later.

    There are different architectures behind services, so these are just tendencies. Ex all three are sometimes placed over the same pipeline. Also, sometimes there’s
    buffering in log pipelines for recent data,.

    View Slide

  21. Parsing latency from events
    Identify the pattern and parse into indexable fields
    input {
    file {
    path => "/var/log/http.log"
    }
    }
    filter {
    grok {
    match => { "message" => "%{IP:client} %{WORD:method} %
    {URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}" }
    }
    }
    an example of reading a file and parsing duration from it. also includes other fields so you can rollup by ip

    config is from elasticsearch grok plugin docs

    View Slide

  22. Bucketing duration
    define boundaries up front…
    boundaries[0] = 1; // 0 to < 1ms
    boundaries[1] = 1000; // 1ms to < 1s
    boundaries[2] = 50000; // 1s to < 50s
    add values by incrementing count in a bucket
    for (int i = 0; i < boundaries.length; i++) {
    if (duration < boundaries[i]) {
    bucket[i]++;
    return;
    }
    }
    bucket[boundaries.length]++; // overflow!
    Code is simplified from google-instrumentation, which is an opinionated way not shared by others. Some prefer recording into an ordinary histogram, later using that to
    make a cumulative one.

    static bucketing isn’t always used, HDRHistogram is dynamic for example.

    View Slide

  23. Shipping spans
    Spans represent operations and are structured
    !""""""""""""""""""""""""""""""""#
    $ ""% structure and report span &""""▶
    $ $
    ${ $
    $ "traceId": "aa", $
    $ "id": "6b", $
    $ "name": "get", $
    $ "timestamp": 1483945573944000,$
    $ "duration": 95491, $
    $ "annotations": [ $
    $--snip-- $
    '""""""""""""""""""""""""""""""""(
    This structure can change, but instrumentation code can live forever!

    json is in zipkin format

    View Slide

  24. How timing data grows
    • Log - grows with traffic and verbosity
    • Metric - fixed wrt traffic
    • Trace - grows with traffic
    ex trace data doesn't necessarily grow based on garbage collection or audit events

    View Slide

  25. Means to reduce volume
    • Log - don’t log irrelevant data, filtering
    • Metric - read-your-writes, coarser grain
    • Trace - sampling, but needs to be consistent
    Each have different retention, too!
    logging and tracing generally increase volume with traffic

    log systems can lose 10% and there’s often no way to say which 10% will be dropped

    you typically can’t get per-request details from a metric

    people can trust tracing too much and not expect data to ever be absent

    sampling is also possible in metrics, but it can interfere with aggregations.

    accidental sampling a lot of folks ship an unholy amount of data at statsd over UDP w/o recording packet loss - coda

    View Slide

  26. Stitching “all” 3 together
    Trace ID
    Pod
    Service Name
    Metrics
    Logging
    Tracing
    with all three you can identify knock-on effects of things like pauses or deployment events

    View Slide

  27. Correlating Metrics and
    Tracing Data
    https://medium.com/observability/want-to-debug-
    latency-7aa48ecbe8f7

    View Slide

  28. Leverage strengths while
    understanding weaknesses
    • Log - monoliths, black boxes, exceptional cases
    • Metric - identify patterns and/or alert
    • Trace - distributed services “why is this slow”
    all tools are sometimes needed.

    sometimes you have software you can’t affect “black boxes” scraping logs can help.

    View Slide

  29. Logging, metrics & distributed tracing 
    — These are problems, not solutions!
    @autoletics
    Instrumenting simulated episodes derived from real systems is not new, but rarely discussed. We need to consider all options even if some feel easier to reach

    https://medium.com/@autoletics/logging-metrics-distributed-tracing-these-are-problems-not-solutions-d528c09a4dbe

    https://medium.com/@autoletics/data-in-motion-reconstructing-the-past-4859749983bf

    other simulation focused attempts https://github.com/adrianco/spigo

    canopy decoupled recording from model creation https://blog.acolyer.org/2017/11/22/canopy-an-end-to-end-performance-tracing-and-analysis-system/

    View Slide

  30. Was this helpful?
    If so, thank folks who helped with this!
    @adrianfcole
    @munroenic
    @basvanbeek
    @bogdandrutu
    @jeanneretph
    If not, blame me,
    @peterbourgon
    @felix_b
    @abhik5ingh
    https://peter.bourgon.org/blog/
    @coda
    @jon_k_schneider
    @autoletics
    Peter’s blog led to this talk. read it!

    https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html

    The others offered time reviewing, sometimes multiple passes. Each of which improved this content.

    View Slide