Observability 3 ways: logging metrics and tracing

Observability 3 ways: logging metrics and tracing

Talk given at dotscale exploring three common observability systems. This talk was inspired by https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html

5b0bda421e4500b706c5c42d20c3a632?s=128

Adrian Cole

April 24, 2017
Tweet

Transcript

  1. Observability 3 ways Logging, Metrics and Tracing @adrianfcole works at

    Pivotal works on Zipkin let’s talk about observability.. how we understand production.. through three common tools
  2. Unifying theory • Logging - recording events • Metrics -

    data combined from measuring events • Tracing - recording events with causal ordering Everything is based on events credit: coda hale note: metrics take in events and emit events! ex a reading of requests per second is itself an event data combined from measuring events put more specifically: metrics are “statistical aggregates of properties of events which generate periodic events recording the instantaneous values of those aggregates"
  3. Tracing Request scoped Logging Events Metrics Aggregatable* credit: peter bourgon

    Focal areas often confused because they have things in common, like a timeline. start with logging: crappy error happened tracing: impact of that error metrics: how many errors of this type are happening in the system logs: discrete events: debug, error, audit, request details crappy error happened; tracing can tell you the impact of that error. for example did it cause a caller to fail or did it delay it? tracing: request-scope causal info: latency, queries, IDs metrics: gauge counter histogram; success failure or customer how many errors of this type are happening in this cluster? not all metrics are meaningfully aggregatable, ex percentiles or averages https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html
  4. Let’s use latency to compare a few tools • Log

    - event (response time) • Metric - value (response time) • Trace - tree (response time) event value and tree are outputs of each corresponding system
  5. Logs show response time [20/Apr/2017:14:19:07 +0000] "GET / HTTP/1.1" 200

    7918 "" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv: 1.8.1.11) Gecko/20061201 Firefox/2.0.0.11 (Ubuntu- feisty)" **0/95491** Look! this request took 95 milliseconds! often a field or other to derive duration from logs. note there’s some complexity in this format, and often latency is timestamp math between events.
  6. Metrics show response time Is 95 milliseconds slow? How fast

    were most requests at 14:19? context of a fact within the system. 95ms is indeed slow, but not critical. most requests were good at that time, even if the system had trouble 10 minutes prior can be work resource event customer metrics
  7. What caused the request to take 95 milliseconds? åȇȇȇȇȇȇȇȇȇȇȇȇ95491 microsecondsʒʒʒʒʒʒʒʒʒʒʒʒʒʒʒʒʒʒʒʒʒʒʒʒʒʒʒå

    ʔ䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢ʔ ʔ䡢䡢䡢䡢䡢䡢䡢䡢䡢ʔ åȇȇȇȇȇȇȇȇȇȇȇȇ 557231 microsecondsʒʒʒʒʒʒʒʒʒʒʒå ʔ䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢ʔ ʔ䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢ʔ ʔ䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢䡢ʔ Traces show response time an error delayed the request, which would have otherwise been performant.
  8. First thoughts… • Log - easy to “grep”, manually read

    • Metric - can identify trends • Trace - identify cause across services
  9. How do you write timing code? • Log - time

    and write formatted or structured logs • Metric - time and store the number • Trace - start, propagate and finish a “span” Jargon alert! span == one operation in the call graph
  10. Logging response time Find the thing you want and time

    it, format the result into a log statement. long tookMs = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - startNs); logger.log("<-- " + response.code() + ' ' + response.message() + ' ' + response.request().url() + " (" + tookMs + "ms" + (!logHeaders ? ", " + bodySize + " body" : "") + ')'); often global reference to a logger, as events share no state between them Code is simplified and uncompilable variant of okhttp request logger
  11. Metric’ing response time Initialize something to record duration and add

    to it val latencyStat = statsReceiver.stat("request_latency_us") def apply(request: Req, service: Service[Req, Rep]): Future[Rep] = { val elapsed = Stopwatch.start() service(request).respond { response => latencyStat.add(elapsed().inUnit(MICROSECONDS)) } } often endpoint-scoped reference to a stat, api often limited to the value type Code is simplified and uncompilable variant of finagle tracing
  12. Tracing response time Create and manage a span. Pass it

    on via headers ConnectionManager::mutateRequestHeaders( --snip-- if (tracing_decision.is_tracing) { active_span_ = tracer_.startSpan(*this, *request_headers_, request_info_); } } ConnectionManager::onResponseStream( --snip-- if (active_span_) { tracer_.finalizeSpan(*active_span_, *request_headers_, request_info_); } } managing spans ensures parent/child links are maintained. this allows the system to collate spans into a trace automatically. Code is simplified and uncompilable variant of lyft envoy tracing
  13. Impact of timing code • Log - ubiquitous apis, but

    requires coordination • Metric - easy, but least context • Trace - hardest, as identifiers must be passed within and between services logging and metrics require less state, so are easier to program
  14. Should you write timing code? • Frameworks usually have metrics

    built-in • Many frameworks have tracing built-in • Lots of edge cases in this sort of code! edge cases like clock skew, sparse traces, overhead management
  15. How is timing data shipped? • Log - pull raw

    events into a parsing pipeline • Metric - report duration buckets near-real time • Trace - report spans near-real time logs usually imply disk writes and/or store forward pipelines which can be available minutes later. There are different architectures behind services, so these are just tendencies. Ex all three are sometimes placed over the same pipeline. Also, sometimes there’s buffering in log pipelines for recent data,.
  16. Parsing latency from events Identify the pattern and parse into

    indexable fields input { file { path => "/var/log/http.log" } } filter { grok { match => { "message" => "%{IP:client} %{WORD:method} % {URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}" } } } an example of reading a file and parsing duration from it. also includes other fields so you can rollup by ip config is from elasticsearch grok plugin docs
  17. Bucketing duration define boundaries up front… boundaries[0] = 1; //

    0 to < 1ms boundaries[1] = 1000; // 1ms to < 1s boundaries[2] = 50000; // 1s to < 50s add values by incrementing count in a bucket for (int i = 0; i < boundaries.length; i++) { if (duration < boundaries[i]) { bucket[i]++; return; } } bucket[boundaries.length]++; // overflow! Code is simplified from google-instrumentation static bucketing isn’t always used, HDRHistogram is dynamic for example.
  18. Shipping spans Spans represent operations and are structured !""""""""""""""""""""""""""""""""# $

    ""% structure and report span &""""▶ $ $ ${ $ $ "traceId": "aa", $ $ "id": "6b", $ $ "name": "get", $ $ "timestamp": 1483945573944000,$ $ "duration": 95491, $ $ "annotations": [ $ $--snip-- $ '""""""""""""""""""""""""""""""""( This structure can change, but instrumentation code can live forever! json is in zipkin format
  19. How timing data grows • Log - grows with traffic

    and verbosity • Metric - fixed wrt traffic • Trace - grows with traffic ex trace data doesn't necessarily grow based on garbage collection or audit events
  20. Means to reduce volume • Log - don’t log irrelevant

    data, filtering • Metric - read-your-writes, coarser grain • Trace - sampling, but needs to be consistent Each have different retention, too! logging and tracing generally increase volume with traffic log systems can lose 10% and there’s often no way to say which 10% will be dropped you typically can’t get per-request details from a metric people can trust tracing too much and not expect data to ever be absent sampling is also possible in metrics, but it can interfere with aggregations. accidental sampling a lot of folks ship an unholy amount of data at statsd over UDP w/o recording packet loss - coda
  21. Stitching all 3 together Trace ID Cluster RPC Name Metrics

    Logging Tracing with all three you can identify knock-on effects of things like pauses or deployment events
  22. Leverage strengths while understanding weaknesses • Log - monoliths, black

    boxes, exceptional cases • Metric - identify patterns and/or alert • Trace - distributed services “why is this slow” all tools are sometimes needed. sometimes you have software you can’t affect “black boxes” scraping logs can help.
  23. Was this helpful? If so, thank folks who helped with

    this! @adrianfcole @munroenic @basvanbeek @bogdandrutu @jeanneretph If not, blame me, @peterbourgon @felix_b @abhik5ingh https://peter.bourgon.org/blog/ @coda Peter’s blog led to this talk. read it! https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html The others offered time reviewing, sometimes multiple passes. Each of which improved this content.