Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Problem with Pre-Aggregated Metrics

Christine Yen
December 07, 2018

The Problem with Pre-Aggregated Metrics

Pre-aggregated metrics and time series form the backbone of many monitoring setups and have many redeeming qualities, but simply aren't sufficient for capturing the many ways things can go wrong in modern or complex systems. Problems inherent in the concepts behind and implementation of pre-aggregated metrics prevent them from being effective for any sort of debugging or diagnostics; we'll talk about why that is, and what techniques we should be leaning on instead.

Presented at YOW! 2018.

Christine Yen

December 07, 2018
Tweet

More Decks by Christine Yen

Other Decks in Technology

Transcript

  1. The Problem with PRE-AGGREGATED METRICS @cyen @honeycombio A story about

    the past and why observability is the glorious future
  2. Question → Graph → Counters Has my app been getting

    traction? 2 :03 :00 :01 :02 :04 10 9 7 6
  3. The Problem with PRE-AGGREGATED METRICS The Problem with PRE-AGGREGATED METRICS

    The Problem with PRE-AGGREGATED METRICS The Problem with PRE-AGGREGATED METRICS
  4. { "action": "write", "client_platform": "iOS" } PART 1: THE PROBLEM

    WITH "PRE" "Why is my API request volume increasing? Where are they coming from?"
  5. { "action": "write", "client_platform": "iOS" } PART 1: THE PROBLEM

    WITH "PRE" • APIRequest "Why is my API request volume increasing? Where are they coming from?"
  6. { "action": "write", "client_platform": "iOS" } PART 1: THE PROBLEM

    WITH "PRE" • APIRequest • APIRequest.Write.iOS • APIRequest.Read.iOS • APIRequest.Write.Android • APIRequest.Read.Android "Why is my API request volume increasing? Where are they coming from?"
  7. { "endpoint": "/objects/b3C0sj4", "method": "PUT", "status": 201, "app_id": "8b3jOsmH4", "client_platform":

    "iOS", "request_dur_ms": 32.153, "mongodb_dur_ms": 29.83, "build_id": 1325, "hostname": "appserver50" } PART 1: THE PROBLEM WITH "PRE"
  8. PART 1: THE PROBLEM WITH "PRE" • Limit yourself to

    questions your past self found interesting • (+ to an older understanding of your system) WAYS TO BE LET DOWN BY YOUR PAST SELF
  9. PART 1: THE PROBLEM WITH "PRE" • Limit yourself to

    questions your past self found interesting • (+ to an older understanding of your system) • Create too many low-signal metrics, dashboards, and alerts WAYS TO BE LET DOWN BY YOUR PAST SELF
  10. PART 2: THE PROBLEM WITH "AGGREGATED" { "action": "write", "client_platform":

    "iOS" } Count API Requests Count API Requests with "action": "write" Graph Count API Requests with "…platform": "ios" Count API Requests with "action": "write" and "…platform": "ios"
  11. PART 2: THE PROBLEM WITH "AGGREGATED" { "action": "write", "client_platform":

    "iOS" } Count API Requests Count API Requests with "action": "write" Graph Count API Requests with "…platform": "ios" Count API Requests with "action": "write" and "…platform": "ios" +1 :03 :00 :01 :02 +1 Storage Illustration +1 +1
  12. PART 2: THE PROBLEM WITH "AGGREGATED" { "action": "write", "client_platform":

    "iOS" } Count API Requests Count API Requests with "action": "write" Graph Count API Requests with "…platform": "ios" Count API Requests with "action": "write" and "…platform": "ios" +1 :03 :00 :01 :02 +1 Storage Illustration +1 +1 APIRequest APIRequest.Write Identifier APIRequest.iOS APIRequest.Write.iOS
  13. PART 2: THE PROBLEM WITH "AGGREGATED" { "action": "write", "client_platform":

    "iOS" } Count API Requests Count API Requests with "action": "write" Graph Count API Requests with "…platform": "ios" Count API Requests with "action": "write" and "…platform": "ios" +1 :03 :00 :01 :02 +1 Storage Illustration +1 +1 APIRequest APIRequest.Write Identifier APIRequest.iOS APIRequest.Write.iOS write iOS
  14. PART 2: THE PROBLEM WITH "AGGREGATED" { "action": "write", "client_platform":

    "iOS" APIRequest.Write APIRequest.iOS APIRequest APIRequest.Write.iOS
  15. PART 2: THE PROBLEM WITH "AGGREGATED" APIRequest.Write APIRequest.Read APIRequest.iOS APIRequest.Android

    APIRequest.Write.iOS APIRequest.Write.Android APIRequest.Read.iOS APIRequest.Read.Android { "action": "write", "client_platform": "iOS" APIRequest.Write APIRequest.iOS APIRequest APIRequest.Write.iOS
  16. PART 2: THE PROBLEM WITH "AGGREGATED" APIRequest.Write APIRequest.Read APIRequest.iOS APIRequest.Android

    APIRequest.Write.iOS APIRequest.Write.Android APIRequest.Read.iOS APIRequest.Read.Android { "action": "write", "client_platform": "iOS" APIRequest.Write APIRequest.iOS APIRequest APIRequest.Write.iOS "sdk_version": "v2" }
  17. PART 2: THE PROBLEM WITH "AGGREGATED" APIRequest.Write APIRequest.Read APIRequest.iOS APIRequest.Android

    APIRequest.Write.iOS APIRequest.Write.Android APIRequest.Read.iOS APIRequest.Read.Android { "action": "write", "client_platform": "iOS" APIRequest.Write APIRequest.iOS APIRequest APIRequest.Write.iOS APIRequest.SDKv2 APIRequest.iOS.SDKv2 APIRequest.Write.SDKv2 APIRequest.Write.iOS.SDKv2 "sdk_version": "v2" }
  18. PART 2: THE PROBLEM WITH "AGGREGATED" APIRequest.Write APIRequest.Read APIRequest.iOS APIRequest.Android

    APIRequest.Write.iOS APIRequest.Write.Android APIRequest.Read.iOS APIRequest.Read.Android APIRequest.Write.iOS.SDKv1 APIRequest.Write.iOS.SDKv2 APIRequest.Write.Android.SDKv1 APIRequest.Write.Android.SDKv2 APIRequest.Read.iOS.SDKv1 APIRequest.Read.iOS.SDKv2 APIRequest.Read.Android.SDKv1 APIRequest.Read.Android.SDKv2 { "action": "write", "client_platform": "iOS" APIRequest.iOS.SDKv1 APIRequest.iOS.SDKv2 APIRequest.Android.SDKv1 APIRequest.Android.SDKv2 APIRequest.Write.SDKv1 APIRequest.Write.SDKv2 APIRequest.Read.SDKv1 APIRequest.Read.SDKv2 APIRequest.SDKv1 APIRequest.SDKv2 APIRequest.Write APIRequest.iOS APIRequest APIRequest.Write.iOS APIRequest.SDKv2 APIRequest.iOS.SDKv2 APIRequest.Write.SDKv2 APIRequest.Write.iOS.SDKv2 "sdk_version": "v2" }
  19. { "action": "write", "client_platform": "iOS" } update PART 2: THE

    PROBLEM WITH "AGGREGATED" APIRequest.iOS APIRequest APIRequest.Update APIRequest.Update.iOS
  20. { "action": "write", "client_platform": "iOS" } update PART 2: THE

    PROBLEM WITH "AGGREGATED" APIRequest.Write APIRequest.Read APIRequest.Update APIRequest.iOS APIRequest.Android APIRequest.Write.iOS APIRequest.Write.Android APIRequest.Read.iOS APIRequest.Read.Android APIRequest.Update.iOS APIRequest.Update.Android APIRequest.iOS APIRequest APIRequest.Update APIRequest.Update.iOS
  21. API Request: processed on host appserver03 APIRequest APIRequest.appserver01 APIRequest.appserver02 APIRequest.appserver03

    PART 2: THE PROBLEM WITH "AGGREGATED" APIRequest.appserver04 APIRequest.appserver05 APIRequest.appserver06 APIRequest.appserver07 APIRequest.appserver08 APIRequest.appserver09
  22. API Request: processed on host appserver03 APIRequest APIRequest.appserver01 APIRequest.appserver02 APIRequest.appserver03

    PART 2: THE PROBLEM WITH "AGGREGATED" APIRequest.appserver04 APIRequest.appserver05 APIRequest.appserver06 APIRequest.appserver07 APIRequest.appserver08 APIRequest.appserver09 APIRequest.pushserver01 APIRequest.pushserver02 APIReque APIReque APIReque APIReque APIReque APIReque
  23. API Request: processed on host appserver03 APIRequest APIRequest.appserver01 APIRequest.appserver02 APIRequest.appserver03

    PART 2: THE PROBLEM WITH "AGGREGATED" APIRequest.appserver04 APIRequest.appserver05 APIRequest.appserver06 APIRequest.appserver07 APIRequest.appserver08 APIRequest.appserver09 APIRequest.pushserver01 APIRequest.pushserver02 APIReque APIReque APIReque APIReque APIReque APIReque IRequest.appserver341 IRequest.appserver342 IRequest.appserver343 IRequest.appserver344 IRequest.pushserver19 IRequest.pushserver20 IRequest.pushserver21 IRequest.pushserver22 APIRequest.pushserver21 APIRequest.server115 APIRequest.server116 APIRequest.server117 APIRequest.server118 APIRequest.server119 APIRequest.server120 APIRequest.server121 APIRequest.serv APIRequest.serv APIRequest.serv APIRequest.serv APIRequest.serv APIRequest.serv APIRequest.serv (˽°□°҂˽Ɨ ˍʓˍ)
  24. :03 :00 :01 :02 :04 57 132 12 ... 0

    0 23:59 0 MyApp.build151 :05 0 :06 0 :07 0 0 0 52 ... 82 13 0 MyApp.build152 0 0 0 0 0 0 ... 0 40 0 MyApp.build154 13 0 0 0 0 0 ... 0 0 0 MyApp.build157 91 124 101 (time) PART 2: THE PROBLEM WITH "AGGREGATED"
  25. :03 :00 :01 :02 :04 ... 23:59 MyApp.0xc42000fc0 :05 :06

    :07 MyApp.0xc420079af MyApp.0x0012ff74c MyApp.0x471cb03e7 ... ... (time) ... PART 2: THE PROBLEM WITH "AGGREGATED"
  26. • Increase dimensionality: break down by (or tag) some new

    attribute ‣ (e.g. major SDK versions) HOW TO EXPLODE YOUR PRE-AGGREGATED METRICS STORAGE PART 2: THE PROBLEM WITH "AGGREGATED"
  27. • Increase dimensionality: break down by (or tag) some new

    attribute ▸ (e.g. major SDK versions) • Increase cardinality: track lots of unique values ‣ (e.g. userID, app version, user agent, hostname) HOW TO EXPLODE YOUR PRE-AGGREGATED METRICS STORAGE PART 2: THE PROBLEM WITH "AGGREGATED"
  28. • Increase dimensionality: break down by (or tag) some new

    attribute ▸ (e.g. major SDK versions) • Increase cardinality: track lots of unique values ‣ (e.g. userID, app version, user agent, hostname) HOW TO EXPLODE YOUR PRE-AGGREGATED METRICS STORAGE PART 2: THE PROBLEM WITH "AGGREGATED" where the good stuff is
  29. PART 3: THE PROBLEM WITH "METRICS" Memory climbed/fell… twice. Our

    request rate fell off a cliff! The number of errors spiked!
  30. PART 3: THE PROBLEM WITH "METRICS" Whoa! What caused that

    spike? Memory climbed/fell… twice. Our request rate fell off a cliff! The number of errors spiked! Was it the same reason both times? For all requests or a certain type?
  31. PART 3: THE PROBLEM WITH "METRICS" latform": "ios", "error": true

    } { "client_platform": "ios", "error": false } { "client_p
  32. The overall error rate for my API is 0.012! PART

    3: THE PROBLEM WITH "METRICS" latform": "ios", "error": true } { "client_platform": "ios", "error": false } { "client_p
  33. The overall error rate for my API is 0.012! PART

    3: THE PROBLEM WITH "METRICS" latform": "ios", "error": true } { "client_platform": "ios", "error": false } { "client_p
  34. That one! Right there! PART 3: THE PROBLEM WITH "METRICS"

    The overall error rate for my API is 0.012!
  35. API requests with… roundtrip_sec < 10 and sdk = "iOS"

    PART 3: THE PROBLEM WITH "METRICS" The overall error rate for my API is ____
  36. API requests with… roundtrip_sec < 10 and sdk = "iOS"

    PART 3: THE PROBLEM WITH "METRICS" The overall error rate for my API is ____
  37. WAYS TO DESTROY YOUR FAITH IN METRICS (AND HURT YOUR

    CREDIBILITY) PART 2: THE PROBLEM WITH "AGGREGATED"
  38. • Draw conclusions from metrics you have (vs the metrics

    you want) WAYS TO DESTROY YOUR FAITH IN METRICS (AND HURT YOUR CREDIBILITY) PART 2: THE PROBLEM WITH "AGGREGATED"
  39. • Draw conclusions from metrics you have (vs the metrics

    you want) • Assume that if your metrics don’t show something, it’s not happening WAYS TO DESTROY YOUR FAITH IN METRICS (AND HURT YOUR CREDIBILITY) PART 2: THE PROBLEM WITH "AGGREGATED"
  40. ▸ Use dashboards, not too many; use KPIs that matter

    ▸ … but allow getting specific: arbitrary dimensionality and cardinality DETECT THE FUTURE, NOW WITH VERBS!
  41. ▸ Use dashboards, not too many; use KPIs that matter

    ▸ … but allow getting specific: arbitrary dimensionality and cardinality • Retain the raw data (read-time aggregation > pre-aggregation) ▸ … otherwise that calculated number is all you’ll get DETECT THE FUTURE, NOW WITH VERBS!
  42. • Speed matters when testing out hypotheses ‣ … something

    we’re getting better and better at now REFINE THE FUTURE, NOW WITH VERBS!
  43. • Speed matters when testing out hypotheses ‣ … something

    we’re getting better and better at now • Capture as much context as you might want later ▸ … to track down the new strange problems today REFINE THE FUTURE, NOW WITH VERBS!
  44. Let’s stop talking about PRE-AGGREGATED METRICS And think about what

    our tools should do to keep up with our brand-new questions