Upgrade to Pro — share decks privately, control downloads, hide ads and more …

2022-10-19 JavaOne - To Production and Beyond: Metrics with Micrometer

2022-10-19 JavaOne - To Production and Beyond: Metrics with Micrometer

Jonatan Ivanov

November 19, 2022
Tweet

More Decks by Jonatan Ivanov

Other Decks in Programming

Transcript

  1. Erin Schnabel Jonatan Ivanov 2022-10-19 To Production and Beyond Metrics

    with Micrometer
  2. What is Observability? How well we can understand the internals

    of a system based on its outputs (Providing meaningful information about what happens inside)
  3. Various Opinions 3 pillars: Logging, Metrics, Distributed Tracing TEMPLE: Tracing,

    Change Events, Metrics, Profiles, Exceptions Arbitrary Wide Events, Signals But what about: /health, /info, etc. Service Registry/Discoverability API Discoverability
  4. Why do we need Observability? Today's systems are insanely complex

    (cloud) (Death Star Architecture, Big Ball of Mud)
  5. Why do we need Observability? Environments can be chaotic You

    turn a knob here a little and services are going down there We need to deal with unknown unknowns We can’t know everything Thing can be perceived differently by observers Everything is broken for the users but seems ok to you
  6. Metrics Primary tool for measuring availability (SLI) Cheap Compact Efficient

    Cardinality is a concern, but…
  7. Time Series: How does data change over time? Time is

    the primary (x) axis Data collected at regular intervals Appended to the end of a series
  8. Time Series: Looking for trends…

  9. Time Series: Aggregation across dimensions Hierarchical job.instance.monsters.dice_rolls.die.face=value job.instance.monsters.dice_rolls.d10.01=value job.instance.monsters.dice_rolls.d8.08=value job.instance.monsters.dice_rolls.d6.03=value

    … *.*.monsters.dice_rolls.d6.*=value *.*.monsters.dice_rolls.d8.*=value Is that the right order? What if there is another…
  10. Time Series: Aggregation across dimensions Dimensional dice_rolls_total { die="d10" face="01"

    instance="..." job="..." } 77062 More flexible (additional / missing data) Cardinality: die (6 values), face (20 values), instance (3 values), job (3 values) = 6*20*3*3 = 1080 series
  11. Micrometer Popular Metrics library on the JVM Like SLF4J, but

    for metrics Simple API Supports the most popular metric backends Support for lots of third-party libraries/frameworks Spring, Quarkus, Micronaut, Helidon, etc.
  12. Like SLF4J, but for metrics … Ganglia Graphite Humio InfluxDB

    JMX KairosDB New Relic OpenTSDB OTLP Prometheus SignalFx Stackdriver (GCP) StatsD Wavefront (VMware) AppOptics Atlas Azure Monitor CloudWatch (AWS) Datadog Dynatrace Elastic
  13. 1. registry.counter("example.prime.number", "type", "prime"); 2. Counter.builder("counter") .baseUnit("beans") // optional .description("a

    description") // optional .tags("region", "test") // optional .register(registry); 3. @Counted(value = "metric.all", extraTags = { "region", "test" }) ** Counter → Increment Only
  14. Micrometer: Counter registry.counter("dice.rolls", "die", d, "face", f).increment(); Prometheus: dice_rolls_total {

    die="d10" face="01" instance="..." job="..." } 77062 Rate of change: Counter reset Process restart Rate of change to value of counter in last 10 minutes:
  15. A rate is an average over an interval: the size

    of the interval matters Note: Aggregation and precision 1 minute 30 minutes
  16. Distribution Summary → .record(value), not time 1. registry.histogram("name", "tagName", "tagValue")

    2. DistributionSummary.builder("response.size") .description("a description") // optional .baseUnit("bytes") // optional .tags("tagName", "tagValue") // optional .register(registry); Minimal information from a histogram: sum, count, max • sum / count = aggregable average • max is a decaying signal over a larger time window Aggregable: Raw parts can be safely recombined across dimensions
  17. A little aside, just for context… Attack types: AC: Armor

    class DC: Difficulty class; : oneRound: Troll(LARGE GIANT){AC:15,HP:84(8d10+40),STR:18(+4),DEX:13(+1),CON:20(+5),INT:7(-2),WIS:9(-1),CHA:7(-2),CR: Pit Fiend(LARGE FIEND){AC:19,HP:300(24d10+168),STR:26(+8),DEX:14(+2),CON:24(+7),INT:22(+6),WIS:18(+4),CHA: : attack: miss: Troll(36) -> Pit Fiend(100) : attack: miss: Troll(36) -> Pit Fiend(100) : attack: hit> Troll(36) -> Pit Fiend(97) for 9 damage using Claws[7hit,11(2d6+4)|slashing] : attack: hit> Pit Fiend(97) -> Troll(10) for 22 damage using Bite[14hit,22(4d6+8)|piercing] : attack: MISS: Pit Fiend(97) -> Troll(10) : attack: HIT> Pit Fiend(97) -> Troll(0) for 34 damage using Mace[14hit,15(2d6+8)|bludgeoning] : oneRound: survivors Pit Fiend(LARGE FIEND){AC:19,HP:300(24d10+168),STR:26(+8),DEX:14(+2),CON:24(+7),INT:22(+6),WIS:18(+4),CHA: Special: MISS: natural 1 HIT: natural 20
  18. Distribution Summary: known unknowns registry.histogram("encounter.rounds", new Tag("numCombatants", e.getNumCombatants()), new Tag("targetSelector",

    e.getSelector()), new Tag("sizeDelta", e.getSizeDelta()) ).update(totalRounds); encounter_rounds_count{ numCombatants="03", sizeDelta="00", targetSelector="BiggestFirst", instance="...", job="...", } registry.histogram("round.attacks", new Tag("hitOrMiss", event.hitOrMiss()), new Tag("attackType", event.getAttackType()), new Tag("damageType", event.getType()), ).update(event.getDamageAmount()); round_attacks_count{ attackType="attack-ac", damageType="acid", hitOrMiss="critical hit", instance="...", job="..." }
  19. Distribution Summary

  20. Timer → .record(duration) 1. registry.timer("my.timer", "tagName", "tagValue"); 2. Timer.builder("my.timer") .description("description

    ") // optional .tags("tagName", "tagValue") // optional .register(registry); 3. @Timed(value = "call", extraTags = {"tagName", "tagValue"}) **
  21. Working with Timers 1. timer.record(() -> noReturnValue()); 2. timer.recordCallable(() ->

    returnValue()); 3. Runnable r = timer.wrap(() -> noReturnValue()); 4. Callable c = timer.wrap(() -> returnValue()); 5. Sample s = Timer.start(registry); doStuff; s.stop(timer);
  22. Timing HTTP requests Example from Quarkus, Spring does similar.. Timer.Sample

    sample = requestMetric.getSample(); // stored Timer.Builder builder = Timer.builder("http.server.requests") .tags(Tags.of( VertxMetricsTags.method(requestMetric.request().method()), HttpCommonTags.uri(path, response.statusCode()), VertxMetricsTags.outcome(response), HttpCommonTags.status(response.statusCode()))); sample.stop(builder.register(registry));
  23. Gauge → increase/decrease, observed value 1. List<String> list = registry.gauge("list.size",

    Tags.of(...), new ArrayList<>(), List::size); 2. Gauge.builder("jvm.threads.peak", threadBean, ThreadMXBean::getPeakThreadCount) .tags("region", "test") .description("The peak live thread count...") .baseUnit(BaseUnits.THREADS) .register(registry);
  24. LongTaskTimer → active count, longest active task 1. registry.more().longTaskTimer("long.task", "region",

    "test"); 2. LongTaskTimer.builder("long.task") .description("a description") // optional .tags("region", "test") // optional .register(registry); 3. @Timed(value = "long.task", longTask = true, … ) **
  25. What’s (kind of) new in Micrometer? 1.9.0 - 2022 May

    OTLP Registry (OpenTelemetry) HighCardinalityTagsDetector Exemplars (Prometheus)
  26. Exemplars (Prometheus) “Metadata” that you can attach to your metrics

    Updated at measurement time (sampled) They are not tags (high cardinality) Usually traceId and spanId Correlate Metrics to Distributed Tracing and Logs Available for Counter and Histogram buckets
  27. None
  28. None
  29. None
  30. None
  31. What’s new in Micrometer? 1.10.0 - 2022 November (RC is

    out) Micrometer Tracing (Sleuth w/o Spring deps.) Micrometer Docs Generator Micrometer Context Propagation Observation API (micrometer-core)
  32. • Add logs (application logs) • Add metrics ◦ Increment

    Counters ◦ Start/Stop Timers • Add Distributed Tracing ◦ Start/Stop Spans ◦ Log Correlation ◦ Context Propagation You want to instrument your application…
  33. Observation API (Micrometer 1.10) Observation observation = Observation.start("talk",registry); try {

    // TODO: scope Thread.sleep(1000); } catch (Exception exception) { observation.error(exception); throw exception; } finally { // TODO: attach tags (key-value) observation.stop(); }
  34. Observation API (Micrometer 1.10) ObservationRegistry registry = ObservationRegistry.create(); registry.observationConfig() .observationHandler(new

    MeterHandler(...)) .observationHandler(new TracingHandler(...)) .observationHandler(new LoggingHandler(...)) .observationHandler(new AuditEventHandler(...)); Observation observation = Observation.start("talk",registry); // let the fun begin… observation.stop();
  35. Observation API (Micrometer 1.10) Observation.createNotStarted("talk",registry) .lowCardinalityKeyValue("conference", "J1") .highCardinalityKeyValue("uid", userId) .observe(this::talk);

    @Observed
  36. DEMO

  37. Thank you! Follow us on Twitter: @ebullientworks @jonatan_ivanov Learn more:

    SpringOne December 6-8
  38. FAQ Why measuring avg/median is not a good idea? Why

    measuring only TP95 (or TP99) is not a good idea? Why avg(TP95) does not make sense? Why should you measure max? What’s the problems can high cardinality cause in metrics?
  39. Logging - Metrics - Distributed Tracing - And More… Logging

    What happened (why)? → Emitting events Metrics What is the context? → Aggregating data Distributed Tracing Why happened? → Recording events with causal ordering And More… /health, /info, etc. Events/Signals Service Registry/Discoverability, API Discoverability
  40. Examples Logging Processing took 140ms Metrics P99.999: 140ms Max: 150

    ms Distributed Tracing DB was slow (lot of data was requested) Logging Processing failed (stacktrace?) Metrics The error rate is 0.001/sec 2 errors in the last 30 minutes Distributed Tracing DB call failed (invalid input)
  41. Counter Timer, LongTaskTimer DistributionSummary Gauge How they help answering: How

    will you know if you've deployed a bug that affects your users? Or if your last change caused significant performance degradation? How can you know when network issues arise? Or one of your dependencies goes down? Meters
  42. Distribution Summary

  43. FunctionCounter → gauge-like counter function 1. registry.more().counter(...) 2. FunctionCounter.builder("hibernate.transactions", statistics,

    s -> s.getTransactionCount() - s.getSuccessfulTransactionCount()) .tags("entityManagerFactory", sessionFactoryName) .tags("result", "failure") .description("The number of transactions we know to have failed") .register(registry);
  44. FunctionTimer → gauge-like, counter & duration function 1. registry.more().timer(...); 2.

    FunctionTimer.builder("cache.gets.latency", cache, c -> c.getLocalMapStats().getGetOperationCount(), c -> c.getLocalMapStats().getTotalGetLatency(), TimeUnit.NANOSECONDS) .tags("name", cache.getName()) .description("Cache gets") .register(registry);