2022-10-19 JavaOne - To Production and Beyond: Metrics with Micrometer

Erin Schnabel Jonatan Ivanov 2022-10-19 To Production and Beyond Metrics
with Micrometer

What is Observability? How well we can understand the internals
of a system based on its outputs (Providing meaningful information about what happens inside)

Various Opinions 3 pillars: Logging, Metrics, Distributed Tracing TEMPLE: Tracing,
Change Events, Metrics, Proﬁles, Exceptions Arbitrary Wide Events, Signals But what about: /health, /info, etc. Service Registry/Discoverability API Discoverability

Why do we need Observability? Today's systems are insanely complex
(cloud) (Death Star Architecture, Big Ball of Mud)

Why do we need Observability? Environments can be chaotic You
turn a knob here a little and services are going down there We need to deal with unknown unknowns We can’t know everything Thing can be perceived diﬀerently by observers Everything is broken for the users but seems ok to you

Metrics Primary tool for measuring availability (SLI) Cheap Compact Eﬃcient
Cardinality is a concern, but…

Time Series: How does data change over time? Time is
the primary (x) axis Data collected at regular intervals Appended to the end of a series

Time Series: Looking for trends…

Time Series: Aggregation across dimensions Hierarchical job.instance.monsters.dice_rolls.die.face=value job.instance.monsters.dice_rolls.d10.01=value job.instance.monsters.dice_rolls.d8.08=value job.instance.monsters.dice_rolls.d6.03=value
… *.*.monsters.dice_rolls.d6.*=value *.*.monsters.dice_rolls.d8.*=value Is that the right order? What if there is another…

Time Series: Aggregation across dimensions Dimensional dice_rolls_total { die="d10" face="01"
instance="..." job="..." } 77062 More ﬂexible (additional / missing data) Cardinality: die (6 values), face (20 values), instance (3 values), job (3 values) = 6*20*3*3 = 1080 series

Micrometer Popular Metrics library on the JVM Like SLF4J, but
for metrics Simple API Supports the most popular metric backends Support for lots of third-party libraries/frameworks Spring, Quarkus, Micronaut, Helidon, etc.

Like SLF4J, but for metrics … Ganglia Graphite Humio InﬂuxDB
JMX KairosDB New Relic OpenTSDB OTLP Prometheus SignalFx Stackdriver (GCP) StatsD Wavefront (VMware) AppOptics Atlas Azure Monitor CloudWatch (AWS) Datadog Dynatrace Elastic

1. registry.counter("example.prime.number", "type", "prime"); 2. Counter.builder("counter") .baseUnit("beans") // optional .description("a
description") // optional .tags("region", "test") // optional .register(registry); 3. @Counted(value = "metric.all", extraTags = { "region", "test" }) ** Counter → Increment Only

Micrometer: Counter registry.counter("dice.rolls", "die", d, "face", f).increment(); Prometheus: dice_rolls_total {
die="d10" face="01" instance="..." job="..." } 77062 Rate of change: Counter reset Process restart Rate of change to value of counter in last 10 minutes:

A rate is an average over an interval: the size
of the interval matters Note: Aggregation and precision 1 minute 30 minutes

Distribution Summary → .record(value), not time 1. registry.histogram("name", "tagName", "tagValue")
2. DistributionSummary.builder("response.size") .description("a description") // optional .baseUnit("bytes") // optional .tags("tagName", "tagValue") // optional .register(registry); Minimal information from a histogram: sum, count, max • sum / count = aggregable average • max is a decaying signal over a larger time window Aggregable: Raw parts can be safely recombined across dimensions

A little aside, just for context… Attack types: AC: Armor
class DC: Diﬃculty class; : oneRound: Troll(LARGE GIANT){AC:15,HP:84(8d10+40),STR:18(+4),DEX:13(+1),CON:20(+5),INT:7(-2),WIS:9(-1),CHA:7(-2),CR: Pit Fiend(LARGE FIEND){AC:19,HP:300(24d10+168),STR:26(+8),DEX:14(+2),CON:24(+7),INT:22(+6),WIS:18(+4),CHA: : attack: miss: Troll(36) -> Pit Fiend(100) : attack: miss: Troll(36) -> Pit Fiend(100) : attack: hit> Troll(36) -> Pit Fiend(97) for 9 damage using Claws[7hit,11(2d6+4)|slashing] : attack: hit> Pit Fiend(97) -> Troll(10) for 22 damage using Bite[14hit,22(4d6+8)|piercing] : attack: MISS: Pit Fiend(97) -> Troll(10) : attack: HIT> Pit Fiend(97) -> Troll(0) for 34 damage using Mace[14hit,15(2d6+8)|bludgeoning] : oneRound: survivors Pit Fiend(LARGE FIEND){AC:19,HP:300(24d10+168),STR:26(+8),DEX:14(+2),CON:24(+7),INT:22(+6),WIS:18(+4),CHA: Special: MISS: natural 1 HIT: natural 20

Distribution Summary: known unknowns registry.histogram("encounter.rounds", new Tag("numCombatants", e.getNumCombatants()), new Tag("targetSelector",
e.getSelector()), new Tag("sizeDelta", e.getSizeDelta()) ).update(totalRounds); encounter_rounds_count{ numCombatants="03", sizeDelta="00", targetSelector="BiggestFirst", instance="...", job="...", } registry.histogram("round.attacks", new Tag("hitOrMiss", event.hitOrMiss()), new Tag("attackType", event.getAttackType()), new Tag("damageType", event.getType()), ).update(event.getDamageAmount()); round_attacks_count{ attackType="attack-ac", damageType="acid", hitOrMiss="critical hit", instance="...", job="..." }

Distribution Summary

Timer → .record(duration) 1. registry.timer("my.timer", "tagName", "tagValue"); 2. Timer.builder("my.timer") .description("description
") // optional .tags("tagName", "tagValue") // optional .register(registry); 3. @Timed(value = "call", extraTags = {"tagName", "tagValue"}) **

Working with Timers 1. timer.record(() -> noReturnValue()); 2. timer.recordCallable(() ->
returnValue()); 3. Runnable r = timer.wrap(() -> noReturnValue()); 4. Callable c = timer.wrap(() -> returnValue()); 5. Sample s = Timer.start(registry); doStuff; s.stop(timer);

Timing HTTP requests Example from Quarkus, Spring does similar.. Timer.Sample
sample = requestMetric.getSample(); // stored Timer.Builder builder = Timer.builder("http.server.requests") .tags(Tags.of( VertxMetricsTags.method(requestMetric.request().method()), HttpCommonTags.uri(path, response.statusCode()), VertxMetricsTags.outcome(response), HttpCommonTags.status(response.statusCode()))); sample.stop(builder.register(registry));

Gauge → increase/decrease, observed value 1. List<String> list = registry.gauge("list.size",
Tags.of(...), new ArrayList<>(), List::size); 2. Gauge.builder("jvm.threads.peak", threadBean, ThreadMXBean::getPeakThreadCount) .tags("region", "test") .description("The peak live thread count...") .baseUnit(BaseUnits.THREADS) .register(registry);

LongTaskTimer → active count, longest active task 1. registry.more().longTaskTimer("long.task", "region",
"test"); 2. LongTaskTimer.builder("long.task") .description("a description") // optional .tags("region", "test") // optional .register(registry); 3. @Timed(value = "long.task", longTask = true, … ) **

What’s (kind of) new in Micrometer? 1.9.0 - 2022 May
OTLP Registry (OpenTelemetry) HighCardinalityTagsDetector Exemplars (Prometheus)

Exemplars (Prometheus) “Metadata” that you can attach to your metrics
Updated at measurement time (sampled) They are not tags (high cardinality) Usually traceId and spanId Correlate Metrics to Distributed Tracing and Logs Available for Counter and Histogram buckets

What’s new in Micrometer? 1.10.0 - 2022 November (RC is
out) Micrometer Tracing (Sleuth w/o Spring deps.) Micrometer Docs Generator Micrometer Context Propagation Observation API (micrometer-core)

• Add logs (application logs) • Add metrics ◦ Increment
Counters ◦ Start/Stop Timers • Add Distributed Tracing ◦ Start/Stop Spans ◦ Log Correlation ◦ Context Propagation You want to instrument your application…

Observation API (Micrometer 1.10) Observation observation = Observation.start("talk",registry); try {
// TODO: scope Thread.sleep(1000); } catch (Exception exception) { observation.error(exception); throw exception; } finally { // TODO: attach tags (key-value) observation.stop(); }

Observation API (Micrometer 1.10) ObservationRegistry registry = ObservationRegistry.create(); registry.observationConfig() .observationHandler(new
MeterHandler(...)) .observationHandler(new TracingHandler(...)) .observationHandler(new LoggingHandler(...)) .observationHandler(new AuditEventHandler(...)); Observation observation = Observation.start("talk",registry); // let the fun begin… observation.stop();

Observation API (Micrometer 1.10) Observation.createNotStarted("talk",registry) .lowCardinalityKeyValue("conference", "J1") .highCardinalityKeyValue("uid", userId) .observe(this::talk);
@Observed

Thank you! Follow us on Twitter: @ebullientworks @jonatan_ivanov Learn more:
SpringOne December 6-8

FAQ Why measuring avg/median is not a good idea? Why
measuring only TP95 (or TP99) is not a good idea? Why avg(TP95) does not make sense? Why should you measure max? What’s the problems can high cardinality cause in metrics?

Logging - Metrics - Distributed Tracing - And More… Logging
What happened (why)? → Emitting events Metrics What is the context? → Aggregating data Distributed Tracing Why happened? → Recording events with causal ordering And More… /health, /info, etc. Events/Signals Service Registry/Discoverability, API Discoverability

Examples Logging Processing took 140ms Metrics P99.999: 140ms Max: 150
ms Distributed Tracing DB was slow (lot of data was requested) Logging Processing failed (stacktrace?) Metrics The error rate is 0.001/sec 2 errors in the last 30 minutes Distributed Tracing DB call failed (invalid input)

Counter Timer, LongTaskTimer DistributionSummary Gauge How they help answering: How
will you know if you've deployed a bug that aﬀects your users? Or if your last change caused signiﬁcant performance degradation? How can you know when network issues arise? Or one of your dependencies goes down? Meters

Distribution Summary

FunctionCounter → gauge-like counter function 1. registry.more().counter(...) 2. FunctionCounter.builder("hibernate.transactions", statistics,
s -> s.getTransactionCount() - s.getSuccessfulTransactionCount()) .tags("entityManagerFactory", sessionFactoryName) .tags("result", "failure") .description("The number of transactions we know to have failed") .register(registry);

FunctionTimer → gauge-like, counter & duration function 1. registry.more().timer(...); 2.
FunctionTimer.builder("cache.gets.latency", cache, c -> c.getLocalMapStats().getGetOperationCount(), c -> c.getLocalMapStats().getTotalGetLatency(), TimeUnit.NANOSECONDS) .tags("name", cache.getName()) .description("Cache gets") .register(registry);

2022-10-19 JavaOne - To Production and Beyond: ...

2022-10-19 JavaOne - To Production and Beyond: Metrics with Micrometer

More Decks by Jonatan Ivanov

Other Decks in Programming

Featured

Transcript