Slide 1

Slide 1 text

Erin Schnabel Jonatan Ivanov 2022-10-19 To Production and Beyond Metrics with Micrometer

Slide 2

Slide 2 text

What is Observability? How well we can understand the internals of a system based on its outputs (Providing meaningful information about what happens inside)

Slide 3

Slide 3 text

Various Opinions 3 pillars: Logging, Metrics, Distributed Tracing TEMPLE: Tracing, Change Events, Metrics, Profiles, Exceptions Arbitrary Wide Events, Signals But what about: /health, /info, etc. Service Registry/Discoverability API Discoverability

Slide 4

Slide 4 text

Why do we need Observability? Today's systems are insanely complex (cloud) (Death Star Architecture, Big Ball of Mud)

Slide 5

Slide 5 text

Why do we need Observability? Environments can be chaotic You turn a knob here a little and services are going down there We need to deal with unknown unknowns We can’t know everything Thing can be perceived differently by observers Everything is broken for the users but seems ok to you

Slide 6

Slide 6 text

Metrics Primary tool for measuring availability (SLI) Cheap Compact Efficient Cardinality is a concern, but…

Slide 7

Slide 7 text

Time Series: How does data change over time? Time is the primary (x) axis Data collected at regular intervals Appended to the end of a series

Slide 8

Slide 8 text

Time Series: Looking for trends…

Slide 9

Slide 9 text

Time Series: Aggregation across dimensions Hierarchical job.instance.monsters.dice_rolls.die.face=value job.instance.monsters.dice_rolls.d10.01=value job.instance.monsters.dice_rolls.d8.08=value job.instance.monsters.dice_rolls.d6.03=value … *.*.monsters.dice_rolls.d6.*=value *.*.monsters.dice_rolls.d8.*=value Is that the right order? What if there is another…

Slide 10

Slide 10 text

Time Series: Aggregation across dimensions Dimensional dice_rolls_total { die="d10" face="01" instance="..." job="..." } 77062 More flexible (additional / missing data) Cardinality: die (6 values), face (20 values), instance (3 values), job (3 values) = 6*20*3*3 = 1080 series

Slide 11

Slide 11 text

Micrometer Popular Metrics library on the JVM Like SLF4J, but for metrics Simple API Supports the most popular metric backends Support for lots of third-party libraries/frameworks Spring, Quarkus, Micronaut, Helidon, etc.

Slide 12

Slide 12 text

Like SLF4J, but for metrics … Ganglia Graphite Humio InfluxDB JMX KairosDB New Relic OpenTSDB OTLP Prometheus SignalFx Stackdriver (GCP) StatsD Wavefront (VMware) AppOptics Atlas Azure Monitor CloudWatch (AWS) Datadog Dynatrace Elastic

Slide 13

Slide 13 text

1. registry.counter("example.prime.number", "type", "prime"); 2. Counter.builder("counter") .baseUnit("beans") // optional .description("a description") // optional .tags("region", "test") // optional .register(registry); 3. @Counted(value = "metric.all", extraTags = { "region", "test" }) ** Counter → Increment Only

Slide 14

Slide 14 text

Micrometer: Counter registry.counter("dice.rolls", "die", d, "face", f).increment(); Prometheus: dice_rolls_total { die="d10" face="01" instance="..." job="..." } 77062 Rate of change: Counter reset Process restart Rate of change to value of counter in last 10 minutes:

Slide 15

Slide 15 text

A rate is an average over an interval: the size of the interval matters Note: Aggregation and precision 1 minute 30 minutes

Slide 16

Slide 16 text

Distribution Summary → .record(value), not time 1. registry.histogram("name", "tagName", "tagValue") 2. DistributionSummary.builder("response.size") .description("a description") // optional .baseUnit("bytes") // optional .tags("tagName", "tagValue") // optional .register(registry); Minimal information from a histogram: sum, count, max ● sum / count = aggregable average ● max is a decaying signal over a larger time window Aggregable: Raw parts can be safely recombined across dimensions

Slide 17

Slide 17 text

A little aside, just for context… Attack types: AC: Armor class DC: Difficulty class; : oneRound: Troll(LARGE GIANT){AC:15,HP:84(8d10+40),STR:18(+4),DEX:13(+1),CON:20(+5),INT:7(-2),WIS:9(-1),CHA:7(-2),CR: Pit Fiend(LARGE FIEND){AC:19,HP:300(24d10+168),STR:26(+8),DEX:14(+2),CON:24(+7),INT:22(+6),WIS:18(+4),CHA: : attack: miss: Troll(36) -> Pit Fiend(100) : attack: miss: Troll(36) -> Pit Fiend(100) : attack: hit> Troll(36) -> Pit Fiend(97) for 9 damage using Claws[7hit,11(2d6+4)|slashing] : attack: hit> Pit Fiend(97) -> Troll(10) for 22 damage using Bite[14hit,22(4d6+8)|piercing] : attack: MISS: Pit Fiend(97) -> Troll(10) : attack: HIT> Pit Fiend(97) -> Troll(0) for 34 damage using Mace[14hit,15(2d6+8)|bludgeoning] : oneRound: survivors Pit Fiend(LARGE FIEND){AC:19,HP:300(24d10+168),STR:26(+8),DEX:14(+2),CON:24(+7),INT:22(+6),WIS:18(+4),CHA: Special: MISS: natural 1 HIT: natural 20

Slide 18

Slide 18 text

Distribution Summary: known unknowns registry.histogram("encounter.rounds", new Tag("numCombatants", e.getNumCombatants()), new Tag("targetSelector", e.getSelector()), new Tag("sizeDelta", e.getSizeDelta()) ).update(totalRounds); encounter_rounds_count{ numCombatants="03", sizeDelta="00", targetSelector="BiggestFirst", instance="...", job="...", } registry.histogram("round.attacks", new Tag("hitOrMiss", event.hitOrMiss()), new Tag("attackType", event.getAttackType()), new Tag("damageType", event.getType()), ).update(event.getDamageAmount()); round_attacks_count{ attackType="attack-ac", damageType="acid", hitOrMiss="critical hit", instance="...", job="..." }

Slide 19

Slide 19 text

Distribution Summary

Slide 20

Slide 20 text

Timer → .record(duration) 1. registry.timer("my.timer", "tagName", "tagValue"); 2. Timer.builder("my.timer") .description("description ") // optional .tags("tagName", "tagValue") // optional .register(registry); 3. @Timed(value = "call", extraTags = {"tagName", "tagValue"}) **

Slide 21

Slide 21 text

Working with Timers 1. timer.record(() -> noReturnValue()); 2. timer.recordCallable(() -> returnValue()); 3. Runnable r = timer.wrap(() -> noReturnValue()); 4. Callable c = timer.wrap(() -> returnValue()); 5. Sample s = Timer.start(registry); doStuff; s.stop(timer);

Slide 22

Slide 22 text

Timing HTTP requests Example from Quarkus, Spring does similar.. Timer.Sample sample = requestMetric.getSample(); // stored Timer.Builder builder = Timer.builder("http.server.requests") .tags(Tags.of( VertxMetricsTags.method(requestMetric.request().method()), HttpCommonTags.uri(path, response.statusCode()), VertxMetricsTags.outcome(response), HttpCommonTags.status(response.statusCode()))); sample.stop(builder.register(registry));

Slide 23

Slide 23 text

Gauge → increase/decrease, observed value 1. List list = registry.gauge("list.size", Tags.of(...), new ArrayList<>(), List::size); 2. Gauge.builder("jvm.threads.peak", threadBean, ThreadMXBean::getPeakThreadCount) .tags("region", "test") .description("The peak live thread count...") .baseUnit(BaseUnits.THREADS) .register(registry);

Slide 24

Slide 24 text

LongTaskTimer → active count, longest active task 1. registry.more().longTaskTimer("long.task", "region", "test"); 2. LongTaskTimer.builder("long.task") .description("a description") // optional .tags("region", "test") // optional .register(registry); 3. @Timed(value = "long.task", longTask = true, … ) **

Slide 25

Slide 25 text

What’s (kind of) new in Micrometer? 1.9.0 - 2022 May OTLP Registry (OpenTelemetry) HighCardinalityTagsDetector Exemplars (Prometheus)

Slide 26

Slide 26 text

Exemplars (Prometheus) “Metadata” that you can attach to your metrics Updated at measurement time (sampled) They are not tags (high cardinality) Usually traceId and spanId Correlate Metrics to Distributed Tracing and Logs Available for Counter and Histogram buckets

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

No content

Slide 31

Slide 31 text

What’s new in Micrometer? 1.10.0 - 2022 November (RC is out) Micrometer Tracing (Sleuth w/o Spring deps.) Micrometer Docs Generator Micrometer Context Propagation Observation API (micrometer-core)

Slide 32

Slide 32 text

● Add logs (application logs) ● Add metrics ○ Increment Counters ○ Start/Stop Timers ● Add Distributed Tracing ○ Start/Stop Spans ○ Log Correlation ○ Context Propagation You want to instrument your application…

Slide 33

Slide 33 text

Observation API (Micrometer 1.10) Observation observation = Observation.start("talk",registry); try { // TODO: scope Thread.sleep(1000); } catch (Exception exception) { observation.error(exception); throw exception; } finally { // TODO: attach tags (key-value) observation.stop(); }

Slide 34

Slide 34 text

Observation API (Micrometer 1.10) ObservationRegistry registry = ObservationRegistry.create(); registry.observationConfig() .observationHandler(new MeterHandler(...)) .observationHandler(new TracingHandler(...)) .observationHandler(new LoggingHandler(...)) .observationHandler(new AuditEventHandler(...)); Observation observation = Observation.start("talk",registry); // let the fun begin… observation.stop();

Slide 35

Slide 35 text

Observation API (Micrometer 1.10) Observation.createNotStarted("talk",registry) .lowCardinalityKeyValue("conference", "J1") .highCardinalityKeyValue("uid", userId) .observe(this::talk); @Observed

Slide 36

Slide 36 text

DEMO

Slide 37

Slide 37 text

Thank you! Follow us on Twitter: @ebullientworks @jonatan_ivanov Learn more: SpringOne December 6-8

Slide 38

Slide 38 text

FAQ Why measuring avg/median is not a good idea? Why measuring only TP95 (or TP99) is not a good idea? Why avg(TP95) does not make sense? Why should you measure max? What’s the problems can high cardinality cause in metrics?

Slide 39

Slide 39 text

Logging - Metrics - Distributed Tracing - And More… Logging What happened (why)? → Emitting events Metrics What is the context? → Aggregating data Distributed Tracing Why happened? → Recording events with causal ordering And More… /health, /info, etc. Events/Signals Service Registry/Discoverability, API Discoverability

Slide 40

Slide 40 text

Examples Logging Processing took 140ms Metrics P99.999: 140ms Max: 150 ms Distributed Tracing DB was slow (lot of data was requested) Logging Processing failed (stacktrace?) Metrics The error rate is 0.001/sec 2 errors in the last 30 minutes Distributed Tracing DB call failed (invalid input)

Slide 41

Slide 41 text

Counter Timer, LongTaskTimer DistributionSummary Gauge How they help answering: How will you know if you've deployed a bug that affects your users? Or if your last change caused significant performance degradation? How can you know when network issues arise? Or one of your dependencies goes down? Meters

Slide 42

Slide 42 text

Distribution Summary

Slide 43

Slide 43 text

FunctionCounter → gauge-like counter function 1. registry.more().counter(...) 2. FunctionCounter.builder("hibernate.transactions", statistics, s -> s.getTransactionCount() - s.getSuccessfulTransactionCount()) .tags("entityManagerFactory", sessionFactoryName) .tags("result", "failure") .description("The number of transactions we know to have failed") .register(registry);

Slide 44

Slide 44 text

FunctionTimer → gauge-like, counter & duration function 1. registry.more().timer(...); 2. FunctionTimer.builder("cache.gets.latency", cache, c -> c.getLocalMapStats().getGetOperationCount(), c -> c.getLocalMapStats().getTotalGetLatency(), TimeUnit.NANOSECONDS) .tags("name", cache.getName()) .description("Cache gets") .register(registry);