Change Events, Metrics, Profiles, Exceptions Arbitrary Wide Events, Signals But what about: /health, /info, etc. Service Registry/Discoverability API Discoverability
turn a knob here a little and services are going down there We need to deal with unknown unknowns We can’t know everything Thing can be perceived differently by observers Everything is broken for the users but seems ok to you
for metrics Simple API Supports the most popular metric backends Support for lots of third-party libraries/frameworks Spring, Quarkus, Micronaut, Helidon, etc.
die="d10" face="01" instance="..." job="..." } 77062 Rate of change: Counter reset Process restart Rate of change to value of counter in last 10 minutes:
2. DistributionSummary.builder("response.size") .description("a description") // optional .baseUnit("bytes") // optional .tags("tagName", "tagValue") // optional .register(registry); Minimal information from a histogram: sum, count, max • sum / count = aggregable average • max is a decaying signal over a larger time window Aggregable: Raw parts can be safely recombined across dimensions
Updated at measurement time (sampled) They are not tags (high cardinality) Usually traceId and spanId Correlate Metrics to Distributed Tracing and Logs Available for Counter and Histogram buckets
measuring only TP95 (or TP99) is not a good idea? Why avg(TP95) does not make sense? Why should you measure max? What’s the problems can high cardinality cause in metrics?
What happened (why)? → Emitting events Metrics What is the context? → Aggregating data Distributed Tracing Why happened? → Recording events with causal ordering And More… /health, /info, etc. Events/Signals Service Registry/Discoverability, API Discoverability
ms Distributed Tracing DB was slow (lot of data was requested) Logging Processing failed (stacktrace?) Metrics The error rate is 0.001/sec 2 errors in the last 30 minutes Distributed Tracing DB call failed (invalid input)
will you know if you've deployed a bug that affects your users? Or if your last change caused significant performance degradation? How can you know when network issues arise? Or one of your dependencies goes down? Meters
s -> s.getTransactionCount() - s.getSuccessfulTransactionCount()) .tags("entityManagerFactory", sessionFactoryName) .tags("result", "failure") .description("The number of transactions we know to have failed") .register(registry);