2022-07-11 SpringOne Tour - Observability

Jonatan Ivanov 2022-04-27 Observability Copyright © 2022 VMware, Inc. or
its aﬃliates. Beyond the three pillars with Spring

About Me - @jonatan_ivanov - develotters.com - Seattle Java User
Group - Spring Team @ VMware - Micrometer - Spring Cloud Sleuth - “Spring Observability”

Disclaimer This presentation may contain product features or functionality that
are currently under development. This overview of new technology represents no commitment from VMware to deliver these features in any generally available product. Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind. Technical feasibility and market demand will aﬀect ﬁnal delivery. Pricing and packaging for any new features/functionality/technology discussed or presented, have not been determined. The information in this presentation is for informational purposes only and may not be incorporated into any contract. There is no commitment or obligation to deliver any items presented herein.

Cover w/ Image Agenda - What is Observability? - Why
do we need it? - “The Three Pillars” (with examples) - Logging - Metrics - Distributed Tracing - How to implement it with Spring? - “Non-conventional” Observability - Q&A

What is Observability? Why do we need it?

What is Observability? “In control theory, observability is a measure
of how well internal states of a system can be inferred from knowledge of its external outputs.” … “A system is said to be observable if [...] the current state can be estimated using only the information from outputs.” (Wikipedia)

What is Observability? How well we can understand the internals
of a system based on its outputs (Providing meaningful information about what happens inside)

What is Observability? Being able to ask arbitrary questions without
knowing ahead what you want to ask Turning data points and context into insights Being able to quickly troubleshoot problems with no prior knowledge (unknown unknowns)

Why do we need Observability? Today's systems are insanely complex
(cloud) (Death Star Architecture, Big Ball of Mud)

Why do we need Observability? Complexity (cloud): LAMP stack vs.
Cloud Environments We need to face unknown unknowns We might not know where our apps are We might not know how many instances we have (or what versions) We can’t modify/debug/etc. it Something is always broken (Fallacies of Distributed Computing) Like sending rovers to Mars: You can’t touch/modify them after launch

Why do we need Observability? Chaos Environments can be chaotic
You turn a knob here a little and services are going down there Unknown Unknowns We can’t know everything, we need to deal with unknown unknowns “This should be impossible!”, “That will never happen!” Relativity The same thing can be perceived diﬀerently by diﬀerent observers Everything is broken for the users but the server side seems ok

Why do we need Observability? Continuous Improvement If you want
to improve something, you need to be able to measure it ﬁrst How many resources do you utilize (cpu, ram, io, etc.)? What are your throughput/latency (max.) patterns? How frequently do you deploy? How long does it take for the code to go live? How long does it take to troubleshoot an issue or recover from an outage? How often are you paged?

Why do we need Observability? Opens the door for advanced
capabilities Chaos Engineering Anomaly Detection Feature ﬂags A/B Testing Auto-tuning Adaptive Apps

“The Three Pillars” (The most popular approach)

Logging - Metrics - Distributed Tracing Metrics What is the
context? Measure-and-Combine data Aggregatable Can identify trends Not traﬃc-sensitive (usually) Distributed Tracing Why happened? Recording events With causal ordering Can identify cause across apps Context Propagation (later) Logging What happened? Emitting events Easy to read (grep) INFO/WARN/ERROR/… Stacktraces

Example: Latency Metrics “99.999% of the requests were faster than
140ms.” “The max was 150ms.” So it’s quite bad. But why was this slow? Logging “Processing a request took 140ms.” Is it bad? Is it good? What is the context? Distributed Tracing “Service A called Service B.” “Service B called the DB.” “The services were ok.” “The network was ok.” “The DB was slow.” “Because somebody requested a lot of data.”

Example: Error Metrics “The error rate is 0.001/sec.” “We had
2 errors recently.” So it’s not that bad. But why did this happen? Logging “Request processing failed.” “Here’s the stacktrace.” Is it bad? (Well, it failed.) How bad? How many of them failed? What is the context? Distributed Tracing “Service A called Service B.” “Service B called the DB.” “The services were ok.” “The network was ok.” “The DB call failed.” “Because of invalid input.”

Logging

Application logs: classic DEBUG/INFO/WARN/ERROR events (+stacktraces) Payload logs: Raw request
and response pairs GC logs: GC events (JEP 271 - Uniﬁed GC Logging) Access logs: Logs from the underlying HTTP server (e.g.: Tomcat) - Who and when called our service - What request (HTTP method, headers, path, query) - Response status, processing time, payload sizes etc. (audit logs, metrics in logs, trace logs) Logging 101 - Types of logs

SLF4J with Logback comes pre-conﬁgured but you can replace Logback
SLF4J - Simple Logging Façade for Java - Simple API for various logging libraries - Allows to plug in the desired logging library Logback - Modern logging library - Natively implements the SLF4J API If you want Log4j2 instead of Logback: - spring-boot-starter-logging + spring-boot-starter-log4j2 Logging with Spring: SLF4J + Logback

Logging with Spring: Payload, Access, GC Payload logs: Logbook +
logbook-spring-boot-starter (auto-conﬁgured) Access logs: server.tomcat.accesslog.enabled=true server.tomcat.basedir=logs server.tomcat.accesslog.pattern=... server.jetty.accesslog.enabled=true server.undertow.accesslog.enabled=true + logback-access (if you want to use Logback, needs to be conﬁgured) GC logs: JVM args

Metrics

Metrics 101 Time series data: data that changes over time
Trends, context, anomaly detection, visualization, alerting Various Backends Publishing: Client Pushes vs. Server Polls Dimensionality: Dimensional vs. Hierarchical

Metrics with Spring: Micrometer Popular Metrics library on the JVM
Like SLF4J, but for metrics Simple API Supports the most popular metric backends Comes with spring-boot-actuator Spring projects are instrumented using Micrometer A lot of third-party libraries use Micrometer

Micrometer - Like SLF4J, but for metrics Ganglia Graphite Humio
InﬂuxDB JMX KairosDB New Relic OpenTSDB Prometheus SignalFx Stackdriver (GCP) StatsD Wavefront* (VMware) (/actuator/metrics) AppOptics Atlas Azure Monitor CloudWatch (AWS) Datadog Dynatrace Elastic *VMware Tanzu Observability by Wavefront

Distributed Tracing

Distributed Tracing 101

Distributed Tracing 101 - Correlation TraceId: 123 123 123

Distributed Tracing 101 - Span and Trace E F C
D B A TraceId: 123

Span (basic unit of work) SpanId, ParentSpanId, TraceId Timestamps (start/stop)
Events (annotations) with timestamps Tags (key-value pairs) ProcessId Local IP, Remote IP + Log correlation (and context propagation) + Visualization Distributed Tracing 101 - Span and Trace

Distributed Tracing with Spring: Spring Cloud Sleuth Distributed Tracing Support
for Spring Provides an abstraction layer on top of tracing libraries (3.x) - Brave (OpenZipkin), default - OpenTelemetry (CNCF), experimental Log Correlation + Context Propagation Instrumentation for Spring Projects (and your application) Instrumentation for third-party libraries (through Brave and OTel) Supports various backends (through Brave and OTel)

All-In-One: Observation API (Micrometer.next) Observation observation = Observation.start("test", registry); try
{ // TODO: scope Thread.sleep(1000); } catch (Exception exception) { observation.error(exception); throw exception; } finally { // TODO: attach tags observation.stop(); } observation.observeChecked(() -> Thread.sleep(1000));

“Non-conventional” Observability

“Non-conventional” Observability Is there anything else beyond Logging + Metrics
+ Tracing? We are looking for: - outputs (that provide) - meaningful information - about what’s inside of our system

Spring Boot Actuator auditevents beans caches conditions configprops env flyway
health (k8s probes) heap/thread dump httptrace info integrationgraph jolokia logfile loggers liquibase metrics, traces mappings prometheus quartz scheduledtasks sessions shutdown startup

{ "status": "UP", "components": { "db": { "status": "UP", "details":
{ "database": "H2", "validationQuery": "isValid()" } }, [...] } } Health Endpoint

{ "status": "UP", "components": { [...] "diskSpace": { "status": "UP",
"details": { "total": 1000240963584, "free": 764043239424, "threshold": 10485760, "exists": true } }, "ping": { "status": "UP" }, } } Health Endpoint

{ "status": "UP", "components": { [...] "tealeafService": { "status": "UP",
"details": { "components": { ... } } }, "waterService": { "status": "UP", "details": { "components": { ... } } } } } Health Endpoint

"git": { "branch": "main", "commit": { "id": "96c9ebe", "time": "2022-04-07T19:19:19Z"
} }, "build": { "artifact": "tea-service", "name": "tea-service", "time": "2022-04-07T19:19:35.153Z", "version": "96c9ebe.1649359173515", // 1.2.3 "group": "org.example.teahouse" } Info Endpoint

"java": { "vendor": "Eclipse Adoptium", "version": "18", "runtime": { "name":
"OpenJDK Runtime Environment", "version": "18+36" }, "jvm": { "name": "OpenJDK 64-Bit Server VM", "vendor": "Eclipse Adoptium", "version": "18+36" } }, "environment": { "activeProfiles": [ "local" ] } Info Endpoint

"memory": { "total": 268435456, "max": 268435456, "free": 149509024 }, "cpu":
{ "availableProcessors": 16 }, "gcs": [ { "name": "G1 Young Generation", "memoryPoolNames": [ ... ] }, { "name": "G1 Old Generation", "memoryPoolNames": [ ... ] } ] Info Endpoint

"user": { "timezone": "UTC", "country": "US", "language": "en", "dir": "~/GitHub/teahouse/tea-service"
}, "os": { "arch": "x86_64", "name": "Mac OS X", "version": "latest :)" }, "network": { "host": "my-hostname", "ip": "192.168.0.100" }, "startTime": "2022-04-07T19:19:36.898Z", "uptime": "PT15M31.094729S", "heartbeat": "2022-04-07T19:35:07.992731Z" Info Endpoint

Info Endpoint How to contact the dev team, where is
the repo of the project? Cloud instanceId and type image version region, account, cloud provider TLS Certiﬁcate Chain subject, issuer validity (expiration date) -> health check? signature algorithm You can create your own endpoint Dependencies used runtime; Dependency lock ﬁles /whoami: username + roles

Service Registry/Discoverability How many service instances do we have (by
environment)? What versions are deployed (by environment)? Where are they? host/ip, port instanceId, region, account, cloud provider, etc. Service starts/stops (deployments, restarts)?

Eureka

Tanzu Application Live View

API Discoverability How can I call this service? Spring REST
Docs Generates docs from tests and hand-written docs Spring Cloud Contract + Pact Broker Consumer Driven Contracts (test client-server contract) You know when you break your clients Swagger / OpenAPI + ReDoc API spec, docs API browser + client Spring HATEOAS + HAL Explorer Add links to your resources (other resources or operations) API browser + client

{ "id": "6b55663a", [...] "_links": { "self": { "href": "/tealeaves/6b55663a"
}, "search": { "href": "/tealeaves/search/findByName?name=sencha" }, "collection": { "href": "/tealeaves" } } } Spring HATEOAS

{ "_embedded": { "tealeaves": [...] }, "_links": { "first": {
"href": "/tealeaves?page=0&size=5" }, "prev": { "href": "/tealeaves?page=0&size=5" }, "self": { "href": "/tealeaves?page=1&size=5" }, "next": { "href": "/tealeaves?page=2&size=5" }, "last": { "href": "/tealeaves?page=2&size=5" } }, "page": { "size": 5, "totalElements": 15, "totalPages": 3, "number": 1 } } Spring HATEOAS

Questions/Feedback? Contact me on Twitter: @jonatan_ivanov Visit my blog: develotters.com
Try it on your own: github.com/jonatan-ivanov/teahouse © 2022 Spring. A VMware-backed project.

2022-07-11 SpringOne Tour - Observability

2022-07-11 SpringOne Tour - Observability

More Decks by Jonatan Ivanov

Other Decks in Programming

Featured

Transcript