2022-07-11 SpringOne Tour - Observability

Slide 1

Slide 1 text

Slide 2

Slide 2 text

About Me - @jonatan_ivanov - develotters.com - Seattle Java User Group - Spring Team @ VMware - Micrometer - Spring Cloud Sleuth - “Spring Observability”

Slide 3

Slide 3 text

Disclaimer This presentation may contain product features or functionality that are currently under development. This overview of new technology represents no commitment from VMware to deliver these features in any generally available product. Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind. Technical feasibility and market demand will aﬀect ﬁnal delivery. Pricing and packaging for any new features/functionality/technology discussed or presented, have not been determined. The information in this presentation is for informational purposes only and may not be incorporated into any contract. There is no commitment or obligation to deliver any items presented herein.

Slide 4

Slide 4 text

Cover w/ Image Agenda - What is Observability? - Why do we need it? - “The Three Pillars” (with examples) - Logging - Metrics - Distributed Tracing - How to implement it with Spring? - “Non-conventional” Observability - Q&A

Slide 5

Slide 5 text

What is Observability? Why do we need it?

Slide 6

Slide 6 text

What is Observability? “In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.” … “A system is said to be observable if [...] the current state can be estimated using only the information from outputs.” (Wikipedia)

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

What is Observability? How well we can understand the internals of a system based on its outputs (Providing meaningful information about what happens inside)

Slide 9

Slide 9 text

What is Observability? Being able to ask arbitrary questions without knowing ahead what you want to ask Turning data points and context into insights Being able to quickly troubleshoot problems with no prior knowledge (unknown unknowns)

Slide 10

Slide 10 text

Why do we need Observability? Today's systems are insanely complex (cloud) (Death Star Architecture, Big Ball of Mud)

Slide 11

Slide 11 text

Why do we need Observability? Complexity (cloud): LAMP stack vs. Cloud Environments We need to face unknown unknowns We might not know where our apps are We might not know how many instances we have (or what versions) We can’t modify/debug/etc. it Something is always broken (Fallacies of Distributed Computing) Like sending rovers to Mars: You can’t touch/modify them after launch

Slide 12

Slide 12 text

Why do we need Observability? Chaos Environments can be chaotic You turn a knob here a little and services are going down there Unknown Unknowns We can’t know everything, we need to deal with unknown unknowns “This should be impossible!”, “That will never happen!” Relativity The same thing can be perceived diﬀerently by diﬀerent observers Everything is broken for the users but the server side seems ok

Slide 13

Slide 13 text

Why do we need Observability? Continuous Improvement If you want to improve something, you need to be able to measure it ﬁrst How many resources do you utilize (cpu, ram, io, etc.)? What are your throughput/latency (max.) patterns? How frequently do you deploy? How long does it take for the code to go live? How long does it take to troubleshoot an issue or recover from an outage? How often are you paged?

Slide 14

Slide 14 text

Why do we need Observability? Opens the door for advanced capabilities Chaos Engineering Anomaly Detection Feature ﬂags A/B Testing Auto-tuning Adaptive Apps

Slide 15

Slide 15 text

“The Three Pillars” (The most popular approach)

Slide 16

Slide 16 text

Logging - Metrics - Distributed Tracing Metrics What is the context? Measure-and-Combine data Aggregatable Can identify trends Not traﬃc-sensitive (usually) Distributed Tracing Why happened? Recording events With causal ordering Can identify cause across apps Context Propagation (later) Logging What happened? Emitting events Easy to read (grep) INFO/WARN/ERROR/… Stacktraces

Slide 17

Slide 17 text

Example: Latency Metrics “99.999% of the requests were faster than 140ms.” “The max was 150ms.” So it’s quite bad. But why was this slow? Logging “Processing a request took 140ms.” Is it bad? Is it good? What is the context? Distributed Tracing “Service A called Service B.” “Service B called the DB.” “The services were ok.” “The network was ok.” “The DB was slow.” “Because somebody requested a lot of data.”

Slide 18

Slide 18 text

Example: Error Metrics “The error rate is 0.001/sec.” “We had 2 errors recently.” So it’s not that bad. But why did this happen? Logging “Request processing failed.” “Here’s the stacktrace.” Is it bad? (Well, it failed.) How bad? How many of them failed? What is the context? Distributed Tracing “Service A called Service B.” “Service B called the DB.” “The services were ok.” “The network was ok.” “The DB call failed.” “Because of invalid input.”

Slide 19

Slide 19 text

Logging

Slide 20

Slide 20 text

Application logs: classic DEBUG/INFO/WARN/ERROR events (+stacktraces) Payload logs: Raw request and response pairs GC logs: GC events (JEP 271 - Uniﬁed GC Logging) Access logs: Logs from the underlying HTTP server (e.g.: Tomcat) - Who and when called our service - What request (HTTP method, headers, path, query) - Response status, processing time, payload sizes etc. (audit logs, metrics in logs, trace logs) Logging 101 - Types of logs

Slide 21

Slide 21 text

SLF4J with Logback comes pre-conﬁgured but you can replace Logback SLF4J - Simple Logging Façade for Java - Simple API for various logging libraries - Allows to plug in the desired logging library Logback - Modern logging library - Natively implements the SLF4J API If you want Log4j2 instead of Logback: - spring-boot-starter-logging + spring-boot-starter-log4j2 Logging with Spring: SLF4J + Logback

Slide 22

Slide 22 text

Logging with Spring: Payload, Access, GC Payload logs: Logbook + logbook-spring-boot-starter (auto-conﬁgured) Access logs: server.tomcat.accesslog.enabled=true server.tomcat.basedir=logs server.tomcat.accesslog.pattern=... server.jetty.accesslog.enabled=true server.undertow.accesslog.enabled=true + logback-access (if you want to use Logback, needs to be conﬁgured) GC logs: JVM args

Slide 23

Slide 23 text

Metrics

Slide 24

Slide 24 text

Metrics 101 Time series data: data that changes over time Trends, context, anomaly detection, visualization, alerting Various Backends Publishing: Client Pushes vs. Server Polls Dimensionality: Dimensional vs. Hierarchical

Slide 25

Slide 25 text

Metrics with Spring: Micrometer Popular Metrics library on the JVM Like SLF4J, but for metrics Simple API Supports the most popular metric backends Comes with spring-boot-actuator Spring projects are instrumented using Micrometer A lot of third-party libraries use Micrometer

Slide 26

Slide 26 text

Micrometer - Like SLF4J, but for metrics Ganglia Graphite Humio InﬂuxDB JMX KairosDB New Relic OpenTSDB Prometheus SignalFx Stackdriver (GCP) StatsD Wavefront* (VMware) (/actuator/metrics) AppOptics Atlas Azure Monitor CloudWatch (AWS) Datadog Dynatrace Elastic *VMware Tanzu Observability by Wavefront

Slide 27

Slide 27 text

Distributed Tracing

Slide 28

Slide 28 text

Distributed Tracing 101

Slide 29

Slide 29 text

Distributed Tracing 101 - Correlation TraceId: 123 123 123

Slide 30

Slide 30 text

Distributed Tracing 101 - Span and Trace E F C D B A TraceId: 123

Slide 31

Slide 31 text

Span (basic unit of work) SpanId, ParentSpanId, TraceId Timestamps (start/stop) Events (annotations) with timestamps Tags (key-value pairs) ProcessId Local IP, Remote IP + Log correlation (and context propagation) + Visualization Distributed Tracing 101 - Span and Trace

Slide 32

Slide 32 text

No content

Slide 33

Slide 33 text

Distributed Tracing with Spring: Spring Cloud Sleuth Distributed Tracing Support for Spring Provides an abstraction layer on top of tracing libraries (3.x) - Brave (OpenZipkin), default - OpenTelemetry (CNCF), experimental Log Correlation + Context Propagation Instrumentation for Spring Projects (and your application) Instrumentation for third-party libraries (through Brave and OTel) Supports various backends (through Brave and OTel)

Slide 34

Slide 34 text

All-In-One: Observation API (Micrometer.next) Observation observation = Observation.start("test", registry); try { // TODO: scope Thread.sleep(1000); } catch (Exception exception) { observation.error(exception); throw exception; } finally { // TODO: attach tags observation.stop(); } observation.observeChecked(() -> Thread.sleep(1000));

Slide 35

Slide 35 text

“Non-conventional” Observability

Slide 36

Slide 36 text

“Non-conventional” Observability Is there anything else beyond Logging + Metrics + Tracing? We are looking for: - outputs (that provide) - meaningful information - about what’s inside of our system

Slide 37

Slide 37 text

Spring Boot Actuator auditevents beans caches conditions configprops env flyway health (k8s probes) heap/thread dump httptrace info integrationgraph jolokia logfile loggers liquibase metrics, traces mappings prometheus quartz scheduledtasks sessions shutdown startup

Slide 38

Slide 38 text

{ "status": "UP", "components": { "db": { "status": "UP", "details": { "database": "H2", "validationQuery": "isValid()" } }, [...] } } Health Endpoint

Slide 39

Slide 39 text

{ "status": "UP", "components": { [...] "diskSpace": { "status": "UP", "details": { "total": 1000240963584, "free": 764043239424, "threshold": 10485760, "exists": true } }, "ping": { "status": "UP" }, } } Health Endpoint

Slide 40

Slide 40 text

{ "status": "UP", "components": { [...] "tealeafService": { "status": "UP", "details": { "components": { ... } } }, "waterService": { "status": "UP", "details": { "components": { ... } } } } } Health Endpoint

Slide 41

Slide 41 text

"git": { "branch": "main", "commit": { "id": "96c9ebe", "time": "2022-04-07T19:19:19Z" } }, "build": { "artifact": "tea-service", "name": "tea-service", "time": "2022-04-07T19:19:35.153Z", "version": "96c9ebe.1649359173515", // 1.2.3 "group": "org.example.teahouse" } Info Endpoint

Slide 42

Slide 42 text

"java": { "vendor": "Eclipse Adoptium", "version": "18", "runtime": { "name": "OpenJDK Runtime Environment", "version": "18+36" }, "jvm": { "name": "OpenJDK 64-Bit Server VM", "vendor": "Eclipse Adoptium", "version": "18+36" } }, "environment": { "activeProfiles": [ "local" ] } Info Endpoint

Slide 43

Slide 43 text

"memory": { "total": 268435456, "max": 268435456, "free": 149509024 }, "cpu": { "availableProcessors": 16 }, "gcs": [ { "name": "G1 Young Generation", "memoryPoolNames": [ ... ] }, { "name": "G1 Old Generation", "memoryPoolNames": [ ... ] } ] Info Endpoint

Slide 44

Slide 44 text

"user": { "timezone": "UTC", "country": "US", "language": "en", "dir": "~/GitHub/teahouse/tea-service" }, "os": { "arch": "x86_64", "name": "Mac OS X", "version": "latest :)" }, "network": { "host": "my-hostname", "ip": "192.168.0.100" }, "startTime": "2022-04-07T19:19:36.898Z", "uptime": "PT15M31.094729S", "heartbeat": "2022-04-07T19:35:07.992731Z" Info Endpoint

Slide 45

Slide 45 text

Info Endpoint How to contact the dev team, where is the repo of the project? Cloud instanceId and type image version region, account, cloud provider TLS Certiﬁcate Chain subject, issuer validity (expiration date) -> health check? signature algorithm You can create your own endpoint Dependencies used runtime; Dependency lock ﬁles /whoami: username + roles

Slide 46

Slide 46 text

Service Registry/Discoverability How many service instances do we have (by environment)? What versions are deployed (by environment)? Where are they? host/ip, port instanceId, region, account, cloud provider, etc. Service starts/stops (deployments, restarts)?

Slide 47

Slide 47 text

Eureka

Slide 48

Slide 48 text

No content

Slide 49

Slide 49 text

No content

Slide 50

Slide 50 text

Tanzu Application Live View

Slide 51

Slide 51 text

Tanzu Application Live View

Slide 52

Slide 52 text

API Discoverability How can I call this service? Spring REST Docs Generates docs from tests and hand-written docs Spring Cloud Contract + Pact Broker Consumer Driven Contracts (test client-server contract) You know when you break your clients Swagger / OpenAPI + ReDoc API spec, docs API browser + client Spring HATEOAS + HAL Explorer Add links to your resources (other resources or operations) API browser + client

Slide 53

Slide 53 text

No content

Slide 54

Slide 54 text

{ "id": "6b55663a", [...] "_links": { "self": { "href": "/tealeaves/6b55663a" }, "search": { "href": "/tealeaves/search/findByName?name=sencha" }, "collection": { "href": "/tealeaves" } } } Spring HATEOAS

Slide 55

Slide 55 text

{ "_embedded": { "tealeaves": [...] }, "_links": { "first": { "href": "/tealeaves?page=0&size=5" }, "prev": { "href": "/tealeaves?page=0&size=5" }, "self": { "href": "/tealeaves?page=1&size=5" }, "next": { "href": "/tealeaves?page=2&size=5" }, "last": { "href": "/tealeaves?page=2&size=5" } }, "page": { "size": 5, "totalElements": 15, "totalPages": 3, "number": 1 } } Spring HATEOAS