Upgrade to Pro — share decks privately, control downloads, hide ads and more …

2022-04-14 Devnexus - Observability

2022-04-14 Devnexus - Observability

Jonatan Ivanov

May 02, 2022
Tweet

More Decks by Jonatan Ivanov

Other Decks in Programming

Transcript

  1. Jonatan Ivanov 2022-04-14 Observability Copyright © 2022 VMware, Inc. or

    its affiliates. Beyond the three pillars with Spring
  2. About Me - @jonatan_ivanov - develotters.com - Seattle Java User

    Group - Spring Team @ VMware - Micrometer - Spring Cloud Sleuth - “Spring Observability”
  3. Disclaimer This presentation may contain product features or functionality that

    are currently under development. This overview of new technology represents no commitment from VMware to deliver these features in any generally available product. Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind. Technical feasibility and market demand will affect final delivery. Pricing and packaging for any new features/functionality/technology discussed or presented, have not been determined. The information in this presentation is for informational purposes only and may not be incorporated into any contract. There is no commitment or obligation to deliver any items presented herein.
  4. Cover w/ Image Agenda - What is Observability? - Why

    do we need it? - “The Three Pillars” (with examples) - Logging - Metrics - Distributed Tracing - How to implement it with Spring? - “Non-conventional” Observability - Q&A
  5. What is Observability? “In control theory, observability is a measure

    of how well internal states of a system can be inferred from knowledge of its external outputs.” … “A system is said to be observable if [...] the current state can be estimated using only the information from outputs.” (Wikipedia)
  6. What is Observability? How well we can understand the internals

    of a system based on its outputs (Providing meaningful information about what happens inside)
  7. What is Observability? Being able to ask arbitrary questions without

    knowing ahead what you want to ask Turning data points and context into insights Being able to quickly troubleshoot problems with no prior knowledge (unknown unknowns)
  8. Why do we need Observability? Today's systems are insanely complex

    (cloud) (Death Star Architecture, Big Ball of Mud)
  9. Why do we need Observability? Complexity (cloud): LAMP stack vs.

    Cloud Environments We need to face unknown unknowns We might not know where our apps are We might not know how many instances we have (or what versions) We can’t modify/debug/etc. it Something is always broken (Fallacies of Distributed Computing) Like sending rovers to Mars: You can’t touch/modify them after launch
  10. Why do we need Observability? Chaos Environments can be chaotic

    You turn a knob here a little and services are going down there Unknown Unknowns We can’t know everything, we need to deal with unknown unknowns “This should be impossible!”, “That will never happen!” Relativity The same thing can be perceived differently by different observers Everything is broken for the users but the server side seems ok
  11. Why do we need Observability? Continuous Improvement If you want

    to improve something, you need to be able to measure it first How many resources do you utilize (cpu, ram, io, etc.)? What are your throughput/latency (max.) patterns? How frequently do you deploy? How long does it take for the code to go live? How long does it take to troubleshoot an issue or recover from an outage? How often are you paged?
  12. Why do we need Observability? Opens the door for advanced

    capabilities Chaos Engineering Anomaly Detection Feature flags A/B Testing Auto-tuning Adaptive Apps
  13. Logging - Metrics - Distributed Tracing Metrics What is the

    context? Measure-and-Combine data Aggregatable Can identify trends Not traffic-sensitive (usually) Distributed Tracing Why happened? Recording events With causal ordering Can identify cause across apps Context Propagation (later) Logging What happened? Emitting events Easy to read (grep) INFO/WARN/ERROR/… Stacktraces
  14. Example: Latency Metrics “99.999% of the requests were faster than

    140ms.” “The max was 150ms.” So it’s quite bad. But why was this slow? Logging “Processing a request took 140ms.” Is it bad? Is it good? What is the context? Distributed Tracing “Service A called Service B.” “Service B called the DB.” “The services were ok.” “The network was ok.” “The DB was slow.” “Because somebody requested a lot of data.”
  15. Example: Error Metrics “The error rate is 0.001/sec.” “We had

    2 errors recently.” So it’s not that bad. But why did this happen? Logging “Request processing failed.” “Here’s the stacktrace.” Is it bad? (Well, it failed.) How bad? How many of them failed? What is the context? Distributed Tracing “Service A called Service B.” “Service B called the DB.” “The services were ok.” “The network was ok.” “The DB call failed.” “Because of invalid input.”
  16. Application logs: classic DEBUG/INFO/WARN/ERROR events (+stacktraces) Payload logs: Raw request

    and response pairs GC logs: GC events (JEP 271 - Unified GC Logging) Access logs: Logs from the underlying HTTP server (e.g.: Tomcat) - Who and when called our service - What request (HTTP method, headers, path, query) - Response status, processing time, payload sizes etc. (audit logs, metrics in logs, trace logs) Logging 101 - Types of logs
  17. Logging 101 - Application Logs (SLF4J) // Lombok: @Slf4j Logger

    LOGGER = LoggerFactory.getLogger(Car.class); LOGGER.warn("Low Battery!"); LOGGER.debug("Starting search for charging stations..."); LOGGER.trace("Sending search request..."); LOGGER.trace("Receiving search response..."); LOGGER.info("Located nearest charging station."); LOGGER.debug("Exiting gas station search."); LOGGER.error("Nooooooo...", exception); 2022-04-14 15:45:00.500 WARN org.example.Car: Low Battery!
  18. Logging 101 - Access Logs (Tomcat) 2022-04-14 15:45:00.500 method="GET" url="/tea/sencha"

    query="?size=xxxl" protocol="HTTP/1.1" statusCode="500" duration="46" remoteIp="0:0:0:0:0:0:0:1" localIp="192.168.0.100" localPort="8090" userAgent="Mozilla/5.0 …"
  19. Logging 101 - Payload Logs "origin": "local", "type": "request", "correlation":

    "93f7e21e34e0564c", "protocol": "HTTP/1.1", "remote": "localhost", "method": "GET", "uri": "http://localhost:8091/waters/search/findBySize?size=xxxl", "host": "localhost", "path": "/waters/search/findBySize", "scheme": "http", "port": "8091", "headers": { "accept": [ "application/json" ], "host": [ "localhost:8091" ], "user-agent": [ "okhttp/4.x" ] }
  20. Logging 101 - Payload Logs "origin": "remote", "type": "response", "correlation":

    "93f7e21e34e0564c", "duration": 17, "protocol": "HTTP/1.1", "status": 404, "headers": { "content-type": [ "application/json" ] }, "body": { "status": 404, "error": "Not Found", "path": "/waters/search/findBySize" }
  21. Logging 101 - GC Logs [15:45:00.688][0.384s][start ] Pause Young (Normal)

    (G1 Evacuation Pause) [15:45:00.688][0.384s][task ] Using 6 workers of 13 for evacuation [15:45:00.692][0.388s][phases] Pre Evacuate Collection Set: 0.1ms [15:45:00.692][0.388s][phases] Merge Heap Roots: 0.0ms [15:45:00.692][0.388s][phases] Evacuate Collection Set: 3.1ms [15:45:00.692][0.388s][phases] Post Evacuate Collection Set: 0.4ms [15:45:00.692][0.388s][phases] Other: 0.6ms [15:45:00.692][0.388s][heap ] Eden regions: 23->0(26) [15:45:00.692][0.388s][heap ] Survivor regions: 0->3(3) [15:45:00.692][0.388s][heap ] Old regions: 0->5 [15:45:00.692][0.388s][heap ] Archive regions: 2->2 [15:45:00.692][0.388s][heap ] Humongous regions: 2->1 [15:45:00.692][0.388s][metasp] Metaspace: 2692K(2816K)->2692K(2816K) NonClass: 2369K(2432K)->2369K(2432K) Class: 322K(384K)->322K(384K) [15:45:00.692][0.388s][gc ] Pause Young (Normal) (G1 Evacuation Pause) 25M->9M(256M) 4.701ms [15:45:00.692][0.388s][cpu ] User=0.01s Sys=0.01s Real=0.01s
  22. SLF4J with Logback comes pre-configured but you can replace Logback

    SLF4J - Simple Logging Façade for Java - Simple API for various logging libraries - Allows to plug in the desired logging library Logback - Modern logging library - Natively implements the SLF4J API If you want Log4j2 instead of Logback: - spring-boot-starter-logging + spring-boot-starter-log4j2 Logging with Spring: SLF4J + Logback
  23. Logging with Spring: Payload, Access, GC Payload logs: Logbook +

    logbook-spring-boot-starter (auto-configured) Access logs: server.tomcat.accesslog.enabled=true server.tomcat.basedir=logs server.tomcat.accesslog.pattern=... server.jetty.accesslog.enabled=true server.undertow.accesslog.enabled=true + logback-access (if you want to use Logback, needs to be configured) GC logs: JVM args
  24. Metrics 101 Time series data: data that changes over time

    Trends, context, anomaly detection, visualization, alerting Various Backends Publishing: Client Pushes vs. Server Polls Dimensionality: Dimensional vs. Hierarchical
  25. Metrics 101 - Meters Counter: Rate/Frequency change of an event

    Rate of incoming HTTP requests over time Rates of cache hits/misses Gauge: Current value of something (async) CPU temperature over time Queue size over time Timer/DistributionSummary: Composite count, sum, max, histogram, percentiles Duration of HTTP requests over time (latency) Distributions of bytes sent/received
  26. Metrics with Spring: Micrometer Popular Metrics library on the JVM

    Like SLF4J, but for metrics Simple API Supports the most popular metric backends Comes with spring-boot-actuator Spring projects are instrumented using Micrometer A lot of third-party libraries use Micrometer
  27. Micrometer - Like SLF4J, but for metrics Ganglia Graphite Humio

    InfluxDB JMX KairosDB New Relic OpenTSDB Prometheus SignalFx Stackdriver (GCP) StatsD Wavefront* (VMware) (/actuator/metrics) AppOptics Atlas Azure Monitor CloudWatch (AWS) Datadog Dynatrace Elastic *VMware Tanzu Observability by Wavefront
  28. Span (basic unit of work) SpanId, ParentSpanId, TraceId Timestamps (start/stop)

    Events (annotations) with timestamps Tags (key-value pairs) ProcessId Local IP, Remote IP + Log correlation (and context propagation) + Visualization Distributed Tracing 101 - Span and Trace
  29. Distributed Tracing 101 - Log Correlation 2022-04-14 15:45:00.500 [1,2] INFO

    Request received… 2022-04-14 15:45:00.506 [1,2] DEBUG Request is valid 2022-04-14 15:45:00.508 [1,2] INFO Sending a request… 2022-04-14 15:45:00.530 [1,2] INFO Response received 2022-04-14 15:45:00.532 [1,2] INFO Sending response 2022-04-14 15:45:00.510 [1,3] INFO Request received… 2022-04-14 15:45:00.516 [1,3] DEBUG Request is valid 2022-04-14 15:45:00.518 [1,3] INFO Calling the DB… 2022-04-14 15:45:00.524 [1,3] INFO ResultSet received 2022-04-14 15:45:00.528 [1,3] INFO Sending response
  30. Distributed Tracing with Spring: Spring Cloud Sleuth Distributed Tracing Support

    for Spring Provides an abstraction layer on top of tracing libraries (3.x) - Brave (OpenZipkin), default - OpenTelemetry (CNCF), experimental Log Correlation + Context Propagation Instrumentation for Spring Projects (and your application) Instrumentation for third-party libraries (through Brave and OTel) Supports various backends (through Brave and OTel)
  31. All-In-One: Observation API (Micrometer.next) Observation observation = Observation.start("test", registry); try

    { // TODO: scope Thread.sleep(1000); } catch (Exception exception) { observation.error(exception); throw exception; } finally { // TODO: attach tags observation.stop(); } observation.observeChecked(() -> Thread.sleep(1000));
  32. “Non-conventional” Observability Is there anything else beyond Logging + Metrics

    + Tracing? We are looking for: - outputs (that provide) - meaningful information - about what’s inside of our system
  33. Spring Boot Actuator auditevents beans caches conditions configprops env flyway

    health (k8s probes) heap/thread dump httptrace info integrationgraph jolokia logfile loggers liquibase metrics, traces mappings prometheus quartz scheduledtasks sessions shutdown startup
  34. { "status": "UP", "components": { "db": { "status": "UP", "details":

    { "database": "H2", "validationQuery": "isValid()" } }, [...] } } Health Endpoint
  35. { "status": "UP", "components": { [...] "diskSpace": { "status": "UP",

    "details": { "total": 1000240963584, "free": 764043239424, "threshold": 10485760, "exists": true } }, "ping": { "status": "UP" }, } } Health Endpoint
  36. { "status": "UP", "components": { [...] "tealeafService": { "status": "UP",

    "details": { "components": { ... } } }, "waterService": { "status": "UP", "details": { "components": { ... } } } } } Health Endpoint
  37. "git": { "branch": "main", "commit": { "id": "96c9ebe", "time": "2022-04-07T19:19:19Z"

    } }, "build": { "artifact": "tea-service", "name": "tea-service", "time": "2022-04-07T19:19:35.153Z", "version": "96c9ebe.1649359173515", // 1.2.3 "group": "org.example.teahouse" } Info Endpoint
  38. "java": { "vendor": "Eclipse Adoptium", "version": "18", "runtime": { "name":

    "OpenJDK Runtime Environment", "version": "18+36" }, "jvm": { "name": "OpenJDK 64-Bit Server VM", "vendor": "Eclipse Adoptium", "version": "18+36" } }, "environment": { "activeProfiles": [ "local" ] } Info Endpoint
  39. "memory": { "total": 268435456, "max": 268435456, "free": 149509024 }, "cpu":

    { "availableProcessors": 16 }, "gcs": [ { "name": "G1 Young Generation", "memoryPoolNames": [ ... ] }, { "name": "G1 Old Generation", "memoryPoolNames": [ ... ] } ] Info Endpoint
  40. "user": { "timezone": "UTC", "country": "US", "language": "en", "dir": "~/GitHub/teahouse/tea-service"

    }, "os": { "arch": "x86_64", "name": "Mac OS X", "version": "latest :)" }, "network": { "host": "my-hostname", "ip": "192.168.0.100" }, "startTime": "2022-04-07T19:19:36.898Z", "uptime": "PT15M31.094729S", "heartbeat": "2022-04-07T19:35:07.992731Z" Info Endpoint
  41. Info Endpoint How to contact the dev team, where is

    the repo of the project? Cloud instanceId and type image version region, account, cloud provider TLS Certificate Chain subject, issuer validity (expiration date) -> health check? signature algorithm You can create your own endpoint Dependencies used runtime; Dependency lock files /whoami: username + roles
  42. Service Registry/Discoverability How many service instances do we have (by

    environment)? What versions are deployed (by environment)? Where are they? host/ip, port instanceId, region, account, cloud provider, etc. Service starts/stops (deployments, restarts)?
  43. API Discoverability How can I call this service? Spring REST

    Docs Generates docs from tests and hand-written docs Spring Cloud Contract + Pact Broker Consumer Driven Contracts (test client-server contract) You know when you break your clients Swagger / OpenAPI + ReDoc API spec, docs API browser + client Spring HATEOAS + HAL Explorer Add links to your resources (other resources or operations) API browser + client
  44. { "id": "6b55663a", [...] "_links": { "self": { "href": "/tealeaves/6b55663a"

    }, "search": { "href": "/tealeaves/search/findByName?name=sencha" }, "collection": { "href": "/tealeaves" } } } Spring HATEOAS
  45. { "_embedded": { "tealeaves": [...] }, "_links": { "first": {

    "href": "/tealeaves?page=0&size=5" }, "prev": { "href": "/tealeaves?page=0&size=5" }, "self": { "href": "/tealeaves?page=1&size=5" }, "next": { "href": "/tealeaves?page=2&size=5" }, "last": { "href": "/tealeaves?page=2&size=5" } }, "page": { "size": 5, "totalElements": 15, "totalPages": 3, "number": 1 } } Spring HATEOAS
  46. Questions/Feedback? Contact me on Twitter: @jonatan_ivanov Visit my blog: develotters.com

    Try it on your own: github.com/jonatan-ivanov/teahouse © 2022 Spring. A VMware-backed project.