Slide 1

Slide 1 text

.consulting .solutions .partnership Observability in Dynamic and Distributed Environments Alexander Schwartz, Principal IT Consultant OOP Munich 2019-01-24

Slide 2

Slide 2 text

Observability in Dynamic and Distributed Environments 2 © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz Aspects of Observability 1 Health Status Information 2 Alerts and Insights with Metrics 3 Hunting Root Causes and Latencies with Traces 4 Logs as a Stream of Events 5 Putting it all together 6

Slide 3

Slide 3 text

About me – Principal IT Consultant @ msg Travel & Logistics © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 3 15+ years Java 7 years PL/SQL 7 years online loans 3,5 Online Banking 7 years IT Consulting 600+ Geocaches @ahus1de

Slide 4

Slide 4 text

Observability in Dynamic and Distributed Environments 4 © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz Aspects of Observability 1 Health Status Information 2 Alerts and Insights with Metrics 3 Hunting Root Causes and Latencies with Traces 4 Logs as a Stream of Events 5 Putting it all together 6

Slide 5

Slide 5 text

Aspects of Observability 1. https://de.wikipedia.org/wiki/Fallacies_of_Distributed_Computing Distributed and dynamic environments © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 5 • Services scale up and down to offer the capacity needed • Networks between services can (and will) fail unexpectedly and are sometimes slow • Topologies change • Different administrators for different services and infrastructure elements

Slide 6

Slide 6 text

Aspects of Observability Overview Observability © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 6 Known Knowns Known Unknowns Unknown Unknowns Monitoring tells me something is broken (symptom). It‘s the source of alerting, if an immediate action by a human is required. Observability is everything else I need to find out why something is not working. Situation known at time of development Possible situation, exact parameters unclear Unexpected situation, analysis based on context information Unknown Knowns Situations that were known to 3rd-party-developers

Slide 7

Slide 7 text

Aspects of Observability Using the information for Monitoring and Root Cause Analysis © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 7 Known Knowns Known Unknowns Unknown Unknowns Status Information (Health Check) Logs (Events) of categories ERROR/WARN Tracing (dependencies, error causes, latencies) Metrics (counters, gauges, error rates, execution times) Logs (Events) of categories INFO/DEBUG Monitoring & Alerting ? Analysis of root causes Unknown Knowns +

Slide 8

Slide 8 text

Observability in Dynamic and Distributed Environments 8 © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz Aspects of Observability 1 Health Status Information 2 Alerts and Insights with Metrics 3 Hunting Root Causes and Latencies with Traces 4 Logs as a Stream of Events 5 Putting it all together 6

Slide 9

Slide 9 text

Health and other Status Information Intention for Status Information © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 9 • Deliver information before requests are processed by the system (to find misconfigurations early in distributed systems) • Actionable for the receiver of the information • Observe situations that can‘t be remedied by the service itself • Category of „known knowns“ (when implemented by the team or well known 3rd party library) or „unknown knowns“ (if it is a surprise feature in a 3rd party library) Examples: Status check Actions to remedy Database reachable Check database, check application configuration, check network Connectivity Check Check configuration, check Network Circuit Breaker open Check Network, check if remote service is too slow, check for more requests than usual

Slide 10

Slide 10 text

Health and other Status Information Spring Boot Health Information © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 10 Type Question Action Endpoint to deliver a global health status Is the application degraded or in an error state? Depending on the failed sub-system a specific action is required. Observation: • Service health depends on external services that might still be starting/recovering • Restarts don’t help to remedy the situation • Although the application is degraded, the application might have fallbacks • Some Spring Boot health checks perform remote network calls and might take an unacceptable long time to complete if network problem occur

Slide 11

Slide 11 text

Health and other Status Information Kubernetes Liveness and Readiness Probes © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 11 Observation 1: Using the same probes for different actions in different situation is a bad idea Observation 2: Spring Boot Health Information doesn’t match Kubernetes readiness/liveness concept, therefore implement different endpoints to provide answers specific to Kubernetes’ questions Type Question Action Readiness Probe (when starting) Has the pod started correctly? After a timeout a non-ready pod will be killed and restarted. Once it is ready it will be added to the load balancer. Readiness Probe (when running) Can the pod serve requests? Enable or disable the pod in the load balancer. Liveness Probe Is the pod is in a deterministic state? Kill and restart the pod when this state persists.

Slide 12

Slide 12 text

Health and other Status Information Best practices for Health Indicators © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 12 When collecting health indications: • Implement health indicators for all remote systems and configurations • Make the health indicators complete in a predetermined time (short timeouts) • Archive the status indicator of each subsystem in your metrics/monitoring system When triggering automatic actions: • Ask specific questions • Provide an answer as an aggregation of maybe several health indicators

Slide 13

Slide 13 text

Observability in Dynamic and Distributed Environments 13 © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz Aspects of Observability 1 Health Status Information 2 Alerts and Insights with Metrics 3 Hunting Root Causes and Latencies with Traces 4 Logs as a Stream of Events 5 Putting it all together 6

Slide 14

Slide 14 text

Insights with Metrics Micrometer © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 14 Micrometer [maɪˈkrɒm.ɪ.tər] is a facade (API) to collect metrics independent of the storage or your metrics (“SLF4J, but for metrics”). • Multidimensional metrics • Integrations for libraries and backends (Prometheus, Datadog, Ganglia, Graphite, JMX, New Relic, …) • Ready to use with Spring Boot 1.x and 2.x • Can be used stand-alone and together with other frameworks • Version 1.0 released at the same time as Spring Boot 2.0 • Version 1.1 released at the same time as Spring Boot 2.1 Homepage: https://micrometer.io/ License: Apache 2.0

Slide 15

Slide 15 text

Insights with Metrics Micrometer API © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 15 Base for all metrics: Name, optional: tags and description Some Metrics Types: • Counter: e. g. number of successful and not successful calls • Gauge: e. g. number of currently active database connections • Timer: e. g. duration of calls Counter myOperationSuccess = Counter .builder("myOperation") .description("a description for humans") .tags("result", "success") .register(registry); myOperationSuccess.increment();

Slide 16

Slide 16 text

Insights with Metrics Micrometer API © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 16 Derived Metrics: • Rate: e. g. calls per second • Percentile: e. g. 90% of all calls faster than X ms • Histogram: e. g. X calls in the interval of 50 ms to 100 ms Histograms can be aggregated over instances, percentiles can‘t!

Slide 17

Slide 17 text

Insights with Metrics Direction of the arrow: Data flow Micrometer Architecture © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 17 Metrics Backend Meter Registry Micrometer Core 3rd Party Libraries Adapter Your Code Application

Slide 18

Slide 18 text

Insights with Metrics Direction of the arrow: Data flow Example Infrastructure © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 19 Grafana Prometheus Application Alert Manager Metrics Backend Application Application

Slide 19

Slide 19 text

Insights with Metrics Prometheus for distributed environments © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 20 Responsibilities: • Collect and store metrics • Handle inquiries for dashboards • Alarms, Trend calculations Well suited for: • Discovering services via service discovery in dynamic environments • Adding metadata to collected metrics (stage, datacenter, cluster, node) • Blending infrastructure metrics with application metrics including business metrics Homepage: https://prometheus.io License: Apache 2.0

Slide 20

Slide 20 text

Insights with Metrics Using Metrics to build Monitoring and Alerting © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 25 • Plain metric http_server_requests_seconds_count • Metric filtered by label http_server_requests_seconds_count{status!='200'} • Rate of calls in 5-minute interval (moving average) rate (http_server_requests_seconds_count{status!='200'} [5m]) • Error Percentage per URI used for alerting sum by (uri) (rate (http_server_requests_seconds_count {status!='200'} [5m])) / sum by (uri) (rate (http_server_requests_seconds_count [5m])) > 0.01

Slide 21

Slide 21 text

Insights with Metrics Best practices for Metrics © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 29 • Use the metrics your framework provides • Setup the metrics environment already in testing stages • When something is being logged, it is probably a good idea to create a metric for it • Collecting metrics is more runtime-efficient than writing it into a log

Slide 22

Slide 22 text

Observability in Dynamic and Distributed Environments 30 © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz Aspects of Observability 1 Health Status Information 2 Alerts and Insights with Metrics 3 Hunting Root Causes and Latencies with Traces 4 Logs as a Stream of Events 5 Putting it all together 6

Slide 23

Slide 23 text

Hunting Root Causes and Latencies with Traces Analyzing errors and latencies © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 31 The server doesn’t respond! Sometimes it is so slow!

Slide 24

Slide 24 text

Hunting Root Causes and Latencies with Traces 1. https://twitter.com/rakyll/status/1045075510538035200 A better way to explain why tail latency matters © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 32 Jaana B. Dogan (@rakyll)

Slide 25

Slide 25 text

Hunting Root Causes and Latencies with Traces Concept of Google’s Dapper © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 33 pass Trace-ID and additionally a Span-ID assign Trace-ID

Slide 26

Slide 26 text

Hunting Root Causes and Latencies with Traces Zipkin Server receives information and provides a Web UI © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 34 For each Trace-/Span-ID: • start-/end time client • start-/end time server • tags and logs

Slide 27

Slide 27 text

Hunting Root Causes and Latencies with Traces Web UI Zipkin © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 35

Slide 28

Slide 28 text

Hunting Root Causes and Latencies with Traces Sampling Traces © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 37 • In production only a percentage of requests is traced to save data and performance. The percentage is enough to trace errors and latencies, but you won‘t be analyze all requests • The first node decides what calls are being traced. • You can force tracing by adding for example a special header to the incoming request (don‘t allow that header to be sent from public Internet to prevent denial of service attack)

Slide 29

Slide 29 text

Hunting Root Causes and Latencies with Traces Zipkin Browser-Plugin for Chrome © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 38 Add trace-ID to HTTP Headers at client Direct link to traces view in a new tab

Slide 30

Slide 30 text

Hunting Root Causes and Latencies with Traces Best practices for Tracing © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 39 • Trace everything in development and testing • Trace a small percentage or only on-demand in production • Use tracing to investigate incidents and locate the relevant service (“unknown unknowns”) • Use the generated dependency graph to verify and optimize dependencies

Slide 31

Slide 31 text

Observability in Dynamic and Distributed Environments 40 © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz Aspects of Observability 1 Health Status Information 2 Alerts and Insights with Metrics 3 Hunting Root Causes and Latencies with Traces 4 Logs as a Stream of Events 5 Putting it all together 6

Slide 32

Slide 32 text

Logs as a Stream of Events Welcome to Logging Hell! © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 41 Log: Code: for (Invoice i : invoiceRespository.findAll()) { i.calculateTotal(); } 07:26:00.595 d.a.t.d.Invoice ERROR - can't load item ID 4711

Slide 33

Slide 33 text

Logs as a Stream of Events Logs with Context Information © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 42 for (Invoice i : respository.findAll()) { ThreadContext.put(INVOICE_ID, Long.toString(i.getId())); try { i.calculateTotal(); } finally { ThreadContext.remove(INVOICE_ID); } }

Slide 34

Slide 34 text

Logs as a Stream of Events Configuring the Logging Pattern © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 43 Configuration: Log output: 08:39:42.969 {invoiceId=1} ... - can't load purchase ID 4711

Slide 35

Slide 35 text

Logs as a Stream of Events Log output for Web Applications © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 47 08:52:54.276 {http.method=GET, http.url=http://localhost:8080/api/startBillingRun, invoiceId=1, user=Theo Tester} ERROR d.a.t.d.Invoice - can't load purchase ID 4711 Further fields that might of of interest: • Client-IP-Address • Part of the Session-ID • Browser name (User Agent)

Slide 36

Slide 36 text

Logs as a Stream of Events Adding correlation IDs to log entries © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 48 Additional Tracing HTTP Header: GET /api/callback HTTP/1.1 Host: localhost:8080 ... X-B3-SpanId: 34e628fc44c0cff1 X-B3-TraceId: a72f03509a36daae ... 09:20:05.840 { X-B3-SpanId=34e628fc44c0cff1, X-B3-TraceId=a72f03509a36daae, ..., invoiceId=1} ERROR d.a.t.d.Invoice - can't load purchase ID 4711 Additional Information in the Log:

Slide 37

Slide 37 text

Logs as a Stream of Events Per-Request-Debugging-Logging © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 49 • Zipkin passes on the Header X-B3-Flags, that can encode additional information • Zipkin-Chrome-Plugin passes the value “1” (debug=true) • A Servlet-Filter can pick up this value and pass it to the Tread Context of Log4j2 (for example set X-B3-Flags-debug to true) • The following log4j2-Configuration activates the trace level for requests that have this HTTP-Header

Slide 38

Slide 38 text

Logs as a Stream of Events Aggregate logs in a central, searchable store © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 50 Create a searchable store of logs (rotate after fixed amount of logs). JSON log entries are the simplest format that can be parsed. All nodes ship their logs to a central store (asynchronously)

Slide 39

Slide 39 text

Logs as a Stream of Events Best practices for Logging © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 51 • Use context information • Have a central searchable storage for your logs • Use JSON as a log format • Enable log level configuration on request level • Include correlation IDs in user’s error messages so can search the logs

Slide 40

Slide 40 text

Observability in Dynamic and Distributed Environments 52 © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz Aspects of Observability 1 Health Status Information 2 Alerts and Insights with Metrics 3 Hunting Root Causes and Latencies with Traces 4 Logs as a Stream of Events 5 Putting it all together 6

Slide 41

Slide 41 text

Putting it all together Status Information © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 53 • Deliver actionable information even when no requests are processed • Essential to handle „known knowns“ problems ahead of users reporting problems • Implement health indicators for all remote systems and configurations • Ask specific questions to trigger actions

Slide 42

Slide 42 text

Putting it all together Metrics © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 54 • Look at response time percentiles and error rates to spot problems • Essential to handle „known known“ and “known unknown” problems ahead of users reporting problems • Frameworks already offer standard metrics • Adding a custom metric to an application takes sometimes only one line of code • Aggregate metrics over nodes and environments and calculating historical trends • Treat your status information as metrics to use the same metrics and alerting pipeline

Slide 43

Slide 43 text

Putting it all together Tracing © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 55 • Use it to find out root causes for latencies and errors • Essential to handle „unknown unknowns“ during incident investigation • Can draw automatic call trees from trace information • It provides you free correlation IDs for your logs • On top of tracing you can enable per-request debug logs

Slide 44

Slide 44 text

Putting it all together Logging © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 56 • Use Levels error/warn to signal problems („known knowns“, „unknown knowns“) • Use Levels info/debug to analyze incidents • Aggregate the logs with context, starting with the correlation ID and request meta data • Make them searchable by logging in JSON format and aggregating all logs in a central store (Graylog or Kibana)

Slide 45

Slide 45 text

Putting it all together Things that changed in distributed environments © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 57 Type Static Nodes Dynamic Multi-Node / Distributed Correlation of events Session ID or thread ID Correlation ID passed between processes and hosts

Slide 46

Slide 46 text

Putting it all together Things that changed in distributed environments © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 58 Type Static Nodes Dynamic Multi-Node / Distributed Correlation of events Session ID or thread ID Correlation ID passed between processes and hosts Searching logs Log files and manually configured log level Central log server with searchable logs, dynamic log level

Slide 47

Slide 47 text

Putting it all together Things that changed in distributed environments © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 59 Type Static Nodes Dynamic Multi-Node / Distributed Correlation of events Session ID or thread ID Correlation ID passed between processes and hosts Searching logs Log files and manually configured log level Central log server with searchable logs, dynamic log levels Metrics and Monitoring Manual monitoring configured by operations Linked to Service Discovery, metrics added by application team autonomously

Slide 48

Slide 48 text

Putting it all together Things that changed in distributed environments © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 60 Type Static Nodes Dynamic Multi-Node / Distributed Correlation of events Session ID or thread ID Correlation ID passed between processes and hosts Searching logs Log files and manually configured log level Central log server with searchable logs, dynamic log levels Metrics and Monitoring Manual monitoring configured by operations Linked to Service Discovery, metrics added by application team autonomously Status Information Preflight check when starting the application Used for automated restarts and scaling, configure load balancers, check remote connections

Slide 49

Slide 49 text

Putting it all together Things that changed in distributed environments © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 61 Type Static Nodes Dynamic Multi-Node / Distributed Correlation of events Session ID or thread ID Correlation ID passed between processes and hosts Searching logs Log files and manually configured log level Central log server with searchable logs, dynamic log levels Metrics and Monitoring Manual monitoring configured by operations Linked to Service Discovery, metrics added by application team autonomously Status Information Preflight check when starting the application Used for automated restarts and scaling, configure load balancers, check remote connections Tracing Root cause usually within the edge node Root cause for latency/fault usually not in edge node, tracing will point to the problem/dependency

Slide 50

Slide 50 text

Putting it all together Things that changed in distributed environments © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 62 Type Static Nodes Dynamic Multi-Node / Distributed Correlation of events Session ID or thread ID Correlation ID passed between processes and hosts Searching logs Log files and manually configured log level Central log server with searchable logs, dynamic log levels Metrics and Monitoring Manual monitoring configured by operations Linked to Service Discovery, metrics added by application team autonomously Status Information Preflight check when starting the application Used for automated restarts and scaling, configure load balancers, check remote connections Tracing Root cause usually within the edge node Root cause for latency/fault usually not in edge node, tracing will point to the problem/dependency Some of the previous best practices turned to essential practices in a distributed environment.

Slide 51

Slide 51 text

Links © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 63 Micrometer.io https://micrometer.io Prometheus: https://prometheus.io Grafana https://grafana.com Log4j https://logging.apache.org/log4j Graylog https://www.graylog.org/ @ahus1de Zipkin, Brave https://github.com/openzipkin Google SRE Book (Chapter “Monitoring Distributed Systems”) https://landing.google.com/sre/ Additional Slides https://www.ahus1.de/post/micrometer https://www.ahus1.de/post/logging-and-tracing https://www.ahus1.de/post/prometheus-and-grafana

Slide 52

Slide 52 text

.consulting .solutions .partnership Alexander Schwartz Principal IT Consultant +49 171 5625767 alexander.schwartz@msg.group @ahus1de msg systems ag (Headquarters) Robert-Buerkle-Str. 1, 85737 Ismaning Germany www.msg.group