Observability in dynamic and distributed Microservice Environments

Observability in dynamic and distributed Microservice Environments

To be able to identify and analyse problems, applications need to provide status and runtime information. Distributed and dynamic microservice environments need to standardize this to enable efficient operations.
This talk presents four areas that are part of observability: status information, logs, metrics, and traces. Technology examples show how these concepts can be applied.
The talk will explain each concept and will present a practical example, for example Spring Boot Actuator for status information, Log4j for logs, Micrometer and Prometheus for metrics.

5f528a3f6814d28b583f31842e3e8d9e?s=128

Alexander Schwartz

January 24, 2019
Tweet

Transcript

  1. .consulting .solutions .partnership Observability in Dynamic and Distributed Environments Alexander

    Schwartz, Principal IT Consultant OOP Munich 2019-01-24
  2. Observability in Dynamic and Distributed Environments 2 © msg |

    January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz Aspects of Observability 1 Health Status Information 2 Alerts and Insights with Metrics 3 Hunting Root Causes and Latencies with Traces 4 Logs as a Stream of Events 5 Putting it all together 6
  3. About me – Principal IT Consultant @ msg Travel &

    Logistics © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 3 15+ years Java 7 years PL/SQL 7 years online loans 3,5 Online Banking 7 years IT Consulting 600+ Geocaches @ahus1de
  4. Observability in Dynamic and Distributed Environments 4 © msg |

    January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz Aspects of Observability 1 Health Status Information 2 Alerts and Insights with Metrics 3 Hunting Root Causes and Latencies with Traces 4 Logs as a Stream of Events 5 Putting it all together 6
  5. Aspects of Observability 1. https://de.wikipedia.org/wiki/Fallacies_of_Distributed_Computing Distributed and dynamic environments ©

    msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 5 • Services scale up and down to offer the capacity needed • Networks between services can (and will) fail unexpectedly and are sometimes slow • Topologies change • Different administrators for different services and infrastructure elements
  6. Aspects of Observability Overview Observability © msg | January 2019

    | Observability in Dynamic and Distributed Environments | Alexander Schwartz 6 Known Knowns Known Unknowns Unknown Unknowns Monitoring tells me something is broken (symptom). It‘s the source of alerting, if an immediate action by a human is required. Observability is everything else I need to find out why something is not working. Situation known at time of development Possible situation, exact parameters unclear Unexpected situation, analysis based on context information Unknown Knowns Situations that were known to 3rd-party-developers
  7. Aspects of Observability Using the information for Monitoring and Root

    Cause Analysis © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 7 Known Knowns Known Unknowns Unknown Unknowns Status Information (Health Check) Logs (Events) of categories ERROR/WARN Tracing (dependencies, error causes, latencies) Metrics (counters, gauges, error rates, execution times) Logs (Events) of categories INFO/DEBUG Monitoring & Alerting ? Analysis of root causes Unknown Knowns +
  8. Observability in Dynamic and Distributed Environments 8 © msg |

    January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz Aspects of Observability 1 Health Status Information 2 Alerts and Insights with Metrics 3 Hunting Root Causes and Latencies with Traces 4 Logs as a Stream of Events 5 Putting it all together 6
  9. Health and other Status Information Intention for Status Information ©

    msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 9 • Deliver information before requests are processed by the system (to find misconfigurations early in distributed systems) • Actionable for the receiver of the information • Observe situations that can‘t be remedied by the service itself • Category of „known knowns“ (when implemented by the team or well known 3rd party library) or „unknown knowns“ (if it is a surprise feature in a 3rd party library) Examples: Status check Actions to remedy Database reachable Check database, check application configuration, check network Connectivity Check Check configuration, check Network Circuit Breaker open Check Network, check if remote service is too slow, check for more requests than usual
  10. Health and other Status Information Spring Boot Health Information ©

    msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 10 Type Question Action Endpoint to deliver a global health status Is the application degraded or in an error state? Depending on the failed sub-system a specific action is required. Observation: • Service health depends on external services that might still be starting/recovering • Restarts don’t help to remedy the situation • Although the application is degraded, the application might have fallbacks • Some Spring Boot health checks perform remote network calls and might take an unacceptable long time to complete if network problem occur
  11. Health and other Status Information Kubernetes Liveness and Readiness Probes

    © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 11 Observation 1: Using the same probes for different actions in different situation is a bad idea Observation 2: Spring Boot Health Information doesn’t match Kubernetes readiness/liveness concept, therefore implement different endpoints to provide answers specific to Kubernetes’ questions Type Question Action Readiness Probe (when starting) Has the pod started correctly? After a timeout a non-ready pod will be killed and restarted. Once it is ready it will be added to the load balancer. Readiness Probe (when running) Can the pod serve requests? Enable or disable the pod in the load balancer. Liveness Probe Is the pod is in a deterministic state? Kill and restart the pod when this state persists.
  12. Health and other Status Information Best practices for Health Indicators

    © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 12 When collecting health indications: • Implement health indicators for all remote systems and configurations • Make the health indicators complete in a predetermined time (short timeouts) • Archive the status indicator of each subsystem in your metrics/monitoring system When triggering automatic actions: • Ask specific questions • Provide an answer as an aggregation of maybe several health indicators
  13. Observability in Dynamic and Distributed Environments 13 © msg |

    January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz Aspects of Observability 1 Health Status Information 2 Alerts and Insights with Metrics 3 Hunting Root Causes and Latencies with Traces 4 Logs as a Stream of Events 5 Putting it all together 6
  14. Insights with Metrics Micrometer © msg | January 2019 |

    Observability in Dynamic and Distributed Environments | Alexander Schwartz 14 Micrometer [maɪˈkrɒm.ɪ.tər] is a facade (API) to collect metrics independent of the storage or your metrics (“SLF4J, but for metrics”). • Multidimensional metrics • Integrations for libraries and backends (Prometheus, Datadog, Ganglia, Graphite, JMX, New Relic, …) • Ready to use with Spring Boot 1.x and 2.x • Can be used stand-alone and together with other frameworks • Version 1.0 released at the same time as Spring Boot 2.0 • Version 1.1 released at the same time as Spring Boot 2.1 Homepage: https://micrometer.io/ License: Apache 2.0
  15. Insights with Metrics Micrometer API © msg | January 2019

    | Observability in Dynamic and Distributed Environments | Alexander Schwartz 15 Base for all metrics: Name, optional: tags and description Some Metrics Types: • Counter: e. g. number of successful and not successful calls • Gauge: e. g. number of currently active database connections • Timer: e. g. duration of calls Counter myOperationSuccess = Counter .builder("myOperation") .description("a description for humans") .tags("result", "success") .register(registry); myOperationSuccess.increment();
  16. Insights with Metrics Micrometer API © msg | January 2019

    | Observability in Dynamic and Distributed Environments | Alexander Schwartz 16 Derived Metrics: • Rate: e. g. calls per second • Percentile: e. g. 90% of all calls faster than X ms • Histogram: e. g. X calls in the interval of 50 ms to 100 ms Histograms can be aggregated over instances, percentiles can‘t!
  17. Insights with Metrics Direction of the arrow: Data flow Micrometer

    Architecture © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 17 Metrics Backend Meter Registry Micrometer Core 3rd Party Libraries Adapter Your Code Application
  18. Insights with Metrics Direction of the arrow: Data flow Example

    Infrastructure © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 19 Grafana Prometheus Application Alert Manager Metrics Backend Application Application
  19. Insights with Metrics Prometheus for distributed environments © msg |

    January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 20 Responsibilities: • Collect and store metrics • Handle inquiries for dashboards • Alarms, Trend calculations Well suited for: • Discovering services via service discovery in dynamic environments • Adding metadata to collected metrics (stage, datacenter, cluster, node) • Blending infrastructure metrics with application metrics including business metrics Homepage: https://prometheus.io License: Apache 2.0
  20. Insights with Metrics Using Metrics to build Monitoring and Alerting

    © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 25 • Plain metric http_server_requests_seconds_count • Metric filtered by label http_server_requests_seconds_count{status!='200'} • Rate of calls in 5-minute interval (moving average) rate (http_server_requests_seconds_count{status!='200'} [5m]) • Error Percentage per URI used for alerting sum by (uri) (rate (http_server_requests_seconds_count {status!='200'} [5m])) / sum by (uri) (rate (http_server_requests_seconds_count [5m])) > 0.01
  21. Insights with Metrics Best practices for Metrics © msg |

    January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 29 • Use the metrics your framework provides • Setup the metrics environment already in testing stages • When something is being logged, it is probably a good idea to create a metric for it • Collecting metrics is more runtime-efficient than writing it into a log
  22. Observability in Dynamic and Distributed Environments 30 © msg |

    January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz Aspects of Observability 1 Health Status Information 2 Alerts and Insights with Metrics 3 Hunting Root Causes and Latencies with Traces 4 Logs as a Stream of Events 5 Putting it all together 6
  23. Hunting Root Causes and Latencies with Traces Analyzing errors and

    latencies © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 31 The server doesn’t respond! Sometimes it is so slow!
  24. Hunting Root Causes and Latencies with Traces 1. https://twitter.com/rakyll/status/1045075510538035200 A

    better way to explain why tail latency matters © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 32 Jaana B. Dogan (@rakyll)
  25. Hunting Root Causes and Latencies with Traces Concept of Google’s

    Dapper © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 33 pass Trace-ID and additionally a Span-ID assign Trace-ID
  26. Hunting Root Causes and Latencies with Traces Zipkin Server receives

    information and provides a Web UI © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 34 For each Trace-/Span-ID: • start-/end time client • start-/end time server • tags and logs
  27. Hunting Root Causes and Latencies with Traces Web UI Zipkin

    © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 35
  28. Hunting Root Causes and Latencies with Traces Sampling Traces ©

    msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 37 • In production only a percentage of requests is traced to save data and performance. The percentage is enough to trace errors and latencies, but you won‘t be analyze all requests • The first node decides what calls are being traced. • You can force tracing by adding for example a special header to the incoming request (don‘t allow that header to be sent from public Internet to prevent denial of service attack)
  29. Hunting Root Causes and Latencies with Traces Zipkin Browser-Plugin for

    Chrome © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 38 Add trace-ID to HTTP Headers at client Direct link to traces view in a new tab
  30. Hunting Root Causes and Latencies with Traces Best practices for

    Tracing © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 39 • Trace everything in development and testing • Trace a small percentage or only on-demand in production • Use tracing to investigate incidents and locate the relevant service (“unknown unknowns”) • Use the generated dependency graph to verify and optimize dependencies
  31. Observability in Dynamic and Distributed Environments 40 © msg |

    January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz Aspects of Observability 1 Health Status Information 2 Alerts and Insights with Metrics 3 Hunting Root Causes and Latencies with Traces 4 Logs as a Stream of Events 5 Putting it all together 6
  32. Logs as a Stream of Events Welcome to Logging Hell!

    © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 41 Log: Code: for (Invoice i : invoiceRespository.findAll()) { i.calculateTotal(); } 07:26:00.595 d.a.t.d.Invoice ERROR - can't load item ID 4711
  33. Logs as a Stream of Events Logs with Context Information

    © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 42 for (Invoice i : respository.findAll()) { ThreadContext.put(INVOICE_ID, Long.toString(i.getId())); try { i.calculateTotal(); } finally { ThreadContext.remove(INVOICE_ID); } }
  34. Logs as a Stream of Events Configuring the Logging Pattern

    © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 43 Configuration: Log output: <PatternLayout pattern="%d{HH:mm:ss.SSS} %X %-5level ..."/> 08:39:42.969 {invoiceId=1} ... - can't load purchase ID 4711
  35. Logs as a Stream of Events Log output for Web

    Applications © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 47 08:52:54.276 {http.method=GET, http.url=http://localhost:8080/api/startBillingRun, invoiceId=1, user=Theo Tester} ERROR d.a.t.d.Invoice - can't load purchase ID 4711 Further fields that might of of interest: • Client-IP-Address • Part of the Session-ID • Browser name (User Agent)
  36. Logs as a Stream of Events Adding correlation IDs to

    log entries © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 48 Additional Tracing HTTP Header: GET /api/callback HTTP/1.1 Host: localhost:8080 ... X-B3-SpanId: 34e628fc44c0cff1 X-B3-TraceId: a72f03509a36daae ... 09:20:05.840 { X-B3-SpanId=34e628fc44c0cff1, X-B3-TraceId=a72f03509a36daae, ..., invoiceId=1} ERROR d.a.t.d.Invoice - can't load purchase ID 4711 Additional Information in the Log:
  37. Logs as a Stream of Events Per-Request-Debugging-Logging © msg |

    January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 49 • Zipkin passes on the Header X-B3-Flags, that can encode additional information • Zipkin-Chrome-Plugin passes the value “1” (debug=true) • A Servlet-Filter can pick up this value and pass it to the Tread Context of Log4j2 (for example set X-B3-Flags-debug to true) • The following log4j2-Configuration activates the trace level for requests that have this HTTP-Header <DynamicThresholdFilter key="X-B3-Flags-debug" onMatch="ACCEPT" defaultThreshold="warn" onMismatch="NEUTRAL"> <KeyValuePair key="true" value="trace"/> </DynamicThresholdFilter>
  38. Logs as a Stream of Events Aggregate logs in a

    central, searchable store © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 50 Create a searchable store of logs (rotate after fixed amount of logs). JSON log entries are the simplest format that can be parsed. All nodes ship their logs to a central store (asynchronously)
  39. Logs as a Stream of Events Best practices for Logging

    © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 51 • Use context information • Have a central searchable storage for your logs • Use JSON as a log format • Enable log level configuration on request level • Include correlation IDs in user’s error messages so can search the logs
  40. Observability in Dynamic and Distributed Environments 52 © msg |

    January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz Aspects of Observability 1 Health Status Information 2 Alerts and Insights with Metrics 3 Hunting Root Causes and Latencies with Traces 4 Logs as a Stream of Events 5 Putting it all together 6
  41. Putting it all together Status Information © msg | January

    2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 53 • Deliver actionable information even when no requests are processed • Essential to handle „known knowns“ problems ahead of users reporting problems • Implement health indicators for all remote systems and configurations • Ask specific questions to trigger actions
  42. Putting it all together Metrics © msg | January 2019

    | Observability in Dynamic and Distributed Environments | Alexander Schwartz 54 • Look at response time percentiles and error rates to spot problems • Essential to handle „known known“ and “known unknown” problems ahead of users reporting problems • Frameworks already offer standard metrics • Adding a custom metric to an application takes sometimes only one line of code • Aggregate metrics over nodes and environments and calculating historical trends • Treat your status information as metrics to use the same metrics and alerting pipeline
  43. Putting it all together Tracing © msg | January 2019

    | Observability in Dynamic and Distributed Environments | Alexander Schwartz 55 • Use it to find out root causes for latencies and errors • Essential to handle „unknown unknowns“ during incident investigation • Can draw automatic call trees from trace information • It provides you free correlation IDs for your logs • On top of tracing you can enable per-request debug logs
  44. Putting it all together Logging © msg | January 2019

    | Observability in Dynamic and Distributed Environments | Alexander Schwartz 56 • Use Levels error/warn to signal problems („known knowns“, „unknown knowns“) • Use Levels info/debug to analyze incidents • Aggregate the logs with context, starting with the correlation ID and request meta data • Make them searchable by logging in JSON format and aggregating all logs in a central store (Graylog or Kibana)
  45. Putting it all together Things that changed in distributed environments

    © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 57 Type Static Nodes Dynamic Multi-Node / Distributed Correlation of events Session ID or thread ID Correlation ID passed between processes and hosts
  46. Putting it all together Things that changed in distributed environments

    © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 58 Type Static Nodes Dynamic Multi-Node / Distributed Correlation of events Session ID or thread ID Correlation ID passed between processes and hosts Searching logs Log files and manually configured log level Central log server with searchable logs, dynamic log level
  47. Putting it all together Things that changed in distributed environments

    © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 59 Type Static Nodes Dynamic Multi-Node / Distributed Correlation of events Session ID or thread ID Correlation ID passed between processes and hosts Searching logs Log files and manually configured log level Central log server with searchable logs, dynamic log levels Metrics and Monitoring Manual monitoring configured by operations Linked to Service Discovery, metrics added by application team autonomously
  48. Putting it all together Things that changed in distributed environments

    © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 60 Type Static Nodes Dynamic Multi-Node / Distributed Correlation of events Session ID or thread ID Correlation ID passed between processes and hosts Searching logs Log files and manually configured log level Central log server with searchable logs, dynamic log levels Metrics and Monitoring Manual monitoring configured by operations Linked to Service Discovery, metrics added by application team autonomously Status Information Preflight check when starting the application Used for automated restarts and scaling, configure load balancers, check remote connections
  49. Putting it all together Things that changed in distributed environments

    © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 61 Type Static Nodes Dynamic Multi-Node / Distributed Correlation of events Session ID or thread ID Correlation ID passed between processes and hosts Searching logs Log files and manually configured log level Central log server with searchable logs, dynamic log levels Metrics and Monitoring Manual monitoring configured by operations Linked to Service Discovery, metrics added by application team autonomously Status Information Preflight check when starting the application Used for automated restarts and scaling, configure load balancers, check remote connections Tracing Root cause usually within the edge node Root cause for latency/fault usually not in edge node, tracing will point to the problem/dependency
  50. Putting it all together Things that changed in distributed environments

    © msg | January 2019 | Observability in Dynamic and Distributed Environments | Alexander Schwartz 62 Type Static Nodes Dynamic Multi-Node / Distributed Correlation of events Session ID or thread ID Correlation ID passed between processes and hosts Searching logs Log files and manually configured log level Central log server with searchable logs, dynamic log levels Metrics and Monitoring Manual monitoring configured by operations Linked to Service Discovery, metrics added by application team autonomously Status Information Preflight check when starting the application Used for automated restarts and scaling, configure load balancers, check remote connections Tracing Root cause usually within the edge node Root cause for latency/fault usually not in edge node, tracing will point to the problem/dependency Some of the previous best practices turned to essential practices in a distributed environment.
  51. Links © msg | January 2019 | Observability in Dynamic

    and Distributed Environments | Alexander Schwartz 63 Micrometer.io https://micrometer.io Prometheus: https://prometheus.io Grafana https://grafana.com Log4j https://logging.apache.org/log4j Graylog https://www.graylog.org/ @ahus1de Zipkin, Brave https://github.com/openzipkin Google SRE Book (Chapter “Monitoring Distributed Systems”) https://landing.google.com/sre/ Additional Slides https://www.ahus1.de/post/micrometer https://www.ahus1.de/post/logging-and-tracing https://www.ahus1.de/post/prometheus-and-grafana
  52. .consulting .solutions .partnership Alexander Schwartz Principal IT Consultant +49 171

    5625767 alexander.schwartz@msg.group @ahus1de msg systems ag (Headquarters) Robert-Buerkle-Str. 1, 85737 Ismaning Germany www.msg.group