Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Observability from 0 to 100

Nico Krijnen
September 20, 2022

Observability from 0 to 100

We all know that our systems need to be more observable, but how do you get to valuable insights? Can you really get up and running as fast as vendors let you believe? And what are the places where it is worth spending extra time to get things right? Observability capabilities have been growing rapidly over the last years. Many vendors and tools exist in this field, with offerings varying in customizability and out-of-the-box capabilities. In this talk we'll take a look at what you need to do to get observability going for your platform and applications.

We go beyond the marketing-talk and see what it is like to build up proper observability. To make it concrete we'll do that for a Kubernetes cluster running JVM applications for which we will gather logs, metrics and traces to an Elastic stack. We'll manage all the configuration as Infrastructure-as-code with the CDK for Terraform. Along the way we will share some key ingredients that we discovered that make your observability setup more effective and downright simpler. Expect to discover what you need to do, not just on infrastructure level, but also in application code, and most important: on organizational level, so you can expose everything to the right people.

Nico Krijnen

September 20, 2022
Tweet

More Decks by Nico Krijnen

Other Decks in Programming

Transcript

  1. Observability Monitoring APM Security Analytics / SIEM Property of a

    system, like testability Act of observing a system over time Monitoring the performance and availability of an application Threat detection and proactive security measures
  2. Logs Process Logstash Filebeat Fluentd FluentBit Vector Elastic Agent ...

    Store & Visualize Elasticsearch + Kibana Datadog Splunk ...
  3. Logs Process Logstash Filebeat Fluentd FluentBit Vector Elastic Agent ...

    Store & Visualize Elasticsearch + Kibana Datadog Splunk ...
  4. Vendor Agents • Log & Metrics collection on host tailing

    log files and collecting host metrics • Metrics & Trace collection in-proces, adding instrumentation inside your application In-process agent Process Agent Host
  5. Observability – Ingredients – Make it work – OpenTelemetry –

    Takeaway Make it work Understanding & Buy-in
  6. Know your needs and budget • Easy to produce terrabytes

    of data • Cost rises with amount of data & retention • Know your budget & what that gets you • Know what your organization needs • Make sure you have support
  7. Start small • Spike: whole chain working for one service

    • Validate and fine-tune • Scale out, implement blue-print for all services
  8. Elastic Cloud [Azure Deployment in West Europe (NL) managed by

    Elastic] Developer / Operator [Person] Analyze logs, metrics, traces. Quickly find cause of issues. Admin [Person] Manage configuration NON-PROD [k8s cluster] PROD [k8s cluster] Elastic Agent [Container: DaemonSet Pod = one pod per worker node] Run FileBeat, MetricBeat, APM server with configuration as defined in Elastic Fleet Policy. Exposes APM Server port 8200 on host. aks-agentpool-vmss0 [k8s worker node] some-service [Container: Pod] Elastic APM JVM Agent is embedded in Docker image other-server [Container: Pod] Elastic APM JVM Agent is embedded in Docker image aks-agentpool-vmss1 [k8s worker node] aks-agentpool-vmss2 [k8s worker node] Elastic Agent Elastic Agent Elastic Agent [DaemonSet] Send JVM metrics and traces to APM Server [HTTP @ HOST-IP:8200] Send JVM metrics and traces to APM Server [HTTP @ HOST-IP:8200] Ingest logs, metrics, traces [HTTPS] Configure log, metrics, trace collection policies [HTTPS] Search & analyze logs, metrics, traces [HTTPS] Idem Elastic Fleet [Software System] Central management of configuration for log & metric collection agents. Elasticsearch [Software System] Search & analytics engine. Also stores configuration data. Kibana [Software System] Centralized logs, metrics, traces & Management UI Legend Person External Software System Infra Container Application Container Fleet enrollment token for PROD agent policy Fleet enrollment token for NON-PROD agent policy Elastic Agent Business / PO [Person] Analyze usage. Gather data to make decisions. Elastic Agent Elastic Agent Fetch desired config & Send logs, metrics, traces to Fleet Server [HTTPS]
  9. Elastic Cloud [Azure Deployment in West Europe (NL) managed by

    Elastic] Developer / Operator [Person] Analyze logs, metrics, traces. Quickly find cause of issues. Admin [Person] Manage configuration NON-PROD [k8s cluster] PROD [k8s cluster] Elastic Agent [Container: DaemonSet Pod = one pod per worker node] Run FileBeat, MetricBeat, APM server with configuration as defined in Elastic Fleet Policy. Exposes APM Server port 8200 on host. aks-agentpool-vmss0 [k8s worker node] some-service [Container: Pod] Elastic APM JVM Agent is embedded in Docker image other-server [Container: Pod] Elastic APM JVM Agent is embedded in Docker image aks-agentpool-vmss1 [k8s worker node] aks-agentpool-vmss2 [k8s worker node] Elastic Agent Elastic Agent Elastic Agent [DaemonSet] Send JVM metrics and traces to APM Server [HTTP @ HOST-IP:8200] Send JVM metrics and traces to APM Server [HTTP @ HOST-IP:8200] Ingest logs, metrics, traces [HTTPS] Configure log, metrics, trace collection policies [HTTPS] Search & analyze logs, metrics, traces [HTTPS] Idem Elastic Fleet [Software System] Central management of configuration for log & metric collection agents. Elasticsearch [Software System] Search & analytics engine. Also stores configuration data. Kibana [Software System] Centralized logs, metrics, traces & Management UI Legend Person External Software System Infra Container Application Container Fleet enrollment token for PROD agent policy Fleet enrollment token for NON-PROD agent policy Elastic Agent Business / PO [Person] Analyze usage. Gather data to make decisions. Elastic Agent Elastic Agent Fetch desired config & Send logs, metrics, traces to Fleet Server [HTTPS]
  10. Elastic Agent [Container] Elastic Agent [Component: Process] Manage collectors &

    shippers Elastic Filebeat [Component: Process] Lightweight log-shipper Starts & provides configuration Elastic Metricbeat [Component: Process] Lightweight metric-collector & shipper Elastic APM Server [Component: Process] Lightweight trace-collector & shipper Elastic Fleet [Software System] Central management of configuration for log & metric collection agents. Elastic Cloud [Azure Deployment managed by Elastic] Ship logs [HTTPS] Loads policy config & poll for policy changes [HTTPS] Ship metrics [HTTPS] Ship metrics & traces [HTTPS] some-service [Container] Elastic APM Agent [Component: Java Agent] Wraps Java application and automatically instruments JVM, Spring, Logging, etc to enrich log output and collect traces & metrics. Java VM [Process] Application [Component: Spring Boot Application] Tail pod container logfiles [Volume mount] Collect metrics [JSON/HTTP] /var/log/containers/*${kubernetes.container.id}.log K8s API Send JVM metrics and traces to APM Server [HTTP @ HOST-IP:8200] aks-agentpool-vmss* [k8s worker node (host)] Legend External Software System Infra Component Application Component
  11. Elastic Agent [Component: Process] Manage collectors & shippers Starts &

    provides configuration Elastic Fleet [Software System] Central management of configuration for log & metric collection agents. Elastic Cloud [Azure Deployment managed by Elastic] Loads policy config & poll for policy changes [HTTPS]
  12. Elastic Agent [Container] Elastic Agent [Component: Process] Manage collectors &

    shippers Elastic Filebeat [Component: Process] Lightweight log-shipper Starts & provides configuration Elastic Metricbeat [Component: Process] Lightweight metric-collector & shipper Elastic APM Server [Component: Process] Lightweight trace-collector & shipper Ship logs [HTTPS] Ship metrics [HTTPS] Ship metrics & traces [HTTPS] rvice Elastic APM Agent [Component: Java Agent] Wraps Java application and automatically instruments JVM, Spring, Logging, etc to enrich log output and collect traces & metrics. a VM ess] Application [Component: Spring Boot Application] Tail pod container logfiles [Volume mount] Collect metrics [JSON/HTTP] /var/log/containers/*${kubernetes.container.id}.log K8s API Send JVM metrics and traces to APM Server [HTTP @ HOST-IP:8200]
  13. Elastic Agent [Container] Elastic Agent [Component: Process] Manage collectors &

    shippers Elastic Filebeat [Component: Process] Lightweight log-shipper Starts & provides configuration Elastic Metricbeat [Component: Process] Lightweight metric-collector & shipper Elastic APM Server [Component: Process] Lightweight trace-collector & shipper Elastic Fleet [Software System] Central management of configuration for log & metric collection agents. Elastic Cloud [Azure Deployment managed by Elastic] Ship logs [HTTPS] Loads policy config & poll for policy changes [HTTPS] Ship metrics [HTTPS] Ship metrics & traces [HTTPS] some-service [Container] Elastic APM Agent [Component: Java Agent] Wraps Java application and automatically instruments JVM, Spring, Logging, etc to enrich log output and collect traces & metrics. Java VM [Process] Application [Component: Spring Boot Application] Tail pod container logfiles [Volume mount] Collect metrics [JSON/HTTP] /var/log/containers/*${kubernetes.container.id}.log K8s API Send JVM metrics and traces to APM Server [HTTP @ HOST-IP:8200] aks-agentpool-vmss* [k8s worker node (host)] Legend External Software System Infra Component Application Component
  14. Elastic Agent [Container] Elastic Filebeat [Component: Process] Lightweight log-shipper Elastic

    Metricbeat [Component: Process] Lightweight metric-collector & shipper Elastic APM Server [Component: Process] Lightweight trace-collector & shipper Ship logs [HTTPS] Ship metrics [HTTPS] Ship metrics & traces [HTTPS] some-service [Container] Elastic APM Agent [Component: Java Agent] Wraps Java application and automatically instruments JVM, Spring, Logging, etc to enrich log output and collect traces & metrics. Java VM [Process] Application [Component: Spring Boot Application] Tai Send JVM metrics and traces to APM Server [HTTP @ HOST-IP:8200] aks-agentpool-vmss* [k8s worker node (host)] Elastic Agent [Container] Elastic Filebeat [Component: Process] Lightweight log-shipper Elastic Metricbeat [Component: Process] Lightweight metric-collector & shipper Elastic APM Server [Component: Process] Lightweight trace-collector & shipper Ship logs [HTTPS] Ship metrics [HTTPS] Ship metrics & trac [HTTPS] some-service [Container] Elastic APM Agent [Component: Java Agent] Wraps Java application and automatically instruments JVM, Spring, Logging, etc to enrich log output and collect traces & metrics. Java VM [Process] Application [Component: Spring Boot Application] Send JVM metrics and traces to APM Server [HTTP @ HOST-IP:8200] aks-agentpool-vmss* [k8s worker node (host)]
  15. Instrumentation Elastic Agent [Container] Elastic Agent [Component: Process] Manage collectors

    & shippers Elastic Filebeat [Component: Process] Lightweight log-shipper Starts & provides configuration Elastic Metricbeat [Component: Process] Lightweight metric-collector & shipper Elastic APM Server [Component: Process] Lightweight trace-collector & shipper Ship logs [HTTPS] Ship metrics [HTTPS] Ship metrics & traces [HTTPS] some-service [Container] Elastic APM Agent [Component: Java Agent] Wraps Java application and automatically instruments JVM, Spring, Logging, etc to enrich log output and collect traces & metrics. Java VM [Process] Application [Component: Spring Boot Application] Tail pod container logfiles [Volume mount] Collect metrics [JSON/HTTP] /var/log/containers/*${kubernetes.container.id}.log K8s API Send JVM metrics and traces to APM Server [HTTP @ HOST-IP:8200] aks-agentpool-vmss* [k8s worker node (host)]
  16. Optimize query results • Add metadata to help you filter

    • Service name • Environment (dev, staging, prod) • Kubernetes metadata • Cloud metadata
  17. Make sure no data is dropped object mapping for [json.data]

    tried to parse field [data] as object, but found a concrete value"}, dropping event! {"json": {"data": {"gtin": "...", "product":"..." } } } {"json": {"data": "..."} }
  18. Tune capacity - retention policies • Not all data is

    equally valuable • Your apps vs system and platform • Logs, Metrics, Traces
  19. Tune capacity - rollup jobs • Drop what you don't

    need • Reduce metric density after some time • Capture all failure traces, but only sample % of normal traces
  20. Tune capacity - data tiering • Benefit of observability data:

    it's all timeseries • Lifecycle policy • Hot, expensive storage for latest data • Warm, cold and frozen for data as it gets older
  21. OpenTelemetry maturity https://opentelemetry.io/docs/reference/specification/status/ Tracing • API: stable, feature-freeze • SDK:

    stable • Protocol: stable Metrics • API: stable • SDK: mixed • Protocol: stable Logging • API: draft • SDK: draft • Protocol: stable
  22. Key takeaway • Observability can be complex, but it doesn't

    have to be. • Uniform JSON logging will help you keep things simple.
  23. Key takeaway • Observability can be complex, but it doesn't

    have to be. • Uniform JSON logging will help you keep things simple. • Don't forget who you're doing this for: increase understanding of your systems so you can provide better experiences.