Improve Monitoring and Observability for Kubernetes with OSS tools

Slide 1

Slide 1 text

@nileshgule Improve Monitoring and Observability for Kubernetes with OSS tools

Slide 2

Slide 2 text

Nilesh Gule ARCHITECT | MICROSOFT MVP | First Docker Captain in Singapore “Code with Passion and Strive for Excellence” nileshgule @nileshgul e Nilesh Gule NileshGule www.handsonarchitect.co m https://www.youtube.com/@nilesh-gule

Slide 3

Slide 3 text

@nileshgule

Slide 4

Slide 4 text

@nileshgule CNCF cloud trail https://github.com/cncf/trailmap

Slide 5

Slide 5 text

@nileshgule CNCF Observability landscape https://landscape.cncf.io

Slide 6

Slide 6 text

@nileshgule CNCF Observability Radar https://radar.cncf.io/2020-09-observability

Slide 7

Slide 7 text

@nileshgule 3 Pillars of Observability Logs Metrics Traces

Slide 8

Slide 8 text

@nileshgule Centralized Logging

Slide 9

Slide 9 text

@nileshgule ❑ Application specific ❖ Long term log retention for compliance reasons ❖ Workloads scheduled on different nodes during application restarts / updates ❖ Autoscaling workloads ❑ Kubernetes upgrades ❖ Auto healing can reschedule workloads ❖ Underlying nodes added / deleted during cluster scaling ❖ Underlying nodes replaced during cluster upgrades Container based workloads Why centralized logging ❖ Not much control over underlying infra ❖ Relies on cloud prover specific logging and monitoring solution PaaS / Serverless services

Slide 10

Slide 10 text

@nileshgule Financial Services App Loki integration Log collector Log storage Log search, visualise, dashboards backend-service account-service authentication-service forex-service transaction-service

Slide 11

Slide 11 text

@nileshgule Demo 1 – Log Aggregation with Loki

Slide 12

Slide 12 text

@nileshgule Metrics

Slide 13

Slide 13 text

@nileshgule • Application specific • Monitor resource usage • Monitor scaling needs • Monitor anomalies / outliers • Kubernetes platform level • Monitor cluster resources (CPU / RAM) • API health • Autoscaling Container based workloads Why Metrics • Monitor resource usage • Scaling • Bottlenecks PaaS / Serverless services

Slide 14

Slide 14 text

@nileshgule Prometheus Architecture

Slide 15

Slide 15 text

@nileshgule Demo 2 – Metrics using Prometheus & Grafana

Slide 16

Slide 16 text

@nileshgule Financial Services App Prometheus integration Scrape Metrics Metrics storage visualise, dashboards backend-service account-service authentication-service forex-service transaction-service service-monitor

Slide 17

Slide 17 text

@nileshgule Distributed Tracing

Slide 18

Slide 18 text

@nileshgule • Distributed Tracing • Understanding complex systems • Performance monitoring and optimizations • Debugging and problem resolution Why Distributed Tracing

Slide 19

Slide 19 text

@nileshgule Financial Services App Jaeger integration Distributed Traces Visualise Traces backend-service account-service authentication-service forex-service transaction-service Jaeger Operator

Slide 20

Slide 20 text

@nileshgule Demo 3 – Distributed Tracing using Jaeger

Slide 21

Slide 21 text

@nileshgule End to End Observability backend-service account-service authentication-service forex-service transaction-service

Slide 22

Slide 22 text

@nileshgule Analogy - Use right tool for right purpose

Slide 23

Slide 23 text

@nileshgule Summary Modern day cloud native applications need new ways to address observability & monitoring ✓ Use best-of-class for given use case ✓ Rely on open standards (e.g. OpenTelemetry) ✓ Build portable observability systems (e.g. hybrid cloud migration) Log Aggregation ✓ Loki helps in centralized logging ✓ Grafana is used to visualize logs and build dashboards Metrics ✓ Prometheus provides easy to use metrics for platforms, applications ✓ Grafana provides visualization capabilities to build intuitive dashboards Distributed Tracing ✓ Jaeger provides distributed tracing capabilities

Slide 24

Slide 24 text

@nileshgule Some Recommendations ♣ Too many agents ♣ Instrumentation, vendor lock-in ♣ Cloud native logs ♣ Cloud native metrics ♣ Cloud native traces ♣ Single pane of glass, correlation ∞ OpenTelemetry collector ∞ OpenTelemetry, OpenMetrics ∞ Fluent Bit / Fluentd, OpenSearch, Loki ∞ Prometheus, Cortex, Thanos ∞ OpenTelemetry, Jaeger, Grafana ∞ Grafana Challenges Tools

Slide 25

Slide 25 text

@nileshgule References Log Aggregation ❖ Grafana Loki Monitoring & Alerting ❖ Prometheus ❖ Grafana ❖ Kube Prometheus stack ❖ Houssem Dellai – Prometheus & Grafana for monitoring Kubernetes Distributed Tracing ❖ Jaeger Tracing

Slide 26

Slide 26 text

@nileshgule Source Code & slide deck Financial Services Demo https://github.com/infofractionalservices/microservices/tree/do cker_build_fixes https://speakerdeck.com/nileshgule/ https://www.slideshare.net/nileshgule/

Slide 27

Slide 27 text

Q&A