Implementing Observability Practices on Kubernetes

Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

Speakers - Site Reliability Engineer @ LINE Taiwan - SRE Team - Loves Linux Networking, Distributed Systems & Open Source Che Wei Lin a.k.a johnlin - Site Reliability Engineer @ LINE Taiwan - SRE Team - A Gopher, interested in Kubernetes and Severless Sometimes contribute Open Source project. Wei

Slide 3

Slide 3 text

Agenda - Traditional Monitoring Practices & Background - Overview of Observability Platform - Adoption Status & Current Practices - Challenges of Adoption - Sharing Use Cases

Slide 4

Slide 4 text

Traditional Monitoring Practices Background introduction

Slide 5

Slide 5 text

Previous (K8s) Adoption Status October 2020 Projects 20+ App Configs 130+ K8s Clusters 50+

Slide 6

Slide 6 text

Current (K8s) Adoption Status October 2021 Projects +100% 20 40 K8s Clusters +40% 50 70 App Configs +500% 130 650+

Slide 7

Slide 7 text

Background Verda Infrastructure - Verda is a private cloud service for LINE - VKS : Verda Kubernetes Service - VES : Verda Elasticsearch Service - VOS : Verda Object Storage - Most application teams use in-house service VKS for hosting applications

Slide 8

Slide 8 text

Traditional Monitoring Practice Traditional Approaches - Prometheus long-term storage problem (ex: 6 months metrics) - Each team may have their own Grafana dashboards cause redundant works For Self-hosting Prometheus, Grafana For Self-hosting ELK stack - Need an expert in ELK stack to manage the performance and indices Build-in LogSender & Prometheus is Not Enough - Logs: Collect container logs with Fluentd and send to Verda Elasticsearch Service (VES) - Metrics: Prometheus monitor cluster level - Traces: Few teams adopt Zipkins for distributed request tracing

Slide 9

Slide 9 text

Overview of Observability Platform

Slide 10

Slide 10 text

Observability Platform Goal of Observability Platform - Use the same data sources and dashboards to discuss the behavior of events - The process of importing indicators into the application is the same and experience can be shared - Reduce communication costs caused by differences in tools Pooling of system knowledge for organization Everyone Uses Same Tools & Dashboards - Managed single Grafana instance with multiple organizations for all projects - Provides integrated logs, metrics, and tracing data sources - Built-in common used dashboards

Slide 11

Slide 11 text

Multi-tenancy Project-based Permission Control Grafana Grants Permissions by Mail Group Indicator of Project B Indicator of Project A Users

Slide 12

Slide 12 text

Architecture Overview

Slide 13

Slide 13 text

Grafana Loki Bucket / {Tenant ID} Grafana Promtail {Tenant ID: A} Workloads Loki Verda Object Storage Promtail {Tenant ID: B} Workloads Verda Network Node Node Node Node Node Node Node Node Node LogQL {Tenant ID} Observability Cluster Project A VKS Cluster Project B VKS Cluster logs logs Read Write Write Write

Slide 14

Slide 14 text

Prometheus & Thanos っっっっっz Thanos Query B Promtheus + Thanos Thanos Query Verda Network Node Node Node Node Node Node Node Node Node Observability Cluster Project A VKS Cluster Project B VKS Cluster Promtheus + Thanos Thanos Query Thanos Query z Thanos Query A Grafana Verda Object Storage Bucket A Bucket B Thanos Store Thanos Store PromQL Thanos Query Read Read

Slide 15

Slide 15 text

Grafana Tempo & OpenTelemetry Collector Open Telemetry Collector A Pod with Jaeger sidecar Verda Network Node Node Node Node Node Node Node Node Node Observability Cluster Project A VKS Cluster Project B VKS Cluster Open Telemetry Collector Open Telemetry Collector B Grafana Bucket / {Tenant ID} Verda Object Storage Kafka Tempo API/SDK/Exporters Pod with Jaeger sidecar Open Telemetry Collector API/SDK/Exporters Write Read Read Write Write Read

Slide 16

Slide 16 text

Storage Usage Telemetry Data 20M Objects Logs 30 TB 360K Objects Metrics 60 TB 1M Objects Traces 20 TB

Slide 17

Slide 17 text

Managed Ingress Controller Traefik Kubernetes Ingress Provider - Automatically installed for clusters for each project - Parts of observability platform: integrated sidecar for collecting traces and expose metrics - Traefik as a Kubernetes ingress provider

Slide 18

Slide 18 text

Managed Ingress Controller Open Telemetry Collector B Ingress Controller Traefik Verda Network Node Node Node Node Node Node Node Node Node Observability Cluster Project A VKS Cluster Project B VKS Cluster Open Telemetry Collector A Jaeger sidecar Ingress Controller Traefik Jaeger sidecar Verda LB Network Traffic Verda LB Network Traffic Open Telemetry Collector Open Telemetry Collector Write Write

Slide 19

Slide 19 text

Adoption Status & Current Practices

Slide 20

Slide 20 text

Adoption Status Observability Platform Project Adopted LINE SPOT LINE Shopping LINE Travel LINE HUB LINE MUSIC TW LINE TODAY OAPLUS Sticker TW

Slide 21

Slide 21 text

Adoption Status Statistics of Grafana Users Organizations 34 Active users (30d) 120+ Users 200+

Slide 22

Slide 22 text

Adoption Status Statistics of Grafana Across all Organizations Dashboards 750+ Alerts 300 Data Sources 200

Slide 23

Slide 23 text

Adoption Status Statistics of Data Sources 0 5 10 15 20 25 30 35 40 MySQL Elasticsearch InfluxDB Grafana Tempo Grafana Loki Prometheus

Slide 24

Slide 24 text

Adoption Status Statistics of Telemetry Data 40B+ / week Log Lines 300M / h Metric Series 16M 20B+ / week Trace Spans 200M / h hourly, weekly

Slide 25

Slide 25 text

How to manage agents for more than 70 clusters

Slide 26

Slide 26 text

Current Practices: Argo CD ApplicationSet Manage Multiple Argo CD App Cluster B Cluster C Cluster A Argo CD Git

Slide 27

Slide 27 text

Current Practices: Telemetry Data - Logs: A record of an event that happened within an application - Metrics: Numerical representation of data. Determine a service or component’s overall behavior over time - Traces: The entire path of a request as it moves through all the nodes of a distributed system Exemplars Split view with labels Metric queries Span metrics processor Trace to logs Followed Trace ID Metrics Traces Logs

Slide 28

Slide 28 text

Challenges of Adoption

Slide 29

Slide 29 text

Challenges of Adoption Challenges of Adopting from Application Team Introducing new languages (LogQL, PromQL) Instrumenting for metrics & tracing may require retrofitting existing infrastructure

Slide 30

Slide 30 text

How to deal with these challenges Internal Hands-on Workshops for Teams

Slide 31

Slide 31 text

How to deal with these challenges Built-in Default Dashboards on Grafana - Blackbox/Uptime Dashboards - Kubernetes Events/Deployment/Jobs - Traefik Latency/QPS Dashboards

Slide 32

Slide 32 text

How to deal with these challenges Provide managed ingress controller which already export metrics & traces Out of the box feature Built-in TLS certificates & Traefik common middlewares Integrated structure log, tracing agent and exposed metrics

Slide 33

Slide 33 text

Sharing Use Cases

Slide 34

Slide 34 text

Use Case #1 Interactive Observation with Different Indicators Received Alerts from Slack Check metrics time range to locate logs Inspect logs Use LogQL to calculate the average elapsed time Metrics Logs Log Metrics Query Alerts

Slide 35

Slide 35 text

Alerts to Metrics

Slide 36

Slide 36 text

Locate Logs

Slide 37

Slide 37 text

Metric Queries LogQL to Calculate the Average Elapsed Time - Elapsed time peaks in the specific pod - Check the last request before restart

Slide 38

Slide 38 text

Use Case #2 Interactive Observation with Different Indicators Find out 500 error from metrics Link to corresponding access logs Open trace viewer for a request Jump to application logs in span of trace Logs Traces Logs Metrics

Slide 39

Slide 39 text

Increased error rate of HTTP requests Error in metrics

Slide 40

Slide 40 text

Metrics to Logs Sync All Views to This Time Range

Slide 41

Slide 41 text

Filter logs by LogQL Locate Log

Slide 42

Slide 42 text

Logs to Traces Find the Trace from Logs

Slide 43

Slide 43 text

Traces to Logs Checking the stack trace of error

Slide 44

Slide 44 text

Roadmap Next Steps Error Tracking Sentry Web Vitals Metrics to Traces Prometheus with Exemplars Logs Grafana Loki with Promtail Traces to Metrics OpenTelemetry Collector with Span Metrics Processor Traces Grafana Tempo with OpenTelemetry Collector Metrics Prometheus with Thaons

Slide 45

Slide 45 text

Summary - Observability platform brings a consolidated dashboard and pooling of system knowledge across organization - Current observability practices speed up for teams to achieve actionable insights from data - Traditional monitoring practices are not enough for teams debugging in a distributed system Key Takeaways - Adoption challenges, experience sharing and use cases sharing

Slide 46

Slide 46 text

Thank you