Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Implementing Observability Practices on Kubernetes

Implementing Observability Practices on Kubernetes

Che-Wei Lin (johnlin)
LINE Taiwan / SRE / Site Reliability Engineer
Jun-Wei Wan
LINE Taiwan / SRE / Site Reliability Engineer



November 11, 2021

More Decks by LINE DEVDAY 2021

Other Decks in Technology


  1. None
  2. Speakers - Site Reliability Engineer @ LINE Taiwan - SRE

    Team - Loves Linux Networking, Distributed Systems & Open Source Che Wei Lin a.k.a johnlin - Site Reliability Engineer @ LINE Taiwan - SRE Team - A Gopher, interested in Kubernetes and Severless Sometimes contribute Open Source project. Wei
  3. Agenda - Traditional Monitoring Practices & Background - Overview of

    Observability Platform - Adoption Status & Current Practices - Challenges of Adoption - Sharing Use Cases
  4. Traditional Monitoring Practices Background introduction

  5. Previous (K8s) Adoption Status October 2020 Projects 20+ App Configs

    130+ K8s Clusters 50+
  6. Current (K8s) Adoption Status October 2021 Projects +100% 20 40

    K8s Clusters +40% 50 70 App Configs +500% 130 650+
  7. Background Verda Infrastructure - Verda is a private cloud service

    for LINE - VKS : Verda Kubernetes Service - VES : Verda Elasticsearch Service - VOS : Verda Object Storage - Most application teams use in-house service VKS for hosting applications
  8. Traditional Monitoring Practice Traditional Approaches - Prometheus long-term storage problem

    (ex: 6 months metrics) - Each team may have their own Grafana dashboards cause redundant works For Self-hosting Prometheus, Grafana For Self-hosting ELK stack - Need an expert in ELK stack to manage the performance and indices Build-in LogSender & Prometheus is Not Enough - Logs: Collect container logs with Fluentd and send to Verda Elasticsearch Service (VES) - Metrics: Prometheus monitor cluster level - Traces: Few teams adopt Zipkins for distributed request tracing
  9. Overview of Observability Platform

  10. Observability Platform Goal of Observability Platform - Use the same

    data sources and dashboards to discuss the behavior of events - The process of importing indicators into the application is the same and experience can be shared - Reduce communication costs caused by differences in tools Pooling of system knowledge for organization Everyone Uses Same Tools & Dashboards - Managed single Grafana instance with multiple organizations for all projects - Provides integrated logs, metrics, and tracing data sources - Built-in common used dashboards
  11. Multi-tenancy Project-based Permission Control Grafana Grants Permissions by Mail Group

    Indicator of Project B Indicator of Project A Users
  12. Architecture Overview

  13. Grafana Loki Bucket / {Tenant ID} Grafana Promtail {Tenant ID:

    A} Workloads Loki Verda Object Storage Promtail {Tenant ID: B} Workloads Verda Network Node Node Node Node Node Node Node Node Node LogQL {Tenant ID} Observability Cluster Project A VKS Cluster Project B VKS Cluster logs logs Read Write Write Write
  14. Prometheus & Thanos っっっっっz Thanos Query B Promtheus + Thanos

    Thanos Query Verda Network Node Node Node Node Node Node Node Node Node Observability Cluster Project A VKS Cluster Project B VKS Cluster Promtheus + Thanos Thanos Query Thanos Query z Thanos Query A Grafana Verda Object Storage Bucket A Bucket B Thanos Store Thanos Store PromQL Thanos Query Read Read
  15. Grafana Tempo & OpenTelemetry Collector Open Telemetry Collector A Pod

    with Jaeger sidecar Verda Network Node Node Node Node Node Node Node Node Node Observability Cluster Project A VKS Cluster Project B VKS Cluster Open Telemetry Collector Open Telemetry Collector B Grafana Bucket / {Tenant ID} Verda Object Storage Kafka Tempo API/SDK/Exporters Pod with Jaeger sidecar Open Telemetry Collector API/SDK/Exporters Write Read Read Write Write Read
  16. Storage Usage Telemetry Data 20M Objects Logs 30 TB 360K

    Objects Metrics 60 TB 1M Objects Traces 20 TB
  17. Managed Ingress Controller Traefik Kubernetes Ingress Provider - Automatically installed

    for clusters for each project - Parts of observability platform: integrated sidecar for collecting traces and expose metrics - Traefik as a Kubernetes ingress provider
  18. Managed Ingress Controller Open Telemetry Collector B Ingress Controller Traefik

    Verda Network Node Node Node Node Node Node Node Node Node Observability Cluster Project A VKS Cluster Project B VKS Cluster Open Telemetry Collector A Jaeger sidecar Ingress Controller Traefik Jaeger sidecar Verda LB Network Traffic Verda LB Network Traffic Open Telemetry Collector Open Telemetry Collector Write Write
  19. Adoption Status & Current Practices

  20. Adoption Status Observability Platform Project Adopted LINE SPOT LINE Shopping

  21. Adoption Status Statistics of Grafana Users Organizations 34 Active users

    (30d) 120+ Users 200+
  22. Adoption Status Statistics of Grafana Across all Organizations Dashboards 750+

    Alerts 300 Data Sources 200
  23. Adoption Status Statistics of Data Sources 0 5 10 15

    20 25 30 35 40 MySQL Elasticsearch InfluxDB Grafana Tempo Grafana Loki Prometheus
  24. Adoption Status Statistics of Telemetry Data 40B+ / week Log

    Lines 300M / h Metric Series 16M 20B+ / week Trace Spans 200M / h hourly, weekly
  25. How to manage agents for more than 70 clusters

  26. Current Practices: Argo CD ApplicationSet Manage Multiple Argo CD App

    Cluster B Cluster C Cluster A Argo CD Git
  27. Current Practices: Telemetry Data - Logs: A record of an

    event that happened within an application - Metrics: Numerical representation of data. Determine a service or component’s overall behavior over time - Traces: The entire path of a request as it moves through all the nodes of a distributed system Exemplars Split view with labels Metric queries Span metrics processor Trace to logs Followed Trace ID Metrics Traces Logs
  28. Challenges of Adoption

  29. Challenges of Adoption Challenges of Adopting from Application Team Introducing

    new languages (LogQL, PromQL) Instrumenting for metrics & tracing may require retrofitting existing infrastructure
  30. How to deal with these challenges Internal Hands-on Workshops for

  31. How to deal with these challenges Built-in Default Dashboards on

    Grafana - Blackbox/Uptime Dashboards - Kubernetes Events/Deployment/Jobs - Traefik Latency/QPS Dashboards
  32. How to deal with these challenges Provide managed ingress controller

    which already export metrics & traces Out of the box feature Built-in TLS certificates & Traefik common middlewares Integrated structure log, tracing agent and exposed metrics
  33. Sharing Use Cases

  34. Use Case #1 Interactive Observation with Different Indicators Received Alerts

    from Slack Check metrics time range to locate logs Inspect logs Use LogQL to calculate the average elapsed time Metrics Logs Log Metrics Query Alerts
  35. Alerts to Metrics

  36. Locate Logs

  37. Metric Queries LogQL to Calculate the Average Elapsed Time -

    Elapsed time peaks in the specific pod - Check the last request before restart
  38. Use Case #2 Interactive Observation with Different Indicators Find out

    500 error from metrics Link to corresponding access logs Open trace viewer for a request Jump to application logs in span of trace Logs Traces Logs Metrics
  39. Increased error rate of HTTP requests Error in metrics

  40. Metrics to Logs Sync All Views to This Time Range

  41. Filter logs by LogQL Locate Log

  42. Logs to Traces Find the Trace from Logs

  43. Traces to Logs Checking the stack trace of error

  44. Roadmap Next Steps Error Tracking Sentry Web Vitals Metrics to

    Traces Prometheus with Exemplars Logs Grafana Loki with Promtail Traces to Metrics OpenTelemetry Collector with Span Metrics Processor Traces Grafana Tempo with OpenTelemetry Collector Metrics Prometheus with Thaons
  45. Summary - Observability platform brings a consolidated dashboard and pooling

    of system knowledge across organization - Current observability practices speed up for teams to achieve actionable insights from data - Traditional monitoring practices are not enough for teams debugging in a distributed system Key Takeaways - Adoption challenges, experience sharing and use cases sharing
  46. Thank you