Upgrade to Pro — share decks privately, control downloads, hide ads and more …

LINE TECHPULSE 2022 - Implementing Observability Practices on Kubernetes

LINE TECHPULSE 2022 - Implementing Observability Practices on Kubernetes

Implementing Observability Practices on Kubernetes by Che-Wei Lin / Wan Jun-Wei / SRE @ LINE TECHPULSE 2022 https://techpulse.line.me/


LINE Developers Taiwan

January 21, 2022

More Decks by LINE Developers Taiwan

Other Decks in Programming


  1. Che-Wei Lin / Wan Jun-Wei / SRE Implementing Observability Practices

    on Kubernetes 
  2. Speakers Che Wei Lin a.k.a johnlin › Site Reliability Engineer

    @ LINE Taiwan › SRE Team › Loves Linux Networking, Distributed Systems & Open Source Wei › Site Reliability Engineer @ LINE Taiwan › SRE Team › A Gopher, interested in Kubernetes and Serverless › Sometimes contribute Open Source project.
  3. Agenda › Traditional Monitoring Practices & Background › Overview of

    Observability Platform › Adoption Status & Current Practices › Challenges of Adoption › Sharing Use Cases
  4. Traditional Monitoring Practices

  5. Previous (K8s) Adoption Status October 2020 Projects 20+ App Configs

    130+ K8s Clusters 50+
  6. Current (K8s) Adoption Status October 2021 Projects +100% 20 40

    K8s Clusters +40% 50 70 App Configs +500% 130 650+
  7. Background Verda Infrastructure › VKS : Verda Kubernetes Service ›

    VES : Verda Elasticsearch Service › VOS : Verda Object Storage › Most application teams use in-house service VKS for hosting applications › Verda is a private cloud service for LINE
  8. Traditional Monitoring Practice Traditional Approaches › Prometheus long-term storage problem

    (ex: 6 months metrics) › Each team may have their own Grafana dashboards cause redundant works For Self-hosting Prometheus, Grafana For Self-hosting ELK stack › Need an expert in ELK stack to manage the performance and indices Build-in LogSender & Prometheus is Not Enough › Logs: Collect container logs with Fluentd and send to Verda Elasticsearch Service (VES) › Metrics: Prometheus monitor cluster level › Traces: Few teams adopt Zipkins for distributed request tracing
  9. Overview of Observability Platform

  10. Observability Platform Goal of Observability Platform › Use the same

    data sources and dashboards to discuss the behavior of events › The process of importing indicators into the application is the same and experience can be shared › Reduce communication costs caused by differences in tools Pooling of system knowledge for organization Everyone Uses Same Tools & Dashboards › Provides integrated logs, metrics, and tracing data sources › Built-in common used dashboards Reduce time for repeated deployment of monitoring facilities › Managed single Grafana instance › Multi-tenant for organizations and projects
  11. Architecture Overview

  12. Managed Ingress Controller

  13. Grafana Tempo & OpenTelemetry Collector

  14. Prometheus & Thanos

  15. Grafana Loki

  16. Adoption Status & 
 Current Practices

  17. Adoption Status Observability Platform Project Adopted LINE HUB LINE TRAVEL

  18. Adoption Status Statistics of Grafana Users Organizations 34 Active users

    (30d) 120+ Users 200+
  19. Adoption Status Statistics of Grafana Across all Organizations Dashboards 750+

    Alerts 300 Data Sources 200
  20. Prometheus Grafana Loki Grafana Tempo InfluxDB Elasticsearch MySQL 0 10

    20 30 40 Adoption Status Statistics of Data Sources
  21. Adoption Status Statistics of Telemetry Data 40B+ / week Log

    Lines 300M / h hourly, weekly Metric Series 16M 20B+ / week Trace Spans 200M / h
  22. Adoption Status Storage Usage 20M Objects Logs 30TB 360K Objects

    Metrics 60 TB 1M Objects Traces 20TB
  23. How to manage agents for more than 70 clusters

  24. Current Practices: Argo CD ApplicationSet Manage Multiple Argo CD App

    Cluster B Cluster C Cluster A Argo CD Git
  25. Current Practices: Telemetry Data Logs, Metrics & Traces › Logs:

    A record of an event that happened within an application › Metrics: Numerical representation of data. Determine a service or component’s overall behavior over time › Traces: The entire path of a request as it moves through all the nodes of a distributed system Exemplars Split view with labels Metric queries Span metrics processor Trace to logs Followed Trace ID Metrics Traces Logs
  26. Challenges of Adoption

  27. Challenges of Adoption Challenges of Adopting from Application Team Introducing

    new languages (LogQL, PromQL) Instrumenting for metrics & tracing may require retrofitting existing infrastructure
  28. How to deal with these challenges Internal Hands-on Workshops for

  29. How to deal with these challenges Built-in Default Dashboards on

    Grafana › Blackbox/Uptime Dashboards › Kubernetes Events/Deployment/ Jobs › Traefik Latency/QPS Dashboards
  30. How to deal with these challenges Provide managed ingress controller

    which already export metrics & traces Out of the box feature Built-in TLS certificates & Traefik common middlewares Integrated structure log, tracing agent and exposed metrics
  31. Sharing Use Cases

  32. Use Case Interactive Observation with Different Indicators Find out 500

    error from metrics Link to corresponding access logs Open trace viewer for a request Jump to application logs in span of trace Logs Traces Logs Metrics
  33. Increased error rate of HTTP requests Error in metrics

  34. Metrics to Logs Sync All Views to This Time Range

  35. Filter logs by LogQL Locate Log

  36. Logs to Traces Find the Trace from Logs

  37. Traces to Logs Checking the stack trace of error

  38. Roadmap Next Steps Error Tracking Sentry Web Vitals Metrics to

    Traces Prometheus 
 Exemplars Logs Grafana Loki 
 Promtail Traces to Metrics OpenTelemetry Collector with 
 Span Metrics Processor Traces Grafana Tempo 
 OpenTelemetry Collector Metrics Prometheus 
  39. Summary Key Takeaways › Observability platform brings a consolidated dashboard

    and pooling of system knowledge across organization › Current observability practices speed up for teams to achieve actionable insights from data › Traditional monitoring practices are not enough for teams debugging in a distributed system › Adoption challenges, experience sharing and use cases sharing
  40. Thank you