Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Implementing Observability Practices on Kubernetes

Implementing Observability Practices on Kubernetes

Che-Wei Lin (johnlin)
LINE Taiwan / SRE / Site Reliability Engineer
Jun-Wei Wan
LINE Taiwan / SRE / Site Reliability Engineer

https://linedevday.linecorp.com/2021/ja/sessions/25
https://linedevday.linecorp.com/2021/en/sessions/25
https://linedevday.linecorp.com/2021/ko/sessions/25

LINE DEVDAY 2021

November 11, 2021
Tweet

More Decks by LINE DEVDAY 2021

Other Decks in Technology

Transcript

  1. Speakers
    - Site Reliability Engineer @ LINE Taiwan
    - SRE Team
    - Loves Linux Networking, Distributed Systems & Open Source
    Che Wei Lin a.k.a johnlin
    - Site Reliability Engineer @ LINE Taiwan
    - SRE Team
    - A Gopher, interested in Kubernetes and Severless
    Sometimes contribute Open Source project.
    Wei

    View full-size slide

  2. Agenda
    - Traditional Monitoring Practices & Background
    - Overview of Observability Platform
    - Adoption Status & Current Practices
    - Challenges of Adoption
    - Sharing Use Cases

    View full-size slide

  3. Traditional Monitoring Practices
    Background introduction

    View full-size slide

  4. Previous (K8s) Adoption Status
    October 2020
    Projects
    20+
    App Configs
    130+
    K8s Clusters
    50+

    View full-size slide

  5. Current (K8s) Adoption Status
    October 2021
    Projects
    +100%
    20 40
    K8s Clusters
    +40%
    50 70
    App Configs
    +500%
    130 650+

    View full-size slide

  6. Background
    Verda Infrastructure
    - Verda is a private cloud service for LINE
    - VKS : Verda Kubernetes Service
    - VES : Verda Elasticsearch Service
    - VOS : Verda Object Storage
    - Most application teams use in-house service VKS for hosting applications

    View full-size slide

  7. Traditional Monitoring Practice
    Traditional Approaches
    - Prometheus long-term storage problem (ex: 6 months metrics)
    - Each team may have their own Grafana dashboards cause redundant works
    For Self-hosting Prometheus, Grafana
    For Self-hosting ELK stack
    - Need an expert in ELK stack to manage the performance and indices
    Build-in LogSender & Prometheus is Not Enough
    - Logs: Collect container logs with Fluentd and send to Verda Elasticsearch Service (VES)
    - Metrics: Prometheus monitor cluster level
    - Traces: Few teams adopt Zipkins for distributed request tracing

    View full-size slide

  8. Overview of
    Observability Platform

    View full-size slide

  9. Observability Platform
    Goal of Observability Platform
    - Use the same data sources and dashboards to discuss the behavior of events
    - The process of importing indicators into the application is the same and experience can be shared
    - Reduce communication costs caused by differences in tools
    Pooling of system knowledge for organization
    Everyone Uses Same Tools & Dashboards
    - Managed single Grafana instance with multiple organizations for all projects
    - Provides integrated logs, metrics, and tracing data sources
    - Built-in common used dashboards

    View full-size slide

  10. Multi-tenancy
    Project-based Permission Control
    Grafana Grants
    Permissions by
    Mail Group
    Indicator of
    Project B
    Indicator of
    Project A
    Users

    View full-size slide

  11. Architecture Overview

    View full-size slide

  12. Grafana Loki
    Bucket / {Tenant ID}
    Grafana
    Promtail {Tenant ID: A}
    Workloads
    Loki
    Verda Object Storage
    Promtail {Tenant ID: B}
    Workloads
    Verda Network
    Node
    Node
    Node
    Node
    Node
    Node
    Node
    Node
    Node
    LogQL {Tenant ID}
    Observability Cluster Project A VKS Cluster Project B VKS Cluster
    logs logs
    Read
    Write
    Write
    Write

    View full-size slide

  13. Prometheus & Thanos
    っっっっっz
    Thanos Query B
    Promtheus + Thanos
    Thanos
    Query
    Verda Network
    Node
    Node
    Node
    Node
    Node
    Node
    Node
    Node
    Node
    Observability Cluster Project A VKS Cluster Project B VKS Cluster
    Promtheus + Thanos
    Thanos
    Query
    Thanos
    Query
    z
    Thanos Query A
    Grafana
    Verda Object Storage
    Bucket A Bucket B
    Thanos
    Store
    Thanos
    Store
    PromQL Thanos
    Query
    Read
    Read

    View full-size slide

  14. Grafana Tempo & OpenTelemetry Collector
    Open Telemetry Collector A
    Pod with Jaeger sidecar
    Verda Network
    Node
    Node
    Node
    Node
    Node
    Node
    Node
    Node
    Node
    Observability Cluster Project A VKS Cluster Project B VKS Cluster
    Open Telemetry Collector
    Open Telemetry Collector B
    Grafana
    Bucket / {Tenant ID}
    Verda Object Storage
    Kafka
    Tempo
    API/SDK/Exporters
    Pod with Jaeger sidecar
    Open Telemetry Collector
    API/SDK/Exporters
    Write
    Read
    Read
    Write
    Write
    Read

    View full-size slide

  15. Storage Usage
    Telemetry Data
    20M Objects
    Logs
    30 TB
    360K Objects
    Metrics
    60 TB
    1M Objects
    Traces
    20 TB

    View full-size slide

  16. Managed Ingress Controller
    Traefik Kubernetes Ingress Provider
    - Automatically installed for clusters for each project
    - Parts of observability platform: integrated sidecar for collecting traces and
    expose metrics
    - Traefik as a Kubernetes ingress provider

    View full-size slide

  17. Managed Ingress Controller
    Open Telemetry Collector B
    Ingress Controller
    Traefik
    Verda Network
    Node
    Node
    Node
    Node
    Node
    Node
    Node
    Node
    Node
    Observability Cluster Project A VKS Cluster Project B VKS Cluster
    Open Telemetry Collector A
    Jaeger
    sidecar
    Ingress Controller
    Traefik
    Jaeger
    sidecar
    Verda LB
    Network Traffic
    Verda LB
    Network Traffic
    Open Telemetry Collector Open Telemetry Collector
    Write Write

    View full-size slide

  18. Adoption Status &
    Current Practices

    View full-size slide

  19. Adoption Status
    Observability Platform Project Adopted
    LINE SPOT
    LINE Shopping
    LINE Travel
    LINE HUB
    LINE MUSIC TW
    LINE TODAY OAPLUS
    Sticker TW

    View full-size slide

  20. Adoption Status
    Statistics of Grafana Users
    Organizations
    34
    Active users (30d)
    120+
    Users
    200+

    View full-size slide

  21. Adoption Status
    Statistics of Grafana Across all Organizations
    Dashboards
    750+
    Alerts
    300
    Data Sources
    200

    View full-size slide

  22. Adoption Status
    Statistics of Data Sources
    0 5 10 15 20 25 30 35 40
    MySQL
    Elasticsearch
    InfluxDB
    Grafana Tempo
    Grafana Loki
    Prometheus

    View full-size slide

  23. Adoption Status
    Statistics of Telemetry Data
    40B+ / week
    Log Lines
    300M / h
    Metric Series
    16M
    20B+ / week
    Trace Spans
    200M / h
    hourly, weekly

    View full-size slide

  24. How to manage agents for
    more than 70 clusters

    View full-size slide

  25. Current Practices: Argo CD ApplicationSet
    Manage Multiple Argo CD App
    Cluster B
    Cluster C
    Cluster A
    Argo CD
    Git

    View full-size slide

  26. Current Practices: Telemetry Data
    - Logs: A record of an event that happened
    within an application
    - Metrics: Numerical representation of data.
    Determine a service or component’s overall
    behavior over time
    - Traces: The entire path of a request as it
    moves through all the nodes of a distributed
    system
    Exemplars Split view
    with labels
    Metric
    queries
    Span metrics
    processor
    Trace to logs
    Followed Trace ID
    Metrics
    Traces Logs

    View full-size slide

  27. Challenges of Adoption

    View full-size slide

  28. Challenges of Adoption
    Challenges of Adopting from Application Team
    Introducing new languages (LogQL, PromQL)
    Instrumenting for metrics & tracing may require retrofitting existing infrastructure

    View full-size slide

  29. How to deal with these challenges
    Internal Hands-on Workshops for Teams

    View full-size slide

  30. How to deal with these challenges
    Built-in Default Dashboards on Grafana
    - Blackbox/Uptime Dashboards
    - Kubernetes Events/Deployment/Jobs
    - Traefik Latency/QPS Dashboards

    View full-size slide

  31. How to deal with these challenges
    Provide managed ingress controller which already export metrics & traces
    Out of the box feature
    Built-in TLS certificates & Traefik common middlewares
    Integrated structure log, tracing agent and exposed metrics

    View full-size slide

  32. Sharing Use Cases

    View full-size slide

  33. Use Case #1
    Interactive Observation with Different Indicators
    Received Alerts
    from Slack
    Check metrics time
    range to locate logs
    Inspect logs Use LogQL to calculate
    the average elapsed time
    Metrics Logs
    Log
    Metrics Query
    Alerts

    View full-size slide

  34. Alerts to Metrics

    View full-size slide

  35. Metric Queries
    LogQL to Calculate the Average Elapsed Time
    - Elapsed time peaks in the
    specific pod
    - Check the last request before
    restart

    View full-size slide

  36. Use Case #2
    Interactive Observation with Different Indicators
    Find out 500 error
    from metrics
    Link to
    corresponding
    access logs
    Open trace viewer
    for a request
    Jump to application
    logs in span of trace
    Logs Traces Logs
    Metrics

    View full-size slide

  37. Increased error rate of HTTP requests
    Error in metrics

    View full-size slide

  38. Metrics to Logs
    Sync All Views to This Time Range

    View full-size slide

  39. Filter logs by LogQL
    Locate Log

    View full-size slide

  40. Logs to Traces
    Find the Trace from Logs

    View full-size slide

  41. Traces to Logs
    Checking the stack trace of error

    View full-size slide

  42. Roadmap
    Next Steps
    Error Tracking
    Sentry
    Web Vitals
    Metrics to Traces
    Prometheus
    with
    Exemplars
    Logs
    Grafana Loki
    with
    Promtail
    Traces to Metrics
    OpenTelemetry Collector
    with
    Span Metrics Processor
    Traces
    Grafana Tempo
    with
    OpenTelemetry Collector
    Metrics
    Prometheus
    with
    Thaons

    View full-size slide

  43. Summary
    - Observability platform brings a consolidated dashboard and pooling of
    system knowledge across organization
    - Current observability practices speed up for teams to achieve actionable
    insights from data
    - Traditional monitoring practices are not enough for teams debugging in a
    distributed system
    Key Takeaways
    - Adoption challenges, experience sharing and use cases sharing

    View full-size slide