Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Grafana Alloy Best Practice

Grafana Alloy Best Practice

Speaker: Eric Huang
Event: COSCUP 2024

Avatar for LINE Developers Taiwan

LINE Developers Taiwan PRO

August 04, 2024
Tweet

More Decks by LINE Developers Taiwan

Other Decks in Technology

Transcript

  1. Eric Huang LINE Taiwan / SRE 2021 : E-SUN bank

    2022 : LINE Taiwan Kubernetes, Rust, eBPF 2 titaneric chen-yi-huang
  2. • Collect metrics from client side & browser • Monitor

    web application performance • Discover error • Track user behavior (session) Real User Monitoring (RUM) Source: Web Vitals, User-centric performance metrics, Grafana Faro OSS 4
  3. (Distributed) Tracing Source: Observability primer | OpenTelemetry 5 • Represent

    the full journey of request though distributed environment • Improve the visibility of the app • Diagnose the source of error
  4. “Alloy is a flexible, high performance, vendor- neutral distribution of

    the OpenTelemetry Collector” Key features: • Custom components • Chained components • Debugging utilities Adopt faro.receiver component with Faro SDK Grafana Alloy Source: Grafana Alloy | Grafana Alloy documentation 7
  5. 8

  6. “Grafana Faro includes a highly configurable web SDK for real

    user monitoring that instruments browser frontend applications to capture observability signals.” Key features: • Monitoring applications performance • Captures errors, logs, user activity • Instrument performance and observe full stack Grafana Faro Web SDK Source: Grafana Faro OSS | Web SDK for real user monitoring (RUM) 9
  7. 10

  8. End-to-End Tracing Spans include: • frontend app (nextjs) • ingress

    controller (traefik) • web framework (flask) • http client library (requests) 12
  9. Requirements Must have: • Adopt present observability platform • Easy

    to deploy alloy service automatically • Control traffic load sent from real user Nice to have: • Easy application for new tenant • Slack workflow • Sample code for SSR and CSR app • Nextjs based sample app 22
  10. Why? • Adopt gateway instead of individual ingress for each

    cluster? • Unified traffic control by SRE • Decouple the business logic and telemetry traffic • Easy deployment for alloy • Cut down the Security Review procedure 25
  11. Why? • Choose Contour instead of Traefik, or other ingress

    controller? • Contour is more performant and less memory consumption • Envoy Gateway is considered, but k8s version is not compatible 26
  12. • Handle incoming large amount of traffic • Load test

    and tuning for Contour and Alloy • 3 levels of protections 1. Client side sampling 2. Contour rate limit 3. Grafana Alloy rate limit • Increasing load from Loki and Tempo • Continuously tuning for Loki and Tempo • Individual rate limit for each tenant Challenges Load Test Report: Alloy: 1500 RPS (1 core, 1Gi) Envoy: 10000 connection (3 core, 1Gi) 27
  13. • Web vitals is stored in Loki instead of Prometheus

    • Adopt Loki Rulers to ingest Loki query result into Prometheus • Faster loading for real user monitoring dashboard • Constrained trace propagation in present architecture • Upgrade or update the trace propagation in the intermediate • Block trace propagation header from API gateway • Add allowed list for trace context header (e.g., TraceParent, Uber-Trace-Id) Challenges 28
  14. • Upgrade Traefik to v3.0 to adopt OpenTelemetry • Resolve

    the issue of unbalanced requests to OTEL collector • Zero-code instrumentation by eBPF (e.g., Grafana Beyla) • Continuously tuning for Tempo, Loki, and Alloy 31