Slide 1

Slide 1 text

Grafana Alloy Best Practice

Slide 2

Slide 2 text

Eric Huang LINE Taiwan / SRE 2021 : E-SUN bank 2022 : LINE Taiwan Kubernetes, Rust, eBPF 2 titaneric chen-yi-huang

Slide 3

Slide 3 text

01 Introduction

Slide 4

Slide 4 text

• Collect metrics from client side & browser • Monitor web application performance • Discover error • Track user behavior (session) Real User Monitoring (RUM) Source: Web Vitals, User-centric performance metrics, Grafana Faro OSS 4

Slide 5

Slide 5 text

(Distributed) Tracing Source: Observability primer | OpenTelemetry 5 • Represent the full journey of request though distributed environment • Improve the visibility of the app • Diagnose the source of error

Slide 6

Slide 6 text

(Distributed) Tracing 6 Source: COSCUP 2024

Slide 7

Slide 7 text

“Alloy is a flexible, high performance, vendor- neutral distribution of the OpenTelemetry Collector” Key features: • Custom components • Chained components • Debugging utilities Adopt faro.receiver component with Faro SDK Grafana Alloy Source: Grafana Alloy | Grafana Alloy documentation 7

Slide 8

Slide 8 text

8

Slide 9

Slide 9 text

“Grafana Faro includes a highly configurable web SDK for real user monitoring that instruments browser frontend applications to capture observability signals.” Key features: • Monitoring applications performance • Captures errors, logs, user activity • Instrument performance and observe full stack Grafana Faro Web SDK Source: Grafana Faro OSS | Web SDK for real user monitoring (RUM) 9

Slide 10

Slide 10 text

10

Slide 11

Slide 11 text

02 Results

Slide 12

Slide 12 text

End-to-End Tracing Spans include: • frontend app (nextjs) • ingress controller (traefik) • web framework (flask) • http client library (requests) 12

Slide 13

Slide 13 text

RUM dashboards official RUM dashboard: Loki datasource 13

Slide 14

Slide 14 text

RUM dashboards improved RUM dashboard: Prometheus datasource 14

Slide 15

Slide 15 text

Session/Trace Explore 15

Slide 16

Slide 16 text

Session Detail 16

Slide 17

Slide 17 text

03 Present Architecture

Slide 18

Slide 18 text

Technology stack 18 Infra managed by JP, KR Serve managed by TW SRE

Slide 19

Slide 19 text

Architecture (User) 19

Slide 20

Slide 20 text

Architecture (SRE) 20

Slide 21

Slide 21 text

How to design? 04

Slide 22

Slide 22 text

Requirements Must have: • Adopt present observability platform • Easy to deploy alloy service automatically • Control traffic load sent from real user Nice to have: • Easy application for new tenant • Slack workflow • Sample code for SSR and CSR app • Nextjs based sample app 22

Slide 23

Slide 23 text

Alloy Architecture (User) 23

Slide 24

Slide 24 text

Alloy Architecture (SRE) 24

Slide 25

Slide 25 text

Why? • Adopt gateway instead of individual ingress for each cluster? • Unified traffic control by SRE • Decouple the business logic and telemetry traffic • Easy deployment for alloy • Cut down the Security Review procedure 25

Slide 26

Slide 26 text

Why? • Choose Contour instead of Traefik, or other ingress controller? • Contour is more performant and less memory consumption • Envoy Gateway is considered, but k8s version is not compatible 26

Slide 27

Slide 27 text

• Handle incoming large amount of traffic • Load test and tuning for Contour and Alloy • 3 levels of protections 1. Client side sampling 2. Contour rate limit 3. Grafana Alloy rate limit • Increasing load from Loki and Tempo • Continuously tuning for Loki and Tempo • Individual rate limit for each tenant Challenges Load Test Report: Alloy: 1500 RPS (1 core, 1Gi) Envoy: 10000 connection (3 core, 1Gi) 27

Slide 28

Slide 28 text

• Web vitals is stored in Loki instead of Prometheus • Adopt Loki Rulers to ingest Loki query result into Prometheus • Faster loading for real user monitoring dashboard • Constrained trace propagation in present architecture • Upgrade or update the trace propagation in the intermediate • Block trace propagation header from API gateway • Add allowed list for trace context header (e.g., TraceParent, Uber-Trace-Id) Challenges 28

Slide 29

Slide 29 text

Challenges 29 Source: DevOpsDays Taipei 2024 Source: DevOpsDays Taipei 2023

Slide 30

Slide 30 text

Future work 05

Slide 31

Slide 31 text

• Upgrade Traefik to v3.0 to adopt OpenTelemetry • Resolve the issue of unbalanced requests to OTEL collector • Zero-code instrumentation by eBPF (e.g., Grafana Beyla) • Continuously tuning for Tempo, Loki, and Alloy 31

Slide 32

Slide 32 text

Q&A 06