Eric Huang
LINE Taiwan / SRE
2021 : E-SUN bank
2022 : LINE Taiwan
Kubernetes, Rust, eBPF
2
titaneric chen-yi-huang
Slide 3
Slide 3 text
01
Introduction
Slide 4
Slide 4 text
• Collect metrics from client side & browser
• Monitor web application performance
• Discover error
• Track user behavior (session)
Real User Monitoring (RUM)
Source: Web Vitals, User-centric performance metrics, Grafana Faro OSS
4
Slide 5
Slide 5 text
(Distributed) Tracing
Source: Observability primer | OpenTelemetry
5
• Represent the full journey of request though distributed environment
• Improve the visibility of the app
• Diagnose the source of error
Slide 6
Slide 6 text
(Distributed) Tracing
6
Source: COSCUP 2024
Slide 7
Slide 7 text
“Alloy is a flexible, high performance, vendor-
neutral distribution of the OpenTelemetry
Collector”
Key features:
• Custom components
• Chained components
• Debugging utilities
Adopt faro.receiver component with Faro SDK
Grafana Alloy
Source: Grafana Alloy | Grafana Alloy documentation
7
Slide 8
Slide 8 text
8
Slide 9
Slide 9 text
“Grafana Faro includes a highly
configurable web SDK for real user
monitoring that instruments browser
frontend applications to capture
observability signals.”
Key features:
• Monitoring applications
performance
• Captures errors, logs, user activity
• Instrument performance and
observe full stack
Grafana Faro
Web SDK
Source: Grafana Faro OSS | Web SDK for real user monitoring (RUM)
9
RUM dashboards
official RUM dashboard: Loki datasource
13
Slide 14
Slide 14 text
RUM dashboards improved RUM dashboard: Prometheus datasource
14
Slide 15
Slide 15 text
Session/Trace Explore
15
Slide 16
Slide 16 text
Session Detail
16
Slide 17
Slide 17 text
03
Present
Architecture
Slide 18
Slide 18 text
Technology stack
18
Infra managed by JP, KR
Serve managed by TW SRE
Slide 19
Slide 19 text
Architecture (User)
19
Slide 20
Slide 20 text
Architecture (SRE)
20
Slide 21
Slide 21 text
How to
design?
04
Slide 22
Slide 22 text
Requirements
Must have:
• Adopt present observability
platform
• Easy to deploy alloy service
automatically
• Control traffic load sent
from real user
Nice to have:
• Easy application for new
tenant
• Slack workflow
• Sample code for SSR and
CSR app
• Nextjs based sample app
22
Slide 23
Slide 23 text
Alloy Architecture (User)
23
Slide 24
Slide 24 text
Alloy Architecture (SRE)
24
Slide 25
Slide 25 text
Why?
• Adopt gateway instead of individual ingress for each cluster?
• Unified traffic control by SRE
• Decouple the business logic and telemetry traffic
• Easy deployment for alloy
• Cut down the Security Review procedure
25
Slide 26
Slide 26 text
Why?
• Choose Contour instead of Traefik, or other ingress controller?
• Contour is more performant and less memory consumption
• Envoy Gateway is considered, but k8s version is not compatible
26
Slide 27
Slide 27 text
• Handle incoming large amount of traffic
• Load test and tuning for Contour and Alloy
• 3 levels of protections
1. Client side sampling
2. Contour rate limit
3. Grafana Alloy rate limit
• Increasing load from Loki and Tempo
• Continuously tuning for Loki and Tempo
• Individual rate limit for each tenant
Challenges
Load Test Report:
Alloy: 1500 RPS (1 core, 1Gi)
Envoy: 10000 connection (3 core, 1Gi)
27
Slide 28
Slide 28 text
• Web vitals is stored in Loki instead of Prometheus
• Adopt Loki Rulers to ingest Loki query result into Prometheus
• Faster loading for real user monitoring dashboard
• Constrained trace propagation in present architecture
• Upgrade or update the trace propagation in the intermediate
• Block trace propagation header from API gateway
• Add allowed list for trace context header (e.g., TraceParent, Uber-Trace-Id)
Challenges
28
• Upgrade Traefik to v3.0 to adopt OpenTelemetry
• Resolve the issue of unbalanced requests to OTEL collector
• Zero-code instrumentation by eBPF (e.g., Grafana Beyla)
• Continuously tuning for Tempo, Loki, and Alloy
31