The pod knows where you're from

The pod knows where you’re from preserving Source IP with
Cilium and DSR VIENNA

The Source IP Problem Why do we even care?

06.03.2026 eBPF Vienna 3 Why Source IP Matters Security -
Rate limiting, IP-based access control, blocking attackers Compliance - Audit logs require real client IPs Geolocation - Geo-based routing and content delivery Analytics - Real visitor tracking and metrics Communication - Clients that are identified by their IP and comm. is initiated by server directly AI - The AI’s response is more costly than the request. Without source IP, your pods see the node’s IP instead of the client’s IP

06.03.2026 eBPF Vienna 4 What Happens by Default? With externalTrafficPolicy:
Cluster (the default): Pod Node B (has pod) Node A (no pod) VIP 185.150.8.128 Client 1.2.3.4 Pod Node B (has pod) Node A (no pod) VIP 185.150.8.128 Client 1.2.3.4 externalTrafficPolicy: Cluster ⚠️ No local pod! kube-proxy forwards + SNAT to Node A IP 😢 Sees Node A IP not Client IP! Request to :8080 Forward src: 85.217.172.142 ❌ Deliver packet

06.03.2026 eBPF Vienna 5 The SNAT Problem When traffic crosses
nodes, kube-proxy performs Source NAT: Why does this happen? kube-proxy needs return traffic to come back through the same path Without SNAT, the pod would reply directly to the client → asymmetric routing fails Original: Client IP (1.2.3.4) → VIP → Node A → Node B → Pod ↓ SNAT happens! ↓ Pod sees: Node A IP (85.217.172.142) ❌

The Tradeoffs What options do we have today?

06.03.2026 eBPF Vienna 7 Option 1: externalTrafficPolicy: Local How it
works: Traffic only goes to pods on the same node that received it apiVersion: v1 kind: Service metadata: name: my-service spec: type: LoadBalancer externalTrafficPolicy: Local # 👈 The magic setting ports: - port: 80

06.03.2026 eBPF Vienna 8 Option 1: externalTrafficPolicy: Local Pod Node
B (has pod) Node A (no pod) VIP 185.150.8.128 Client 1.2.3.4 Pod Node B (has pod) Node A (no pod) VIP 185.150.8.128 Client 1.2.3.4 externalTrafficPolicy: Local ❌ No local pod! Health check fails ✅ Sees Client IP! 1.2.3.4 alt [Traffic hits Node A (no pod)] [Traffic hits Node B (has pod)] Request to :8080 Forward Connection refused Forward Direct delivery | Flow

06.03.2026 eBPF Vienna 9 Option 1: externalTrafficPolicy: Local Pros Cons
✅ Source IP preserved ❌ Traffic only to nodes with pods ✅ No app changes needed ❌ Uneven load distribution ✅ Works for any L4 protocol ❌ Services fail if no local pod The fundamental problem: You trade high availability for source IP preservation "Workaround": You would need a pod on every node (and node health checks for VIP/LB/etc.) | Tradeoff

06.03.2026 eBPF Vienna 10 Option 2: L7 Headers (X-Forwarded-For) Use
an Ingress Controller that adds HTTP headers: Pros Cons ✅ Works with any traffic policy ❌ Only HTTP/HTTPS ✅ Well-supported by apps ❌ Can’t use for TCP/UDP services ✅ Multiple proxies can chain ❌ Headers can be spoofed # Traefik, NGINX Ingress, etc. add these headers X-Forwarded-For: 1.2.3.4 Forwarded: for=1.2.3.4

06.03.2026 eBPF Vienna 11 What About Proxy Protocol? Proxy Protocol
prepends client info to TCP stream: The catch: It doesn’t solve cross-node forwarding! Proxy Protocol works LB → first node ✅ But kube-proxy still SNATs when forwarding Node A → Node B ❌ Still requires externalTrafficPolicy: Local to work properly Proxy Protocol is a companion to Local policy, not a replacement PROXY TCP4 1.2.3.4 185.150.8.128 56789 8080\r\n <actual TCP data follows>

06.03.2026 eBPF Vienna 12 Summary of Current Options Approach HA
L4 Support No App Changes Source IP Cluster (default) ✅ ✅ ✅ ❌ Lost externalTrafficPolicy ❌ ✅ ✅ ✅ X-Forwarded-For ✅ ❌ HTTP only ⚠️ Parse header ✅ Proxy Protocol + e.T.P. ❌ ✅ ⚠️ Parse PP ✅ 🤔 Can we have it all? HA + L4 + No app changes + Source IP?

Direct Server Return The real solution with Cilium

06.03.2026 eBPF Vienna 14 What is DSR? Direct Server Return
(DSR) is a load balancing technique where: 1. Request path: Client → LB → Node A → Node B (pod) 2. Response path: Pod → directly to Client (bypasses Node A!) Why does this help? The pod can reply directly because it knows the original client IP No SNAT needed → source IP is preserved at the packet level Works with externalTrafficPolicy: Cluster → full HA

06.03.2026 eBPF Vienna 15 Why Cilium? Traditional kube-proxy Uses iptables/IPVS
Must SNAT for routing No DSR support Cilium with eBPF Replaces kube-proxy entirely Native DSR support Preserves source IP in packet DSR Tunneling: Encapsulates DSR metadata via Geneve # cilium-values.yaml kubeProxyReplacement: true loadBalancer: mode: dsr dsrDispatch: geneve

06.03.2026 eBPF Vienna 16 DSR Packet Flow Pod Node B
(Cilium + Pod) Node A (Cilium) NLB 185.150.8.128 Client 1.2.3.4 Pod Node B (Cilium + Pod) Node A (Cilium) NLB 185.150.8.128 Client 1.2.3.4 externalTrafficPolicy: Cluster + DSR Cilium eBPF intercepts Backend on Node B Encapsulate with Geneve Decapsulate Geneve ✅ Sees Client IP! DSR Reply Bypasses Node A! Request to :8080 Forward (round-robin) Geneve tunnel Inner src: 1.2.3.4 ✅ Deliver with original client IP Direct reply src: NLB IP DSR Considerations Most NLBs are stateful and use Destination NAT (DNAT) DSR needs the asymetric path IP needs to be spoofed or the NLB must be stateless or spoof-protection must be disabled Another Alternative is the Border Gateway Protocol (BGP)

06.03.2026 eBPF Vienna 17 The Geneve Magic When traffic must
cross nodes, Cilium: 1. Encapsulates the original packet in a Geneve tunnel 2. Preserves the original source IP in the inner packet 3. Adds DSR metadata (NLB IP for reply) The pod sees 1.2.3.4 as source IP, and replies using the NLB IP as source! ┌─────────────────────────────────────────────────────────┐ │ Outer IP: src=Node A, dst=Node B │ ├─────────────────────────────────────────────────────────┤ │ Geneve Header: DSR info (NLB IP: 185.150.8.128) │ ├─────────────────────────────────────────────────────────┤ │ Inner IP: src=1.2.3.4 (Client!) dst=Pod IP │ └─────────────────────────────────────────────────────────┘

06.03.2026 eBPF Vienna 18 Configuration 1. K8S Cluster Setup 2.
Cilium Helm Values 1 # Example: Kind Cluster 2 # No default CNI and no kube-proxy 3 kind: Cluster 4 apiVersion: "kind.x-k8s.io/v1alpha4" 5 networking: 6 disableDefaultCNI: true 7 kubeProxyMode: none 1 kubeProxyReplacement: true 2 routingMode: tunnel 3 tunnelProtocol: geneve 4 5 loadBalancer: 6 mode: dsr 7 dsrDispatch: geneve

06.03.2026 eBPF Vienna 19 Verification: BPF LB Tables $ cilium
bpf lb list SERVICE ADDRESS BACKEND ADDRESS 85.217.173.6:31000/TCP 0.0.0.0:0 (16) (0) [NodePort, dsr] 0.0.0.0:31000/TCP 192.168.1.36:8080/TCP (15) (1) The [NodePort, dsr] flag confirms DSR is active! Hubble Flow Evidence # Traffic from external client to pod 1.2.3.4:9063 (world) -> tcp-echo:8080 to-overlay FORWARDED # to-overlay = sent via Geneve tunnel with DSR

06.03.2026 eBPF Vienna 20 Reply Rewrite When the pod replies,
eBPF rewrites the source: Without DSR With DSR The reply bypasses Node A entirely → Direct Server Return src: 10.0.1.50 (Pod IP) dst: 1.2.3.4 (Client) ❌ Client rejects! src: 185.150.8.128 (Service IP) dst: 1.2.3.4 (Client) ✅ Client accepts!

06.03.2026 eBPF Vienna 21 The Full Picture Feature Default e.T.P.:
Local X-Forwarded-For DSR High Availability ✅ ❌ ✅ ✅ Source IP Preserved ❌ ✅ ✅ ✅ Works for L4 (TCP/UDP) ✅ ✅ ❌ ✅ No App Changes ✅ ✅ ⚠️ ✅ DSR = Best of all worlds 🎉

DEMO kube-proxy vs Cilium DSR Available at: github.com/iorgreths/meetup-ebpf-dsr-source-ip/demos

06.03.2026 eBPF Vienna 23 SNAT Mode: Return Path Through LB
Node Request Path (Client → Server) Response Path (Server → Client) All response traffic must traverse the LB node. With a 100 Mbps port limit, the 1 Gbps backend→LB link floods the LB node’s ingress, causing TCP congestion collapse (5,431 retransmits) and effective throughput of only 10.6 Mbps. Client (192.168.1.202) ───100 Mbps───▶ LB Node (192.168.1.11) ───1 Gbps───▶ Pod Node (192.168.1.13) ⚠ BOTTLENECK Client (192.168.1.202) ◀───100 Mbps─── LB Node (192.168.1.11) ◀───1 Gbps─── Pod Node (192.168.1.13) ▲ 100 Mbps LIMIT

06.03.2026 eBPF Vienna 24 DSR Mode: Direct Return to Client
Request Path (Client → Server) Response Path (Server → Client) ✓ DIRECT — 1 Gbps! Response traffic goes directly from backend to client at full 1 Gbps, completely bypassing the 100 Mbps LB node. Result: 936 Mbps throughput. Client (192.168.1.202) ───100 Mbps───▶ LB Node (192.168.1.11) ───1 Gbps───▶ Pod Node (192.168.1.13) Client (192.168.1.202) ◀═══ 1 Gbps DIRECT ═══ Pod Node (192.168.1.13) LB Node -- BYPASSED for responses

06.03.2026 eBPF Vienna 25 Why DSR Matters for AI Inference
AI inference has fundamentally asymmetric traffic — small requests, large responses The Asymmetry Problem Direction Size Request (Inbound) ~1-50 KB Response (Outbound) ~10 KB - 10 MB Response traffic can be 10-1000x larger than request traffic, depending on the workload Inference Workload Examples Workload In Out LLM Streaming 1-50 KB 10-100 KB Image Gen ~1 KB 1-10 MB Batch Embeddings ~10 KB ~6 MB Video Analysis ~2 MB frame ~50 KB *Sizes derived from typical API response formats. Image generation based on 1024×1024 PNG output. LLM token sizes based on ~4 chars/token average.

06.03.2026 eBPF Vienna 26 Why DSR Matters for AI Inference
(2) With SNAT, all inference response traffic is funneled through the LB node’s NIC. GPU nodes are expensive, their output shouldn’t be bottlenecked by a non-GPU LB node. DSR eliminates this funnel entirely.

06.03.2026 eBPF Vienna 27 N:1 Oversubscription — The Real-World Problem
Even at uniform link speeds, multiple workers saturate the LB node’s return path SNAT: Workers share LB bandwidth Workers Status 1 worker (500 Mbps) ✅ OK (< 1 Gbps) 2 workers (1,000 Mbps) ⚠️ Marginal (= 1 Gbps) 3+ workers (1,500+ Mbps) ❌ Congestion collapse DSR: Workers respond directly Workers Throughput 1 worker Up to 1 Gbps 2 workers Up to 2 Gbps aggregate N workers Up to N × 1 Gbps (linear) LB Node — only handles small inbound requests

06.03.2026 eBPF Vienna 28 N:1 Oversubscription (2) This is what
our DEMO simulates. We limited the LB node to 100 Mbps to demonstrate what happens when return traffic exceeds LB capacity. In production, the same effect occurs naturally when N worker nodes × response bandwidth > LB node NIC capacity. DSR throughput scales linearly with worker count. SNAT is capped at LB node bandwidth.

Closing Remarks

06.03.2026 eBPF Vienna 30 Summary The Problem externalTrafficPolicy: Cluster +
cross-node traffic = SNAT => Source IP lost The Tradeoffs Local policy: Preserves IP but sacrifices HA L7 headers: HTTP only, requires header parsing Proxy Protocol: Still needs Local policy The Solution Cilium DSR: HA + L4 + Source IP preserved + No app changes Uses Geneve encapsulation to carry DSR metadata across nodes

06.03.2026 eBPF Vienna 31 When Does DSR Shine? Best For
DSR AI/ML inference (asymmetric traffic) API gateways, CDN, streaming Image/video generation services Multiple workers behind one LB Mixed link speeds in cluster Preserving client source IP SNAT Still OK For Symmetric traffic patterns most web HTTPS Uniform high-speed links, few workers Simple debugging (single path) When client IP not needed

06.03.2026 eBPF Vienna 32 References Cilium Documentation - DSR Kubernetes
- Source IP HAProxy - Proxy Protocol Spec Exoscale SKS Documentation Cilium Hetzner Performance Testing Architecture: The 10G eBPF Edge with OPNsense & Istio-Ready Cilium DSR Bringing eBPF and Cilium to GKE Amazon Titain Image Generator G1 Models OpenAI: How to Count Tokens with Tiktoken

Thank You Questions?

The pod knows where you're from

The pod knows where you're from

Marcel Gredler

Other Decks in Science

Featured

Transcript