Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The pod knows where you're from

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.

The pod knows where you're from

Meetup talk about DSR and source IP preservation.
Including why this matters for AI interference.

References: https://github.com/iorgreths/meetup-ebpf-dsr-source-ip/blob/main/REFERENCES.md
DEMOS: https://github.com/iorgreths/meetup-ebpf-dsr-source-ip/tree/main/demos

Avatar for Marcel Gredler

Marcel Gredler

March 09, 2026
Tweet

Other Decks in Science

Transcript

  1. 06.03.2026 eBPF Vienna 3 Why Source IP Matters Security -

    Rate limiting, IP-based access control, blocking attackers Compliance - Audit logs require real client IPs Geolocation - Geo-based routing and content delivery Analytics - Real visitor tracking and metrics Communication - Clients that are identified by their IP and comm. is initiated by server directly AI - The AI’s response is more costly than the request. Without source IP, your pods see the node’s IP instead of the client’s IP
  2. 06.03.2026 eBPF Vienna 4 What Happens by Default? With externalTrafficPolicy:

    Cluster (the default): Pod Node B (has pod) Node A (no pod) VIP 185.150.8.128 Client 1.2.3.4 Pod Node B (has pod) Node A (no pod) VIP 185.150.8.128 Client 1.2.3.4 externalTrafficPolicy: Cluster ⚠️ No local pod! kube-proxy forwards + SNAT to Node A IP 😢 Sees Node A IP not Client IP! Request to :8080 Forward src: 85.217.172.142 ❌ Deliver packet
  3. 06.03.2026 eBPF Vienna 5 The SNAT Problem When traffic crosses

    nodes, kube-proxy performs Source NAT: Why does this happen? kube-proxy needs return traffic to come back through the same path Without SNAT, the pod would reply directly to the client → asymmetric routing fails Original: Client IP (1.2.3.4) → VIP → Node A → Node B → Pod ↓ SNAT happens! ↓ Pod sees: Node A IP (85.217.172.142) ❌
  4. 06.03.2026 eBPF Vienna 7 Option 1: externalTrafficPolicy: Local How it

    works: Traffic only goes to pods on the same node that received it apiVersion: v1 kind: Service metadata: name: my-service spec: type: LoadBalancer externalTrafficPolicy: Local # 👈 The magic setting ports: - port: 80
  5. 06.03.2026 eBPF Vienna 8 Option 1: externalTrafficPolicy: Local Pod Node

    B (has pod) Node A (no pod) VIP 185.150.8.128 Client 1.2.3.4 Pod Node B (has pod) Node A (no pod) VIP 185.150.8.128 Client 1.2.3.4 externalTrafficPolicy: Local ❌ No local pod! Health check fails ✅ Sees Client IP! 1.2.3.4 alt [Traffic hits Node A (no pod)] [Traffic hits Node B (has pod)] Request to :8080 Forward Connection refused Forward Direct delivery | Flow
  6. 06.03.2026 eBPF Vienna 9 Option 1: externalTrafficPolicy: Local Pros Cons

    ✅ Source IP preserved ❌ Traffic only to nodes with pods ✅ No app changes needed ❌ Uneven load distribution ✅ Works for any L4 protocol ❌ Services fail if no local pod The fundamental problem: You trade high availability for source IP preservation "Workaround": You would need a pod on every node (and node health checks for VIP/LB/etc.) | Tradeoff
  7. 06.03.2026 eBPF Vienna 10 Option 2: L7 Headers (X-Forwarded-For) Use

    an Ingress Controller that adds HTTP headers: Pros Cons ✅ Works with any traffic policy ❌ Only HTTP/HTTPS ✅ Well-supported by apps ❌ Can’t use for TCP/UDP services ✅ Multiple proxies can chain ❌ Headers can be spoofed # Traefik, NGINX Ingress, etc. add these headers X-Forwarded-For: 1.2.3.4 Forwarded: for=1.2.3.4
  8. 06.03.2026 eBPF Vienna 11 What About Proxy Protocol? Proxy Protocol

    prepends client info to TCP stream: The catch: It doesn’t solve cross-node forwarding! Proxy Protocol works LB → first node ✅ But kube-proxy still SNATs when forwarding Node A → Node B ❌ Still requires externalTrafficPolicy: Local to work properly Proxy Protocol is a companion to Local policy, not a replacement PROXY TCP4 1.2.3.4 185.150.8.128 56789 8080\r\n <actual TCP data follows>
  9. 06.03.2026 eBPF Vienna 12 Summary of Current Options Approach HA

    L4 Support No App Changes Source IP Cluster (default) ✅ ✅ ✅ ❌ Lost externalTrafficPolicy ❌ ✅ ✅ ✅ X-Forwarded-For ✅ ❌ HTTP only ⚠️ Parse header ✅ Proxy Protocol + e.T.P. ❌ ✅ ⚠️ Parse PP ✅ 🤔 Can we have it all? HA + L4 + No app changes + Source IP?
  10. 06.03.2026 eBPF Vienna 14 What is DSR? Direct Server Return

    (DSR) is a load balancing technique where: 1. Request path: Client → LB → Node A → Node B (pod) 2. Response path: Pod → directly to Client (bypasses Node A!) Why does this help? The pod can reply directly because it knows the original client IP No SNAT needed → source IP is preserved at the packet level Works with externalTrafficPolicy: Cluster → full HA
  11. 06.03.2026 eBPF Vienna 15 Why Cilium? Traditional kube-proxy Uses iptables/IPVS

    Must SNAT for routing No DSR support Cilium with eBPF Replaces kube-proxy entirely Native DSR support Preserves source IP in packet DSR Tunneling: Encapsulates DSR metadata via Geneve # cilium-values.yaml kubeProxyReplacement: true loadBalancer: mode: dsr dsrDispatch: geneve
  12. 06.03.2026 eBPF Vienna 16 DSR Packet Flow Pod Node B

    (Cilium + Pod) Node A (Cilium) NLB 185.150.8.128 Client 1.2.3.4 Pod Node B (Cilium + Pod) Node A (Cilium) NLB 185.150.8.128 Client 1.2.3.4 externalTrafficPolicy: Cluster + DSR Cilium eBPF intercepts Backend on Node B Encapsulate with Geneve Decapsulate Geneve ✅ Sees Client IP! DSR Reply Bypasses Node A! Request to :8080 Forward (round-robin) Geneve tunnel Inner src: 1.2.3.4 ✅ Deliver with original client IP Direct reply src: NLB IP DSR Considerations Most NLBs are stateful and use Destination NAT (DNAT) DSR needs the asymetric path IP needs to be spoofed or the NLB must be stateless or spoof-protection must be disabled Another Alternative is the Border Gateway Protocol (BGP)
  13. 06.03.2026 eBPF Vienna 17 The Geneve Magic When traffic must

    cross nodes, Cilium: 1. Encapsulates the original packet in a Geneve tunnel 2. Preserves the original source IP in the inner packet 3. Adds DSR metadata (NLB IP for reply) The pod sees 1.2.3.4 as source IP, and replies using the NLB IP as source! ┌─────────────────────────────────────────────────────────┐ │ Outer IP: src=Node A, dst=Node B │ ├─────────────────────────────────────────────────────────┤ │ Geneve Header: DSR info (NLB IP: 185.150.8.128) │ ├─────────────────────────────────────────────────────────┤ │ Inner IP: src=1.2.3.4 (Client!) dst=Pod IP │ └─────────────────────────────────────────────────────────┘
  14. 06.03.2026 eBPF Vienna 18 Configuration 1. K8S Cluster Setup 2.

    Cilium Helm Values 1 # Example: Kind Cluster 2 # No default CNI and no kube-proxy 3 kind: Cluster 4 apiVersion: "kind.x-k8s.io/v1alpha4" 5 networking: 6 disableDefaultCNI: true 7 kubeProxyMode: none 1 kubeProxyReplacement: true 2 routingMode: tunnel 3 tunnelProtocol: geneve 4 5 loadBalancer: 6 mode: dsr 7 dsrDispatch: geneve
  15. 06.03.2026 eBPF Vienna 19 Verification: BPF LB Tables $ cilium

    bpf lb list SERVICE ADDRESS BACKEND ADDRESS 85.217.173.6:31000/TCP 0.0.0.0:0 (16) (0) [NodePort, dsr] 0.0.0.0:31000/TCP 192.168.1.36:8080/TCP (15) (1) The [NodePort, dsr] flag confirms DSR is active! Hubble Flow Evidence # Traffic from external client to pod 1.2.3.4:9063 (world) -> tcp-echo:8080 to-overlay FORWARDED # to-overlay = sent via Geneve tunnel with DSR
  16. 06.03.2026 eBPF Vienna 20 Reply Rewrite When the pod replies,

    eBPF rewrites the source: Without DSR With DSR The reply bypasses Node A entirely → Direct Server Return src: 10.0.1.50 (Pod IP) dst: 1.2.3.4 (Client) ❌ Client rejects! src: 185.150.8.128 (Service IP) dst: 1.2.3.4 (Client) ✅ Client accepts!
  17. 06.03.2026 eBPF Vienna 21 The Full Picture Feature Default e.T.P.:

    Local X-Forwarded-For DSR High Availability ✅ ❌ ✅ ✅ Source IP Preserved ❌ ✅ ✅ ✅ Works for L4 (TCP/UDP) ✅ ✅ ❌ ✅ No App Changes ✅ ✅ ⚠️ ✅ DSR = Best of all worlds 🎉
  18. 06.03.2026 eBPF Vienna 23 SNAT Mode: Return Path Through LB

    Node Request Path (Client → Server) Response Path (Server → Client) All response traffic must traverse the LB node. With a 100 Mbps port limit, the 1 Gbps backend→LB link floods the LB node’s ingress, causing TCP congestion collapse (5,431 retransmits) and effective throughput of only 10.6 Mbps. Client (192.168.1.202) ───100 Mbps───▶ LB Node (192.168.1.11) ───1 Gbps───▶ Pod Node (192.168.1.13) ⚠ BOTTLENECK Client (192.168.1.202) ◀───100 Mbps─── LB Node (192.168.1.11) ◀───1 Gbps─── Pod Node (192.168.1.13) ▲ 100 Mbps LIMIT
  19. 06.03.2026 eBPF Vienna 24 DSR Mode: Direct Return to Client

    Request Path (Client → Server) Response Path (Server → Client) ✓ DIRECT — 1 Gbps! Response traffic goes directly from backend to client at full 1 Gbps, completely bypassing the 100 Mbps LB node. Result: 936 Mbps throughput. Client (192.168.1.202) ───100 Mbps───▶ LB Node (192.168.1.11) ───1 Gbps───▶ Pod Node (192.168.1.13) Client (192.168.1.202) ◀═══ 1 Gbps DIRECT ═══ Pod Node (192.168.1.13) LB Node -- BYPASSED for responses
  20. 06.03.2026 eBPF Vienna 25 Why DSR Matters for AI Inference

    AI inference has fundamentally asymmetric traffic — small requests, large responses The Asymmetry Problem Direction Size Request (Inbound) ~1-50 KB Response (Outbound) ~10 KB - 10 MB Response traffic can be 10-1000x larger than request traffic, depending on the workload Inference Workload Examples Workload In Out LLM Streaming 1-50 KB 10-100 KB Image Gen ~1 KB 1-10 MB Batch Embeddings ~10 KB ~6 MB Video Analysis ~2 MB frame ~50 KB *Sizes derived from typical API response formats. Image generation based on 1024×1024 PNG output. LLM token sizes based on ~4 chars/token average.
  21. 06.03.2026 eBPF Vienna 26 Why DSR Matters for AI Inference

    (2) With SNAT, all inference response traffic is funneled through the LB node’s NIC. GPU nodes are expensive, their output shouldn’t be bottlenecked by a non-GPU LB node. DSR eliminates this funnel entirely.
  22. 06.03.2026 eBPF Vienna 27 N:1 Oversubscription — The Real-World Problem

    Even at uniform link speeds, multiple workers saturate the LB node’s return path SNAT: Workers share LB bandwidth Workers Status 1 worker (500 Mbps) ✅ OK (< 1 Gbps) 2 workers (1,000 Mbps) ⚠️ Marginal (= 1 Gbps) 3+ workers (1,500+ Mbps) ❌ Congestion collapse DSR: Workers respond directly Workers Throughput 1 worker Up to 1 Gbps 2 workers Up to 2 Gbps aggregate N workers Up to N × 1 Gbps (linear) LB Node — only handles small inbound requests
  23. 06.03.2026 eBPF Vienna 28 N:1 Oversubscription (2) This is what

    our DEMO simulates. We limited the LB node to 100 Mbps to demonstrate what happens when return traffic exceeds LB capacity. In production, the same effect occurs naturally when N worker nodes × response bandwidth > LB node NIC capacity. DSR throughput scales linearly with worker count. SNAT is capped at LB node bandwidth.
  24. 06.03.2026 eBPF Vienna 30 Summary The Problem externalTrafficPolicy: Cluster +

    cross-node traffic = SNAT => Source IP lost The Tradeoffs Local policy: Preserves IP but sacrifices HA L7 headers: HTTP only, requires header parsing Proxy Protocol: Still needs Local policy The Solution Cilium DSR: HA + L4 + Source IP preserved + No app changes Uses Geneve encapsulation to carry DSR metadata across nodes
  25. 06.03.2026 eBPF Vienna 31 When Does DSR Shine? Best For

    DSR AI/ML inference (asymmetric traffic) API gateways, CDN, streaming Image/video generation services Multiple workers behind one LB Mixed link speeds in cluster Preserving client source IP SNAT Still OK For Symmetric traffic patterns most web HTTPS Uniform high-speed links, few workers Simple debugging (single path) When client IP not needed
  26. 06.03.2026 eBPF Vienna 32 References Cilium Documentation - DSR Kubernetes

    - Source IP HAProxy - Proxy Protocol Spec Exoscale SKS Documentation Cilium Hetzner Performance Testing Architecture: The 10G eBPF Edge with OPNsense & Istio-Ready Cilium DSR Bringing eBPF and Cilium to GKE Amazon Titain Image Generator G1 Models OpenAI: How to Count Tokens with Tiktoken