Upgrade to Pro — share decks privately, control downloads, hide ads and more …

eBPF Networking 應用在 Kubernetes 上 你應該知道的事

eBPF Networking 應用在 Kubernetes 上 你應該知道的事

John Lin
Site Reliability Engineer, LINE Taiwan
Event: LINE x KCD Taiwan Meetup #49

LINE Developers Taiwan

September 01, 2022
Tweet

More Decks by LINE Developers Taiwan

Other Decks in Technology

Transcript

  1. About Me • John Lin, Site Reliability Engineer at LINE

    Taiwan • Previously Tencent • Kubernetes, Observability & Networking • Follow me on Twi?er: @johnlin__ LINE x KCD Taiwan Meetup 2
  2. We're Hiring Site Reliability Engineers Taipei | LINE Taiwan |

    Engineering System | Remote | Full-;me h=ps:/ /careers.linecorp.com/jobs/1330 LINE x KCD Taiwan Meetup 3
  3. Qualifica(ons and Skills • Kubernetes, Cloud Na+ve • On-premise Private

    Cloud • Observability • Have a be:er behind-the-scenes view can get insights in a distributed system • Troubleshoo2ng • Tracking down the roots of the problems LINE x KCD Taiwan Meetup 4
  4. Last talk at COSCUP ... • We looked at eBPF

    based Container Networking • 4 types of datapath on Kubernetes Networking • eBPF Development • eBPF Program Types • Some things were glossed over • Cilium features and their applicaDons • “Sidecarless” model to the world of service mesh • Today's talk • Notable Cilium features on Kubernetes • Sidecar model & Sidecarless model LINE x KCD Taiwan Meetup 5
  5. Outline • Quick Recap • 4 types of datapath •

    Cilium features • kube-proxy replacment • DSR (Dircet Server Return) • Egress IP Gateway • Sidecarless model - Cilium service mesh • ConsideraEons of AdopEng LINE x KCD Taiwan Meetup 7
  6. Quick Recap • The Linux Netowkring Subsystem • Linux Kernel

    Networking • iptables LINE x KCD Taiwan Meetup 8
  7. Quick Recap • Early Point of Kernel Networking • XDP

    (eXpress Data Path) • TC ingress/egress LINE x KCD Taiwan Meetup 11
  8. Quick Recap • Early Point of Kernel Networking • XDP

    (eXpress Data Path) • TC ingress/egress • Not Kernel bypass LINE x KCD Taiwan Meetup 12
  9. Quick Recap • eBPF 會是 Kubernetes Networking 上的 game changer

    • Cilium 為⽬前最出⾊的 eBPF CNI 專案 • Cilium 5 個重要的 BPF 程式 • tc/BPF, XDP/BPF • Cgroup Socket/BPF, Socket OpAon/BPF, Socket Map/BPF • Cilium 幾個值得注意的功能 LINE x KCD Taiwan Meetup 13
  10. iptables Skills Ma)er Kubernetes Networking: Default policy enforcement in terms

    of networking: iptables Early stage of docker networking manipula;on tool - pipework LINE x KCD Taiwan Meetup 15
  11. kube-proxy (iptables) Replacment • Cilium 1.6 release 帶來了最後⼀塊拼圖, kube-proxy 的移除

    • eBPF service 實作不會因為 Kubernetes Service 數量成長延遲跟著遞增 • eBPF service 實作不會有 packets traverse 匹配多筆規則的非確定 (non- determinisAc) 情況 • ⼤量的 iptables 規則從 Node 上被移除 LINE x KCD Taiwan Meetup 16
  12. kube-proxy (iptables) Replacment • 取代 iptables rules 對性能影響? • 南北向流量

    (N-S Traffic) • NodePort • ExternalIPs • LoadBalancer LINE x KCD Taiwan Meetup 17
  13. kube-proxy (iptables) Replacment • 取代 iptables rules 對性能影響? • 進去

    netfilter 之前 tc/BPF 做 NAT • 南北向流量 (N-S Traffic) • NodePort • ExternalIPs • LoadBalancer LINE x KCD Taiwan Meetup 18
  14. kube-proxy (iptables) Replacment • 取代 iptables rules 對性能影響? • 在進

    netfilter 之前 tc/BPF 做 NAT • 南北向流量 (N-S Traffic) • NodePort • ExternalIPs • LoadBalancer LINE x KCD Taiwan Meetup 19
  15. kube-proxy (iptables) Replacment • 取代 iptables rules 對性能影響? • 在進

    netfilter 之前 tc/BPF 做 NAT • 南北向流量 (N-S Traffic) • NodePort • ExternalIPs • LoadBalancer LINE x KCD Taiwan Meetup 20
  16. kube-proxy (iptables) Replacment • 取代 iptables rules 對性能影響? • 在進

    netfilter 之前 tc/BPF 做 NAT • 南北向流量 (N-S Traffic) • NodePort, ExternalIPs & LoadBalancer • 為什麼不在 XDP/BPF 做? • tc/BPF 有更好的 packet mangling 能⼒ • tc/BPF 可以實現在 ingress/egress 上 • skb alloc 之前實現困難讀⾼ • struct __sk_buff, struct xdp_md LINE x KCD Taiwan Meetup 21
  17. kube-proxy (iptables) Replacment • 取代 iptables rules 對性能影響? • 在進

    netfilter 之前 tc/BPF 做 NAT • 南北向流量 (N-S Traffic) • NodePort, ExternalIPs & LoadBalancer • 為什麼不在 XDP/BPF 做? Cilium 1.8 已經⽀援 • tc/BPF 有更好的 packet mangling 能⼒ • tc/BPF 可以實現在 ingress/egress 上 • skb alloc 之前實現困難讀⾼ • struct __sk_buff, struct xdp_md LINE x KCD Taiwan Meetup 22
  18. kube-proxy (iptables) Replacment • 東⻄向流量 (E-W Traffic) • ClusterIP •

    取代 iptables rules 對性能影響? LINE x KCD Taiwan Meetup 23
  19. kube-proxy (iptables) Replacment • 東⻄向流量 (E-W Traffic) • ClusterIP •

    取代 iptables rules 對性能影響? • 在進 netfilter 之前 (Socket level) Socket Op5on, Socket Map/BPF 做 NAT • 地址轉換發⽣在 Socket connect(2) system call • 避免了更低層次 (IP) 中 per-packet 的 NAT (依 賴 conntrack,在⼤量 connec5ons 產⽣的問題) LINE x KCD Taiwan Meetup 24
  20. Keep Client Source IP Address • Kubernetes Networking1: • L7:

    X-Forwarded-For (XFF) header • L4: Proxy Protocol • Service Type: NodePort externalTrafficPolicy=Local • Local: 不會跨節點轉發,流量轉發少⼀ 次 SNAT. Latency 短,但 LB 不均勻 1 Keep Client Source IP Post on CNTUG LINE x KCD Taiwan Meetup 26
  21. Without DSR (SNAT) • Service Type: NodePort externalTrafficPolicy=Cluster • Step

    2 做了⼀次 SNAT, Client source IP 改變 • Step 3, 4 按原請求路徑回去 LINE x KCD Taiwan Meetup 27
  22. Dircet Server Return • eBPF with DSR • Step 2

    沒有發⽣ SNAT, Client source IP 被保留 • Step 3 直接回 Client LINE x KCD Taiwan Meetup 28
  23. Egress IP Problem • 企業⽤⼾/ On-Premises ⽤⼾南北向流 量需求2 • Egress

    traffic with predictable IP addresses 2 Envoy Gateway Post on CNTUG LINE x KCD Taiwan Meetup 31
  24. Egress IP Gateway • Possible Solu,ons 1. Service Mesh e.g.

    istio-egressgateway 2. Setup a proxy inside/outside of cluster • Envoy as Proxy • Ingress Controller + Kubernetes Service ExternalIPs 3. NAT gateway • Cloud Provider LINE x KCD Taiwan Meetup 32
  25. Egress IP Gateway • First release beta feature in Cilium

    1.10. CiliumEgressNATPolicy CRD • Cilium 1.12 promoted to stable. CiliumEgressGatewayPolicy CRD • eBPF-based SNAT • Single Point of Failure? • Egress Gateway High Availability (HA), which supports mulNple egress nodes LINE x KCD Taiwan Meetup 33
  26. What problems does a service mesh solve? • Provides observability,

    reliability, and security features3 • Non-invasive for applica9on • App. doesn't need to implement (auto-instrumented o11y) • Transparent for applica9on • App. doesn't aware that the service mesh 3 Service Mesh Post on 矽⾕⽜的耕⽥筆記 LINE x KCD Taiwan Meetup 35
  27. Concerns & Problems • Processing overhead. Inject sidecar proxy. per-pod

    • Addi8onal complexity infrastructure • Higher latency and opera8on costs • Configura8on design complexity and test validity • Verify the service mesh control plane configura8on and updates LINE x KCD Taiwan Meetup 36
  28. Sidecarless model - Cilium service mesh • From eBPF advocate

    perspec1ve • sidecar 注入多走2次 networking stack. per pod LINE x KCD Taiwan Meetup 37
  29. Sidecarless model - Cilium service mesh • From eBPF advocate

    perspec1ve • sidecar 注入多走2次 networking stack. per pod LINE x KCD Taiwan Meetup 38
  30. Sidecarless model - Cilium service mesh • From eBPF advocate

    perspec1ve • sidecar 注入多走2次 networking stack. per pod • Early stage of op1miza1on • 如果兩個 pods 在相同 node, 則兩個 pods 之間的網路可以得到最佳化 (Socket Op1on, Socket Map/BFP) LINE x KCD Taiwan Meetup 39
  31. Sidecarless model - Cilium service mesh • Socket level redirec.on

    (instead of network-level redirec.on) • Cilium 1.11 Sidecar-free Service Mesh datapath was introduced (instead of sidecar model) • Per-node proxy model • Cilium 1.12 new op.on for user LINE x KCD Taiwan Meetup 40
  32. Concerns & Problems • Per-node proxy5 • Applica3ons are now

    vulnerable to “noisy neighbor” traffic • The blast radius of a proxy is large. Proxy failures and upgrades • eBPF, sidecars, and the future of the service mesh from Buoyant • Proxy resource consump3on is now highly variable • Security story is now far more complex 5 Twi&er thread LINE x KCD Taiwan Meetup 41
  33. Concerns & Problems • Improve the proxy for suppor1ng mul1-

    tenancy? 6 • Complexity is not worth it 6 Twi&er thread LINE x KCD Taiwan Meetup 42
  34. Considera*ons of Adop*ng Adop%ng eBPF • Kubernetes cluster 規模⼤ Services

    數量龐⼤(>1000) • 應⽤有網路性能上有即時性、⾼連接數需求等 • 遊戲, 直播等... 業務場景對即時性或同時在線有要求 • 網路需要某些彈性設計但在 iptables 上不容易實現 • Egress IP gateway • Keep client source IP, DSR LINE x KCD Taiwan Meetup 43
  35. Considera*ons of Adop*ng Adop%ng Service Mesh in Enterprise • Observability,

    reliability, security 的需求 • Proxy-level, service-level metrics/distributed traces/access logs • Timeout, Retry, Circuit breaking • Encrypted traffic between service. mutual TLS • 整體團隊維運及管理 Service Mesh 的能⼒ • Service Mesh 配置設計複雜度 • Service Mesh 配置測試有效性,及驗證可靠性 • Service Mesh 更新或升級⽅案 LINE x KCD Taiwan Meetup 44
  36. Summary & Takeaways What JavaScript is to the browser, eBPF

    is to the Linux kernel — Thomas Graf, KubeCon + CloudNa7veCon Europe 2022 • Cilium notable features • Kube-proxy replacment • DSR • Egress IP gateway • Sidecar-free Service Mesh • ConsideraCons of adopCng eBPF & Service Mesh LINE x KCD Taiwan Meetup 45