Slide 1

Slide 1 text

eBPF Networking 應⽤在 Kubernetes 上 你應該知道的事 John Lin Site Reliability Engineer, LINE Taiwan @ LINE x KCD Taiwan Meetup #49

Slide 2

Slide 2 text

About Me • John Lin, Site Reliability Engineer at LINE Taiwan • Previously Tencent • Kubernetes, Observability & Networking • Follow me on Twitter: @johnlin__ LINE x KCD Taiwan Meetup 2

Slide 3

Slide 3 text

We're Hiring Site Reliability Engineers Taipei | LINE Taiwan | Engineering System | Remote | Full-time https://careers.linecorp.com/jobs/1330 LINE x KCD Taiwan Meetup 3

Slide 4

Slide 4 text

Qualifications and Skills • Kubernetes, Cloud Native • On-premise Private Cloud • Observability • Have a better behind-the-scenes view can get insights in a distributed system • Troubleshooting • Tracking down the roots of the problems LINE x KCD Taiwan Meetup 4

Slide 5

Slide 5 text

Last talk at COSCUP ... • We looked at eBPF based Container Networking • 4 types of datapath on Kubernetes Networking • eBPF Development • eBPF Program Types • Some things were glossed over • Cilium features and their applications • “Sidecarless” model to the world of service mesh • Today's talk • Notable Cilium features on Kubernetes • Sidecar model & Sidecarless model LINE x KCD Taiwan Meetup 5

Slide 6

Slide 6 text

Why Networking? LINE x KCD Taiwan Meetup 6

Slide 7

Slide 7 text

Outline • Quick Recap • 4 types of datapath • Cilium features • kube-proxy replacment • DSR (Dircet Server Return) • Egress IP Gateway • Sidecarless model - Cilium service mesh • Considerations of Adopting LINE x KCD Taiwan Meetup 7

Slide 8

Slide 8 text

Quick Recap • The Linux Netowkring Subsystem • Linux Kernel Networking • iptables LINE x KCD Taiwan Meetup 8

Slide 9

Slide 9 text

Quick Recap • Kernel Modules for Networking • Open vSwitch, IPvS LINE x KCD Taiwan Meetup 9

Slide 10

Slide 10 text

Quick Recap • Kernel-Bypass Networking • DPDK LINE x KCD Taiwan Meetup 10

Slide 11

Slide 11 text

Quick Recap • Early Point of Kernel Networking • XDP (eXpress Data Path) • TC ingress/egress LINE x KCD Taiwan Meetup 11

Slide 12

Slide 12 text

Quick Recap • Early Point of Kernel Networking • XDP (eXpress Data Path) • TC ingress/egress • Not Kernel bypass LINE x KCD Taiwan Meetup 12

Slide 13

Slide 13 text

Quick Recap • eBPF 會是 Kubernetes Networking 上的 game changer • Cilium 為⽬前最出⾊的 eBPF CNI 專案 • Cilium 5 個重要的 BPF 程式 • tc/BPF, XDP/BPF • Cgroup Socket/BPF, Socket Option/BPF, Socket Map/BPF • Cilium 幾個值得注意的功能 LINE x KCD Taiwan Meetup 13

Slide 14

Slide 14 text

Cilium Features kube-proxy replacment LINE x KCD Taiwan Meetup 14

Slide 15

Slide 15 text

iptables Skills Matter Kubernetes Networking: Default policy enforcement in terms of networking: iptables Early stage of docker networking manipulation tool - pipework LINE x KCD Taiwan Meetup 15

Slide 16

Slide 16 text

kube-proxy (iptables) Replacment • Cilium 1.6 release 帶來了最後⼀塊拼 圖,kube-proxy 的移除 • eBPF service 實作不會因為 Kubernetes Service 數量成⻑延遲跟著遞增 • eBPF service 實作不會有 packets traverse 匹配多筆規則的⾮確定 (non- deterministic) 情況 • 大量的 iptables 規則從 Node 上被移 除 LINE x KCD Taiwan Meetup 16

Slide 17

Slide 17 text

kube-proxy (iptables) Replacment • 取代 iptables rules 對性能影響? • 南北向流量 (N-S Traffic) • NodePort • ExternalIPs • LoadBalancer LINE x KCD Taiwan Meetup 17

Slide 18

Slide 18 text

kube-proxy (iptables) Replacment • 取代 iptables rules 對性能影響? • 在進 netfilter 之前 tc/BPF 做 NAT • 南北向流量 (N-S Traffic) • NodePort • ExternalIPs • LoadBalancer LINE x KCD Taiwan Meetup 18

Slide 19

Slide 19 text

kube-proxy (iptables) Replacment • 取代 iptables rules 對性能影響? • 在進 netfilter 之前 tc/BPF 做 NAT • 南北向流量 (N-S Traffic) • NodePort • ExternalIPs • LoadBalancer LINE x KCD Taiwan Meetup 19

Slide 20

Slide 20 text

kube-proxy (iptables) Replacment • 取代 iptables rules 對性能影響? • 在進 netfilter 之前 tc/BPF 做 NAT • 南北向流量 (N-S Traffic) • NodePort • ExternalIPs • LoadBalancer LINE x KCD Taiwan Meetup 20

Slide 21

Slide 21 text

kube-proxy (iptables) Replacment • 取代 iptables rules 對性能影響? • 在進 netfilter 之前 tc/BPF 做 NAT • 南北向流量 (N-S Traffic) • NodePort, ExternalIPs & LoadBalancer • 為什麼不在 XDP/BPF 做? • tc/BPF 有更好的 packet mangling 能⼒ • tc/BPF 可以實現在 ingress/egress 上 • skb alloc 之前實現困難讀⾼ • struct __sk_buff, struct xdp_md LINE x KCD Taiwan Meetup 21

Slide 22

Slide 22 text

kube-proxy (iptables) Replacment • 取代 iptables rules 對性能影響? • 在進 netfilter 之前 tc/BPF 做 NAT • 南北向流量 (N-S Traffic) • NodePort, ExternalIPs & LoadBalancer • 為什麼不在 XDP/BPF 做? Cilium 1.8 已經⽀援 • tc/BPF 有更好的 packet mangling 能⼒ • tc/BPF 可以實現在 ingress/egress 上 • skb alloc 之前實現困難讀⾼ • struct __sk_buff, struct xdp_md LINE x KCD Taiwan Meetup 22

Slide 23

Slide 23 text

kube-proxy (iptables) Replacment • 東⻄向流量 (E-W Traffic) • ClusterIP • 取代 iptables rules 對性能影響? LINE x KCD Taiwan Meetup 23

Slide 24

Slide 24 text

kube-proxy (iptables) Replacment • 東⻄向流量 (E-W Traffic) • ClusterIP • 取代 iptables rules 對性能影響? • 在進 netfilter 之前 (Socket level) Socket Option, Socket Map/BPF 做 NAT • 地址轉換發⽣在 Socket connect(2) system call • 避免了更低層次 (IP) 中 per-packet 的 NAT (依賴 conntrack,在大量 connections 產⽣ 的問題) LINE x KCD Taiwan Meetup 24

Slide 25

Slide 25 text

Cilium Features DSR (Dircet Server Return) LINE x KCD Taiwan Meetup 25

Slide 26

Slide 26 text

Keep Client Source IP Address • Kubernetes Networking1: • L7: X-Forwarded-For (XFF) header • L4: Proxy Protocol • Service Type: NodePort externalTrafficPolicy=Local • Local: 不會跨節點轉發,流量轉發少⼀ 次 SNAT. Latency 短,但 LB 不均勻 1 Keep Client Source IP Post on CNTUG LINE x KCD Taiwan Meetup 26

Slide 27

Slide 27 text

Without DSR (SNAT) • Service Type: NodePort externalTrafficPolicy=Cluster • Step 2 做了⼀次 SNAT, Client source IP 改變 • Step 3, 4 按原請求路徑回去 LINE x KCD Taiwan Meetup 27

Slide 28

Slide 28 text

Dircet Server Return • eBPF with DSR • Step 2 沒有發⽣ SNAT, Client source IP 被保留 • Step 3 直接回 Client LINE x KCD Taiwan Meetup 28

Slide 29

Slide 29 text

LINE x KCD Taiwan Meetup 29

Slide 30

Slide 30 text

Cilium Features Egress IP Gateway LINE x KCD Taiwan Meetup 30

Slide 31

Slide 31 text

Egress IP Problem • 企業⽤⼾/ On-Premises ⽤⼾南北向流 量需求2 • Egress traffic with predictable IP addresses 2 Envoy Gateway Post on CNTUG LINE x KCD Taiwan Meetup 31

Slide 32

Slide 32 text

Egress IP Gateway • Possible Solutions 1.Service Mesh e.g. istio-egressgateway 2.Setup a proxy inside/outside of cluster • Envoy as a L7/L4 Proxy 3.NAT gateway • Cloud Provider LINE x KCD Taiwan Meetup 32

Slide 33

Slide 33 text

Egress IP Gateway • First release beta feature in Cilium 1.10. CiliumEgressNATPolicy CRD • Cilium 1.12 promoted to stable. CiliumEgressGatewayPolicy CRD • eBPF-based SNAT • Single Point of Failure? • Egress Gateway High Availability (HA), which supports multiple egress nodes LINE x KCD Taiwan Meetup 33

Slide 34

Slide 34 text

Cilium Features Sidecarless model - Cilium service mesh LINE x KCD Taiwan Meetup 34

Slide 35

Slide 35 text

What problems does a service mesh solve? • Provides observability, reliability, and security features3 • Non-invasive for application • App. doesn't need to implement (auto-instrumented o11y) • Transparent for application • App. doesn't aware that the service mesh 3 Service Mesh Post on 矽⾕⽜的耕⽥筆記 LINE x KCD Taiwan Meetup 35

Slide 36

Slide 36 text

Concerns & Problems • Processing overhead. Inject sidecar proxy. per-pod • Additional complexity infrastructure • Higher latency and operation costs • Configuration design complexity and test validity • Verify the service mesh control plane configuration and updates LINE x KCD Taiwan Meetup 36

Slide 37

Slide 37 text

Sidecarless model - Cilium service mesh • From eBPF advocate perspective • sidecar 注⼊多走2次 networking stack. per pod LINE x KCD Taiwan Meetup 37

Slide 38

Slide 38 text

Sidecarless model - Cilium service mesh • From eBPF advocate perspective • sidecar 注⼊多走2次 networking stack. per pod LINE x KCD Taiwan Meetup 38

Slide 39

Slide 39 text

Sidecarless model - Cilium service mesh • From eBPF advocate perspective • sidecar 注⼊多走2次 networking stack. per pod • Early stage of optimization • 如果兩個 pods 在相同 node, 則兩個 pods 之間的網路可以得到最佳化 (Socket Option, Socket Map/BFP) LINE x KCD Taiwan Meetup 39

Slide 40

Slide 40 text

Sidecarless model - Cilium service mesh • Socket level redirection (instead of network-level redirection) • Cilium 1.11 Sidecar-free Service Mesh datapath was introduced (instead of sidecar model) • Per-node proxy model • Cilium 1.12 new option for user LINE x KCD Taiwan Meetup 40

Slide 41

Slide 41 text

Concerns & Problems • Per-node proxy5 • Applications are now vulnerable to “noisy neighbor” traffic • The blast radius of a proxy is large. Proxy failures and upgrades • eBPF, sidecars, and the future of the service mesh from Buoyant • Proxy resource consumption is now highly variable • Security story is now far more complex 5 Twitter thread LINE x KCD Taiwan Meetup 41

Slide 42

Slide 42 text

Concerns & Problems • Improve the proxy for supporting multi-tenancy? 6 • Complexity is not worth it 6 Twitter thread LINE x KCD Taiwan Meetup 42

Slide 43

Slide 43 text

Considerations of Adopting Adopting eBPF • Kubernetes cluster 規模大 Services 數量龐大(>1000) • 應⽤有網路性能上有即時性、⾼連接數需求等 • 遊戲, 直播等... 業務場景對即時性或同時在線有要求 • 網路需要某些彈性設計但在 iptables 上不容易實現 • Egress IP gateway • Keep client source IP, DSR LINE x KCD Taiwan Meetup 43

Slide 44

Slide 44 text

Considerations of Adopting Adopting Service Mesh in Enterprise • Observability, reliability, security 的需求 • Proxy-level, service-level metrics/distributed traces/access logs • Timeout, Retry, Circuit breaking • Encrypted traffic between service. mutual TLS • 整體團隊維運及管理 Service Mesh 的能⼒ • Service Mesh 配置設計複雜度 • Service Mesh 配置測試有效性,及驗證可靠性 • Service Mesh 更新或升級⽅案 LINE x KCD Taiwan Meetup 44

Slide 45

Slide 45 text

Summary & Takeaways What JavaScript is to the browser, eBPF is to the Linux kernel — Thomas Graf, KubeCon + CloudNativeCon Europe 2022 • Cilium notable features • Kube-proxy replacment • DSR • Egress IP gateway • Sidecar-free Service Mesh • Considerations of adopting eBPF & Service Mesh LINE x KCD Taiwan Meetup 45

Slide 46

Slide 46 text

We're Hiring Site Reliability Engineers Taipei | LINE Taiwan | Engineering System | Remote | Full-time https://careers.linecorp.com/jobs/1330 LINE x KCD Taiwan Meetup 46