Slide 1

Slide 1 text

Linux at Cloudflare Marek Majkowski

Slide 2

Slide 2 text

Global network, speed and security

Slide 3

Slide 3 text

Edge Core

Slide 4

Slide 4 text

Edge Network - locations

Slide 5

Slide 5 text

Edge Network - anycast network

Slide 6

Slide 6 text

Edge Network - software management ● Uniform configuration everywhere ● No virtualization, no containers, raw metal ● Thousands of IP addresses/subnets (anycast) ● Multiple applications ○ HTTP (HTTP, TLS 1.3, HTTP2, QUIC) ○ DNS (Auth, Resolver) ○ Other

Slide 7

Slide 7 text

https://blog.cloudflare.com/a-tour-inside-cloudflares-g9-servers/

Slide 8

Slide 8 text

https://twitter.com/eastdakota/statu https://blog.cloudflare.com/arm-tak

Slide 9

Slide 9 text

Edge Network - uniform stack anycast origin fetch

Slide 10

Slide 10 text

Edge Network - uniform software

Slide 11

Slide 11 text

Edge Network - uniform software

Slide 12

Slide 12 text

iptables: xt_bpf, connlimit, hashlimits, ipsets syn cookies SO_FILTER XDP (locks!) https://lists.openwall.net/netdev/2019/02/22/87 https://patchwork.ozlabs.org/cover/998940/ https://speakerdeck.com/majek04/lessons-from-defending-the-indefensible https://netdevconf.org/2.1/papers/Gilberto_Bertin_XDP_in_practice.pdf DDoS mitigation

Slide 13

Slide 13 text

DDoS mitigation - XDP ● Implementing token buckets without locking is hard ● More concurrency primitives (cmpxchg16b?) ● "Traffic policing in eBPF: applying token bucket algorithm" ○ http://vger.kernel.org/lpc-bpf.html#session-9

Slide 14

Slide 14 text

XDP socket lookup syn cookies https://lists.openwall.net/netdev/2019/02/22/87 https://patchwork.ozlabs.org/cover/998940/ https://speakerdeck.com/majek04/lessons-from-defending-the-indefensible https://netdevconf.org/2.1/papers/Gilberto_Bertin_XDP_in_practice.pdf L4LB Load balancing in XDP

Slide 15

Slide 15 text

Socket dispatch socket dispatch

Slide 16

Slide 16 text

Socket dispatch - DoS considerations ● The case of 30k UDP sockets ● Solution: ebpf token bucket in SO_FILTER 192.0.2.2:53 192.0.2.1:53 192.0.2.0:53 0.0.0.0:53 + SO_FILTER

Slide 17

Slide 17 text

Socket dispatch - zero downtime restart for Quic ● Connected UDP sockets for zero-downtime server restart :443 10.0.0.1 --> 192.0.2.0:443 10.0.0.2 --> 192.0.2.0:443 :443 10.2.0.9 --> 192.0.2.0:443 10.3.0.8 --> 192.0.2.0:443 :443 :443

Slide 18

Slide 18 text

● heavy user of AnyIP https://blog.cloudflare.com/how-we-built-spectrum/ ● SO_BINDTOPREFIX http://patchwork.ozlabs.org/patch/602916/ ● TPROXY https://blog.cloudflare.com/how-we-built-spectrum/ ● TPROXY UDP Socket dispatch - AnyIP Single IP Many subnets Single port bind() SO_BINDTOPREFIX Many ports TPROXY TPROXY

Slide 19

Slide 19 text

Better socket dispatch socket dispatch ● Can we have "inet_lookup" in ebpf? - solve handover

Slide 20

Slide 20 text

Transmission path - TPROXY UDP send() ● Sending packets is hard src IP src port dst IP dst port connected socket - - - - bind(INADDR_ANY) auto - selected selected bind(INADDR_ANY) + IP_PKTINFO PKTINFO - selected selected bind(127.0.0.1, X) + IP_PKTINFO PKTINFO in bind selected selected

Slide 21

Slide 21 text

accept() - EPOLL_RR by Jason Baron https://blog.cloudflare.com/the-sad-state-of-linux-socket-balancing/ https://www.mail-archive.com/[email protected]/msg8316 EPOLL_RR

Slide 22

Slide 22 text

SOCKMAP for TCP splicing + kTLS https://blog.cloudflare.com/sockmap-tcp-splicing-of-the-future/ splice sendfile

Slide 23

Slide 23 text

Tuning for the Internet ● Custom initcwnd, BPF_SOCK_OPT ● BBR ● TCP_NOTSENT_LOWAT ● TCP Fast Open ● ECN ● TCP Multipath ● QUIC - tuning for UDP, like UDP GSO ● More introspection - Listen Drops https://blog.cloudflare.com/http-2-prioritization-with-nginx/

Slide 24

Slide 24 text

Prometheus - ebpf_exporter ● Backend for prometheus ● Allowing more detailed event views - like histograms for block I/O ● https://blog.cloudflare.com/introducing-ebpf_exporter/ Matt Bostock's SREcon17 talk

Slide 25

Slide 25 text

Upgrading kernel ● LTS, inertia ● off-the-tree drivers age fast (we were stuck on 3.18) ○ fusion-io, sfc / ixgbe, netmap, glb-redirect ○ custom patches (EPOLL_RR, SO_BINDTOPREFIX, XDP features) ● hardware issues ○ microcode bugs ○ driver regressions ● regressions ○ SO_FILTER https://www.spinics.net/lists/netdev/msg555565.html ○ nf_conncount https://www.spinics.net/lists/netfilter-devel/msg57316.html ○ nf_nat_cleanup_conntrack https://bugzilla.kernel.org/show_bug.cgi?id=196821 ○ systemd disabling TSO https://blog.cloudflare.com/tracing-system-cpu-on-debian-stretch/

Slide 26

Slide 26 text

● XDP for DDoS ● XDP for load balancing ● xt_bpf on iptables ● SO_FILTER for application DDoS ● SOCKMAP + kTLS within application ● ebpf_exporter for metrics ● ● future: BPF_SOCK_OPTS ● future: cgroups ● future: SO_BINDTOBPF ? BPF is everywhere

Slide 27

Slide 27 text

Thank you!

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

Kernel bypass for DDoS https://blog.cloudflare.com/kernel-bypass/ https://blog.cloudflare.com/why-we-use-the-linux-kernels-tcp-stack/

Slide 30

Slide 30 text

CPU Kernel bypass for application I/O 10%?

Slide 31

Slide 31 text

Kernel bypass for application ● Rarely kernel-bound ● Kernel features ○ iptables for DDoS ○ SYN cookies ○ RFC4821 tcp_mtu_probing ○ BBR ○ Kernel Debuggability ○ tcpdump, sampling

Slide 32

Slide 32 text

No content

Slide 33

Slide 33 text

Core ● Couple large locations ● Marathon + Mesos (custom load balancers, discovery) ● Kubernetes (vxlan) ● Kafka, ClickHouse, Ceph, HBase, Postgresql / CitusDB https://blog.cloudflare.com/http-analytics-for-6m-requests-per-second-using-clickhouse/

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

Dual 25Gbps OCP

Slide 36

Slide 36 text

Bonded 50Gbps?

Slide 37

Slide 37 text

What about NIC tuning, MTU and QoS traffic type MTU QoS: saturation north (public) anycast eyeball requests 1500 inbound: attack outbound: traffic spike south (public) origin origin pulls 1500 - east-west - L4LB inbound requests 1544 - east-west - cache cache traffic jumbo hot assets ● MTU is hard ● LRO can get disabled on large MSS ● tc qdisc work better on physical device

Slide 38

Slide 38 text

No content

Slide 39

Slide 39 text

No content

Slide 40

Slide 40 text

dm-crypt/LUKS saga https://www.spinics.net/lists/dm-crypt/msg07517.html

Slide 41

Slide 41 text

xfs_reclaim https://marc.info/?l=linux-xfs&m=154345788829830&w=2

Slide 42

Slide 42 text

No content

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

microcode bugs ● https://blog.cloudflare.com/however-improbable-the-story-of-a-processor-bug/

Slide 45

Slide 45 text

mprotect() race https://github.com/torvalds/linux/commit/e86f15ee64d8ee4

Slide 46

Slide 46 text

microcode bugs ● RETPOLINE