Linux at Cloudflare

D4e1d473a995ef37b3e03e9e6006c3e3?s=47 majek04
March 22, 2019

Linux at Cloudflare

D4e1d473a995ef37b3e03e9e6006c3e3?s=128

majek04

March 22, 2019
Tweet

Transcript

  1. Linux at Cloudflare Marek Majkowski

  2. Global network, speed and security

  3. Edge Core

  4. Edge Network - locations

  5. Edge Network - anycast network

  6. Edge Network - software management • Uniform configuration everywhere •

    No virtualization, no containers, raw metal • Thousands of IP addresses/subnets (anycast) • Multiple applications ◦ HTTP (HTTP, TLS 1.3, HTTP2, QUIC) ◦ DNS (Auth, Resolver) ◦ Other
  7. https://blog.cloudflare.com/a-tour-inside-cloudflares-g9-servers/

  8. https://twitter.com/eastdakota/statu https://blog.cloudflare.com/arm-tak

  9. Edge Network - uniform stack anycast origin fetch

  10. Edge Network - uniform software

  11. Edge Network - uniform software

  12. iptables: xt_bpf, connlimit, hashlimits, ipsets syn cookies SO_FILTER XDP (locks!)

    https://lists.openwall.net/netdev/2019/02/22/87 https://patchwork.ozlabs.org/cover/998940/ https://speakerdeck.com/majek04/lessons-from-defending-the-indefensible https://netdevconf.org/2.1/papers/Gilberto_Bertin_XDP_in_practice.pdf DDoS mitigation
  13. DDoS mitigation - XDP • Implementing token buckets without locking

    is hard • More concurrency primitives (cmpxchg16b?) • "Traffic policing in eBPF: applying token bucket algorithm" ◦ http://vger.kernel.org/lpc-bpf.html#session-9
  14. XDP socket lookup syn cookies https://lists.openwall.net/netdev/2019/02/22/87 https://patchwork.ozlabs.org/cover/998940/ https://speakerdeck.com/majek04/lessons-from-defending-the-indefensible https://netdevconf.org/2.1/papers/Gilberto_Bertin_XDP_in_practice.pdf L4LB

    Load balancing in XDP
  15. Socket dispatch socket dispatch

  16. Socket dispatch - DoS considerations • The case of 30k

    UDP sockets • Solution: ebpf token bucket in SO_FILTER 192.0.2.2:53 192.0.2.1:53 192.0.2.0:53 0.0.0.0:53 + SO_FILTER
  17. Socket dispatch - zero downtime restart for Quic • Connected

    UDP sockets for zero-downtime server restart :443 10.0.0.1 --> 192.0.2.0:443 10.0.0.2 --> 192.0.2.0:443 :443 10.2.0.9 --> 192.0.2.0:443 10.3.0.8 --> 192.0.2.0:443 :443 :443
  18. • heavy user of AnyIP https://blog.cloudflare.com/how-we-built-spectrum/ • SO_BINDTOPREFIX http://patchwork.ozlabs.org/patch/602916/ •

    TPROXY https://blog.cloudflare.com/how-we-built-spectrum/ • TPROXY UDP Socket dispatch - AnyIP Single IP Many subnets Single port bind() SO_BINDTOPREFIX Many ports TPROXY TPROXY
  19. Better socket dispatch socket dispatch • Can we have "inet_lookup"

    in ebpf? - solve handover
  20. Transmission path - TPROXY UDP send() • Sending packets is

    hard src IP src port dst IP dst port connected socket - - - - bind(INADDR_ANY) auto - selected selected bind(INADDR_ANY) + IP_PKTINFO PKTINFO - selected selected bind(127.0.0.1, X) + IP_PKTINFO PKTINFO in bind selected selected
  21. accept() - EPOLL_RR by Jason Baron https://blog.cloudflare.com/the-sad-state-of-linux-socket-balancing/ https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg8316 EPOLL_RR

  22. SOCKMAP for TCP splicing + kTLS https://blog.cloudflare.com/sockmap-tcp-splicing-of-the-future/ splice sendfile

  23. Tuning for the Internet • Custom initcwnd, BPF_SOCK_OPT • BBR

    • TCP_NOTSENT_LOWAT • TCP Fast Open • ECN • TCP Multipath • QUIC - tuning for UDP, like UDP GSO • More introspection - Listen Drops https://blog.cloudflare.com/http-2-prioritization-with-nginx/
  24. Prometheus - ebpf_exporter • Backend for prometheus • Allowing more

    detailed event views - like histograms for block I/O • https://blog.cloudflare.com/introducing-ebpf_exporter/ Matt Bostock's SREcon17 talk
  25. Upgrading kernel • LTS, inertia • off-the-tree drivers age fast

    (we were stuck on 3.18) ◦ fusion-io, sfc / ixgbe, netmap, glb-redirect ◦ custom patches (EPOLL_RR, SO_BINDTOPREFIX, XDP features) • hardware issues ◦ microcode bugs ◦ driver regressions • regressions ◦ SO_FILTER https://www.spinics.net/lists/netdev/msg555565.html ◦ nf_conncount https://www.spinics.net/lists/netfilter-devel/msg57316.html ◦ nf_nat_cleanup_conntrack https://bugzilla.kernel.org/show_bug.cgi?id=196821 ◦ systemd disabling TSO https://blog.cloudflare.com/tracing-system-cpu-on-debian-stretch/
  26. • XDP for DDoS • XDP for load balancing •

    xt_bpf on iptables • SO_FILTER for application DDoS • SOCKMAP + kTLS within application • ebpf_exporter for metrics • • future: BPF_SOCK_OPTS • future: cgroups • future: SO_BINDTOBPF ? BPF is everywhere
  27. Thank you!

  28. None
  29. Kernel bypass for DDoS https://blog.cloudflare.com/kernel-bypass/ https://blog.cloudflare.com/why-we-use-the-linux-kernels-tcp-stack/

  30. CPU Kernel bypass for application I/O 10%?

  31. Kernel bypass for application • Rarely kernel-bound • Kernel features

    ◦ iptables for DDoS ◦ SYN cookies ◦ RFC4821 tcp_mtu_probing ◦ BBR ◦ Kernel Debuggability ◦ tcpdump, sampling
  32. None
  33. Core • Couple large locations • Marathon + Mesos (custom

    load balancers, discovery) • Kubernetes (vxlan) • Kafka, ClickHouse, Ceph, HBase, Postgresql / CitusDB https://blog.cloudflare.com/http-analytics-for-6m-requests-per-second-using-clickhouse/
  34. None
  35. Dual 25Gbps OCP

  36. Bonded 50Gbps?

  37. What about NIC tuning, MTU and QoS traffic type MTU

    QoS: saturation north (public) anycast eyeball requests 1500 inbound: attack outbound: traffic spike south (public) origin origin pulls 1500 - east-west - L4LB inbound requests 1544 - east-west - cache cache traffic jumbo hot assets • MTU is hard • LRO can get disabled on large MSS • tc qdisc work better on physical device
  38. None
  39. None
  40. dm-crypt/LUKS saga https://www.spinics.net/lists/dm-crypt/msg07517.html

  41. xfs_reclaim https://marc.info/?l=linux-xfs&m=154345788829830&w=2

  42. None
  43. None
  44. microcode bugs • https://blog.cloudflare.com/however-improbable-the-story-of-a-processor-bug/

  45. mprotect() race https://github.com/torvalds/linux/commit/e86f15ee64d8ee4

  46. microcode bugs • RETPOLINE