Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Linux at Cloudflare

D4e1d473a995ef37b3e03e9e6006c3e3?s=47 majek04
March 22, 2019

Linux at Cloudflare

D4e1d473a995ef37b3e03e9e6006c3e3?s=128

majek04

March 22, 2019
Tweet

More Decks by majek04

Other Decks in Programming

Transcript

  1. Linux at Cloudflare Marek Majkowski

  2. Global network, speed and security

  3. Edge Core

  4. Edge Network - locations

  5. Edge Network - anycast network

  6. Edge Network - software management • Uniform configuration everywhere •

    No virtualization, no containers, raw metal • Thousands of IP addresses/subnets (anycast) • Multiple applications ◦ HTTP (HTTP, TLS 1.3, HTTP2, QUIC) ◦ DNS (Auth, Resolver) ◦ Other
  7. https://blog.cloudflare.com/a-tour-inside-cloudflares-g9-servers/

  8. https://twitter.com/eastdakota/statu https://blog.cloudflare.com/arm-tak

  9. Edge Network - uniform stack anycast origin fetch

  10. Edge Network - uniform software

  11. Edge Network - uniform software

  12. iptables: xt_bpf, connlimit, hashlimits, ipsets syn cookies SO_FILTER XDP (locks!)

    https://lists.openwall.net/netdev/2019/02/22/87 https://patchwork.ozlabs.org/cover/998940/ https://speakerdeck.com/majek04/lessons-from-defending-the-indefensible https://netdevconf.org/2.1/papers/Gilberto_Bertin_XDP_in_practice.pdf DDoS mitigation
  13. DDoS mitigation - XDP • Implementing token buckets without locking

    is hard • More concurrency primitives (cmpxchg16b?) • "Traffic policing in eBPF: applying token bucket algorithm" ◦ http://vger.kernel.org/lpc-bpf.html#session-9
  14. XDP socket lookup syn cookies https://lists.openwall.net/netdev/2019/02/22/87 https://patchwork.ozlabs.org/cover/998940/ https://speakerdeck.com/majek04/lessons-from-defending-the-indefensible https://netdevconf.org/2.1/papers/Gilberto_Bertin_XDP_in_practice.pdf L4LB

    Load balancing in XDP
  15. Socket dispatch socket dispatch

  16. Socket dispatch - DoS considerations • The case of 30k

    UDP sockets • Solution: ebpf token bucket in SO_FILTER 192.0.2.2:53 192.0.2.1:53 192.0.2.0:53 0.0.0.0:53 + SO_FILTER
  17. Socket dispatch - zero downtime restart for Quic • Connected

    UDP sockets for zero-downtime server restart :443 10.0.0.1 --> 192.0.2.0:443 10.0.0.2 --> 192.0.2.0:443 :443 10.2.0.9 --> 192.0.2.0:443 10.3.0.8 --> 192.0.2.0:443 :443 :443
  18. • heavy user of AnyIP https://blog.cloudflare.com/how-we-built-spectrum/ • SO_BINDTOPREFIX http://patchwork.ozlabs.org/patch/602916/ •

    TPROXY https://blog.cloudflare.com/how-we-built-spectrum/ • TPROXY UDP Socket dispatch - AnyIP Single IP Many subnets Single port bind() SO_BINDTOPREFIX Many ports TPROXY TPROXY
  19. Better socket dispatch socket dispatch • Can we have "inet_lookup"

    in ebpf? - solve handover
  20. Transmission path - TPROXY UDP send() • Sending packets is

    hard src IP src port dst IP dst port connected socket - - - - bind(INADDR_ANY) auto - selected selected bind(INADDR_ANY) + IP_PKTINFO PKTINFO - selected selected bind(127.0.0.1, X) + IP_PKTINFO PKTINFO in bind selected selected
  21. accept() - EPOLL_RR by Jason Baron https://blog.cloudflare.com/the-sad-state-of-linux-socket-balancing/ https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg8316 EPOLL_RR

  22. SOCKMAP for TCP splicing + kTLS https://blog.cloudflare.com/sockmap-tcp-splicing-of-the-future/ splice sendfile

  23. Tuning for the Internet • Custom initcwnd, BPF_SOCK_OPT • BBR

    • TCP_NOTSENT_LOWAT • TCP Fast Open • ECN • TCP Multipath • QUIC - tuning for UDP, like UDP GSO • More introspection - Listen Drops https://blog.cloudflare.com/http-2-prioritization-with-nginx/
  24. Prometheus - ebpf_exporter • Backend for prometheus • Allowing more

    detailed event views - like histograms for block I/O • https://blog.cloudflare.com/introducing-ebpf_exporter/ Matt Bostock's SREcon17 talk
  25. Upgrading kernel • LTS, inertia • off-the-tree drivers age fast

    (we were stuck on 3.18) ◦ fusion-io, sfc / ixgbe, netmap, glb-redirect ◦ custom patches (EPOLL_RR, SO_BINDTOPREFIX, XDP features) • hardware issues ◦ microcode bugs ◦ driver regressions • regressions ◦ SO_FILTER https://www.spinics.net/lists/netdev/msg555565.html ◦ nf_conncount https://www.spinics.net/lists/netfilter-devel/msg57316.html ◦ nf_nat_cleanup_conntrack https://bugzilla.kernel.org/show_bug.cgi?id=196821 ◦ systemd disabling TSO https://blog.cloudflare.com/tracing-system-cpu-on-debian-stretch/
  26. • XDP for DDoS • XDP for load balancing •

    xt_bpf on iptables • SO_FILTER for application DDoS • SOCKMAP + kTLS within application • ebpf_exporter for metrics • • future: BPF_SOCK_OPTS • future: cgroups • future: SO_BINDTOBPF ? BPF is everywhere
  27. Thank you!

  28. None
  29. Kernel bypass for DDoS https://blog.cloudflare.com/kernel-bypass/ https://blog.cloudflare.com/why-we-use-the-linux-kernels-tcp-stack/

  30. CPU Kernel bypass for application I/O 10%?

  31. Kernel bypass for application • Rarely kernel-bound • Kernel features

    ◦ iptables for DDoS ◦ SYN cookies ◦ RFC4821 tcp_mtu_probing ◦ BBR ◦ Kernel Debuggability ◦ tcpdump, sampling
  32. None
  33. Core • Couple large locations • Marathon + Mesos (custom

    load balancers, discovery) • Kubernetes (vxlan) • Kafka, ClickHouse, Ceph, HBase, Postgresql / CitusDB https://blog.cloudflare.com/http-analytics-for-6m-requests-per-second-using-clickhouse/
  34. None
  35. Dual 25Gbps OCP

  36. Bonded 50Gbps?

  37. What about NIC tuning, MTU and QoS traffic type MTU

    QoS: saturation north (public) anycast eyeball requests 1500 inbound: attack outbound: traffic spike south (public) origin origin pulls 1500 - east-west - L4LB inbound requests 1544 - east-west - cache cache traffic jumbo hot assets • MTU is hard • LRO can get disabled on large MSS • tc qdisc work better on physical device
  38. None
  39. None
  40. dm-crypt/LUKS saga https://www.spinics.net/lists/dm-crypt/msg07517.html

  41. xfs_reclaim https://marc.info/?l=linux-xfs&m=154345788829830&w=2

  42. None
  43. None
  44. microcode bugs • https://blog.cloudflare.com/however-improbable-the-story-of-a-processor-bug/

  45. mprotect() race https://github.com/torvalds/linux/commit/e86f15ee64d8ee4

  46. microcode bugs • RETPOLINE