Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Linux at Cloudflare

majek04
March 22, 2019

Linux at Cloudflare

majek04

March 22, 2019
Tweet

More Decks by majek04

Other Decks in Programming

Transcript

  1. Edge Network - software management • Uniform configuration everywhere •

    No virtualization, no containers, raw metal • Thousands of IP addresses/subnets (anycast) • Multiple applications ◦ HTTP (HTTP, TLS 1.3, HTTP2, QUIC) ◦ DNS (Auth, Resolver) ◦ Other
  2. iptables: xt_bpf, connlimit, hashlimits, ipsets syn cookies SO_FILTER XDP (locks!)

    https://lists.openwall.net/netdev/2019/02/22/87 https://patchwork.ozlabs.org/cover/998940/ https://speakerdeck.com/majek04/lessons-from-defending-the-indefensible https://netdevconf.org/2.1/papers/Gilberto_Bertin_XDP_in_practice.pdf DDoS mitigation
  3. DDoS mitigation - XDP • Implementing token buckets without locking

    is hard • More concurrency primitives (cmpxchg16b?) • "Traffic policing in eBPF: applying token bucket algorithm" ◦ http://vger.kernel.org/lpc-bpf.html#session-9
  4. Socket dispatch - DoS considerations • The case of 30k

    UDP sockets • Solution: ebpf token bucket in SO_FILTER 192.0.2.2:53 192.0.2.1:53 192.0.2.0:53 0.0.0.0:53 + SO_FILTER
  5. Socket dispatch - zero downtime restart for Quic • Connected

    UDP sockets for zero-downtime server restart :443 10.0.0.1 --> 192.0.2.0:443 10.0.0.2 --> 192.0.2.0:443 :443 10.2.0.9 --> 192.0.2.0:443 10.3.0.8 --> 192.0.2.0:443 :443 :443
  6. • heavy user of AnyIP https://blog.cloudflare.com/how-we-built-spectrum/ • SO_BINDTOPREFIX http://patchwork.ozlabs.org/patch/602916/ •

    TPROXY https://blog.cloudflare.com/how-we-built-spectrum/ • TPROXY UDP Socket dispatch - AnyIP Single IP Many subnets Single port bind() SO_BINDTOPREFIX Many ports TPROXY TPROXY
  7. Transmission path - TPROXY UDP send() • Sending packets is

    hard src IP src port dst IP dst port connected socket - - - - bind(INADDR_ANY) auto - selected selected bind(INADDR_ANY) + IP_PKTINFO PKTINFO - selected selected bind(127.0.0.1, X) + IP_PKTINFO PKTINFO in bind selected selected
  8. Tuning for the Internet • Custom initcwnd, BPF_SOCK_OPT • BBR

    • TCP_NOTSENT_LOWAT • TCP Fast Open • ECN • TCP Multipath • QUIC - tuning for UDP, like UDP GSO • More introspection - Listen Drops https://blog.cloudflare.com/http-2-prioritization-with-nginx/
  9. Prometheus - ebpf_exporter • Backend for prometheus • Allowing more

    detailed event views - like histograms for block I/O • https://blog.cloudflare.com/introducing-ebpf_exporter/ Matt Bostock's SREcon17 talk
  10. Upgrading kernel • LTS, inertia • off-the-tree drivers age fast

    (we were stuck on 3.18) ◦ fusion-io, sfc / ixgbe, netmap, glb-redirect ◦ custom patches (EPOLL_RR, SO_BINDTOPREFIX, XDP features) • hardware issues ◦ microcode bugs ◦ driver regressions • regressions ◦ SO_FILTER https://www.spinics.net/lists/netdev/msg555565.html ◦ nf_conncount https://www.spinics.net/lists/netfilter-devel/msg57316.html ◦ nf_nat_cleanup_conntrack https://bugzilla.kernel.org/show_bug.cgi?id=196821 ◦ systemd disabling TSO https://blog.cloudflare.com/tracing-system-cpu-on-debian-stretch/
  11. • XDP for DDoS • XDP for load balancing •

    xt_bpf on iptables • SO_FILTER for application DDoS • SOCKMAP + kTLS within application • ebpf_exporter for metrics • • future: BPF_SOCK_OPTS • future: cgroups • future: SO_BINDTOBPF ? BPF is everywhere
  12. Kernel bypass for application • Rarely kernel-bound • Kernel features

    ◦ iptables for DDoS ◦ SYN cookies ◦ RFC4821 tcp_mtu_probing ◦ BBR ◦ Kernel Debuggability ◦ tcpdump, sampling
  13. Core • Couple large locations • Marathon + Mesos (custom

    load balancers, discovery) • Kubernetes (vxlan) • Kafka, ClickHouse, Ceph, HBase, Postgresql / CitusDB https://blog.cloudflare.com/http-analytics-for-6m-requests-per-second-using-clickhouse/
  14. What about NIC tuning, MTU and QoS traffic type MTU

    QoS: saturation north (public) anycast eyeball requests 1500 inbound: attack outbound: traffic spike south (public) origin origin pulls 1500 - east-west - L4LB inbound requests 1544 - east-west - cache cache traffic jumbo hot assets • MTU is hard • LRO can get disabled on large MSS • tc qdisc work better on physical device