Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Linux at Cloudflare

Linux at Cloudflare

majek04

March 22, 2019
Tweet

More Decks by majek04

Other Decks in Programming

Transcript

  1. Linux at Cloudflare
    Marek Majkowski

    View Slide

  2. Global network, speed and security

    View Slide

  3. Edge Core

    View Slide

  4. Edge Network - locations

    View Slide

  5. Edge Network - anycast network

    View Slide

  6. Edge Network - software management
    ● Uniform configuration everywhere
    ● No virtualization, no containers, raw metal
    ● Thousands of IP addresses/subnets (anycast)
    ● Multiple applications
    ○ HTTP (HTTP, TLS 1.3, HTTP2, QUIC)
    ○ DNS (Auth, Resolver)
    ○ Other

    View Slide

  7. https://blog.cloudflare.com/a-tour-inside-cloudflares-g9-servers/

    View Slide

  8. https://twitter.com/eastdakota/statu
    https://blog.cloudflare.com/arm-tak

    View Slide

  9. Edge Network - uniform stack
    anycast
    origin fetch

    View Slide

  10. Edge Network - uniform software

    View Slide

  11. Edge Network - uniform software

    View Slide

  12. iptables: xt_bpf, connlimit, hashlimits, ipsets
    syn cookies
    SO_FILTER
    XDP
    (locks!)
    https://lists.openwall.net/netdev/2019/02/22/87
    https://patchwork.ozlabs.org/cover/998940/
    https://speakerdeck.com/majek04/lessons-from-defending-the-indefensible
    https://netdevconf.org/2.1/papers/Gilberto_Bertin_XDP_in_practice.pdf
    DDoS mitigation

    View Slide

  13. DDoS mitigation - XDP
    ● Implementing token buckets without locking is hard
    ● More concurrency primitives (cmpxchg16b?)
    ● "Traffic policing in eBPF: applying token bucket algorithm"
    ○ http://vger.kernel.org/lpc-bpf.html#session-9

    View Slide

  14. XDP
    socket lookup
    syn cookies https://lists.openwall.net/netdev/2019/02/22/87
    https://patchwork.ozlabs.org/cover/998940/
    https://speakerdeck.com/majek04/lessons-from-defending-the-indefensible
    https://netdevconf.org/2.1/papers/Gilberto_Bertin_XDP_in_practice.pdf
    L4LB Load balancing in XDP

    View Slide

  15. Socket dispatch
    socket dispatch

    View Slide

  16. Socket dispatch - DoS considerations
    ● The case of 30k UDP sockets
    ● Solution: ebpf token bucket in SO_FILTER
    192.0.2.2:53
    192.0.2.1:53
    192.0.2.0:53
    0.0.0.0:53
    + SO_FILTER

    View Slide

  17. Socket dispatch - zero downtime restart for Quic
    ● Connected UDP sockets for zero-downtime server restart
    :443
    10.0.0.1 --> 192.0.2.0:443
    10.0.0.2 --> 192.0.2.0:443
    :443
    10.2.0.9 --> 192.0.2.0:443
    10.3.0.8 --> 192.0.2.0:443
    :443
    :443

    View Slide

  18. ● heavy user of AnyIP https://blog.cloudflare.com/how-we-built-spectrum/
    ● SO_BINDTOPREFIX http://patchwork.ozlabs.org/patch/602916/
    ● TPROXY https://blog.cloudflare.com/how-we-built-spectrum/
    ● TPROXY UDP
    Socket dispatch - AnyIP
    Single IP Many subnets
    Single port bind() SO_BINDTOPREFIX
    Many ports TPROXY TPROXY

    View Slide

  19. Better socket dispatch
    socket dispatch
    ● Can we have "inet_lookup" in ebpf? - solve handover

    View Slide

  20. Transmission path - TPROXY UDP send()
    ● Sending packets is hard
    src IP src port dst IP dst port
    connected socket - - - -
    bind(INADDR_ANY) auto - selected selected
    bind(INADDR_ANY) + IP_PKTINFO PKTINFO - selected selected
    bind(127.0.0.1, X) + IP_PKTINFO PKTINFO in bind selected selected

    View Slide

  21. accept() - EPOLL_RR by Jason Baron
    https://blog.cloudflare.com/the-sad-state-of-linux-socket-balancing/
    https://www.mail-archive.com/[email protected]/msg8316
    EPOLL_RR

    View Slide

  22. SOCKMAP for TCP splicing + kTLS
    https://blog.cloudflare.com/sockmap-tcp-splicing-of-the-future/
    splice sendfile

    View Slide

  23. Tuning for the Internet
    ● Custom initcwnd, BPF_SOCK_OPT
    ● BBR
    ● TCP_NOTSENT_LOWAT
    ● TCP Fast Open
    ● ECN
    ● TCP Multipath
    ● QUIC - tuning for UDP, like UDP GSO
    ● More introspection - Listen Drops
    https://blog.cloudflare.com/http-2-prioritization-with-nginx/

    View Slide

  24. Prometheus - ebpf_exporter
    ● Backend for prometheus
    ● Allowing more detailed event views - like histograms for block I/O
    ● https://blog.cloudflare.com/introducing-ebpf_exporter/
    Matt Bostock's SREcon17 talk

    View Slide

  25. Upgrading kernel
    ● LTS, inertia
    ● off-the-tree drivers age fast (we were stuck on 3.18)
    ○ fusion-io, sfc / ixgbe, netmap, glb-redirect
    ○ custom patches (EPOLL_RR, SO_BINDTOPREFIX, XDP features)
    ● hardware issues
    ○ microcode bugs
    ○ driver regressions
    ● regressions
    ○ SO_FILTER https://www.spinics.net/lists/netdev/msg555565.html
    ○ nf_conncount https://www.spinics.net/lists/netfilter-devel/msg57316.html
    ○ nf_nat_cleanup_conntrack https://bugzilla.kernel.org/show_bug.cgi?id=196821
    ○ systemd disabling TSO https://blog.cloudflare.com/tracing-system-cpu-on-debian-stretch/

    View Slide

  26. ● XDP for DDoS
    ● XDP for load balancing
    ● xt_bpf on iptables
    ● SO_FILTER for application DDoS
    ● SOCKMAP + kTLS within application
    ● ebpf_exporter for metrics

    ● future: BPF_SOCK_OPTS
    ● future: cgroups
    ● future: SO_BINDTOBPF ?
    BPF is everywhere

    View Slide

  27. Thank you!

    View Slide

  28. View Slide

  29. Kernel bypass for DDoS
    https://blog.cloudflare.com/kernel-bypass/
    https://blog.cloudflare.com/why-we-use-the-linux-kernels-tcp-stack/

    View Slide

  30. CPU
    Kernel bypass for application
    I/O
    10%?

    View Slide

  31. Kernel bypass for application
    ● Rarely kernel-bound
    ● Kernel features
    ○ iptables for DDoS
    ○ SYN cookies
    ○ RFC4821 tcp_mtu_probing
    ○ BBR
    ○ Kernel Debuggability
    ○ tcpdump, sampling

    View Slide

  32. View Slide

  33. Core
    ● Couple large locations
    ● Marathon + Mesos (custom load balancers, discovery)
    ● Kubernetes (vxlan)
    ● Kafka, ClickHouse, Ceph, HBase, Postgresql / CitusDB
    https://blog.cloudflare.com/http-analytics-for-6m-requests-per-second-using-clickhouse/

    View Slide

  34. View Slide

  35. Dual 25Gbps OCP

    View Slide

  36. Bonded 50Gbps?

    View Slide

  37. What about NIC tuning, MTU and QoS
    traffic type MTU QoS: saturation
    north (public) anycast eyeball requests 1500 inbound: attack
    outbound: traffic spike
    south (public) origin origin pulls 1500 -
    east-west - L4LB inbound requests 1544 -
    east-west - cache cache traffic jumbo hot assets
    ● MTU is hard
    ● LRO can get disabled on large MSS
    ● tc qdisc work better on physical device

    View Slide

  38. View Slide

  39. View Slide

  40. dm-crypt/LUKS saga
    https://www.spinics.net/lists/dm-crypt/msg07517.html

    View Slide

  41. xfs_reclaim
    https://marc.info/?l=linux-xfs&m=154345788829830&w=2

    View Slide

  42. View Slide

  43. View Slide

  44. microcode bugs
    ● https://blog.cloudflare.com/however-improbable-the-story-of-a-processor-bug/

    View Slide

  45. mprotect() race
    https://github.com/torvalds/linux/commit/e86f15ee64d8ee4

    View Slide

  46. microcode bugs
    ● RETPOLINE

    View Slide