Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Linux at Cloudflare

majek04
March 22, 2019

Linux at Cloudflare

majek04

March 22, 2019
Tweet

More Decks by majek04

Other Decks in Programming

Transcript

  1. Linux at Cloudflare
    Marek Majkowski

    View full-size slide

  2. Global network, speed and security

    View full-size slide

  3. Edge Network - locations

    View full-size slide

  4. Edge Network - anycast network

    View full-size slide

  5. Edge Network - software management
    ● Uniform configuration everywhere
    ● No virtualization, no containers, raw metal
    ● Thousands of IP addresses/subnets (anycast)
    ● Multiple applications
    ○ HTTP (HTTP, TLS 1.3, HTTP2, QUIC)
    ○ DNS (Auth, Resolver)
    ○ Other

    View full-size slide

  6. https://blog.cloudflare.com/a-tour-inside-cloudflares-g9-servers/

    View full-size slide

  7. https://twitter.com/eastdakota/statu
    https://blog.cloudflare.com/arm-tak

    View full-size slide

  8. Edge Network - uniform stack
    anycast
    origin fetch

    View full-size slide

  9. Edge Network - uniform software

    View full-size slide

  10. Edge Network - uniform software

    View full-size slide

  11. iptables: xt_bpf, connlimit, hashlimits, ipsets
    syn cookies
    SO_FILTER
    XDP
    (locks!)
    https://lists.openwall.net/netdev/2019/02/22/87
    https://patchwork.ozlabs.org/cover/998940/
    https://speakerdeck.com/majek04/lessons-from-defending-the-indefensible
    https://netdevconf.org/2.1/papers/Gilberto_Bertin_XDP_in_practice.pdf
    DDoS mitigation

    View full-size slide

  12. DDoS mitigation - XDP
    ● Implementing token buckets without locking is hard
    ● More concurrency primitives (cmpxchg16b?)
    ● "Traffic policing in eBPF: applying token bucket algorithm"
    ○ http://vger.kernel.org/lpc-bpf.html#session-9

    View full-size slide

  13. XDP
    socket lookup
    syn cookies https://lists.openwall.net/netdev/2019/02/22/87
    https://patchwork.ozlabs.org/cover/998940/
    https://speakerdeck.com/majek04/lessons-from-defending-the-indefensible
    https://netdevconf.org/2.1/papers/Gilberto_Bertin_XDP_in_practice.pdf
    L4LB Load balancing in XDP

    View full-size slide

  14. Socket dispatch
    socket dispatch

    View full-size slide

  15. Socket dispatch - DoS considerations
    ● The case of 30k UDP sockets
    ● Solution: ebpf token bucket in SO_FILTER
    192.0.2.2:53
    192.0.2.1:53
    192.0.2.0:53
    0.0.0.0:53
    + SO_FILTER

    View full-size slide

  16. Socket dispatch - zero downtime restart for Quic
    ● Connected UDP sockets for zero-downtime server restart
    :443
    10.0.0.1 --> 192.0.2.0:443
    10.0.0.2 --> 192.0.2.0:443
    :443
    10.2.0.9 --> 192.0.2.0:443
    10.3.0.8 --> 192.0.2.0:443
    :443
    :443

    View full-size slide

  17. ● heavy user of AnyIP https://blog.cloudflare.com/how-we-built-spectrum/
    ● SO_BINDTOPREFIX http://patchwork.ozlabs.org/patch/602916/
    ● TPROXY https://blog.cloudflare.com/how-we-built-spectrum/
    ● TPROXY UDP
    Socket dispatch - AnyIP
    Single IP Many subnets
    Single port bind() SO_BINDTOPREFIX
    Many ports TPROXY TPROXY

    View full-size slide

  18. Better socket dispatch
    socket dispatch
    ● Can we have "inet_lookup" in ebpf? - solve handover

    View full-size slide

  19. Transmission path - TPROXY UDP send()
    ● Sending packets is hard
    src IP src port dst IP dst port
    connected socket - - - -
    bind(INADDR_ANY) auto - selected selected
    bind(INADDR_ANY) + IP_PKTINFO PKTINFO - selected selected
    bind(127.0.0.1, X) + IP_PKTINFO PKTINFO in bind selected selected

    View full-size slide

  20. accept() - EPOLL_RR by Jason Baron
    https://blog.cloudflare.com/the-sad-state-of-linux-socket-balancing/
    https://www.mail-archive.com/[email protected]/msg8316
    EPOLL_RR

    View full-size slide

  21. SOCKMAP for TCP splicing + kTLS
    https://blog.cloudflare.com/sockmap-tcp-splicing-of-the-future/
    splice sendfile

    View full-size slide

  22. Tuning for the Internet
    ● Custom initcwnd, BPF_SOCK_OPT
    ● BBR
    ● TCP_NOTSENT_LOWAT
    ● TCP Fast Open
    ● ECN
    ● TCP Multipath
    ● QUIC - tuning for UDP, like UDP GSO
    ● More introspection - Listen Drops
    https://blog.cloudflare.com/http-2-prioritization-with-nginx/

    View full-size slide

  23. Prometheus - ebpf_exporter
    ● Backend for prometheus
    ● Allowing more detailed event views - like histograms for block I/O
    ● https://blog.cloudflare.com/introducing-ebpf_exporter/
    Matt Bostock's SREcon17 talk

    View full-size slide

  24. Upgrading kernel
    ● LTS, inertia
    ● off-the-tree drivers age fast (we were stuck on 3.18)
    ○ fusion-io, sfc / ixgbe, netmap, glb-redirect
    ○ custom patches (EPOLL_RR, SO_BINDTOPREFIX, XDP features)
    ● hardware issues
    ○ microcode bugs
    ○ driver regressions
    ● regressions
    ○ SO_FILTER https://www.spinics.net/lists/netdev/msg555565.html
    ○ nf_conncount https://www.spinics.net/lists/netfilter-devel/msg57316.html
    ○ nf_nat_cleanup_conntrack https://bugzilla.kernel.org/show_bug.cgi?id=196821
    ○ systemd disabling TSO https://blog.cloudflare.com/tracing-system-cpu-on-debian-stretch/

    View full-size slide

  25. ● XDP for DDoS
    ● XDP for load balancing
    ● xt_bpf on iptables
    ● SO_FILTER for application DDoS
    ● SOCKMAP + kTLS within application
    ● ebpf_exporter for metrics

    ● future: BPF_SOCK_OPTS
    ● future: cgroups
    ● future: SO_BINDTOBPF ?
    BPF is everywhere

    View full-size slide

  26. Kernel bypass for DDoS
    https://blog.cloudflare.com/kernel-bypass/
    https://blog.cloudflare.com/why-we-use-the-linux-kernels-tcp-stack/

    View full-size slide

  27. CPU
    Kernel bypass for application
    I/O
    10%?

    View full-size slide

  28. Kernel bypass for application
    ● Rarely kernel-bound
    ● Kernel features
    ○ iptables for DDoS
    ○ SYN cookies
    ○ RFC4821 tcp_mtu_probing
    ○ BBR
    ○ Kernel Debuggability
    ○ tcpdump, sampling

    View full-size slide

  29. Core
    ● Couple large locations
    ● Marathon + Mesos (custom load balancers, discovery)
    ● Kubernetes (vxlan)
    ● Kafka, ClickHouse, Ceph, HBase, Postgresql / CitusDB
    https://blog.cloudflare.com/http-analytics-for-6m-requests-per-second-using-clickhouse/

    View full-size slide

  30. Dual 25Gbps OCP

    View full-size slide

  31. Bonded 50Gbps?

    View full-size slide

  32. What about NIC tuning, MTU and QoS
    traffic type MTU QoS: saturation
    north (public) anycast eyeball requests 1500 inbound: attack
    outbound: traffic spike
    south (public) origin origin pulls 1500 -
    east-west - L4LB inbound requests 1544 -
    east-west - cache cache traffic jumbo hot assets
    ● MTU is hard
    ● LRO can get disabled on large MSS
    ● tc qdisc work better on physical device

    View full-size slide

  33. dm-crypt/LUKS saga
    https://www.spinics.net/lists/dm-crypt/msg07517.html

    View full-size slide

  34. xfs_reclaim
    https://marc.info/?l=linux-xfs&m=154345788829830&w=2

    View full-size slide

  35. microcode bugs
    ● https://blog.cloudflare.com/however-improbable-the-story-of-a-processor-bug/

    View full-size slide

  36. mprotect() race
    https://github.com/torvalds/linux/commit/e86f15ee64d8ee4

    View full-size slide

  37. microcode bugs
    ● RETPOLINE

    View full-size slide