Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Investigating Request Delay in a Large-Scale Kafka Cluster Caused by TCP

Investigating Request Delay in a Large-Scale Kafka Cluster Caused by TCP

LINE DEVDAY 2021

November 10, 2021
Tweet

More Decks by LINE DEVDAY 2021

Other Decks in Technology

Transcript

  1. Speaker
    - Haruki Okada
    - Senior Software Engineer
    - Develop company-wide Apache Kafka
    platform

    View full-size slide

  2. Apache Kafka
    - Distributed messaging middleware
    - One of the most popular
    middleware in LINE
    - Highly scalable
    - Multi-tenancy
    - ACL
    - Quota

    View full-size slide

  3. Apache Kafka
    Kafka cluster
    Producer

    Broker
    Broker
    Producer
    Producer

    Consumer
    Consumer
    Consumer

    Clients Clients
    Send Receive

    View full-size slide

  4. Apache Kafka at LINE
    Shared Kafka cluster
    - We adopt multi-tenant model
    Producer

    Broker
    Broker
    Producer
    Producer

    Consumer
    Consumer
    Consumer

    View full-size slide

  5. Apache Kafka at LINE
    User services
    Over 100
    Peak message inflow
    15M / sec
    Daily In + Out
    1.4PB

    View full-size slide

  6. Efforts to make our cluster highly
    reliable and performant
    - refs: “Reliability Engineering Behind The Most Trusted Kafka Platform”
    (LINE DEVELOPER DAY 2019)
    - We put a lot of engineering efforts in
    - measure / optimize our cluster’s performance
    - troubleshooting

    View full-size slide

  7. Agenda - Overview of the phenomenon
    - Investigation
    - Solution

    View full-size slide

  8. Phenomenon
    - One day, when we were in the middle of restarting Kafka brokers to rollout config
    update
    - =>

    View full-size slide

  9. - Critical problem, as it implies data loss
    - Messages cannot be delivered to brokers within configured timeout
    Produce request timeout
    org.apache.kafka.common.errors.TimeoutException: Expiring 59 record(s) for xxx-topic-46:600000 ms has passed
    since batch creation
    at
    org.apache.kafka.clients.producer.internals.FutureRecordMetadata.valueOrError(FutureRecordMetadata.java:98)
    at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:67)
    at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:30)
    at
    ...

    View full-size slide

  10. Started investigation, but…
    - Suspect 2: continuous produce-request rejection (e.g. multiple replicas are dead)
    => No such error log on producer
    - Suspect 1: produce-request response time degradation (e.g. replication delay)
    => No. Response time was fairly normal

    View full-size slide

  11. Mysterious Observation
    Producer Broker
    Insanely high latency Normal response time
    Produce request

    View full-size slide

  12. Mysterious Observation
    Producer Broker
    Broker
    Broker
    Normal
    Normal
    High latency
    Restarted

    View full-size slide

  13. - We had to handle ongoing data loss ASAP though we were not sure about
    the root cause
    Should handle ASAP anyways
    - => Tried restarting the broker again
    - =>
    Worked like a charm… (and lost environment for investigation)

    View full-size slide

  14. Different scope of measuring request
    duration at producer vs broker
    - As we can see, “Time till complete sending request” only
    included in producer side’s latency metric

    View full-size slide

  15. - We concluded that detailed producer side’s information is necessary to
    verify the hypothesis
    - Kafka-client’s DEBUG level logs
    - JVM async-profiler’s profile
    - We asked for service developer’s cooperation for the experiment on
    production environment
    - 1. Be prepared for monitoring and taking profiles
    - 2. Restart same broker again (possibly several times)
    Hypothesis: Producer’s I/O thread got stuck

    View full-size slide

  16. - Unfortunately, the phenomenon didn’t reproduce even with several attempts
    Attempt to reproduce: FAILED
    - It turned out verifying the hypothesis is quite hard due to low reproducibility,
    not only lack of clues to trace client-side’s behavior

    View full-size slide

  17. Hypothesis: Network performance
    degradation
    - But ping latency was quite normal
    - Also there was no:
    - TCP retransmission increase
    - Network bandwidth saturation

    View full-size slide

  18. In the meantime…
    The problem happened again
    - On different service (producer), different broker
    - Exactly same phenomenon as previous occurrence:
    - Producer’s request delayed while broker’s performance was completely sane
    - Started right after restarted a broker
    - Recovered by another restarting
    “Twemoji” ©Twitter, Inc and other contributors (Licensed under CC-BY 4.0)

    View full-size slide

  19. Can't we get any clues from these
    occurrences?
    - We got multiple instances of the phenomenon
    - We started to find similar pattern in every occurrence which possibly
    causes sending produce request to delay

    View full-size slide

  20. Many TCP SYN Cookies were sent
    - Finally, found “node_netstat_TcpExt_SyncookiesSent” spike upon broker
    restart only on the brokers which triggered the phenomenon

    View full-size slide

  21. (Study) TCP SYN Cookies:
    TCP 3-way handshake
    Client Server
    SYN
    SYN+ACK
    ACK
    SYN Queue
    Store
    Retrieve SYN info.
    Then remove from the queue
    Established

    View full-size slide

  22. (Study) TCP SYN Cookies:
    SYN flood attack
    Client (Attacker) Server
    SYN
    SYN+ACK
    ACK
    SYN Queue
    FULL!
    Store
    Left in “half-open” state
    Ignore
    SYN+ACK
    Retrieve SYN info.
    Then remove from the queue

    View full-size slide

  23. (Study) TCP SYN Cookies
    Client Server
    SYN
    SYN+ACK with
    SYN cookies
    ACK
    Established
    SYN Queue
    FULL!
    1. Calculate “SYN Cookies”
    2. Embed it in SYN+ACK as
    TCP initial sequence number
    3. Reply and forget
    1. Decode SYN Cookies
    2. Complete handshake

    View full-size slide

  24. Restarting Kafka caused SYN flood
    Producer
    Broker
    Broker

    Producer
    Consumer
    Consumer

    Broker

    View full-size slide

  25. Restarting Kafka caused SYN flood
    Producer Broker
    Broker
    Broker
    Restart

    Producer
    Consumer
    Consumer

    Failed-over to
    other brokers

    View full-size slide

  26. Restarting Kafka caused SYN flood
    Producer Broker
    Broker
    Broker

    Producer
    Consumer
    Consumer

    Reconnect to the original broker
    => naturally caused SYN flood
    (the broker had over 18K client connections)

    View full-size slide

  27. Drawback of SYN Cookies:
    (Study) TCP window
    Client Server
    ACK
    advertise receiver’s
    new window size
    Send packets
    Receiver (Server)’s window size
    Sender can send packets without waiting ACK up to receiver’s window size

    View full-size slide

  28. Drawback of SYN Cookies:
    (Study) Window size limitation
    Client Server
    ACK
    Send packets MAX window-size: 65535
    (TCP specification)
    Client has to wait server’s ack
    at least every 65535 bytes
    => Could become bottleneck if
    round-trip time is large

    View full-size slide

  29. Drawback of SYN Cookies:
    (Study) Window scaling
    Client Server
    SYN+ACK
    ACK
    SYN Queue
    Established
    SYN
    Client’s window scaling factor
    Server’s window scaling factor
    Exchange “window scaling factor”
    in handshaking

    View full-size slide

  30. Drawback of SYN Cookies:
    (Study) Window scaling
    Client Server
    Send packets
    Extend window size beyond 65535 by
    multiplying window-scaling factor
    actual_window_size =
    window_size * 2^window_scaling
    => Improve throughput especially on
    large round-trip network

    View full-size slide

  31. Drawback of SYN Cookies:
    (Study) Window scaling
    Client Server (window scaling factor = 1)
    If server’s available receive
    buffer = 102400 bytes
    => advertise
    102400 >> 1 = 51200
    as window size
    (Example)
    Window size = 51200
    Restore window size by
    51200 << 1 = 102400
    => can send up to
    102400 bytes

    View full-size slide

  32. Client Server
    ACK
    Established (window-scaling dropped)
    1. Calculate “SYN Cookies”
    2. Embed it in SYN+ACK as
    TCP initial sequence number
    => But sequence-number is
    only 32bit.
    No enough space to store
    window scaling factor
    => drop window scaling
    May cause smaller window-size than
    expected
    => degraded throughput
    SYN+ACK
    SYN
    Client’s window scaling factor
    Window scaling not available
    Drawback of SYN Cookies:
    (Study) Window scaling drop

    View full-size slide

  33. Then, SYN cookies was the culprit? WAIT !!
    When TCP timestamp is enabled:
    Client
    (TCP timestamp enabled)
    ACK
    1. Calculate “SYN Cookies”
    2. Embed it in SYN+ACK as
    TCP initial sequence number
    => But sequence-number is
    only 32bit.
    => If TCP timestamp is
    enabled, use it as an
    extra space to store window
    scaling factor !
    3. Reply and forget
    Established
    SYN+ACK
    Client’s window scaling factor
    Server’s window scaling factor
    SYN
    Server
    (TCP timestamp enabled)

    View full-size slide

  34. Then, SYN cookies was the culprit? WAIT !!
    Even if window-scaling is dropped:
    - Window-scaling is a mechanism to extend TCP window size beyond 65535
    - Effective especially on a network which has large round-trip time
    - However, 65535 should be fairly enough on our environment
    - Typical network latency is < 1ms

    View full-size slide

  35. Let’s experiment
    - We needed to check how window-scaling drop could affect the throughput
    - We did a experiment using Kafka cluster on dev-environment:
    - 1. Enforces every TCP connections to be established via SYN cookies
    - This can be done `net.ipv4.tcp_syncookies` kernel parameter to 2
    - 2. Restart a broker

    View full-size slide

  36. Reproduced despite
    TCP timestamp was still enabled
    - Our experimental producer app reported nearly 500ms request latency
    - While we didn’t see such high response time on broker side
    - Totally unexpected, as TCP timestamp was still enabled
    => window-scaling should be preserved
    “Twemoji” ©Twitter, Inc and other contributors (Licensed under CC-BY 4.0)

    View full-size slide

  37. Observation from tcpdump
    - Surprisingly, producer was wanting ack from broker for every packets
    - This implies advertised window size from broker was small, so producer
    couldn’t send out another packet until receive ack for inflight packet
    - Advertised window size was 789.
    - Actual window size will be 789 * 2^window_scaling

    View full-size slide

  38. Observation from tcpdump on SYN+ACK
    - Oh, then the window size was actually very small !
    - 789 * 2^1 = 1578
    - The dump on SYN+ACK showed window scaling factor = 1
    - Next, we should check broker’s socket status to investigate
    why advertised window size from broker was such small

    View full-size slide

  39. Observation from ss
    - ss is a utility to check socket statistics
    - We can check detailed socket statistics from it by specifying `-i` option.
    For example:
    info ss header
    Window scaling factor of sender / receiver wscale::
    Congestion window size cwnd:
    Timestamp option is set or not ts

    View full-size slide

  40. - Here, ss also showed surprising result
    - Receiver (broker)’s window scaling factor of established socket was 7
    Observation from ss
    - While the server conveyed 1 to the producer through SYN+ACK
    “Twemoji” ©Twitter, Inc and other contributors (Licensed under CC-BY 4.0)
    - => Of course should be same, to calculate window size correctly

    View full-size slide

  41. Which scaling factor is true?
    - We wondered which one (1 or 7) is
    actually used to calculate the
    window size on broker-side
    - After reading Linux kernel (3.10)’s
    source code, we found it is done by
    `tcp_select_window` kernel function
    - => We wanted to dump actually-
    used window-scaling factor by
    hooking the function’s args

    View full-size slide

  42. How can we hook kernel functions?
    BPF + kprobe !
    - BPF is a Linux feature that enabling to execute user-defined programs
    inside the kernel
    - BPF programs can be triggered by various events
    - kprobe : When kernel function is called
    - There is a toolkit called bcc (BPF Compiler Collection) which enabling to
    write BPF programs in C

    View full-size slide

  43. Which scaling factor is true?
    - We wrote a bcc script to dump actual window scaling factor used for
    window calculation
    - https://github.com/ocadaruma/mybcc/blob/main/tools/tcp_wnd.py
    - As the result, we found the broker actually used `7` to calculate window size
    while producer assumed it as `1` which was included in SYN+ACK

    View full-size slide

  44. Discrepancy was due to Linux kernel’s bug
    - We read through Linux kernel (3.10) source code again
    - We found that the logic to determine window scaling factor was slightly
    different between below locations
    - Constructing SYN+ACK packet
    - (In case of SYN flood) Decoding SYN cookies and completing handshake
    - => This difference caused inconsistent window scaling only when
    a connection is established through SYN cookies
    - It was a Linux kernel’s bug (solved in 5.10)
    - https://github.com/torvalds/linux/commit/909172a149749242990a6e64cb55d55460d4e417

    View full-size slide

  45. Okay, all mysteries are solved.
    Overall flow:
    - 1. Restarting a broker caused SYN flood by reconnection from many clients
    - 2. Consequently, some unlucky producers’s TCP connection were established
    through SYN cookies

    View full-size slide

  46. Overall flow:
    Producer Broker
    SYN
    SYN+ACK
    (window scaling factor = 1)
    ACK
    Remember 1
    Due to Linux kernel’s bug,
    window scaling factor is overridden
    with 7
    Initially, window scaling factor is
    calculated as 1
    - 3. Due to kernel’s bug, window scaling got inconsistent

    View full-size slide

  47. Overall flow:
    Producer Broker
    Advertise new window size
    1. Calculate new
    window size
    2. Apply window-scaling
    advertised_window_size =
    window_size >> 7
    1. Calculate original window
    size from
    advertised window size
    2. Apply window-scaling
    window_size =
    advertised_window_size << 1
    64 times smaller window size
    - 3. Window-scaling discrepancy caused broken calculation of window size

    View full-size slide

  48. Overall flow:
    - 5. Producer waits ack from broker for every packets due to
    too small window size
    - 6. Producer started to take insanely long time to complete sending out requests
    - 7. Produce request timeout

    View full-size slide

  49. Solution
    - Disable TCP SYN cookies
    - On our Kafka platform, real-“SYN flood attack” not gonna be happen
    - Though there was a SYN flood, of course all connections are valid
    - Even if SYN’s are dropped, clients will retry SYN soon
    - Drawback:
    - Extra delay will be added until success full connection in case of
    SYN queue full

    View full-size slide

  50. Solution
    - In addition to that, we should increase Kafka socket’s listen backlog size
    - To avoid causing SYN flood in the first place
    - As of 2021-09-29, Kafka’s backlog size is hardcoded (50)
    => We can’t configure
    - Reported KIP and discussion is ongoing:
    - KIP-764: Configurable backlog size for creating Acceptor

    View full-size slide

  51. Conclusion
    - In a large scale Kafka platform that many clients connect to, SYN flood may
    happen
    - In Linux kernel version < 5.10, TCP SYN cookies may cause significant
    network throughput degradation due to inconsistent window scaling factor
    - tcpdump, ss, bcc (BPF) greatly help us to investigate network stack’s issue

    View full-size slide