Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Investigating Request Delay in a Large-Scale Ka...

Investigating Request Delay in a Large-Scale Kafka Cluster Caused by TCP

LINE DEVDAY 2021

November 10, 2021
Tweet

More Decks by LINE DEVDAY 2021

Other Decks in Technology

Transcript

  1. Apache Kafka - Distributed messaging middleware - One of the

    most popular middleware in LINE - Highly scalable - Multi-tenancy - ACL - Quota
  2. Apache Kafka Kafka cluster Producer … Broker Broker Producer Producer

    … Consumer Consumer Consumer … Clients Clients Send Receive
  3. Apache Kafka at LINE Shared Kafka cluster - We adopt

    multi-tenant model Producer … Broker Broker Producer Producer … Consumer Consumer Consumer …
  4. Apache Kafka at LINE User services Over 100 Peak message

    inflow 15M / sec Daily In + Out 1.4PB
  5. Efforts to make our cluster highly reliable and performant -

    refs: “Reliability Engineering Behind The Most Trusted Kafka Platform” (LINE DEVELOPER DAY 2019) - We put a lot of engineering efforts in - measure / optimize our cluster’s performance - troubleshooting
  6. Phenomenon - One day, when we were in the middle

    of restarting Kafka brokers to rollout config update - =>
  7. - Critical problem, as it implies data loss - Messages

    cannot be delivered to brokers within configured timeout Produce request timeout org.apache.kafka.common.errors.TimeoutException: Expiring 59 record(s) for xxx-topic-46:600000 ms has passed since batch creation at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.valueOrError(FutureRecordMetadata.java:98) at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:67) at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:30) at ...
  8. Started investigation, but… - Suspect 2: continuous produce-request rejection (e.g.

    multiple replicas are dead) => No such error log on producer - Suspect 1: produce-request response time degradation (e.g. replication delay) => No. Response time was fairly normal
  9. - We had to handle ongoing data loss ASAP though

    we were not sure about the root cause Should handle ASAP anyways - => Tried restarting the broker again - => Worked like a charm… (and lost environment for investigation)
  10. Different scope of measuring request duration at producer vs broker

    - As we can see, “Time till complete sending request” only included in producer side’s latency metric
  11. - We concluded that detailed producer side’s information is necessary

    to verify the hypothesis - Kafka-client’s DEBUG level logs - JVM async-profiler’s profile - We asked for service developer’s cooperation for the experiment on production environment - 1. Be prepared for monitoring and taking profiles - 2. Restart same broker again (possibly several times) Hypothesis: Producer’s I/O thread got stuck
  12. - Unfortunately, the phenomenon didn’t reproduce even with several attempts

    Attempt to reproduce: FAILED - It turned out verifying the hypothesis is quite hard due to low reproducibility, not only lack of clues to trace client-side’s behavior
  13. Hypothesis: Network performance degradation - But ping latency was quite

    normal - Also there was no: - TCP retransmission increase - Network bandwidth saturation
  14. In the meantime… The problem happened again - On different

    service (producer), different broker - Exactly same phenomenon as previous occurrence: - Producer’s request delayed while broker’s performance was completely sane - Started right after restarted a broker - Recovered by another restarting “Twemoji” ©Twitter, Inc and other contributors (Licensed under CC-BY 4.0)
  15. Can't we get any clues from these occurrences? - We

    got multiple instances of the phenomenon - We started to find similar pattern in every occurrence which possibly causes sending produce request to delay
  16. Many TCP SYN Cookies were sent - Finally, found “node_netstat_TcpExt_SyncookiesSent”

    spike upon broker restart only on the brokers which triggered the phenomenon
  17. (Study) TCP SYN Cookies: TCP 3-way handshake Client Server SYN

    SYN+ACK ACK SYN Queue Store Retrieve SYN info. Then remove from the queue Established
  18. (Study) TCP SYN Cookies: SYN flood attack Client (Attacker) Server

    SYN SYN+ACK ACK SYN Queue FULL! Store Left in “half-open” state Ignore SYN+ACK Retrieve SYN info. Then remove from the queue
  19. (Study) TCP SYN Cookies Client Server SYN SYN+ACK with SYN

    cookies ACK Established SYN Queue FULL! 1. Calculate “SYN Cookies” 2. Embed it in SYN+ACK as TCP initial sequence number 3. Reply and forget 1. Decode SYN Cookies 2. Complete handshake
  20. Restarting Kafka caused SYN flood Producer Broker Broker Broker Restart

    … Producer Consumer Consumer … Failed-over to other brokers
  21. Restarting Kafka caused SYN flood Producer Broker Broker Broker …

    Producer Consumer Consumer … Reconnect to the original broker => naturally caused SYN flood (the broker had over 18K client connections)
  22. Drawback of SYN Cookies: (Study) TCP window Client Server ACK

    advertise receiver’s new window size Send packets Receiver (Server)’s window size Sender can send packets without waiting ACK up to receiver’s window size
  23. Drawback of SYN Cookies: (Study) Window size limitation Client Server

    ACK Send packets MAX window-size: 65535 (TCP specification) Client has to wait server’s ack at least every 65535 bytes => Could become bottleneck if round-trip time is large
  24. Drawback of SYN Cookies: (Study) Window scaling Client Server SYN+ACK

    ACK SYN Queue Established SYN Client’s window scaling factor Server’s window scaling factor Exchange “window scaling factor” in handshaking
  25. Drawback of SYN Cookies: (Study) Window scaling Client Server Send

    packets Extend window size beyond 65535 by multiplying window-scaling factor actual_window_size = window_size * 2^window_scaling => Improve throughput especially on large round-trip network
  26. Drawback of SYN Cookies: (Study) Window scaling Client Server (window

    scaling factor = 1) If server’s available receive buffer = 102400 bytes => advertise 102400 >> 1 = 51200 as window size (Example) Window size = 51200 Restore window size by 51200 << 1 = 102400 => can send up to 102400 bytes
  27. Client Server ACK Established (window-scaling dropped) 1. Calculate “SYN Cookies”

    2. Embed it in SYN+ACK as TCP initial sequence number => But sequence-number is only 32bit. No enough space to store window scaling factor => drop window scaling May cause smaller window-size than expected => degraded throughput SYN+ACK SYN Client’s window scaling factor Window scaling not available Drawback of SYN Cookies: (Study) Window scaling drop
  28. Then, SYN cookies was the culprit? WAIT !! When TCP

    timestamp is enabled: Client (TCP timestamp enabled) ACK 1. Calculate “SYN Cookies” 2. Embed it in SYN+ACK as TCP initial sequence number => But sequence-number is only 32bit. => If TCP timestamp is enabled, use it as an extra space to store window scaling factor ! 3. Reply and forget Established SYN+ACK Client’s window scaling factor Server’s window scaling factor SYN Server (TCP timestamp enabled)
  29. Then, SYN cookies was the culprit? WAIT !! Even if

    window-scaling is dropped: - Window-scaling is a mechanism to extend TCP window size beyond 65535 - Effective especially on a network which has large round-trip time - However, 65535 should be fairly enough on our environment - Typical network latency is < 1ms
  30. Let’s experiment - We needed to check how window-scaling drop

    could affect the throughput - We did a experiment using Kafka cluster on dev-environment: - 1. Enforces every TCP connections to be established via SYN cookies - This can be done `net.ipv4.tcp_syncookies` kernel parameter to 2 - 2. Restart a broker
  31. Reproduced despite TCP timestamp was still enabled - Our experimental

    producer app reported nearly 500ms request latency - While we didn’t see such high response time on broker side - Totally unexpected, as TCP timestamp was still enabled => window-scaling should be preserved “Twemoji” ©Twitter, Inc and other contributors (Licensed under CC-BY 4.0)
  32. Observation from tcpdump - Surprisingly, producer was wanting ack from

    broker for every packets - This implies advertised window size from broker was small, so producer couldn’t send out another packet until receive ack for inflight packet - Advertised window size was 789. - Actual window size will be 789 * 2^window_scaling
  33. Observation from tcpdump on SYN+ACK - Oh, then the window

    size was actually very small ! - 789 * 2^1 = 1578 - The dump on SYN+ACK showed window scaling factor = 1 - Next, we should check broker’s socket status to investigate why advertised window size from broker was such small
  34. Observation from ss - ss is a utility to check

    socket statistics - We can check detailed socket statistics from it by specifying `-i` option. For example: info ss header Window scaling factor of sender / receiver wscale:<sender>:<receiver> Congestion window size cwnd: Timestamp option is set or not ts
  35. - Here, ss also showed surprising result - Receiver (broker)’s

    window scaling factor of established socket was 7 Observation from ss - While the server conveyed 1 to the producer through SYN+ACK “Twemoji” ©Twitter, Inc and other contributors (Licensed under CC-BY 4.0) - => Of course should be same, to calculate window size correctly
  36. Which scaling factor is true? - We wondered which one

    (1 or 7) is actually used to calculate the window size on broker-side - After reading Linux kernel (3.10)’s source code, we found it is done by `tcp_select_window` kernel function - => We wanted to dump actually- used window-scaling factor by hooking the function’s args
  37. How can we hook kernel functions? BPF + kprobe !

    - BPF is a Linux feature that enabling to execute user-defined programs inside the kernel - BPF programs can be triggered by various events - kprobe : When kernel function is called - There is a toolkit called bcc (BPF Compiler Collection) which enabling to write BPF programs in C
  38. Which scaling factor is true? - We wrote a bcc

    script to dump actual window scaling factor used for window calculation - https://github.com/ocadaruma/mybcc/blob/main/tools/tcp_wnd.py - As the result, we found the broker actually used `7` to calculate window size while producer assumed it as `1` which was included in SYN+ACK
  39. Discrepancy was due to Linux kernel’s bug - We read

    through Linux kernel (3.10) source code again - We found that the logic to determine window scaling factor was slightly different between below locations - Constructing SYN+ACK packet - (In case of SYN flood) Decoding SYN cookies and completing handshake - => This difference caused inconsistent window scaling only when a connection is established through SYN cookies - It was a Linux kernel’s bug (solved in 5.10) - https://github.com/torvalds/linux/commit/909172a149749242990a6e64cb55d55460d4e417
  40. Okay, all mysteries are solved. Overall flow: - 1. Restarting

    a broker caused SYN flood by reconnection from many clients - 2. Consequently, some unlucky producers’s TCP connection were established through SYN cookies
  41. Overall flow: Producer Broker SYN SYN+ACK (window scaling factor =

    1) ACK Remember 1 Due to Linux kernel’s bug, window scaling factor is overridden with 7 Initially, window scaling factor is calculated as 1 - 3. Due to kernel’s bug, window scaling got inconsistent
  42. Overall flow: Producer Broker Advertise new window size 1. Calculate

    new window size 2. Apply window-scaling advertised_window_size = window_size >> 7 1. Calculate original window size from advertised window size 2. Apply window-scaling window_size = advertised_window_size << 1 64 times smaller window size - 3. Window-scaling discrepancy caused broken calculation of window size
  43. Overall flow: - 5. Producer waits ack from broker for

    every packets due to too small window size - 6. Producer started to take insanely long time to complete sending out requests - 7. Produce request timeout
  44. Solution - Disable TCP SYN cookies - On our Kafka

    platform, real-“SYN flood attack” not gonna be happen - Though there was a SYN flood, of course all connections are valid - Even if SYN’s are dropped, clients will retry SYN soon - Drawback: - Extra delay will be added until success full connection in case of SYN queue full
  45. Solution - In addition to that, we should increase Kafka

    socket’s listen backlog size - To avoid causing SYN flood in the first place - As of 2021-09-29, Kafka’s backlog size is hardcoded (50) => We can’t configure - Reported KIP and discussion is ongoing: - KIP-764: Configurable backlog size for creating Acceptor
  46. Conclusion - In a large scale Kafka platform that many

    clients connect to, SYN flood may happen - In Linux kernel version < 5.10, TCP SYN cookies may cause significant network throughput degradation due to inconsistent window scaling factor - tcpdump, ss, bcc (BPF) greatly help us to investigate network stack’s issue