Investigating Request Delay in a Large-Scale Kafka Cluster Caused by TCP

Speaker - Haruki Okada - Senior Software Engineer - Develop
company-wide Apache Kafka platform

Apache Kafka - Distributed messaging middleware - One of the
most popular middleware in LINE - Highly scalable - Multi-tenancy - ACL - Quota

Apache Kafka Kafka cluster Producer … Broker Broker Producer Producer
… Consumer Consumer Consumer … Clients Clients Send Receive

Apache Kafka at LINE Shared Kafka cluster - We adopt
multi-tenant model Producer … Broker Broker Producer Producer … Consumer Consumer Consumer …

Apache Kafka at LINE User services Over 100 Peak message
inflow 15M / sec Daily In + Out 1.4PB

Efforts to make our cluster highly reliable and performant -
refs: “Reliability Engineering Behind The Most Trusted Kafka Platform” (LINE DEVELOPER DAY 2019) - We put a lot of engineering efforts in - measure / optimize our cluster’s performance - troubleshooting

Agenda - Overview of the phenomenon - Investigation - Solution

Phenomenon - One day, when we were in the middle
of restarting Kafka brokers to rollout config update - =>

- Critical problem, as it implies data loss - Messages
cannot be delivered to brokers within configured timeout Produce request timeout org.apache.kafka.common.errors.TimeoutException: Expiring 59 record(s) for xxx-topic-46:600000 ms has passed since batch creation at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.valueOrError(FutureRecordMetadata.java:98) at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:67) at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:30) at ...

Started investigation, but… - Suspect 2: continuous produce-request rejection (e.g.
multiple replicas are dead) => No such error log on producer - Suspect 1: produce-request response time degradation (e.g. replication delay) => No. Response time was fairly normal

Mysterious Observation Producer Broker Insanely high latency Normal response time
Produce request

Mysterious Observation Producer Broker Broker Broker Normal Normal High latency
Restarted …

- We had to handle ongoing data loss ASAP though
we were not sure about the root cause Should handle ASAP anyways - => Tried restarting the broker again - => Worked like a charm… (and lost environment for investigation)

Different scope of measuring request duration at producer vs broker
- As we can see, “Time till complete sending request” only included in producer side’s latency metric

- We concluded that detailed producer side’s information is necessary
to verify the hypothesis - Kafka-client’s DEBUG level logs - JVM async-profiler’s profile - We asked for service developer’s cooperation for the experiment on production environment - 1. Be prepared for monitoring and taking profiles - 2. Restart same broker again (possibly several times) Hypothesis: Producer’s I/O thread got stuck

- Unfortunately, the phenomenon didn’t reproduce even with several attempts
Attempt to reproduce: FAILED - It turned out verifying the hypothesis is quite hard due to low reproducibility, not only lack of clues to trace client-side’s behavior

Hypothesis: Network performance degradation - But ping latency was quite
normal - Also there was no: - TCP retransmission increase - Network bandwidth saturation

In the meantime… The problem happened again - On different
service (producer), different broker - Exactly same phenomenon as previous occurrence: - Producer’s request delayed while broker’s performance was completely sane - Started right after restarted a broker - Recovered by another restarting “Twemoji” ©Twitter, Inc and other contributors (Licensed under CC-BY 4.0)

Can't we get any clues from these occurrences? - We
got multiple instances of the phenomenon - We started to find similar pattern in every occurrence which possibly causes sending produce request to delay

Many TCP SYN Cookies were sent - Finally, found “node_netstat_TcpExt_SyncookiesSent”
spike upon broker restart only on the brokers which triggered the phenomenon

(Study) TCP SYN Cookies: TCP 3-way handshake Client Server SYN
SYN+ACK ACK SYN Queue Store Retrieve SYN info. Then remove from the queue Established

(Study) TCP SYN Cookies: SYN flood attack Client (Attacker) Server
SYN SYN+ACK ACK SYN Queue FULL! Store Left in “half-open” state Ignore SYN+ACK Retrieve SYN info. Then remove from the queue

(Study) TCP SYN Cookies Client Server SYN SYN+ACK with SYN
cookies ACK Established SYN Queue FULL! 1. Calculate “SYN Cookies” 2. Embed it in SYN+ACK as TCP initial sequence number 3. Reply and forget 1. Decode SYN Cookies 2. Complete handshake

Restarting Kafka caused SYN flood Producer Broker Broker … Producer
Consumer Consumer … Broker

Restarting Kafka caused SYN flood Producer Broker Broker Broker Restart
… Producer Consumer Consumer … Failed-over to other brokers

Restarting Kafka caused SYN flood Producer Broker Broker Broker …
Producer Consumer Consumer … Reconnect to the original broker => naturally caused SYN flood (the broker had over 18K client connections)

Drawback of SYN Cookies: (Study) TCP window Client Server ACK
advertise receiver’s new window size Send packets Receiver (Server)’s window size Sender can send packets without waiting ACK up to receiver’s window size

Drawback of SYN Cookies: (Study) Window size limitation Client Server
ACK Send packets MAX window-size: 65535 (TCP specification) Client has to wait server’s ack at least every 65535 bytes => Could become bottleneck if round-trip time is large

Drawback of SYN Cookies: (Study) Window scaling Client Server SYN+ACK
ACK SYN Queue Established SYN Client’s window scaling factor Server’s window scaling factor Exchange “window scaling factor” in handshaking

Drawback of SYN Cookies: (Study) Window scaling Client Server Send
packets Extend window size beyond 65535 by multiplying window-scaling factor actual_window_size = window_size * 2^window_scaling => Improve throughput especially on large round-trip network

Drawback of SYN Cookies: (Study) Window scaling Client Server (window
scaling factor = 1) If server’s available receive buffer = 102400 bytes => advertise 102400 >> 1 = 51200 as window size (Example) Window size = 51200 Restore window size by 51200 << 1 = 102400 => can send up to 102400 bytes

Client Server ACK Established (window-scaling dropped) 1. Calculate “SYN Cookies”
2. Embed it in SYN+ACK as TCP initial sequence number => But sequence-number is only 32bit. No enough space to store window scaling factor => drop window scaling May cause smaller window-size than expected => degraded throughput SYN+ACK SYN Client’s window scaling factor Window scaling not available Drawback of SYN Cookies: (Study) Window scaling drop

Then, SYN cookies was the culprit? WAIT !! When TCP
timestamp is enabled: Client (TCP timestamp enabled) ACK 1. Calculate “SYN Cookies” 2. Embed it in SYN+ACK as TCP initial sequence number => But sequence-number is only 32bit. => If TCP timestamp is enabled, use it as an extra space to store window scaling factor ! 3. Reply and forget Established SYN+ACK Client’s window scaling factor Server’s window scaling factor SYN Server (TCP timestamp enabled)

Then, SYN cookies was the culprit? WAIT !! Even if
window-scaling is dropped: - Window-scaling is a mechanism to extend TCP window size beyond 65535 - Effective especially on a network which has large round-trip time - However, 65535 should be fairly enough on our environment - Typical network latency is < 1ms

Let’s experiment - We needed to check how window-scaling drop
could affect the throughput - We did a experiment using Kafka cluster on dev-environment: - 1. Enforces every TCP connections to be established via SYN cookies - This can be done `net.ipv4.tcp_syncookies` kernel parameter to 2 - 2. Restart a broker

Reproduced despite TCP timestamp was still enabled - Our experimental
producer app reported nearly 500ms request latency - While we didn’t see such high response time on broker side - Totally unexpected, as TCP timestamp was still enabled => window-scaling should be preserved “Twemoji” ©Twitter, Inc and other contributors (Licensed under CC-BY 4.0)

Observation from tcpdump - Surprisingly, producer was wanting ack from
broker for every packets - This implies advertised window size from broker was small, so producer couldn’t send out another packet until receive ack for inflight packet - Advertised window size was 789. - Actual window size will be 789 * 2^window_scaling

Observation from tcpdump on SYN+ACK - Oh, then the window
size was actually very small ! - 789 * 2^1 = 1578 - The dump on SYN+ACK showed window scaling factor = 1 - Next, we should check broker’s socket status to investigate why advertised window size from broker was such small

Observation from ss - ss is a utility to check
socket statistics - We can check detailed socket statistics from it by specifying `-i` option. For example: info ss header Window scaling factor of sender / receiver wscale:<sender>:<receiver> Congestion window size cwnd: Timestamp option is set or not ts

- Here, ss also showed surprising result - Receiver (broker)’s
window scaling factor of established socket was 7 Observation from ss - While the server conveyed 1 to the producer through SYN+ACK “Twemoji” ©Twitter, Inc and other contributors (Licensed under CC-BY 4.0) - => Of course should be same, to calculate window size correctly

Which scaling factor is true? - We wondered which one
(1 or 7) is actually used to calculate the window size on broker-side - After reading Linux kernel (3.10)’s source code, we found it is done by `tcp_select_window` kernel function - => We wanted to dump actually- used window-scaling factor by hooking the function’s args

How can we hook kernel functions? BPF + kprobe !
- BPF is a Linux feature that enabling to execute user-defined programs inside the kernel - BPF programs can be triggered by various events - kprobe : When kernel function is called - There is a toolkit called bcc (BPF Compiler Collection) which enabling to write BPF programs in C

Which scaling factor is true? - We wrote a bcc
script to dump actual window scaling factor used for window calculation - https://github.com/ocadaruma/mybcc/blob/main/tools/tcp_wnd.py - As the result, we found the broker actually used `7` to calculate window size while producer assumed it as `1` which was included in SYN+ACK

Discrepancy was due to Linux kernel’s bug - We read
through Linux kernel (3.10) source code again - We found that the logic to determine window scaling factor was slightly different between below locations - Constructing SYN+ACK packet - (In case of SYN flood) Decoding SYN cookies and completing handshake - => This difference caused inconsistent window scaling only when a connection is established through SYN cookies - It was a Linux kernel’s bug (solved in 5.10) - https://github.com/torvalds/linux/commit/909172a149749242990a6e64cb55d55460d4e417

Okay, all mysteries are solved. Overall flow: - 1. Restarting
a broker caused SYN flood by reconnection from many clients - 2. Consequently, some unlucky producers’s TCP connection were established through SYN cookies

Overall flow: Producer Broker SYN SYN+ACK (window scaling factor =
1) ACK Remember 1 Due to Linux kernel’s bug, window scaling factor is overridden with 7 Initially, window scaling factor is calculated as 1 - 3. Due to kernel’s bug, window scaling got inconsistent

Overall flow: Producer Broker Advertise new window size 1. Calculate
new window size 2. Apply window-scaling advertised_window_size = window_size >> 7 1. Calculate original window size from advertised window size 2. Apply window-scaling window_size = advertised_window_size << 1 64 times smaller window size - 3. Window-scaling discrepancy caused broken calculation of window size

Overall flow: - 5. Producer waits ack from broker for
every packets due to too small window size - 6. Producer started to take insanely long time to complete sending out requests - 7. Produce request timeout

Solution - Disable TCP SYN cookies - On our Kafka
platform, real-“SYN flood attack” not gonna be happen - Though there was a SYN flood, of course all connections are valid - Even if SYN’s are dropped, clients will retry SYN soon - Drawback: - Extra delay will be added until success full connection in case of SYN queue full

Solution - In addition to that, we should increase Kafka
socket’s listen backlog size - To avoid causing SYN flood in the first place - As of 2021-09-29, Kafka’s backlog size is hardcoded (50) => We can’t configure - Reported KIP and discussion is ongoing: - KIP-764: Configurable backlog size for creating Acceptor

Conclusion - In a large scale Kafka platform that many
clients connect to, SYN flood may happen - In Linux kernel version < 5.10, TCP SYN cookies may cause significant network throughput degradation due to inconsistent window scaling factor - tcpdump, ss, bcc (BPF) greatly help us to investigate network stack’s issue

Thank you

Investigating Request Delay in a Large-Scale Ka...

Investigating Request Delay in a Large-Scale Kafka Cluster Caused by TCP

More Decks by LINE DEVDAY 2021

Other Decks in Technology

Featured

Transcript