refs: “Reliability Engineering Behind The Most Trusted Kafka Platform” (LINE DEVELOPER DAY 2019) - We put a lot of engineering efforts in - measure / optimize our cluster’s performance - troubleshooting
cannot be delivered to brokers within configured timeout Produce request timeout org.apache.kafka.common.errors.TimeoutException: Expiring 59 record(s) for xxx-topic-46:600000 ms has passed since batch creation at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.valueOrError(FutureRecordMetadata.java:98) at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:67) at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:30) at ...
multiple replicas are dead) => No such error log on producer - Suspect 1: produce-request response time degradation (e.g. replication delay) => No. Response time was fairly normal
we were not sure about the root cause Should handle ASAP anyways - => Tried restarting the broker again - => Worked like a charm… (and lost environment for investigation)
to verify the hypothesis - Kafka-client’s DEBUG level logs - JVM async-profiler’s profile - We asked for service developer’s cooperation for the experiment on production environment - 1. Be prepared for monitoring and taking profiles - 2. Restart same broker again (possibly several times) Hypothesis: Producer’s I/O thread got stuck
Attempt to reproduce: FAILED - It turned out verifying the hypothesis is quite hard due to low reproducibility, not only lack of clues to trace client-side’s behavior
got multiple instances of the phenomenon - We started to find similar pattern in every occurrence which possibly causes sending produce request to delay
cookies ACK Established SYN Queue FULL! 1. Calculate “SYN Cookies” 2. Embed it in SYN+ACK as TCP initial sequence number 3. Reply and forget 1. Decode SYN Cookies 2. Complete handshake
advertise receiver’s new window size Send packets Receiver (Server)’s window size Sender can send packets without waiting ACK up to receiver’s window size
ACK Send packets MAX window-size: 65535 (TCP specification) Client has to wait server’s ack at least every 65535 bytes => Could become bottleneck if round-trip time is large
2. Embed it in SYN+ACK as TCP initial sequence number => But sequence-number is only 32bit. No enough space to store window scaling factor => drop window scaling May cause smaller window-size than expected => degraded throughput SYN+ACK SYN Client’s window scaling factor Window scaling not available Drawback of SYN Cookies: (Study) Window scaling drop
timestamp is enabled: Client (TCP timestamp enabled) ACK 1. Calculate “SYN Cookies” 2. Embed it in SYN+ACK as TCP initial sequence number => But sequence-number is only 32bit. => If TCP timestamp is enabled, use it as an extra space to store window scaling factor ! 3. Reply and forget Established SYN+ACK Client’s window scaling factor Server’s window scaling factor SYN Server (TCP timestamp enabled)
window-scaling is dropped: - Window-scaling is a mechanism to extend TCP window size beyond 65535 - Effective especially on a network which has large round-trip time - However, 65535 should be fairly enough on our environment - Typical network latency is < 1ms
could affect the throughput - We did a experiment using Kafka cluster on dev-environment: - 1. Enforces every TCP connections to be established via SYN cookies - This can be done `net.ipv4.tcp_syncookies` kernel parameter to 2 - 2. Restart a broker
broker for every packets - This implies advertised window size from broker was small, so producer couldn’t send out another packet until receive ack for inflight packet - Advertised window size was 789. - Actual window size will be 789 * 2^window_scaling
size was actually very small ! - 789 * 2^1 = 1578 - The dump on SYN+ACK showed window scaling factor = 1 - Next, we should check broker’s socket status to investigate why advertised window size from broker was such small
socket statistics - We can check detailed socket statistics from it by specifying `-i` option. For example: info ss header Window scaling factor of sender / receiver wscale:<sender>:<receiver> Congestion window size cwnd: Timestamp option is set or not ts
(1 or 7) is actually used to calculate the window size on broker-side - After reading Linux kernel (3.10)’s source code, we found it is done by `tcp_select_window` kernel function - => We wanted to dump actually- used window-scaling factor by hooking the function’s args
- BPF is a Linux feature that enabling to execute user-defined programs inside the kernel - BPF programs can be triggered by various events - kprobe : When kernel function is called - There is a toolkit called bcc (BPF Compiler Collection) which enabling to write BPF programs in C
script to dump actual window scaling factor used for window calculation - https://github.com/ocadaruma/mybcc/blob/main/tools/tcp_wnd.py - As the result, we found the broker actually used `7` to calculate window size while producer assumed it as `1` which was included in SYN+ACK
through Linux kernel (3.10) source code again - We found that the logic to determine window scaling factor was slightly different between below locations - Constructing SYN+ACK packet - (In case of SYN flood) Decoding SYN cookies and completing handshake - => This difference caused inconsistent window scaling only when a connection is established through SYN cookies - It was a Linux kernel’s bug (solved in 5.10) - https://github.com/torvalds/linux/commit/909172a149749242990a6e64cb55d55460d4e417
a broker caused SYN flood by reconnection from many clients - 2. Consequently, some unlucky producers’s TCP connection were established through SYN cookies
1) ACK Remember 1 Due to Linux kernel’s bug, window scaling factor is overridden with 7 Initially, window scaling factor is calculated as 1 - 3. Due to kernel’s bug, window scaling got inconsistent
every packets due to too small window size - 6. Producer started to take insanely long time to complete sending out requests - 7. Produce request timeout
platform, real-“SYN flood attack” not gonna be happen - Though there was a SYN flood, of course all connections are valid - Even if SYN’s are dropped, clients will retry SYN soon - Drawback: - Extra delay will be added until success full connection in case of SYN queue full
socket’s listen backlog size - To avoid causing SYN flood in the first place - As of 2021-09-29, Kafka’s backlog size is hardcoded (50) => We can’t configure - Reported KIP and discussion is ongoing: - KIP-764: Configurable backlog size for creating Acceptor
clients connect to, SYN flood may happen - In Linux kernel version < 5.10, TCP SYN cookies may cause significant network throughput degradation due to inconsistent window scaling factor - tcpdump, ss, bcc (BPF) greatly help us to investigate network stack’s issue