Mysterious producer slowdown in a large scale Kafka platform

by LINE Developers

Slide 1

Slide 1 text

Mysterious producer slowdown in a large scale Kafka platform Haruki Okada

Slide 2

Slide 2 text

● Haruki Okada ● Senior software engineer at LINE Corporation ● Developing company-wide Apache Kafka platform Speaker

Slide 3

Slide 3 text

● Multi-tenancy ● Many services connect our  shared Kafka cluster ● Handle huge production rate ● 15M messages / sec ● 7.5GB / sec Apache Kafka at LINE Kafka  cluster LINE Ads LINE NEWS LINE Official  Account . . . . .

Slide 4

Slide 4 text

Phenomenon org.apache.kafka.common.errors.TimeoutException: Expiring 59 record(s) for xxx-topic—46:600000 ms has passed since batch creation at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.valueOrError(FutureRecordMetadata.java: 98) at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:67) at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:30) at ... => Producer lost some messages 😨 One day…

Slide 5

Slide 5 text

● Case 1: Broker returns error response for produce requests continuously ● e.g. NotEnoughReplicas, NotLeaderForPartition,… ● => No. There’s no client log which indicates error response ● Case 2: Produce request latency is too high against production rate Possible cause of batch expiration

Slide 6

Slide 6 text

Indeed, request latency degraded ● Producer’s Kafka clients metrics (via JMX) showed insane request latency ● But there’s no such high latency recorded on the broker ● => HOW ? ● Another observation: ● such high (client-side) produce latency happened only against specific broker ● Producer was able to produce to other brokers

Slide 7

Slide 7 text

Anyways, should handle ASAP ● Message-loss was ongoing on producer, so we had to handle it ASAP ● We were still not sure about the root cause, but the phenomenon started right after  we restarted Kafka process on the broker ● => We decided to restart the broker again. ● => Hun, worked like a charm…  (and lost environment for investigation)

Slide 8

Slide 8 text

● Producer started to experience slow request latency right after we restarted a broker ● While there’s no such slow latency observed on broker side ● Possible cause: ● Case 1: Producer slowed down entirely (e.g. Sender thread blocking) ● => No, because the producer could produce to other brokers ● Case 2: Network performance degraded Current observation

Slide 9

Slide 9 text

How request-latency is measured ● As we can see, ”time to complete sending request” only included in client-side’s latency ● => Broker/Producer latency discrepancy could happen if sending-out request delays

Slide 10

Slide 10 text

Was network slow? ● No. Was fast enough as usual

Slide 11

Slide 11 text

Any other clues? ● Found significant “node_netstat_TcpExt_SyncookiesSent” spike  on the broker

Slide 12

Slide 12 text

TCP SYN cookies ● A mechanism to deal with SYN flood attack ● Attackers try to open massive TCP connections to a server from many (likely spoofed) source IPs ● Server’s SYN queue will be filled up, then “innocent” clients may be refused  => Denial of service ● SYN cookies is a “stateless” way to handle SYN packets ● In case of flood, server embeds necessary info into SYN+ACK packet by carefully  choosing initial-sequence-number, then reply and forget (without storing in SYN queue) ● When “innocent” clients replied “ACK”, server decodes embedded info from it

Slide 13

Slide 13 text

Kafka’s PLE caused “flood” ● Upon restarting broker process, we execute preferred leader election (PLE) after it became in-sync ● Many clients try to open TCP connection in a short period ● => Linux kernel detected “flood” 😇

Slide 14

Slide 14 text

Drawback of SYN cookies ● It’s known that some TCP options (especially window scaling-factor) can be dropped  if TCP timestamp is disabled, and causes throughput degradation ● Since TCP sequence-number is a 32bit field, it’s not enough to fully store  encoded TCP options ● But linux kernel can use TCP timestamp field additionally to store full info if it’s enabled

Slide 15

Slide 15 text

Sounds suspicious? but… ● TCP timestamp is enabled on our environment by default ● SYN cookies should not drop any TCP options ● Also, even if window-scaling option is dropped, it should not cause such significant  throughput degradation ● Window-scaling is a way to extend TCP window size beyond 2^16 (uint16-max)  by multiplying window size by 2^scaling_factor ● Effective on large-latency connection (like internet) by reducing acks ● But 2^16 should be sufficient on our env (i.e. network latency is < 1ms)

Slide 16

Slide 16 text

Let’s experiment on dev-env ● Though we hardly believe SYN cookies can be the cause at that moment,  digging into it sounds somewhat worthwhile at least ● Experiment: Force SYN cookies to be used for every incoming TCP connections ● Can be done by setting `net.ipv4.tcp_syncookies = 2`

Slide 17

Slide 17 text

Yay, reproduced ! ● As below image shows, observed nearly 500ms request latency on producer side ● But we found no such high latency on broker side

Slide 18

Slide 18 text

Observation:  Waiting acks every time ● We succeeded to take tcpdump this time ● Surprisingly, tcpdump shows: ● producer is waiting ack from broker for every packets ● Also, every acks from broker delayed for 40ms ● Likely because producer is generating too small packets (TCP delayed ack)

Slide 19

Slide 19 text

Observation:  Window-scaling factor discrepancy ● Also, we can see that advertised window-scaling factor on SYN+ACK from broker is 1 ● While `ss` shows broker’s wscale as 7 !!

Slide 20

Slide 20 text

Which scaling factor is true? ● Wrote a bcc (BPF Compiler Collection) script to dump “actual” window scaling factor  that broker uses by attaching kprobe for `__tcp_select_window` function ● https://github.com/ocadaruma/mybcc/blob/main/tools/tcp_wnd.py ● As the result, we found broker actually uses `7` for scaling-factor to calculate  window size, while advertised scaling-factor (to producer) is `1`

Slide 21

Slide 21 text

How can this discrepancy happen? ● Linux kernel’s bug ● https://github.com/torvalds/linux/commit/909172a149749242990a6e64cb55d55460d4e417 ● Kernel calculated initial window differently when a connection is established through  SYN cookies ● It caused inconsistency in advertised scaling-factor and actual one after establishment

Slide 22

Slide 22 text

Okay, mystery is solved now: ● 1. Preferred leader election upon broker-restart causes many clients to open connection in a  short period, then SYN queue filled up ● => 2. Meanwhile, some producer connections fall-back to be established through SYN cookies ● => 3. Due to kernel bug, window scaling-factor became inconsistent between broker/producer.  Broker: `advertised_window = window >> 7`, Producer: `window = advertised_window << 1` ● => 4. Producer waits ack from broker too frequently due to small window size.  At the same time, broker delayed acks because sufficient data isn’t received. ● => 5. Producer starts to take insanely long time to complete sending out request.

Slide 23

Slide 23 text

Solution ● Disable TCP SYN cookies ● Even if SYN’s are dropped when SYN queue full, clients will reconnect soon (SYN retries) ● Drawback: ● Extra delay will be added until successful connection ● (Additionally) Increase listen socket’s backlog size ● As of Kafka 2.8.1, backlog size is hardcoded value (50) so we can’t configure it ● Reported and discussing in KIP-764: Configurable backlog size for creating Acceptor

Slide 24

Slide 24 text

Conclusion ● Depending on Linux kernel version, TCP SYN cookies may cause significant network throughput  degradation due to inconsistent window scaling factor ● In a large scale Kafka platform that handles many connections,  we might face kernel’s network stack issue ● tcpdump, ss, bcc (BPF) greatly help us to investigate network’s problem

Slide 25

Slide 25 text

THANK YOU