Mysterious producer slowdown in a large scale Kafka platform

Mysterious producer slowdown in a large scale Kafka platform Haruki
Okada

• Haruki Okada • Senior software engineer at LINE Corporation
• Developing company-wide Apache Kafka platform Speaker

• Multi-tenancy • Many services connect our  shared Kafka cluster
• Handle huge production rate • 15M messages / sec • 7.5GB / sec Apache Kafka at LINE Kafka  cluster LINE Ads LINE NEWS LINE Official  Account . . . . .

Phenomenon org.apache.kafka.common.errors.TimeoutException: Expiring 59 record(s) for xxx-topic—46:600000 ms has passed
since batch creation at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.valueOrError(FutureRecordMetadata.java: 98) at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:67) at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:30) at ... => Producer lost some messages 😨 One day…

• Case 1: Broker returns error response for produce requests
continuously • e.g. NotEnoughReplicas, NotLeaderForPartition,… • => No. There’s no client log which indicates error response • Case 2: Produce request latency is too high against production rate Possible cause of batch expiration

Indeed, request latency degraded • Producer’s Kafka clients metrics (via
JMX) showed insane request latency • But there’s no such high latency recorded on the broker • => HOW ? • Another observation: • such high (client-side) produce latency happened only against specific broker • Producer was able to produce to other brokers

Anyways, should handle ASAP • Message-loss was ongoing on producer,
so we had to handle it ASAP • We were still not sure about the root cause, but the phenomenon started right after  we restarted Kafka process on the broker • => We decided to restart the broker again. • => Hun, worked like a charm…  (and lost environment for investigation)

• Producer started to experience slow request latency right after
we restarted a broker • While there’s no such slow latency observed on broker side • Possible cause: • Case 1: Producer slowed down entirely (e.g. Sender thread blocking) • => No, because the producer could produce to other brokers • Case 2: Network performance degraded Current observation

How request-latency is measured • As we can see, ”time
to complete sending request” only included in client-side’s latency • => Broker/Producer latency discrepancy could happen if sending-out request delays

Was network slow? • No. Was fast enough as usual

Any other clues? • Found significant “node_netstat_TcpExt_SyncookiesSent” spike  on the
broker

TCP SYN cookies • A mechanism to deal with SYN
flood attack • Attackers try to open massive TCP connections to a server from many (likely spoofed) source IPs • Server’s SYN queue will be filled up, then “innocent” clients may be refused  => Denial of service • SYN cookies is a “stateless” way to handle SYN packets • In case of flood, server embeds necessary info into SYN+ACK packet by carefully  choosing initial-sequence-number, then reply and forget (without storing in SYN queue) • When “innocent” clients replied “ACK”, server decodes embedded info from it

Kafka’s PLE caused “flood” • Upon restarting broker process, we
execute preferred leader election (PLE) after it became in-sync • Many clients try to open TCP connection in a short period • => Linux kernel detected “flood” 😇

Drawback of SYN cookies • It’s known that some TCP
options (especially window scaling-factor) can be dropped  if TCP timestamp is disabled, and causes throughput degradation • Since TCP sequence-number is a 32bit field, it’s not enough to fully store  encoded TCP options • But linux kernel can use TCP timestamp field additionally to store full info if it’s enabled

Sounds suspicious? but… • TCP timestamp is enabled on our
environment by default • SYN cookies should not drop any TCP options • Also, even if window-scaling option is dropped, it should not cause such significant  throughput degradation • Window-scaling is a way to extend TCP window size beyond 2^16 (uint16-max)  by multiplying window size by 2^scaling_factor • Effective on large-latency connection (like internet) by reducing acks • But 2^16 should be sufficient on our env (i.e. network latency is < 1ms)

Let’s experiment on dev-env • Though we hardly believe SYN
cookies can be the cause at that moment,  digging into it sounds somewhat worthwhile at least • Experiment: Force SYN cookies to be used for every incoming TCP connections • Can be done by setting `net.ipv4.tcp_syncookies = 2`

Yay, reproduced ! • As below image shows, observed nearly
500ms request latency on producer side • But we found no such high latency on broker side

Observation:  Waiting acks every time • We succeeded to take
tcpdump this time • Surprisingly, tcpdump shows: • producer is waiting ack from broker for every packets • Also, every acks from broker delayed for 40ms • Likely because producer is generating too small packets (TCP delayed ack)

Observation:  Window-scaling factor discrepancy • Also, we can see that
advertised window-scaling factor on SYN+ACK from broker is 1 • While `ss` shows broker’s wscale as 7 !!

Which scaling factor is true? • Wrote a bcc (BPF
Compiler Collection) script to dump “actual” window scaling factor  that broker uses by attaching kprobe for `__tcp_select_window` function • https://github.com/ocadaruma/mybcc/blob/main/tools/tcp_wnd.py • As the result, we found broker actually uses `7` for scaling-factor to calculate  window size, while advertised scaling-factor (to producer) is `1`

How can this discrepancy happen? • Linux kernel’s bug •
https://github.com/torvalds/linux/commit/909172a149749242990a6e64cb55d55460d4e417 • Kernel calculated initial window differently when a connection is established through  SYN cookies • It caused inconsistency in advertised scaling-factor and actual one after establishment

Okay, mystery is solved now: • 1. Preferred leader election
upon broker-restart causes many clients to open connection in a  short period, then SYN queue filled up • => 2. Meanwhile, some producer connections fall-back to be established through SYN cookies • => 3. Due to kernel bug, window scaling-factor became inconsistent between broker/producer.  Broker: `advertised_window = window >> 7`, Producer: `window = advertised_window << 1` • => 4. Producer waits ack from broker too frequently due to small window size.  At the same time, broker delayed acks because sufficient data isn’t received. • => 5. Producer starts to take insanely long time to complete sending out request.

Solution • Disable TCP SYN cookies • Even if SYN’s
are dropped when SYN queue full, clients will reconnect soon (SYN retries) • Drawback: • Extra delay will be added until successful connection • (Additionally) Increase listen socket’s backlog size • As of Kafka 2.8.1, backlog size is hardcoded value (50) so we can’t configure it • Reported and discussing in KIP-764: Configurable backlog size for creating Acceptor

Conclusion • Depending on Linux kernel version, TCP SYN cookies
may cause significant network throughput  degradation due to inconsistent window scaling factor • In a large scale Kafka platform that handles many connections,  we might face kernel’s network stack issue • tcpdump, ss, bcc (BPF) greatly help us to investigate network’s problem

THANK YOU

Mysterious producer slowdown in a large scale K...

Mysterious producer slowdown in a large scale Kafka platform

LINE Developers

More Decks by LINE Developers

Other Decks in Technology

Featured

Transcript

Mysterious producer slowdown in a large scale Kafka platform Haruki

• Haruki Okada • Senior software engineer at LINE Corporation

• Multi-tenancy • Many services connect our  shared Kafka cluster

Phenomenon org.apache.kafka.common.errors.TimeoutException: Expiring 59 record(s) for xxx-topic—46:600000 ms has passed

• Case 1: Broker returns error response for produce requests

Indeed, request latency degraded • Producer’s Kafka clients metrics (via

Anyways, should handle ASAP • Message-loss was ongoing on producer,

• Producer started to experience slow request latency right after

How request-latency is measured • As we can see, ”time

Was network slow? • No. Was fast enough as usual

Any other clues? • Found significant “node_netstat_TcpExt_SyncookiesSent” spike  on the

TCP SYN cookies • A mechanism to deal with SYN

Kafka’s PLE caused “flood” • Upon restarting broker process, we

Drawback of SYN cookies • It’s known that some TCP

Sounds suspicious? but… • TCP timestamp is enabled on our

Let’s experiment on dev-env • Though we hardly believe SYN

Yay, reproduced ! • As below image shows, observed nearly

Observation:  Waiting acks every time • We succeeded to take

Observation:  Window-scaling factor discrepancy • Also, we can see that

Which scaling factor is true? • Wrote a bcc (BPF

How can this discrepancy happen? • Linux kernel’s bug •

Okay, mystery is solved now: • 1. Preferred leader election

Solution • Disable TCP SYN cookies • Even if SYN’s

Conclusion • Depending on Linux kernel version, TCP SYN cookies

THANK YOU