Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Mysterious producer slowdown in a large scale Kafka platform

Mysterious producer slowdown in a large scale Kafka platform

Speaker: Haruki Okada
LINE Messaging Platform Development Department IMF Part Team

※This material was presented at the following event.


LINE Developers

September 24, 2021


  1. Mysterious producer slowdown in a large scale Kafka platform Haruki

  2. • Haruki Okada • Senior software engineer at LINE Corporation

    • Developing company-wide Apache Kafka platform Speaker
  3. • Multi-tenancy • Many services connect our
 shared Kafka cluster

    • Handle huge production rate • 15M messages / sec • 7.5GB / sec Apache Kafka at LINE Kafka
 cluster LINE Ads LINE NEWS LINE Official
 Account . . . . .
  4. Phenomenon org.apache.kafka.common.errors.TimeoutException: Expiring 59 record(s) for xxx-topic—46:600000 ms has passed

    since batch creation at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.valueOrError(FutureRecordMetadata.java: 98) at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:67) at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:30) at ... => Producer lost some messages 😨 One day…
  5. • Case 1: Broker returns error response for produce requests

    continuously • e.g. NotEnoughReplicas, NotLeaderForPartition,… • => No. There’s no client log which indicates error response • Case 2: Produce request latency is too high against production rate Possible cause of batch expiration
  6. Indeed, request latency degraded • Producer’s Kafka clients metrics (via

    JMX) showed insane request latency • But there’s no such high latency recorded on the broker • => HOW ? • Another observation: • such high (client-side) produce latency happened only against specific broker • Producer was able to produce to other brokers
  7. Anyways, should handle ASAP • Message-loss was ongoing on producer,

    so we had to handle it ASAP • We were still not sure about the root cause, but the phenomenon started right after
 we restarted Kafka process on the broker • => We decided to restart the broker again. • => Hun, worked like a charm…
 (and lost environment for investigation)
  8. • Producer started to experience slow request latency right after

    we restarted a broker • While there’s no such slow latency observed on broker side • Possible cause: • Case 1: Producer slowed down entirely (e.g. Sender thread blocking) • => No, because the producer could produce to other brokers • Case 2: Network performance degraded Current observation
  9. How request-latency is measured • As we can see, ”time

    to complete sending request” only included in client-side’s latency • => Broker/Producer latency discrepancy could happen if sending-out request delays
  10. Was network slow? • No. Was fast enough as usual

  11. Any other clues? • Found significant “node_netstat_TcpExt_SyncookiesSent” spike
 on the

  12. TCP SYN cookies • A mechanism to deal with SYN

    flood attack • Attackers try to open massive TCP connections to a server from many (likely spoofed) source IPs • Server’s SYN queue will be filled up, then “innocent” clients may be refused
 => Denial of service • SYN cookies is a “stateless” way to handle SYN packets • In case of flood, server embeds necessary info into SYN+ACK packet by carefully
 choosing initial-sequence-number, then reply and forget (without storing in SYN queue) • When “innocent” clients replied “ACK”, server decodes embedded info from it
  13. Kafka’s PLE caused “flood” • Upon restarting broker process, we

    execute preferred leader election (PLE) after it became in-sync • Many clients try to open TCP connection in a short period • => Linux kernel detected “flood” 😇
  14. Drawback of SYN cookies • It’s known that some TCP

    options (especially window scaling-factor) can be dropped
 if TCP timestamp is disabled, and causes throughput degradation • Since TCP sequence-number is a 32bit field, it’s not enough to fully store
 encoded TCP options • But linux kernel can use TCP timestamp field additionally to store full info if it’s enabled
  15. Sounds suspicious? but… • TCP timestamp is enabled on our

    environment by default • SYN cookies should not drop any TCP options • Also, even if window-scaling option is dropped, it should not cause such significant
 throughput degradation • Window-scaling is a way to extend TCP window size beyond 2^16 (uint16-max)
 by multiplying window size by 2^scaling_factor • Effective on large-latency connection (like internet) by reducing acks • But 2^16 should be sufficient on our env (i.e. network latency is < 1ms)
  16. Let’s experiment on dev-env • Though we hardly believe SYN

    cookies can be the cause at that moment,
 digging into it sounds somewhat worthwhile at least • Experiment: Force SYN cookies to be used for every incoming TCP connections • Can be done by setting `net.ipv4.tcp_syncookies = 2`
  17. Yay, reproduced ! • As below image shows, observed nearly

    500ms request latency on producer side • But we found no such high latency on broker side
  18. Observation:
 Waiting acks every time • We succeeded to take

    tcpdump this time • Surprisingly, tcpdump shows: • producer is waiting ack from broker for every packets • Also, every acks from broker delayed for 40ms • Likely because producer is generating too small packets (TCP delayed ack)
  19. Observation:
 Window-scaling factor discrepancy • Also, we can see that

    advertised window-scaling factor on SYN+ACK from broker is 1 • While `ss` shows broker’s wscale as 7 !!
  20. Which scaling factor is true? • Wrote a bcc (BPF

    Compiler Collection) script to dump “actual” window scaling factor
 that broker uses by attaching kprobe for `__tcp_select_window` function • https://github.com/ocadaruma/mybcc/blob/main/tools/tcp_wnd.py • As the result, we found broker actually uses `7` for scaling-factor to calculate
 window size, while advertised scaling-factor (to producer) is `1`
  21. How can this discrepancy happen? • Linux kernel’s bug •

    https://github.com/torvalds/linux/commit/909172a149749242990a6e64cb55d55460d4e417 • Kernel calculated initial window differently when a connection is established through
 SYN cookies • It caused inconsistency in advertised scaling-factor and actual one after establishment
  22. Okay, mystery is solved now: • 1. Preferred leader election

    upon broker-restart causes many clients to open connection in a
 short period, then SYN queue filled up • => 2. Meanwhile, some producer connections fall-back to be established through SYN cookies • => 3. Due to kernel bug, window scaling-factor became inconsistent between broker/producer.
 Broker: `advertised_window = window >> 7`, Producer: `window = advertised_window << 1` • => 4. Producer waits ack from broker too frequently due to small window size.
 At the same time, broker delayed acks because sufficient data isn’t received. • => 5. Producer starts to take insanely long time to complete sending out request.
  23. Solution • Disable TCP SYN cookies • Even if SYN’s

    are dropped when SYN queue full, clients will reconnect soon (SYN retries) • Drawback: • Extra delay will be added until successful connection • (Additionally) Increase listen socket’s backlog size • As of Kafka 2.8.1, backlog size is hardcoded value (50) so we can’t configure it • Reported and discussing in KIP-764: Configurable backlog size for creating Acceptor
  24. Conclusion • Depending on Linux kernel version, TCP SYN cookies

    may cause significant network throughput
 degradation due to inconsistent window scaling factor • In a large scale Kafka platform that handles many connections,
 we might face kernel’s network stack issue • tcpdump, ss, bcc (BPF) greatly help us to investigate network’s problem