@ocadaruma) • Technical lead of IMF team at LY Corporation • The team is responsible for providing company- wide Kafka platform • Interests • Distributed Systems, Formal methods, …
Very fortunately, all our brokers have continuous async-profiler profiling enabled! • As a countermeasure to the past another performance issue… • Refs: “Time travel stack trace analysis with async-profiler” @JJUG CCC 2019 Fall
to query partition’s offsets: • Earliest offset • Latest offset • Offset for specific timestamp • Usually considered harmless • Doesn’t touch to actual data • Only requires DESCRIBE permission for the topic
is called with “maxTimestamp” specified, it internally queries the “largest timestamp” of ALL log segment to identify the log segment that contains the target offset
In Kafka, time indexes are implemented as memory-mapped file • Since creating mmap has certain overhead and we don’t need opening it for inactive segment usually, they are “lazily” loaded on-demand • Refs: KIP-263: Allow broker to skip sanity check of inactive segments on broker startup
Large partition has many log segments • To query the offset corresponding to the timestamp, we may need to open many log segment files • Since the broker was just restarted, none of them are opened yet
ListOffsets request content into logs and found that a single client was sending ListOffsets with “max-timestamp” query against many partitions, over 50K+ segments at total
the client and asked stopping it • Permanent Solution: • Not trivial to address fundamentally • Prevent by Authorization => ListOffsets only requires Topic-Describe permission, which we can’t deny • ListOffsets optimization => We have to store another on-disk data only for logging ”largest timestamp”, which is not optimal • Our current strategy (implementation underway): • Detect potentially-risky ListOffsets calls earlier, and to contact the client manually
but actually it might take down the broker easily depending on the usage • Continuous profiling is very useful for troubleshooting • Without this, we would suffer for this issue for much longer