Investigating Orphaned Memory Mappings for Kafka's Index and TimeIndex Files

© LY Corporation LY Corporation Masahiro Mori (森雅弘) Investigating
Orphaned Memory Mappings for Kafka’s Index and TimeIndex Files

© LY Corporation Masahiro Mori (森雅弘) Data Pipeline Engineering
Division, LY Corporation 2 2023 Yahoo! JAPAN (as a new grad) 2023 Yahoo! JAPAN (DataDelivery Dept.) - Kafka DevOps Member 2024~ LY Corporation (DPE) - Kafka DevOps Member © LY Corporation

© LY Corporation Agenda 01 Background: Kafka Index & TimeIndex
Files 02 Issue Overview: Disk-I/O Spike & Broker Crash 03 Root Cause Analysis: Orphaned Memory Mappings 04 Proposed Solutions 05 Validation: Experiments & Performance Evaluation 06 Conclusion & Next Steps 3

© LY Corporation • Kafka broker stores each partition’s data
in separate log segments • The segment size is configured by parameters such as log.segment.bytes • Since consumers can start consuming from an arbitrary offset, they need to quickly determine where within a segment to begin reading • ⇒ Use index file and timeindex file • Index file maps record-offsets to byte-offsets in the log segment • TimeIndex file maps timestamps to record-offsets 4 Segment & Indexes Background Segment 0 ・・・ 681 Broker Segment 682 ・・・ 1390 Active Segment 1391 ・・・ 1997 Start Offset End Offset *.log *.index *.timeindex Bytes 0 1048576 offset bytes timestamp offset 1747964243976 128 1747964244506 448 128 262144 448 693118

© LY Corporation • In an actual Kafka directory layout,
log files for each topic-partition are stored under the directories configured by log.dirs 5 Index & TimeIndex Files Background /kafka-logs/data/sandbox-1 ├── 00000000002258326678.index ├── 00000000002258326678.log ├── 00000000002258326678.timeindex ├── 00000000002259017698.index ├── 00000000002259017698.log ├── 00000000002259017698.timeindex ... > bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files /kafka-logs/data/sandbox-1/00000000002258326678.index Dumping /kafka-logs/data/sandbox-1/00000000002258326678.index offset: 2258326854 position: 172692 offset: 2258326946 position: 284213 offset: 2258327054 position: 433854 offset: 2258327174 position: 588997 offset: 2258327242 position: 778156 offset: 2258327338 position: 876347 offset: 2258327465 position: 1018407 ... > bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files /kafka-logs/data/sandbox- 1/00000000002258326678.timeindex Dumping /kafka-logs/data/sandbox-1/00000000002258326678.timeindex timestamp: 1747958427295 offset: 2258326854 timestamp: 1747958427333 offset: 2258326946 timestamp: 1747958427360 offset: 2258327174 timestamp: 1747958427371 offset: 2258327242 timestamp: 1747958427392 offset: 2258327338 timestamp: 1747958427445 offset: 2258327465 ...

© LY Corporation • During a Kafka upgrade, restarting certain
brokers in the cluster took significantly longer than expected, and metrics showed a pronounced spike in disk I/O • At the same time, the total number of file descriptors (FDs) in use on the node dropped off sharply 6 Issue 1: Disk-I/O Spike & File Descriptor Release Issue Overview & Initial Observations Disk I/O Allocated FDs

© LY Corporation • We observed that in clusters experiencing
a disk I/O spike on restart, the number of file descriptors kept growing for over two months • In contrast, healthy clusters released file descriptors within about one day 7 Issue 1: Disk-I/O Spike - Persistent File Descriptor Growth Issue Overview & Initial Observations

© LY Corporation • When we ran lsof, we saw
that many file descriptors were marked as DEL in the FD column, indicating the underlying files had been deleted • On brokers with a large number of open file descriptors, we counted them and found that roughly 90% pointed to deleted files 8 Issue 1: Disk-I/O Spike - Deleted Files Issue Overview & Initial Observations > sudo cat /proc/$KAFKA_PID/maps ... 7df218400000-7df218e00000 rw-s 00000000 08:11 1268777368 /kafka-logs/data/<topic>-<partition>/00000000011769274742.timeindex.deleted (deleted) 7df219000000-7df219a00000 rw-s 00000000 08:11 1268777367 /kafka-logs/data/<topic>-<partition>/00000000011769274742.index.deleted (deleted) 7df219c00000-7df21a600000 rw-s 00000000 08:11 1269108428 /kafka-logs/data/<topic>-<partition>/00000000213560781794.timeindex.deleted (deleted) 7df21a800000-7df21b200000 rw-s 00000000 08:11 1269108427 /kafka-logs/data/<topic>-<partition>/00000000213560781794.index.deleted (deleted) 7df21b400000-7df21be00000 rw-s 00000000 08:11 1270486651 /kafka-logs/data/<topic>-<partition>/00000000013427564689.timeindex.deleted (deleted) 7df21c000000-7df21ca00000 rw-s 00000000 08:11 1270486650 /kafka-logs/data/<topic>-<partition>/00000000013427564689.index.deleted (deleted) 7df21cc00000-7df21d600000 rw-s 00000000 08:11 1269727492 /kafka-logs/data/<topic>-<partition>/00000000047879681875.timeindex.deleted (deleted) 7df21d800000-7df21e200000 rw-s 00000000 08:11 1269727484 /kafka-logs/data/<topic>-<partition>/00000000047879681875.index.deleted (deleted) > sudo lsof –p 2209787 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME java 2209787 kafka cwd DIR 253,0 55 403408548 /usr/apache/kafka/3.5.1 java 2209787 kafka rtd DIR 253,0 329 512 / java 2209787 kafka txt REG 253,0 18544 2080389093 /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.352.b08-2.el9_1.x86_64/jre/bin/java java 2209787 kafka DEL REG 8,17 1268777368 /kafka-logs/data/<topic>-<partition>/00000000011769274742.timeindex.deleted java 2209787 kafka DEL REG 8,17 1268777367 /kafka-logs/data/<topic>-<partition>/00000000011769274742.index.deleted java 2209787 kafka DEL REG 8,17 1269108428 /kafka-logs/data/<topic>-<partition>/00000000213560781794.timeindex.deleted

© LY Corporation • The broker suddenly crashed with the
following error log 9 Issue 2: Broker Crash Issue Overview & Initial Observations [2025-03-29 09:37:03,218] ERROR Error while appending records to <topic>-<partition> in dir /kafka-logs/data ... java.io.IOException: Map failed at java.base/sun.nio.ch.FileChannelImpl.mapInternal(FileChannelImpl.java:1127) at java.base/sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:1032) at org.apache.kafka.storage.internals.log.AbstractIndex.createMappedBuffer(AbstractIndex.java:467) at org.apache.kafka.storage.internals.log.AbstractIndex.createAndAssignMmap(AbstractIndex.java:105) at org.apache.kafka.storage.internals.log.AbstractIndex.<init>(AbstractIndex.java:83) at org.apache.kafka.storage.internals.log.TimeIndex.<init>(TimeIndex.java:65) at org.apache.kafka.storage.internals.log.LazyIndex.loadIndex(LazyIndex.java:242) at org.apache.kafka.storage.internals.log.LazyIndex.get(LazyIndex.java:179) at org.apache.kafka.storage.internals.log.LogSegment.timeIndex(LogSegment.java:146) at org.apache.kafka.storage.internals.log.LogSegment.readMaxTimestampAndOffsetSoFar(LogSegment.java:201) at org.apache.kafka.storage.internals.log.LogSegment.maxTimestampSoFar(LogSegment.java:211) at org.apache.kafka.storage.internals.log.LogSegment.append(LogSegment.java:262) at kafka.log.LocalLog.append(LocalLog.scala:417) ... Caused by: java.lang.OutOfMemoryError: Map failed at java.base/sun.nio.ch.FileChannelImpl.map0(Native Method) at java.base/sun.nio.ch.FileChannelImpl.mapInternal(FileChannelImpl.java:1124) ... 33 more

© LY Corporation • The broker hit the Linux vm.max_map_count
limit, causing it to crash • ⇒ As a workaround, increase this value • Deleted index files remained memory-mapped, so the total number of mappings continued to grow 10 Issue 2: Broker Crash – vm.max_map_count Issue Overview & Initial Observations https://stackoverflow.com/questions/43042144/kafka-server-failed-to-start-java-io-ioexception-map-failed # before > sysctl vm.max_map_count vm.max_map_count = 262144 # after > sysctl vm.max_map_count vm.max_map_count = 1048576 https://kafka.apache.org/documentation/#os

© LY Corporation • We reviewed past JIRA issues related
to deleted files and found KAFKA-4614 [1] • Delayed GC unmap of a deleted file can miss the filesystem’s inode cache, trigger disk I/O, and degrade performance • The fix was to use a AbstractIndex.forceUnmap() at file-deletion time, so the mapping is torn down while the cache still exists • The symptom in KAFKA-4614 matches what we observed • However, even though the KAFKA-4614 patch has been merged upstream, the deleted index/time index files are not being unmapped in our case 11 Comparison with KAFKA-4614 Root Cause Analysis [1] https://issues.apache.org/jira/browse/KAFKA-4614 Linux Kafka Long time until GC VM for Kafka index.deleted still mapped & inode exists JVM sys_unlink OffsetIndex.delete() marked as deleted in VFS level Linux Kafka VM for Kafka index.deleted JVM MixedGC sys_munmap FileSystem cache FS has inode cache FileSystem inode cache deleted Disks GC threads block application threads slow disk access

© LY Corporation • We suspected there was a bug
in the index file deletion logic, so we spun up a dev cluster with aggressive log rotation settings to observe the behavior • We lowered log.segment.bytes and log.retention.ms for rapid turnover • While remote debugging via IDE, we used strace to capture system calls and concurrently inspected /proc/PID/maps • We confirmed that kafka adds the .deleted suffix and schedules the deletion process, and then AbstractIndex.forceUnmap() is called 12 Delete File Unmapping Behavior Root Cause Analysis Breakpoint reached at kafka.log.LocalLog$.$anonfun$deleteSegmentFiles$5(LocalLog.scala:927) deleteSegments: /kafka-logs/data/sandbox-5/00000000000000460772.timeindex.deleted Breakpoint reached at kafka.log.LocalLog$.$anonfun$deleteSegmentFiles$5(LocalLog.scala:927) deleteSegments: /kafka-logs/data/sandbox-4/00000000000000461179.timeindex.deleted Breakpoint reached at org.apache.kafka.storage.internals.log.AbstractIndex.forceUnmap(AbstractIndex.java:404) unmap: /kafka-logs/data/sandbox-5/00000000000000460772.timeindex.deleted Breakpoint reached at org.apache.kafka.storage.internals.log.AbstractIndex.forceUnmap(AbstractIndex.java:404) unmap: /kafka-logs/data/sandbox-4/00000000000000461179.timeindex.deleted Debug Log from IDE [pid 2126035] openat(AT_FDCWD, "/kafka-logs/data/sandbox-5/00000000000000460772.timeindex", O_RDWR|O_CREAT, 0666) = 305 [pid 2126035] mmap(NULL, 12, PROT_READ|PROT_WRITE, MAP_SHARED, 305, 0) = 0x7f40b802a000 ... [pid 2126035] munmap(0x7f40b802a000, 12) = 0 [pid 2126035] unlink("/kafka-logs/data/sandbox-5/00000000000000460772.timeindex.deleted") = 0 strace

© LY Corporation • Observed that some index files—which had
been unlinked and should have been unmapped— remained mapped in memory • A deeper inspection of /proc/PID/maps revealed that the same index file was mapped twice at different sizes • The first mapping was never unmapped, causing it to linger in memory 13 Duplicate Memory Mappings Root Cause Analysis > sudo cat /proc/$KAFKA_PID/maps | grep deleted 7f3fc1800000-7f3fc2200000 rw-s 00000000 fc:02 9440471 /kafka-logs/data/sandbox-4/00000000000000461285.timeindex.deleted (deleted) 7f3fc2400000-7f3fc2e00000 rw-s 00000000 fc:02 9440459 /kafka-logs/data/sandbox-4/00000000000000461285.index.deleted (deleted) > sudo cat /proc/$KAFKA_PID/maps | grep 236433.timeindex 7f6908c00000-7f6909600000 rw-s 00000000 fc:02 1619450 /kafka-logs/data/sandbox-5/00000000000000236433.timeindex 7f6a2408c000-7f6a2408d000 rw-s 00000000 fc:02 1619450 /kafka-logs/data/sandbox-5/00000000000000236433.timeindex

© LY Corporation All methods on the following slides belong
to the AbstractIndex class • forceUnmap() method unmaps a MappedByteBuffer instance held in the mmap field of AbstractIndex class • This mmap instance corresponds to the file mapping of the index file seen in procfs • New mmap instances are created in two places • In the constructor: createAndAssignMmap() • When the index file is resized: resize() • Change the file size and recreate the mmap instance 14 forceUnmap() Root Cause Analysis https://github.com/apache/kafka/blob/70d6ff42debf7e17478beb899fb5756bfbdbfbb5/storage/src/main/java/org/apache/kafka/storage/internals/log/AbstractIndex.java public abstract class AbstractIndex implements Closeable { ... private volatile MappedByteBuffer mmap; ... public void forceUnmap() throws IOException { try { ByteBufferUnmapper.unmap(file.getAbsolutePath(), mmap); } finally { mmap = null; } }

© LY Corporation • On index file resize, the old
memory mapping is never freed • Kafka may resize an index file during segment rotation or when loading an active segment at broker startup • In this process, ftruncate and mmap syscalls are called, but the old map is not free • As mentioned in the comments, the Windows and z/OS kernels cannot truncate while mapped due to specifications, so unmap is performed • On Linux, safeForceUnmap() is not executed 15 Missing munmap on Index File Resize Root Cause Analysis https://github.com/apache/kafka/blob/70d6ff42debf7e17478beb899fb5756bfbdbfbb5/storage/src/main/java/org/apache/kafka/storage/internals/log/AbstractIndex.java public boolean resize(int newSize) throws IOException { int roundedNewSize = roundDownToExactMultiple(newSize, entrySize()); RandomAccessFile raf = new RandomAccessFile(file, “rw”); /* openat */ try { /* Windows or z/OS won‘t let us modify the file length while the file is mmapped :-( */ if (OperatingSystem.IS_WINDOWS || OperatingSystem.IS_ZOS) safeForceUnmap(); /* munmap (didn’t unmapped in Linux) */ raf.setLength(roundedNewSize); /* ftruncate */ this.length = roundedNewSize; mmap = raf.getChannel().map(FileChannel.MapMode.READ_WRITE, 0, roundedNewSize); /* mmap */ return true; protected void safeForceUnmap() { if (mmap != null) { try { forceUnmap(); } catch (Throwable t) { log.error("Error unmapping index {}", file, t); } } } public void forceUnmap() throws IOException { try { ByteBufferUnmapper.unmap(file.getAbsolutePath(), mmap); } finally { mmap = null; } }

© LY Corporation 1. Patch Kafka Code • Explicitly call
munmap during resizing an index file as well as Windows and z/OS 2. JVM Parameter Tuning • Not all clusters exhibit this issue • In healthy clusters, the Mixed GC will periodically reclaim mappings for deleted files • Older mmaps are not collected by the young GC • JVM tuning might alleviate the symptoms temporarily, but since we saw the problem without any JVM-side changes, it could recur under different workloads or configurations ⇒ We choose to patch the Kafka code as a long-term solution 16 Proposed Solutions

© LY Corporation • In resize(), call safeForceUnmap() before ftruncate
and remapping 17 Applied Changes: AbstractIndex.resize(), kafka v3.8.1 Validation --- a/storage/src/main/java/org/apache/kafka/storage/internals/log/AbstractIndex.java +++ b/storage/src/main/java/org/apache/kafka/storage/internals/log/AbstractIndex.java @@ -198,9 +198,8 @@ public boolean resize(int newSize) throws IOException { try { int position = mmap.position(); - /* Windows or z/OS won't let us modify the file length while the file is mmapped :-( */ - if (OperatingSystem.IS_WINDOWS || OperatingSystem.IS_ZOS) - safeForceUnmap(); + safeForceUnmap(); raf.setLength(roundedNewSize); this.length = roundedNewSize; mmap = raf.getChannel().map(FileChannel.MapMode.READ_WRITE, 0, roundedNewSize);

© LY Corporation • While researching how to safely lock
mmap instances, we discovered KAFKA-7442 [1] • This issue reported the same problem we encountered, but the proposed PR was rejected on safety grounds • ⇒ Acquire a lock at every point where the mmap instance is accessed 18 Applied Changes: lock, kafka v3.8.1 Validation // storage/src/main/java/org/apache/kafka/storage/internals/log/OffsetIndex.java public OffsetPosition lookup(long targetOffset) { lock.lock(); try { ByteBuffer idx = mmap().duplicate(); int slot = largestLowerBoundSlotFor(idx, targetOffset, IndexSearchType.KEY); if (slot == -1) return new OffsetPosition(baseOffset(), 0); else return parseEntry(idx, slot); } finally { lock.unlock(); } } [1] https://issues.apache.org/jira/browse/KAFKA-7442

© LY Corporation • Confirmed that no memory mappings remain
for deleted index/timeindex files • Verified from broker metrics and logs that there are no adverse side effects • Request handler and network processor thread idle percentages remain on par with other brokers • No measurable impact on CPU, disk I/O, or memory utilization 19 Results Validation > sudo cat /proc/775707/maps | awk '/deleted/ && /index/' | wc -l 0

© LY Corporation • Issues • Some Kafka clusters experienced
high disk I/O spikes on broker restart/shutdown • Brokers occasionally crashed after hitting the vm.max_map_count limit • Root cause • Like KAFKA-4614, deleted metadata files were still held memory-mapped • Variations in traffic and JVM tuning led to very infrequent Mixed GC runs, so mappings weren’t reclaimed in time • ⇒ Fundamental flaw: the index file resize logic never unmaps the old memory buffer • Solution • Before resizing, invoke safeForceUnmap() to forcibly unmap the old buffer and add locking around all mmap accesses • Results • Confirmed the fix eliminates the issue with no measurable performance impact • Next steps • Prepare and submit the patch for inclusion in the upstream Kafka project 20 Conclusion & Next steps

Investigating Orphaned Memory Mappings for Kafk...

Investigating Orphaned Memory Mappings for Kafka's Index and TimeIndex Files

LINEヤフーTech (LY Corporation Tech) PRO

More Decks by LINEヤフーTech (LY Corporation Tech)

Other Decks in Technology

Featured

Transcript

© LY Corporation LY Corporation Masahiro Mori (森雅弘) Investigating

© LY Corporation Masahiro Mori (森雅弘) Data Pipeline Engineering

© LY Corporation Agenda 01 Background: Kafka Index & TimeIndex

© LY Corporation • Kafka broker stores each partition’s data

© LY Corporation • In an actual Kafka directory layout,

© LY Corporation • During a Kafka upgrade, restarting certain

© LY Corporation • We observed that in clusters experiencing

© LY Corporation • When we ran lsof, we saw

© LY Corporation • The broker suddenly crashed with the

© LY Corporation • The broker hit the Linux vm.max_map_count

© LY Corporation • We reviewed past JIRA issues related

© LY Corporation • We suspected there was a bug

© LY Corporation • Observed that some index files—which had

© LY Corporation All methods on the following slides belong

© LY Corporation • On index file resize, the old

© LY Corporation 1. Patch Kafka Code • Explicitly call

© LY Corporation • In resize(), call safeForceUnmap() before ftruncate

© LY Corporation • While researching how to safely lock

© LY Corporation • Confirmed that no memory mappings remain

© LY Corporation • Issues • Some Kafka clusters experienced

© LY Corporation Thank You!