Slide 1

Slide 1 text

Buffer allocation and leak detection in Netty 이희승 (@trustin/イ·ヒスン) Apr 2017

Slide 2

Slide 2 text

Agenda ▸Buffer allocator ▸Leak detector Blue text is a hyperlink; please do click!

Slide 3

Slide 3 text

Buffer allocator

Slide 4

Slide 4 text

Buffer allocator ▸Allocates a reference-counted ByteBuf ByteBufAllocator alloc = PooledByteBufAllocator.DEFAULT; ByteBuf heapBuf = alloc.heapBuffer(1024); ByteBuf directBuf = alloc.directBuffer(1024); … heapBuf.release(); directBuf.release(); ▸Why not just ByteBuffer.allocate() or allocateDirect()? ▸ Zeroing out a buffer takes time. ▸ This may change in Java 9. ▸ GC is neither free nor always cheap.

Slide 5

Slide 5 text

PooledByteBufAllocator ▸A variant of jemalloc ▸ Buddy memory allocator ▸ Little external fragmentation ▸ Slab allocation for sub-page allocations ▸ … to reduce the internal fragmentation for small buffers ▸Uses a binary heap instead of a red-black tree ▸ Can be represented as a byte[] ▸ Less overhead and better cache coherence

Slide 6

Slide 6 text

Entry point ▸An allocation request goes through PoolThreadCache ▸ Get one from recently released buffers if possible (no locking) ▸ If not available, get one from an arena (granular locking) ▸ A released buffer goes to ‘recently released buffers’. ▸ To the cache of the thread it was allocated from, via an MPSC queue ▸ Need to disable thread-local caches depending on usage pattern Pooled- ByteBuf- Allocator «thread-local» PoolThreadCache HeapArena Recently released buffers DirectArena «thread-safe» HeapArena «thread-safe» DirectArena Get least-used arenas

Slide 7

Slide 7 text

Arena «PoolChunkList» qInit [0%, 25%)[1] «PoolChunkList» q000 (0%, 50%) «PoolChunkList» q025 (25%, 75%) «PoolChunkList» q050 (50%, 100%) «PoolChunkList» q075 (75%, 100%) «PoolChunkList» q100 [100%, 100%] PoolChunk PoolChunk PoolChunk PoolChunk PoolChunk PoolChunk A new PoolChunk goes here A full PoolChunk goes here ● PoolChunk – a 16[2] MiB memory block ● Tries the list with higher chance first: ● q050 → q025 → q000 → qInit → q075 ● Not pooled if > chunk.capacity ● Moved down when: ● chunk.usage >= list.maxUsage ● Moved up when: ● chunk.usage < list.minUsage ● But never to qInit ● Destroyed when ● chunk.usage == 0 && !qInit.contains(chunk) [1] Note the range expression ● [ or ] – inclusive, ( or ) – exclusive [2] Configurable

Slide 8

Slide 8 text

Chunk ▸Consists of 3 main components ▸ A 16-MiB memory block (byte[] or NIO ByteBuffer) ▸ A binary heap (byte[] memoryMap) ▸ When memoryMap[id]=x, in the subtree rooted at id, the first node that is free to be allocated is at depth x (counted from 0) ▸ At depths [depthOf(id), x), there is no node that is free. ▸ memoryMap[id]=depthOf(id)→ completely free ▸ memoryMap[id]>depthOf(id)→ partially free ▸ memoryMap[id]=maxDepth+1 → fully allocated ▸ The root node ID is 1. ▸ memoryMap[0] is unused.

Slide 9

Slide 9 text

Chunk ▸Consists of 3 main components (cont’d) ▸ A look-up table (byte[] depthMap) ▸ Translates an index of a binary heap node into its depth ▸ depthMap[0] is unused likewise. ▸ Useful when: ▸ Updating memoryMap (the depthOf function) ▸ Calculating the length of allocation in memory block from node ID ▸ Calculating the offset of allocation in memory block from node ID ▸ When a chunk is completely free: ▸ memoryMap is equal to depthMap.

Slide 10

Slide 10 text

memoryMap visualized ▸When empty: #1: 0 #2-3: 1 1 #4-7: 2 2 2 2 #8-15: 3 3 3 3 3 3 3 3 ▸When #2 is allocated: #1: 1 #2-3: 4 1 #4-7: 2 2 2 2 #8-15: 3 3 3 3 3 3 3 3 No need to update since we traverse down from the root when allocating

Slide 11

Slide 11 text

memoryMap visualized (cont’d) ▸When #14 is allocated: #1: 2 #2-3: 4 2 #4-7: 2 2 2 3 #8-15: 3 3 3 3 3 3 4 3 ▸When #2 is freed: #1: 1 #2-3: 1 2 #4-7: 2 2 2 3 #8-15: 3 3 3 3 3 3 4 3 No need to update; just traverse up from #2 to the root Not updated since the right sub-tree is in use

Slide 12

Slide 12 text

Chunk: a sub-page ▸Acts as a slab allocator of ‘tiny’ or ‘small’ buffers ▸ … to reduce internal fragmentation ▸ Tiny – 32, 64, 96, 128, … 512 bytes ▸ Small – 1024, 2048, 4096 bytes ▸Uses a leaf node of a chunk – i.e. a 8192-byte block ▸Arena keeps the linked lists of PoolSubpages ▸ … so that it doesn’t need to traverse deep to the leaf level. ▸ Each size has its own linked list and lock ▸ 16 for tiny and 3 for small PoolSubpage

Slide 13

Slide 13 text

Granular locking revisited ▸(No lock) If possible, get from the thread local cache. ▸If tiny or small[1]: ▸ (Sub-page lock[2]) If possible, get from the sub-page list. ▸ (Chunk lock) Otherwise, allocate a sub-page in the leaf. ▸(Chunk lock) Perform buddy allocation. [1] 0-4096 bytes [2] Each size has its own sub-page list and lock. We have 16 tiny sizes and 3 small sizes by default, so each arena has 19 (16 + 3) fine-grained locks for sub-page allocations, yielding less chance of contention.

Slide 14

Slide 14 text

Future work ▸Even more metrics ▸Adopt the improvements made in jemalloc 4

Slide 15

Slide 15 text

Leak detector

Slide 16

Slide 16 text

Reference counting vs. GC ▸Netty ByteBuf is reference-counted. ▸What happens if a ByteBuf is GC’d? ▸ All is well if ByteBuf.refCnt == 0 ▸ Leak if ByteBuf.refCnt != 0 because the allocator cannot reclaim it ▸Object.finalize() or Weak/PhantomReferences? ▸ Nope. Too slow or not timely enough.

Slide 17

Slide 17 text

Sampled leak detection ▸It’s fine if we can detect leaks from a canary. ▸ Instrument leak detection code for less than 1% of buffers. ▸ Run a canary for sufficient amount of time. ▸ Most leaks are reported real soon. ▸It’s fine if a user can control overhead ▸ Disabled ▸ Simple – 1% sample rate without access recording ▸ Advanced – 1% sample rate with access recording ▸ Paranoid – 100% sample rate with access recording ▸ System properties: ▸ -Dio.netty.leakDetection.level=advanced ▸ -Dio.netty.leakDetection.maxRecords=4

Slide 18

Slide 18 text

Sampled leak detection (cont’d) ▸ResourceLeakDetector ▸ Maintains a ReferenceQueue public interface ResourceLeakTracker { // Records an access (when level >= ADVANCED) void record(); // Records an access with a hint void record(Object hint); // Closes the leak when refCnt == 0 (i.e. released correctly) boolean close(T trackedObject); } class DefaultResourceLeak extends PhantomReference implements ResourceLeakTracker ▸Allocator returns a wrapper of a sampled buffer: ▸ SimpleLeakAwareByteBuf ▸ AdvancedLeakAwareByteBuf

Slide 19

Slide 19 text

Sampled leak detection (cont’d) 12:05:24.374 [nioEventLoop-1-1] ERROR io.netty.util.ResourceLeakDetector - LEAK: ByteBuf.release() was not called before it's garbage-collected. Recent access records: 2 #2: Hint: 'EchoServerHandler#0' will handle the message from this point. io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:329) io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:133) io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:485) io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:452) io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:346) io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:794) java.lang.Thread.run(Thread.java:744) #1: io.netty.buffer.AdvancedLeakAwareByteBuf.writeBytes(AdvancedLeakAwareByteBuf.java:589) io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:208) io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:125) io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:485) io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:452) io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:346) io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:794) java.lang.Thread.run(Thread.java:744) Created at: io.netty.buffer.UnpooledByteBufAllocator.newDirectBuffer(UnpooledByteBufAllocator.java:55) io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:123) io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:485) io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:452) io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:346) io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:794) java.lang.Thread.run(Thread.java:744)

Slide 20

Slide 20 text

Thank you