Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Buffer allocation and leak detection in Netty

Buffer allocation and leak detection in Netty

Trustin Lee

April 06, 2017
Tweet

More Decks by Trustin Lee

Other Decks in Programming

Transcript

  1. Buffer allocator ▸Allocates a reference-counted ByteBuf ByteBufAllocator alloc = PooledByteBufAllocator.DEFAULT;

    ByteBuf heapBuf = alloc.heapBuffer(1024); ByteBuf directBuf = alloc.directBuffer(1024); … heapBuf.release(); directBuf.release(); ▸Why not just ByteBuffer.allocate() or allocateDirect()? ▸ Zeroing out a buffer takes time. ▸ This may change in Java 9. ▸ GC is neither free nor always cheap.
  2. PooledByteBufAllocator ▸A variant of jemalloc ▸ Buddy memory allocator ▸

    Little external fragmentation ▸ Slab allocation for sub-page allocations ▸ … to reduce the internal fragmentation for small buffers ▸Uses a binary heap instead of a red-black tree ▸ Can be represented as a byte[] ▸ Less overhead and better cache coherence
  3. Entry point ▸An allocation request goes through PoolThreadCache ▸ Get

    one from recently released buffers if possible (no locking) ▸ If not available, get one from an arena (granular locking) ▸ A released buffer goes to ‘recently released buffers’. ▸ To the cache of the thread it was allocated from, via an MPSC queue ▸ Need to disable thread-local caches depending on usage pattern Pooled- ByteBuf- Allocator «thread-local» PoolThreadCache HeapArena Recently released buffers DirectArena «thread-safe» HeapArena «thread-safe» DirectArena Get least-used arenas
  4. Arena «PoolChunkList» qInit [0%, 25%)[1] «PoolChunkList» q000 (0%, 50%) «PoolChunkList»

    q025 (25%, 75%) «PoolChunkList» q050 (50%, 100%) «PoolChunkList» q075 (75%, 100%) «PoolChunkList» q100 [100%, 100%] PoolChunk PoolChunk PoolChunk PoolChunk PoolChunk PoolChunk A new PoolChunk goes here A full PoolChunk goes here • PoolChunk – a 16[2] MiB memory block • Tries the list with higher chance first: • q050 → q025 → q000 → qInit → q075 • Not pooled if > chunk.capacity • Moved down when: • chunk.usage >= list.maxUsage • Moved up when: • chunk.usage < list.minUsage • But never to qInit • Destroyed when • chunk.usage == 0 && !qInit.contains(chunk) [1] Note the range expression • [ or ] – inclusive, ( or ) – exclusive [2] Configurable
  5. Chunk ▸Consists of 3 main components ▸ A 16-MiB memory

    block (byte[] or NIO ByteBuffer) ▸ A binary heap (byte[] memoryMap) ▸ When memoryMap[id]=x, in the subtree rooted at id, the first node that is free to be allocated is at depth x (counted from 0) ▸ At depths [depthOf(id), x), there is no node that is free. ▸ memoryMap[id]=depthOf(id)→ completely free ▸ memoryMap[id]>depthOf(id)→ partially free ▸ memoryMap[id]=maxDepth+1 → fully allocated ▸ The root node ID is 1. ▸ memoryMap[0] is unused.
  6. Chunk ▸Consists of 3 main components (cont’d) ▸ A look-up

    table (byte[] depthMap) ▸ Translates an index of a binary heap node into its depth ▸ depthMap[0] is unused likewise. ▸ Useful when: ▸ Updating memoryMap (the depthOf function) ▸ Calculating the length of allocation in memory block from node ID ▸ Calculating the offset of allocation in memory block from node ID ▸ When a chunk is completely free: ▸ memoryMap is equal to depthMap.
  7. memoryMap visualized ▸When empty: #1: 0 #2-3: 1 1 #4-7:

    2 2 2 2 #8-15: 3 3 3 3 3 3 3 3 ▸When #2 is allocated: #1: 1 #2-3: 4 1 #4-7: 2 2 2 2 #8-15: 3 3 3 3 3 3 3 3 No need to update since we traverse down from the root when allocating
  8. memoryMap visualized (cont’d) ▸When #14 is allocated: #1: 2 #2-3:

    4 2 #4-7: 2 2 2 3 #8-15: 3 3 3 3 3 3 4 3 ▸When #2 is freed: #1: 1 #2-3: 1 2 #4-7: 2 2 2 3 #8-15: 3 3 3 3 3 3 4 3 No need to update; just traverse up from #2 to the root Not updated since the right sub-tree is in use
  9. Chunk: a sub-page ▸Acts as a slab allocator of ‘tiny’

    or ‘small’ buffers ▸ … to reduce internal fragmentation ▸ Tiny – 32, 64, 96, 128, … 512 bytes ▸ Small – 1024, 2048, 4096 bytes ▸Uses a leaf node of a chunk – i.e. a 8192-byte block ▸Arena keeps the linked lists of PoolSubpages ▸ … so that it doesn’t need to traverse deep to the leaf level. ▸ Each size has its own linked list and lock ▸ 16 for tiny and 3 for small PoolSubpage
  10. Granular locking revisited ▸(No lock) If possible, get from the

    thread local cache. ▸If tiny or small[1]: ▸ (Sub-page lock[2]) If possible, get from the sub-page list. ▸ (Chunk lock) Otherwise, allocate a sub-page in the leaf. ▸(Chunk lock) Perform buddy allocation. [1] 0-4096 bytes [2] Each size has its own sub-page list and lock. We have 16 tiny sizes and 3 small sizes by default, so each arena has 19 (16 + 3) fine-grained locks for sub-page allocations, yielding less chance of contention.
  11. Reference counting vs. GC ▸Netty ByteBuf is reference-counted. ▸What happens

    if a ByteBuf is GC’d? ▸ All is well if ByteBuf.refCnt == 0 ▸ Leak if ByteBuf.refCnt != 0 because the allocator cannot reclaim it ▸Object.finalize() or Weak/PhantomReferences? ▸ Nope. Too slow or not timely enough.
  12. Sampled leak detection ▸It’s fine if we can detect leaks

    from a canary. ▸ Instrument leak detection code for less than 1% of buffers. ▸ Run a canary for sufficient amount of time. ▸ Most leaks are reported real soon. ▸It’s fine if a user can control overhead ▸ Disabled ▸ Simple – 1% sample rate without access recording ▸ Advanced – 1% sample rate with access recording ▸ Paranoid – 100% sample rate with access recording ▸ System properties: ▸ -Dio.netty.leakDetection.level=advanced ▸ -Dio.netty.leakDetection.maxRecords=4
  13. Sampled leak detection (cont’d) ▸ResourceLeakDetector ▸ Maintains a ReferenceQueue<DefaultResourceLeak> public

    interface ResourceLeakTracker<T> { // Records an access (when level >= ADVANCED) void record(); // Records an access with a hint void record(Object hint); // Closes the leak when refCnt == 0 (i.e. released correctly) boolean close(T trackedObject); } class DefaultResourceLeak<T> extends PhantomReference<T> implements ResourceLeakTracker<T> ▸Allocator returns a wrapper of a sampled buffer: ▸ SimpleLeakAwareByteBuf ▸ AdvancedLeakAwareByteBuf
  14. Sampled leak detection (cont’d) 12:05:24.374 [nioEventLoop-1-1] ERROR io.netty.util.ResourceLeakDetector - LEAK:

    ByteBuf.release() was not called before it's garbage-collected. Recent access records: 2 #2: Hint: 'EchoServerHandler#0' will handle the message from this point. io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:329) io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:133) io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:485) io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:452) io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:346) io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:794) java.lang.Thread.run(Thread.java:744) #1: io.netty.buffer.AdvancedLeakAwareByteBuf.writeBytes(AdvancedLeakAwareByteBuf.java:589) io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:208) io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:125) io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:485) io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:452) io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:346) io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:794) java.lang.Thread.run(Thread.java:744) Created at: io.netty.buffer.UnpooledByteBufAllocator.newDirectBuffer(UnpooledByteBufAllocator.java:55) io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:123) io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:485) io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:452) io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:346) io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:794) java.lang.Thread.run(Thread.java:744)