Buffer allocation and leak detection in Netty

Buffer allocation and leak detection in Netty

9b123f408258511b201ca1230f260340?s=128

Trustin Lee

April 06, 2017
Tweet

Transcript

  1. Buffer allocation and leak detection in Netty 이희승 (@trustin/イ·ヒスン) Apr

    2017
  2. Agenda ▸Buffer allocator ▸Leak detector Blue text is a hyperlink;

    please do click!
  3. Buffer allocator

  4. Buffer allocator ▸Allocates a reference-counted ByteBuf ByteBufAllocator alloc = PooledByteBufAllocator.DEFAULT;

    ByteBuf heapBuf = alloc.heapBuffer(1024); ByteBuf directBuf = alloc.directBuffer(1024); … heapBuf.release(); directBuf.release(); ▸Why not just ByteBuffer.allocate() or allocateDirect()? ▸ Zeroing out a buffer takes time. ▸ This may change in Java 9. ▸ GC is neither free nor always cheap.
  5. PooledByteBufAllocator ▸A variant of jemalloc ▸ Buddy memory allocator ▸

    Little external fragmentation ▸ Slab allocation for sub-page allocations ▸ … to reduce the internal fragmentation for small buffers ▸Uses a binary heap instead of a red-black tree ▸ Can be represented as a byte[] ▸ Less overhead and better cache coherence
  6. Entry point ▸An allocation request goes through PoolThreadCache ▸ Get

    one from recently released buffers if possible (no locking) ▸ If not available, get one from an arena (granular locking) ▸ A released buffer goes to ‘recently released buffers’. ▸ To the cache of the thread it was allocated from, via an MPSC queue ▸ Need to disable thread-local caches depending on usage pattern Pooled- ByteBuf- Allocator «thread-local» PoolThreadCache HeapArena Recently released buffers DirectArena «thread-safe» HeapArena «thread-safe» DirectArena Get least-used arenas
  7. Arena «PoolChunkList» qInit [0%, 25%)[1] «PoolChunkList» q000 (0%, 50%) «PoolChunkList»

    q025 (25%, 75%) «PoolChunkList» q050 (50%, 100%) «PoolChunkList» q075 (75%, 100%) «PoolChunkList» q100 [100%, 100%] PoolChunk PoolChunk PoolChunk PoolChunk PoolChunk PoolChunk A new PoolChunk goes here A full PoolChunk goes here • PoolChunk – a 16[2] MiB memory block • Tries the list with higher chance first: • q050 → q025 → q000 → qInit → q075 • Not pooled if > chunk.capacity • Moved down when: • chunk.usage >= list.maxUsage • Moved up when: • chunk.usage < list.minUsage • But never to qInit • Destroyed when • chunk.usage == 0 && !qInit.contains(chunk) [1] Note the range expression • [ or ] – inclusive, ( or ) – exclusive [2] Configurable
  8. Chunk ▸Consists of 3 main components ▸ A 16-MiB memory

    block (byte[] or NIO ByteBuffer) ▸ A binary heap (byte[] memoryMap) ▸ When memoryMap[id]=x, in the subtree rooted at id, the first node that is free to be allocated is at depth x (counted from 0) ▸ At depths [depthOf(id), x), there is no node that is free. ▸ memoryMap[id]=depthOf(id)→ completely free ▸ memoryMap[id]>depthOf(id)→ partially free ▸ memoryMap[id]=maxDepth+1 → fully allocated ▸ The root node ID is 1. ▸ memoryMap[0] is unused.
  9. Chunk ▸Consists of 3 main components (cont’d) ▸ A look-up

    table (byte[] depthMap) ▸ Translates an index of a binary heap node into its depth ▸ depthMap[0] is unused likewise. ▸ Useful when: ▸ Updating memoryMap (the depthOf function) ▸ Calculating the length of allocation in memory block from node ID ▸ Calculating the offset of allocation in memory block from node ID ▸ When a chunk is completely free: ▸ memoryMap is equal to depthMap.
  10. memoryMap visualized ▸When empty: #1: 0 #2-3: 1 1 #4-7:

    2 2 2 2 #8-15: 3 3 3 3 3 3 3 3 ▸When #2 is allocated: #1: 1 #2-3: 4 1 #4-7: 2 2 2 2 #8-15: 3 3 3 3 3 3 3 3 No need to update since we traverse down from the root when allocating
  11. memoryMap visualized (cont’d) ▸When #14 is allocated: #1: 2 #2-3:

    4 2 #4-7: 2 2 2 3 #8-15: 3 3 3 3 3 3 4 3 ▸When #2 is freed: #1: 1 #2-3: 1 2 #4-7: 2 2 2 3 #8-15: 3 3 3 3 3 3 4 3 No need to update; just traverse up from #2 to the root Not updated since the right sub-tree is in use
  12. Chunk: a sub-page ▸Acts as a slab allocator of ‘tiny’

    or ‘small’ buffers ▸ … to reduce internal fragmentation ▸ Tiny – 32, 64, 96, 128, … 512 bytes ▸ Small – 1024, 2048, 4096 bytes ▸Uses a leaf node of a chunk – i.e. a 8192-byte block ▸Arena keeps the linked lists of PoolSubpages ▸ … so that it doesn’t need to traverse deep to the leaf level. ▸ Each size has its own linked list and lock ▸ 16 for tiny and 3 for small PoolSubpage
  13. Granular locking revisited ▸(No lock) If possible, get from the

    thread local cache. ▸If tiny or small[1]: ▸ (Sub-page lock[2]) If possible, get from the sub-page list. ▸ (Chunk lock) Otherwise, allocate a sub-page in the leaf. ▸(Chunk lock) Perform buddy allocation. [1] 0-4096 bytes [2] Each size has its own sub-page list and lock. We have 16 tiny sizes and 3 small sizes by default, so each arena has 19 (16 + 3) fine-grained locks for sub-page allocations, yielding less chance of contention.
  14. Future work ▸Even more metrics ▸Adopt the improvements made in

    jemalloc 4
  15. Leak detector

  16. Reference counting vs. GC ▸Netty ByteBuf is reference-counted. ▸What happens

    if a ByteBuf is GC’d? ▸ All is well if ByteBuf.refCnt == 0 ▸ Leak if ByteBuf.refCnt != 0 because the allocator cannot reclaim it ▸Object.finalize() or Weak/PhantomReferences? ▸ Nope. Too slow or not timely enough.
  17. Sampled leak detection ▸It’s fine if we can detect leaks

    from a canary. ▸ Instrument leak detection code for less than 1% of buffers. ▸ Run a canary for sufficient amount of time. ▸ Most leaks are reported real soon. ▸It’s fine if a user can control overhead ▸ Disabled ▸ Simple – 1% sample rate without access recording ▸ Advanced – 1% sample rate with access recording ▸ Paranoid – 100% sample rate with access recording ▸ System properties: ▸ -Dio.netty.leakDetection.level=advanced ▸ -Dio.netty.leakDetection.maxRecords=4
  18. Sampled leak detection (cont’d) ▸ResourceLeakDetector ▸ Maintains a ReferenceQueue<DefaultResourceLeak> public

    interface ResourceLeakTracker<T> { // Records an access (when level >= ADVANCED) void record(); // Records an access with a hint void record(Object hint); // Closes the leak when refCnt == 0 (i.e. released correctly) boolean close(T trackedObject); } class DefaultResourceLeak<T> extends PhantomReference<T> implements ResourceLeakTracker<T> ▸Allocator returns a wrapper of a sampled buffer: ▸ SimpleLeakAwareByteBuf ▸ AdvancedLeakAwareByteBuf
  19. Sampled leak detection (cont’d) 12:05:24.374 [nioEventLoop-1-1] ERROR io.netty.util.ResourceLeakDetector - LEAK:

    ByteBuf.release() was not called before it's garbage-collected. Recent access records: 2 #2: Hint: 'EchoServerHandler#0' will handle the message from this point. io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:329) io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:133) io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:485) io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:452) io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:346) io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:794) java.lang.Thread.run(Thread.java:744) #1: io.netty.buffer.AdvancedLeakAwareByteBuf.writeBytes(AdvancedLeakAwareByteBuf.java:589) io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:208) io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:125) io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:485) io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:452) io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:346) io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:794) java.lang.Thread.run(Thread.java:744) Created at: io.netty.buffer.UnpooledByteBufAllocator.newDirectBuffer(UnpooledByteBufAllocator.java:55) io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:123) io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:485) io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:452) io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:346) io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:794) java.lang.Thread.run(Thread.java:744)
  20. Thank you