Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Buffer allocation and leak detection in Netty

Buffer allocation and leak detection in Netty

Trustin Lee

April 06, 2017
Tweet

More Decks by Trustin Lee

Other Decks in Programming

Transcript

  1. Buffer allocation and
    leak detection in Netty
    이희승 (@trustin/イ·ヒスン)
    Apr 2017

    View Slide

  2. Agenda
    ▸Buffer allocator
    ▸Leak detector
    Blue text is a hyperlink; please do click!

    View Slide

  3. Buffer allocator

    View Slide

  4. Buffer allocator
    ▸Allocates a reference-counted ByteBuf
    ByteBufAllocator alloc = PooledByteBufAllocator.DEFAULT;
    ByteBuf heapBuf = alloc.heapBuffer(1024);
    ByteBuf directBuf = alloc.directBuffer(1024);

    heapBuf.release();
    directBuf.release();
    ▸Why not just ByteBuffer.allocate() or
    allocateDirect()?
    ▸ Zeroing out a buffer takes time.
    ▸ This may change in Java 9.
    ▸ GC is neither free nor always cheap.

    View Slide

  5. PooledByteBufAllocator
    ▸A variant of jemalloc
    ▸ Buddy memory allocator
    ▸ Little external fragmentation
    ▸ Slab allocation for sub-page allocations
    ▸ … to reduce the internal fragmentation for small buffers
    ▸Uses a binary heap instead of a red-black tree
    ▸ Can be represented as a byte[]
    ▸ Less overhead and better cache coherence

    View Slide

  6. Entry point
    ▸An allocation request goes through
    PoolThreadCache
    ▸ Get one from recently released buffers if possible (no locking)
    ▸ If not available, get one from an arena (granular locking)
    ▸ A released buffer goes to ‘recently released buffers’.
    ▸ To the cache of the thread it was allocated from, via an MPSC queue
    ▸ Need to disable thread-local caches depending on usage pattern
    Pooled-
    ByteBuf-
    Allocator
    «thread-local»
    PoolThreadCache
    HeapArena
    Recently
    released
    buffers
    DirectArena
    «thread-safe»
    HeapArena
    «thread-safe»
    DirectArena
    Get least-used arenas

    View Slide

  7. Arena
    «PoolChunkList»
    qInit [0%, 25%)[1]
    «PoolChunkList»
    q000 (0%, 50%)
    «PoolChunkList»
    q025 (25%, 75%)
    «PoolChunkList»
    q050 (50%, 100%)
    «PoolChunkList»
    q075 (75%, 100%)
    «PoolChunkList»
    q100 [100%, 100%]
    PoolChunk
    PoolChunk
    PoolChunk
    PoolChunk
    PoolChunk
    PoolChunk
    A new PoolChunk goes here
    A full PoolChunk goes here

    PoolChunk – a 16[2] MiB memory block
    ● Tries the list with higher chance first:

    q050 → q025 → q000 → qInit → q075
    ● Not pooled if > chunk.capacity
    ● Moved down when:

    chunk.usage >= list.maxUsage
    ● Moved up when:

    chunk.usage < list.minUsage
    ● But never to qInit
    ● Destroyed when

    chunk.usage == 0 &&
    !qInit.contains(chunk)
    [1] Note the range expression
    ● [ or ] – inclusive, ( or ) – exclusive
    [2] Configurable

    View Slide

  8. Chunk
    ▸Consists of 3 main components
    ▸ A 16-MiB memory block (byte[] or NIO ByteBuffer)
    ▸ A binary heap (byte[] memoryMap)
    ▸ When memoryMap[id]=x, in the subtree rooted at id, the first node
    that is free to be allocated is at depth x (counted from 0)
    ▸ At depths [depthOf(id), x), there is no node that is free.
    ▸ memoryMap[id]=depthOf(id)→ completely free
    ▸ memoryMap[id]>depthOf(id)→ partially free
    ▸ memoryMap[id]=maxDepth+1 → fully allocated
    ▸ The root node ID is 1.
    ▸ memoryMap[0] is unused.

    View Slide

  9. Chunk
    ▸Consists of 3 main components (cont’d)
    ▸ A look-up table (byte[] depthMap)
    ▸ Translates an index of a binary heap node into its depth
    ▸ depthMap[0] is unused likewise.
    ▸ Useful when:
    ▸ Updating memoryMap (the depthOf function)
    ▸ Calculating the length of allocation in memory block from node ID
    ▸ Calculating the offset of allocation in memory block from node ID
    ▸ When a chunk is completely free:
    ▸ memoryMap is equal to depthMap.

    View Slide

  10. memoryMap visualized
    ▸When empty:
    #1: 0
    #2-3: 1 1
    #4-7: 2 2 2 2
    #8-15: 3 3 3 3 3 3 3 3
    ▸When #2 is allocated:
    #1: 1
    #2-3: 4 1
    #4-7: 2 2 2 2
    #8-15: 3 3 3 3 3 3 3 3
    No need to update since we traverse down from the root when allocating

    View Slide

  11. memoryMap visualized (cont’d)
    ▸When #14 is allocated:
    #1: 2
    #2-3: 4 2
    #4-7: 2 2 2 3
    #8-15: 3 3 3 3 3 3 4 3
    ▸When #2 is freed:
    #1: 1
    #2-3: 1 2
    #4-7: 2 2 2 3
    #8-15: 3 3 3 3 3 3 4 3
    No need to update; just traverse up from #2 to the root
    Not updated since the right sub-tree is in use

    View Slide

  12. Chunk: a sub-page
    ▸Acts as a slab allocator of ‘tiny’ or ‘small’ buffers
    ▸ … to reduce internal fragmentation
    ▸ Tiny – 32, 64, 96, 128, … 512 bytes
    ▸ Small – 1024, 2048, 4096 bytes
    ▸Uses a leaf node of a chunk – i.e. a 8192-byte block
    ▸Arena keeps the linked lists of PoolSubpages
    ▸ … so that it doesn’t need to traverse deep to the leaf level.
    ▸ Each size has its own linked list and lock
    ▸ 16 for tiny and 3 for small
    PoolSubpage

    View Slide

  13. Granular locking revisited
    ▸(No lock) If possible, get from the thread local cache.
    ▸If tiny or small[1]:
    ▸ (Sub-page lock[2]) If possible, get from the sub-page list.
    ▸ (Chunk lock) Otherwise, allocate a sub-page in the leaf.
    ▸(Chunk lock) Perform buddy allocation.
    [1] 0-4096 bytes
    [2] Each size has its own sub-page list and lock. We have 16 tiny sizes and 3 small sizes by default,
    so each arena has 19 (16 + 3) fine-grained locks for sub-page allocations, yielding less chance of contention.

    View Slide

  14. Future work
    ▸Even more metrics
    ▸Adopt the improvements made in jemalloc 4

    View Slide

  15. Leak detector

    View Slide

  16. Reference counting vs. GC
    ▸Netty ByteBuf is reference-counted.
    ▸What happens if a ByteBuf is GC’d?
    ▸ All is well if ByteBuf.refCnt == 0
    ▸ Leak if ByteBuf.refCnt != 0
    because the allocator cannot reclaim it
    ▸Object.finalize() or Weak/PhantomReferences?
    ▸ Nope. Too slow or not timely enough.

    View Slide

  17. Sampled leak detection
    ▸It’s fine if we can detect leaks from a canary.
    ▸ Instrument leak detection code for less than 1% of buffers.
    ▸ Run a canary for sufficient amount of time.
    ▸ Most leaks are reported real soon.
    ▸It’s fine if a user can control overhead
    ▸ Disabled
    ▸ Simple – 1% sample rate without access recording
    ▸ Advanced – 1% sample rate with access recording
    ▸ Paranoid – 100% sample rate with access recording
    ▸ System properties:
    ▸ -Dio.netty.leakDetection.level=advanced
    ▸ -Dio.netty.leakDetection.maxRecords=4

    View Slide

  18. Sampled leak detection (cont’d)
    ▸ResourceLeakDetector
    ▸ Maintains a ReferenceQueue
    public interface ResourceLeakTracker {
    // Records an access (when level >= ADVANCED)
    void record();
    // Records an access with a hint
    void record(Object hint);
    // Closes the leak when refCnt == 0 (i.e. released correctly)
    boolean close(T trackedObject);
    }
    class DefaultResourceLeak extends PhantomReference
    implements ResourceLeakTracker
    ▸Allocator returns a wrapper of a sampled buffer:
    ▸ SimpleLeakAwareByteBuf
    ▸ AdvancedLeakAwareByteBuf

    View Slide

  19. Sampled leak detection (cont’d)
    12:05:24.374 [nioEventLoop-1-1] ERROR io.netty.util.ResourceLeakDetector - LEAK: ByteBuf.release() was not called before
    it's garbage-collected.
    Recent access records: 2
    #2:
    Hint: 'EchoServerHandler#0' will handle the message from this point.
    io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:329)
    io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
    io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:133)
    io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:485)
    io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:452)
    io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:346)
    io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:794)
    java.lang.Thread.run(Thread.java:744)
    #1:
    io.netty.buffer.AdvancedLeakAwareByteBuf.writeBytes(AdvancedLeakAwareByteBuf.java:589)
    io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:208)
    io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:125)
    io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:485)
    io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:452)
    io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:346)
    io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:794)
    java.lang.Thread.run(Thread.java:744)
    Created at:
    io.netty.buffer.UnpooledByteBufAllocator.newDirectBuffer(UnpooledByteBufAllocator.java:55)
    io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
    io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
    io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107)
    io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:123)
    io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:485)
    io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:452)
    io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:346)
    io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:794)
    java.lang.Thread.run(Thread.java:744)

    View Slide

  20. Thank you

    View Slide