Netty Internals - Optimizations everywhere

Slide 1

Slide 1 text

NETTY INTERNALS OPTIMIZATIONS EVERYWHERE

Slide 2

Slide 2 text

ABOUT ME NORMAN MAURER ▸ Netty Project Lead ▸ Java Champion ▸ Cassandra MVP ▸ Apache Software Foundation ▸ Working on large-scale Network Services / Frameworks

Slide 3

Slide 3 text

WHAT IS NETTY THE ASYNCHRONOUS NETWORK FRAMEWORK FOR THE JVM ▸ General purpose network framework ▸ Low-level ▸ Tries to hide many optimisations from the end- user so they not need to care about all of it ▸ “easy to use” Optimize all the things

Slide 4

Slide 4 text

So what optimizations are important and are included ?!?

Slide 5

Slide 5 text

BUFFER POOLING

Slide 6

Slide 6 text

BUFFER POOLING VS NOT POOLING NanoSeconds 0 1500 3000 4500 6000 Bytes 0 256 1024 4096 16384 65536 Unpooled Heap Pooled Heap Unpooled Direct Pooled Direct

Slide 7

Slide 7 text

NETTYS POOLEDBYTEBUFALLOCATOR IN DETAIL ▸ Based on jemalloc paper ▸ ThreadLocal caches for lock-free allocations ▸ Locking per Arena still needed ▸ Size classes to serve different allocations ThreadLocal Cache 2 Arena 1 Arena 2 Arena 3 Size-classes Size-classes Size-classes Thread 2 ThreadLocal Cache 1 Thread 1

Slide 8

Slide 8 text

WHY IS BUFFER POOLING SO IMPORTANT ? ▸ Allocating and deallocating of direct buffers is expensive ▸ Memory fragmentation ▸ applications often have a “constant” allocation pattern

Slide 9

Slide 9 text

ALLOCATIONS

Slide 10

Slide 10 text

ALLOCATIONS OF BYTEBUFFER ▸ Generally allocations of ByteBuffer is expensive because the storage will be zero’ed out ▸ This is true for direct but also heap buffers (and in general for byte[] allocations).

Slide 11

Slide 11 text

WHAT CAN WE DO ABOUT IT ? ALLOCATIONS OF BYTEBUFFER

Slide 12

Slide 12 text

SPEED UP DIRECT MEMORY ALLOCATIONS! SUN.MISC.UNSAFE TO THE RESCUE ▸ Use JNI to allocate the direct ByteBuffer ▸ Unfortunately too slow as calling JNI is “expensive” ▸ Use Unsafe to allocate direct memory and use reﬂect to create ByteBuffer from the memory ▸ Works very well but needs to use Unsafe and reﬂection (which breaks on Java9+) ▸ Need explicit to release direct memory as GC will not take care!

Slide 13

Slide 13 text

DIRECT ALLOCATION BENCHMARKS Allocations / Deallocations per Second 0 1500000 3000000 4500000 6000000 Bytes 1024 8192 Unsafe / Reﬂection ByteBuffer.allocateDirect

Slide 14

Slide 14 text

SPEED UP HEAP MEMORY ALLOCATIONS! JDK.INTERNAL.MISC.UNSAFE TO THE RESCUE ▸ Use jdk.internal.misc.Unsafe to allocate byte[] and use ByteBuffer.wrap(…) to create ByteBuffer that is heap based. ▸ Only works on Java9+ and needs to be “allowed” with JVM startup argument ( —add- opens java/base/jdk.internal.misc=ALL-UNNAMED)

Slide 15

Slide 15 text

BYTE[] ALLOCATION BENCHMARK Nanoseconds per Allocation 0 3000 6000 9000 12000 Bytes 100 1000 10000 100000 new byte[...] Unsafe.allocateUninitializedArray(...)

Slide 16

Slide 16 text

FASTTHREADLOCAL

Slide 17

Slide 17 text

WHY BUILD OUR OWN THREADLOCAL ? ▸ ThreadLocal work in general but sometimes we can do better ▸ Netty has tight control over its Threads that are used by the EventLoop ▸ Use own Thread subclass that allows us to store “local” data within an array for faster access ▸ No good way to ensure some cleanup is done when a ThreadLocal for a thread is destroyed ▸ JDK ThreadLocal uses HashMap internally BENCHMARK MODE CNT SCORE ERROR UNITS FASTTHREADLOCAL THRPT 20 115409.952 ± 3511.358 OPS/S JDKTHREADLOCAL. THRPT 20 70654.729 ± 325.362 OPS/S

Slide 18

Slide 18 text

HOW TO USE FASTTHREADLOCAL ? ▸ Ensure you use the DefaultThreadFactory provided by Netty or use FastThreadLocalThread ▸ Replace ThreadLocal with FastThreadLocal ▸ Win! ▸ Gotchas ? There are always some :( ▸ Using FastThreadLocal from a “NON”  FastThreadLocalThread gives a ca 20% perf drop Doh! Why are there always gotchas!?!?

Slide 19

Slide 19 text

SSLENGINE

Slide 20

Slide 20 text

SSL IN JAVA IS SLOW :( ▸ SSL implementation shipped with Java is slow :( ▸ Its becoming better and better tho ▸ Most people will tell you that you should never terminate SSL in Java :/ Why we can’t have nice things ?!?

Slide 21

Slide 21 text

SPEEDUP SSL IN JAVA ▸ Use JNI to call into OpenSSL/BoringSSL/LibreSSL ▸ Done as SSLEngine implementation (OpenSslEngine) that is part of Netty ▸ Just use it and not worry about writing JNI at all ;) ▸ Support more advanced features like SessionTickets, ALPN etc BENCHMARK (BUFFERTYPE) (CIPHER) (SSLPROVIDER) MODE CNT SCORE ERROR UNITS HANDSHAKE HEAP TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 JDK AVGT 20 9830.999 ± 150.306 US/OP HANDSHAKE HEAP TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 OPENSSL AVGT 20 2578.001 ± 23.596 US/OP HANDSHAKE DIRECT TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 JDK AVGT 20 10022.284 ± 221.849 US/OP HANDSHAKE DIRECT TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 OPENSSL AVGT 20 2597.168 ± 49.033 US/OP OPENSSLENGINE AVERAGE TIME PER OP (HIGHER == WORSE)

Slide 22

Slide 22 text

NATIVE SSL ▸ You can use the SSLEngine implementation provided by Netty even without Netty ▸ ….but you really should just use Netty ;) ▸ Alternative: Use Conscrypt as SSL provider.

Slide 23

Slide 23 text

GC PRESSURE

Slide 24

Slide 24 text

ALLOCATIONS / GC ▸ Allocations are not for free but even worse is collecting the object ▸ The worst == ﬁnalize() ▸ Possible solutions: ▸ ThreadLocal (FastThreadLocal) ▸ Object-Pooling (Recycler) ▸ Just don’t allocate ?!?

Slide 25

Slide 25 text

ALLOCATIONS / GC RECYCLER ▸ a “low-overhead” object pool ▸ Thread-safe ▸ Optimized for offer and poll from within the same Thread ▸ Because its the most likely thing to happen within Netty ▸ Used in multiple places in Netty: ▸ ByteBuf instances ▸ Tasks that are scheduled from outside the EventLoop

Slide 26

Slide 26 text

MEMORY OVERHEAD

Slide 27

Slide 27 text

MEMORY OVERHEAD ▸ Unfortunately Java is very memory heavy :( ▸ Object header overhead ▸ Everything (beside primitive types) will be stored as reference ▸ Possible workarounds: ▸ Extend objects when possible (for internal classes) ▸ Use alternatives that operate on primitives

Slide 28

Slide 28 text

Q & A