Going Off Heap

JVM Tricks little-known) a handful of (useful, but

optimising is often a compromise

often you can simplify a lot

latency is important

pipeline aim for the performance

needs it not everyone daily

still good to know but it all is

crazy offheap before you go

CAS • Without CAS, you need locks to assure atomic
change • lock-free swapping/referencing • available at concurrent.Atomic* • sun.misc.Unsafe/compareAndSwap* • is used in Java lock implementations • doesn’t mean that there will always be a conflict • but when there is a conflict, you know it’s resolved

CAS • Chose between • how many threads are waiting
on the lock • how complicated it is to implement locking “right” • how error prone locking may potentially be • And • How many potential collisions you may have

CAS • Read the (primitive) value • Save old value
for future check • Perform modification to the value • CAS will write successfully only if reference value is unchanged • If it was changed, retry • If it wasn’t, new value will be held by reference

Pooling

pooling: why? • eliminates the cost of object initialisation •
is actively used for, e.g. JDBC (small object amount) • although we’re looking into rapidly/constantly allocated short-lived objects

pooling: allocation strategies • fixed size • single-step growth •
block growth • exponential growth

pooling: fixed size • + predictable size (thanks, cap) •
+ pre-allocated in one run • + when object is available, it’s returned immediately • + no further allocations required • - easy to run into resource starvation problem • - always has to allocate more than “usually” used • - you have to know how many you actually have

pooling: single-step growth • allocated with initial N objects •
when pool is empty, each lease results into allocation • good when initial size is “most likely” correct • size may be underestimated • if pool size is unbound, may blow everything up • single-entry allocations are fast • will always carry “high watermark” number of entries

pooling: block growth • allocated with initial N objects •
good for “lightweight” startup • or when pool size is unclear • when pool is empty, pool is grown by M entries • slower growth step, but no overhead during lease • balance between latency and memory overhead • could be grown by exponential steps (for pessimists)

pooling: choosing the right one • analyse how the objects
are allocated • have “sudden” large usage/allocations changes? • does everything happen “gradually” • if you reach the limit, is it will it only get worse?

pooling: allocation strategies • low/high watermark • lease rate •
return rate

pooling: recycle strategies • reference-counting • borrow/return

recycle strategies: reference counting • plays very well together with
lambdas, e.g.: (pooledObject, pooledObjectConsumer) -> { pooledObject.retain(); pooledObjectConsumer.accept(pooledObject); pooledObject.release(); }; • each block that receives an object, increments counter • whenever flow leaves block, counter is decremented • when counter reaches 0, object returns to pool

recycle strategies: reference counting • good for: • multiple /
concurrent consumers • pipelining / nested processing • pitfalls: • premature release • double increments/decrements • change of object in-place (if not immutable) • “implicit” object recycling

recycle strategies: reference counting • pitfalls: • pooled object has
to be pool-aware (or wrapped) • wrapped = memory overhead • hard to track “who forgot to decrement”

recycle strategies: borrow/return • object is “borrowed” from pool (and
re-init’ed) • when object is not needed, you return it to pool

recycle strategies: borrow/return • good for: • single consumer •
clear usage scope block • object is pool-oblivious (real science term, srsly) • pitfalls: • premature release (always with pooling) • double `borrow`

recycle strategies: general pitfalls • make scopes very explicit •
avoid complex conditionals, return unconditionally • always `finally` return the object • seriously, never forget to return object • pool side: make sure to never give an object twice • “lost” objects (borrowed and never returned)

recycle strategies: general pitfalls • re-initialization is hard to “get
right” • beware of reference values “within” pooled objects • when resetting, don’t leave “garbage”

Alignment CPU cache

Problem

false sharing CPU 1 CPU 2 +--------------------+ +--------------------+ | |
| | +--------------------+ +--------------------+ Cache Line XX Cache line YY +--+--+--+--+--+--+--+ +--+--+--+--+--+--+--+ |XX|XX|XX|YY|YY|YY|YY| |XX|XX|XX|YY|YY|YY|YY| +--+--+--+--+--+--+--+ +--+--+--+--+--+--+--+ RAM +--+--+--+==+==+==+==+==+==+==+--+--+--+--+--+ | | | |XX|XX|XX|YY|YY|YY|YY| | | | | | +--+--+--+==+==+==+==+==+==+==+--+--+--+--+--+

| | +--------------------+ +--------------------+ Cache Line XX Cache line YY +--+--+--+--+--+--+--+ +--+--+--+--+--+--+--+ |XX|XX|XX|YY|YY|YY|YY| |XX|XX|XX|YY|YY|YY|YY| +--+--+--+--+--+--+--+ +--+--+--+--+--+--+--+ RAM +--+--+--+==+==+==+==+==+==+==+--+--+--+--+--+ | | | |XX|XX|XX|ZZ|YY|YY|YY| | | | | | +--+--+--+==+==+==+==+==+==+==+--+--+--+--+--+

| | +--------------------+ +--------------------+ Cache Line XX Cache line YY +--+--+--+--+--+--+--+ +--+--+--+--+--+--+--+ |XX|XX|XX|ZZ|YY|YY|YY| |XX|XX|XX|ZZ|YY|YY|YY| +--+--+--+--+--+--+--+ +--+--+--+--+--+--+--+ RAM +--+--+--+==+==+==+==+==+==+==+--+--+--+--+--+ | | | |XX|XX|XX|ZZ|YY|YY|YY| | | | | | +--+--+--+==+==+==+==+==+==+==+--+--+--+--+--+

Solution

false sharing +--+--+--+ |XX|XX|XX| +--+--+--+ +--+--+--+--+--+--+--+ |PD|PD|XX|XX|XX|PD|PD| +--+--+--+--+--+--+--+ Original Data
Structure Padded Data Structure

CPU cache alignment • avoid cache misses with padding •
avoid cache line contention by “spreading” data • false sharing • needless cache updates when nothing got changed • fixed by padding structures

Strategies waiting

waiting strategies • busy spin • yielding wait • sleep/park
wait

waiting strategies: busy spin • + most precise • -
occupies an entire core

waiting strategies: yielding wait • + quite precise • +
scheduler-friendly • - still spins quite hard

waiting strategies: park/sleep • - very imprecise • - prone
to timer drifts • + very scheduler-friendly

off-heap working with memory

why? • not affected by GC • zero-copy IPC /
networking • great for large datasets • compact data representation* * under certain conditions

how? “Manual” way: • ByteBuffer.allocateDirect • io.netty.ByteBuf (Pooled/Unpooled)

data representation: primitives • byte/short/int32/long • float (32 bit floating
point num) • double (64 bit floating point num) • boolean (essentially flag byte)

complex protocols

data representation: flexible structures 0 1 X +------------+-------------------------+ | field
type | field content | | (byte) | | +------------+-------------------------+ • field type determines how many following bytes should be consumed

data representation: flexible structures • field content essentially can have
same structure 0 1 X +------------+-------------------------------------------+ | | 1 | | field type | +------------+---------------------------| | (byte) | | field type | field content | | | +------------+---------------------------| +------------+-------------------------------------------+

consumption strategies

consumption strategies: c-style (template) • structure is fixed, and holds
offsets to the structure: struct foo_s { char name[10]; // offset 0, length 10 int type; // offset 10, length 4 double min; // offset 14, length 8 double max; // offset 22, length 8 }; 0 10 14 22 30 +----------+----------+---------+---------+ | name | type | min | max | +----------+----------+---------+---------+

consumption strategies: c-style (template) • array access: 0 10 14
22 30 +----------+----------+---------+---------+ | name | type | min | max | +----------+----------+---------+---------+ 30 40 44 52 60 ... +----------+----------+---------+---------+ ... ... | name[1] | type[1] | min[1] | max[1] | ... ... +----------+----------+---------+---------+ ... • essentially, sizeof(struct) + field offset

consumption strategies: c-style (template) • + random access • +
no need to decode/go through an entire buffer • + compact data representation • + no meta-information overhead • - fixed-size structures (no variable-length data)

consumption strategies: sequential reading • hold a pointer to a
current reading position (e.g. 0) 0 +------------+ .... | field type | .... | (byte) | .... +------------+ ....

consumption strategies: sequential reading • identify the type of the
following structure (e.g. string) 0 +---------------------+ .... | field type = string | .... +---------------------+ ....

consumption strategies: sequential reading • read the following <int> which’d
hold string size 0 +---------------------+---------------------+ .... | field type = string | string size = 10 | .... +---------------------+---------------------+ ....

consumption strategies: sequential reading • read string contents: +---------------------+---------------------+---------------+ ....
| field type = string | string size = 10 | 10b content | .... +---------------------+---------------------+---------------+ ....

consumption strategies: sequential reading • rinse and repeat .... +------------------+
.... .... | field type = ??? | .... .... +------------------+ ....

consumption strategies: sequential reading • + flexible data • +
nested structures possible • + easy to compose • - no way to get data without reading meta-info • - metadata overhead

consumption strategies: truth in the middle • in reality, you
will most likely end up with a mix of both +--------------------+--------------------+--------------------+----------+ .... | field type = array | element type = foo | element count = 10 | elements | .... +--------------------+--------------------+--------------------+----------+ ....

consumption strategies: truth is in the middle • encode arrays
with •<type>+<size>+<sequentially-encoded-content> • same with hash maps • compute relative positions of keys and values • read/access by index

consumption strategies: random tips • Encode your data types with
constants, e.g. 0x00001 User 0x00101 User.name 0x00201 User.age 0x00301 User.createdAt ... 0x00002 Car 0x00102 Car.color 0x00202 Car.mileage ... • Use bitmasks for flags • Use strict comparison to determine type

consumption strategies: random tips • Avoid sharing State in buffers
(reader/writer indexes) • Use buffer “Slices” to hand over parts of data • Break down buffers to most atomic pieces • Recycle / pool as much as you can, GC is often an enemy

consumption strategies: random tips • Pipeline writes in one thread
• Do not modify buffers while they’re read (immutable) • Read as concurrently as you can / want / need • Stick to fixed-size structs as much as you can • Avoid re-mapping / defragmenting as much as you can • Keep `free memory map` index • Grab first free fitting memory segment

compact boolean arrays bitmasks

bitmasks bitwise and 0001 & 0010 = 0000 1 &
2 = 0

bitmasks bitwise or 0001 | 0010 = 0011 1 |
2 = 3

bitmasks bitwise xor 0001 ^ 0010 = 0011 1 ^
2 = 2

bitmasks 0000 0000 0001 0000 0000 0000 0000 0000 left
shift of 1 by 20 1L << 20 0000 0000 0000 0000 0000 0000 0000 0001

bitmasks 0000 0000 0001 0000 0000 0000 0000 0011 right
shift of 1 by 5 100L >> 5 0000 0000 0000 0000 0000 0000 0110 0100

bitmasks 0000 0000 0000 0000 0000 0000 1111 1010 32-bit
integer 250 0000 0000 0001 0000 0000 0000 1111 1010 `invert` 20th bit: int j = i ^ (1L << 20) > 1048826 int i = 250 `clear` 20th bit: j ^ (1L << 20) > 250 0000 0000 0000 0000 0000 0000 1111 1010

bitmasks In combination with bitmasks, can be used to create
very large (yet, atomic) bitmasks, for example, combined with AtomicLongArray (for CAS) or Direct Memory.

warpping up

wrapping up • Remember these tricks when writing software •
There are many ways to apply all of it • Every application has a performance-critical part • Understand how things work • Have more stuff in your toolbelt

bit.ly/jdkio_pooling_talk

Going Off Heap

Going Off Heap

More Decks by αλεx π

Other Decks in Technology

Featured

Transcript