Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Going Off Heap

αλεx π
January 19, 2015

Going Off Heap

Number of machines, users, handled connections is raising every year. Processors get faster, amounts of memory grows larger, it gets easier to build distributed systems. Despite all these major improvements, we all still struggle from the performance problems. Gladly, there are many approaches and techniques to solve it.
In this talk, we’re going to cover the major principles of maximising a single-box performance and building a system that yields predictable results on the machine, reduces Garbage Collection pressure, has better CPU utilisation, that help to build lock-free programs and structures.
Compare and Swap (CAS), off-heap buffers, cache line alignment, lightweight data structures, object pooling, linear-memory data structures and many more techniques that are used in modern Java application development.

Licensed under Creative Commons Attribution-NonCommercial 3.0: http://creativecommons.org/licenses/by-nc/3.0/

When using or quoting or rephrasing complete work or parts of work, explicit attribution is mandatory and required. Any commercial usage is prohibited by anyone except for an Author.

αλεx π

January 19, 2015
Tweet

More Decks by αλεx π

Other Decks in Technology

Transcript

  1. CAS

  2. CAS • Without CAS, you need locks to assure atomic

    change • lock-free swapping/referencing • available at concurrent.Atomic* • sun.misc.Unsafe/compareAndSwap* • is used in Java lock implementations • doesn’t mean that there will always be a conflict • but when there is a conflict, you know it’s resolved
  3. CAS • Chose between • how many threads are waiting

    on the lock • how complicated it is to implement locking “right” • how error prone locking may potentially be • And • How many potential collisions you may have
  4. CAS • Read the (primitive) value • Save old value

    for future check • Perform modification to the value • CAS will write successfully only if reference value is unchanged • If it was changed, retry • If it wasn’t, new value will be held by reference
  5. pooling: why? • eliminates the cost of object initialisation •

    is actively used for, e.g. JDBC (small object amount) • although we’re looking into rapidly/constantly allocated short-lived objects
  6. pooling: fixed size • + predictable size (thanks, cap) •

    + pre-allocated in one run • + when object is available, it’s returned immediately • + no further allocations required • - easy to run into resource starvation problem • - always has to allocate more than “usually” used • - you have to know how many you actually have
  7. pooling: single-step growth • allocated with initial N objects •

    when pool is empty, each lease results into allocation • good when initial size is “most likely” correct • size may be underestimated • if pool size is unbound, may blow everything up • single-entry allocations are fast • will always carry “high watermark” number of entries
  8. pooling: block growth • allocated with initial N objects •

    good for “lightweight” startup • or when pool size is unclear • when pool is empty, pool is grown by M entries • slower growth step, but no overhead during lease • balance between latency and memory overhead • could be grown by exponential steps (for pessimists)
  9. pooling: choosing the right one • analyse how the objects

    are allocated • have “sudden” large usage/allocations changes? • does everything happen “gradually” • if you reach the limit, is it will it only get worse?
  10. recycle strategies: reference counting • plays very well together with

    lambdas, e.g.: (pooledObject, pooledObjectConsumer) -> { pooledObject.retain(); pooledObjectConsumer.accept(pooledObject); pooledObject.release(); }; • each block that receives an object, increments counter • whenever flow leaves block, counter is decremented • when counter reaches 0, object returns to pool
  11. recycle strategies: reference counting • good for: • multiple /

    concurrent consumers • pipelining / nested processing • pitfalls: • premature release • double increments/decrements • change of object in-place (if not immutable) • “implicit” object recycling
  12. recycle strategies: reference counting • pitfalls: • pooled object has

    to be pool-aware (or wrapped) • wrapped = memory overhead • hard to track “who forgot to decrement”
  13. recycle strategies: borrow/return • object is “borrowed” from pool (and

    re-init’ed) • when object is not needed, you return it to pool
  14. recycle strategies: borrow/return • good for: • single consumer •

    clear usage scope block • object is pool-oblivious (real science term, srsly) • pitfalls: • premature release (always with pooling) • double `borrow`
  15. recycle strategies: general pitfalls • make scopes very explicit •

    avoid complex conditionals, return unconditionally • always `finally` return the object • seriously, never forget to return object • pool side: make sure to never give an object twice • “lost” objects (borrowed and never returned)
  16. recycle strategies: general pitfalls • re-initialization is hard to “get

    right” • beware of reference values “within” pooled objects • when resetting, don’t leave “garbage”
  17. false sharing CPU 1 CPU 2 +--------------------+ +--------------------+ | |

    | | +--------------------+ +--------------------+ Cache Line XX Cache line YY +--+--+--+--+--+--+--+ +--+--+--+--+--+--+--+ |XX|XX|XX|YY|YY|YY|YY| |XX|XX|XX|YY|YY|YY|YY| +--+--+--+--+--+--+--+ +--+--+--+--+--+--+--+ RAM +--+--+--+==+==+==+==+==+==+==+--+--+--+--+--+ | | | |XX|XX|XX|YY|YY|YY|YY| | | | | | +--+--+--+==+==+==+==+==+==+==+--+--+--+--+--+
  18. false sharing CPU 1 CPU 2 +--------------------+ +--------------------+ | |

    | | +--------------------+ +--------------------+ Cache Line XX Cache line YY +--+--+--+--+--+--+--+ +--+--+--+--+--+--+--+ |XX|XX|XX|YY|YY|YY|YY| |XX|XX|XX|YY|YY|YY|YY| +--+--+--+--+--+--+--+ +--+--+--+--+--+--+--+ RAM +--+--+--+==+==+==+==+==+==+==+--+--+--+--+--+ | | | |XX|XX|XX|ZZ|YY|YY|YY| | | | | | +--+--+--+==+==+==+==+==+==+==+--+--+--+--+--+
  19. false sharing CPU 1 CPU 2 +--------------------+ +--------------------+ | |

    | | +--------------------+ +--------------------+ Cache Line XX Cache line YY +--+--+--+--+--+--+--+ +--+--+--+--+--+--+--+ |XX|XX|XX|ZZ|YY|YY|YY| |XX|XX|XX|ZZ|YY|YY|YY| +--+--+--+--+--+--+--+ +--+--+--+--+--+--+--+ RAM +--+--+--+==+==+==+==+==+==+==+--+--+--+--+--+ | | | |XX|XX|XX|ZZ|YY|YY|YY| | | | | | +--+--+--+==+==+==+==+==+==+==+--+--+--+--+--+
  20. CPU cache alignment • avoid cache misses with padding •

    avoid cache line contention by “spreading” data • false sharing • needless cache updates when nothing got changed • fixed by padding structures
  21. waiting strategies: yielding wait • + quite precise • +

    scheduler-friendly • - still spins quite hard
  22. waiting strategies: park/sleep • - very imprecise • - prone

    to timer drifts • + very scheduler-friendly
  23. why? • not affected by GC • zero-copy IPC /

    networking • great for large datasets • compact data representation* * under certain conditions
  24. data representation: primitives • byte/short/int32/long • float (32 bit floating

    point num) • double (64 bit floating point num) • boolean (essentially flag byte)
  25. data representation: pascal-style strings 0 4 X +------------+-------------------------+ | length

    X | content | | (int) | (10 string) | +------------+-------------------------+
  26. data representation: flexible structures 0 1 X +------------+-------------------------+ | field

    type | field content | | (byte) | | +------------+-------------------------+ • field type determines how many following bytes should be consumed
  27. data representation: flexible structures • field content essentially can have

    same structure 0 1 X +------------+-------------------------------------------+ | | 1 | | field type | +------------+---------------------------| | (byte) | | field type | field content | | | +------------+---------------------------| +------------+-------------------------------------------+
  28. consumption strategies: c-style (template) • structure is fixed, and holds

    offsets to the structure: struct foo_s { char name[10]; // offset 0, length 10 int type; // offset 10, length 4 double min; // offset 14, length 8 double max; // offset 22, length 8 }; 0 10 14 22 30 +----------+----------+---------+---------+ | name | type | min | max | +----------+----------+---------+---------+
  29. consumption strategies: c-style (template) • array access: 0 10 14

    22 30 +----------+----------+---------+---------+ | name | type | min | max | +----------+----------+---------+---------+ 30 40 44 52 60 ... +----------+----------+---------+---------+ ... ... | name[1] | type[1] | min[1] | max[1] | ... ... +----------+----------+---------+---------+ ... • essentially, sizeof(struct) + field offset
  30. consumption strategies: c-style (template) • + random access • +

    no need to decode/go through an entire buffer • + compact data representation • + no meta-information overhead • - fixed-size structures (no variable-length data)
  31. consumption strategies: sequential reading • hold a pointer to a

    current reading position (e.g. 0) 0 +------------+ .... | field type | .... | (byte) | .... +------------+ ....
  32. consumption strategies: sequential reading • identify the type of the

    following structure (e.g. string) 0 +---------------------+ .... | field type = string | .... +---------------------+ ....
  33. consumption strategies: sequential reading • read the following <int> which’d

    hold string size 0 +---------------------+---------------------+ .... | field type = string | string size = 10 | .... +---------------------+---------------------+ ....
  34. consumption strategies: sequential reading • read string contents: +---------------------+---------------------+---------------+ ....

    | field type = string | string size = 10 | 10b content | .... +---------------------+---------------------+---------------+ ....
  35. consumption strategies: sequential reading • rinse and repeat .... +------------------+

    .... .... | field type = ??? | .... .... +------------------+ ....
  36. consumption strategies: sequential reading • + flexible data • +

    nested structures possible • + easy to compose • - no way to get data without reading meta-info • - metadata overhead
  37. consumption strategies: truth in the middle • in reality, you

    will most likely end up with a mix of both +--------------------+--------------------+--------------------+----------+ .... | field type = array | element type = foo | element count = 10 | elements | .... +--------------------+--------------------+--------------------+----------+ ....
  38. consumption strategies: truth is in the middle • encode arrays

    with •<type>+<size>+<sequentially-encoded-content> • same with hash maps • compute relative positions of keys and values • read/access by index
  39. consumption strategies: random tips • Encode your data types with

    constants, e.g. 0x00001 User 0x00101 User.name 0x00201 User.age 0x00301 User.createdAt ... 0x00002 Car 0x00102 Car.color 0x00202 Car.mileage ... • Use bitmasks for flags • Use strict comparison to determine type
  40. consumption strategies: random tips • Avoid sharing State in buffers

    (reader/writer indexes) • Use buffer “Slices” to hand over parts of data • Break down buffers to most atomic pieces • Recycle / pool as much as you can, GC is often an enemy
  41. consumption strategies: random tips • Pipeline writes in one thread

    • Do not modify buffers while they’re read (immutable) • Read as concurrently as you can / want / need • Stick to fixed-size structs as much as you can • Avoid re-mapping / defragmenting as much as you can • Keep `free memory map` index • Grab first free fitting memory segment
  42. bitmasks 0000 0000 0001 0000 0000 0000 0000 0000 left

    shift of 1 by 20 1L << 20 0000 0000 0000 0000 0000 0000 0000 0001
  43. bitmasks 0000 0000 0001 0000 0000 0000 0000 0011 right

    shift of 1 by 5 100L >> 5 0000 0000 0000 0000 0000 0000 0110 0100
  44. bitmasks 0000 0000 0000 0000 0000 0000 1111 1010 32-bit

    integer 250 0000 0000 0001 0000 0000 0000 1111 1010 `invert` 20th bit: int j = i ^ (1L << 20) > 1048826 int i = 250 `clear` 20th bit: j ^ (1L << 20) > 250 0000 0000 0000 0000 0000 0000 1111 1010
  45. bitmasks In combination with bitmasks, can be used to create

    very large (yet, atomic) bitmasks, for example, combined with AtomicLongArray (for CAS) or Direct Memory.
  46. wrapping up • Remember these tricks when writing software •

    There are many ways to apply all of it • Every application has a performance-critical part • Understand how things work • Have more stuff in your toolbelt