Garbage collection on JVM

Garbage Collection on JVM October 18, 2014 Dimitry Ushakov

Thank you to our sponsors … GOLD Bronze Happy Hour
sponsor Austin .Net User Group Codecamp 2014

JVM?   Java virtual machine   Can execute any java
bytecode   Can be anything that compiles to java bytecode   Scala, Clojure, Java, Jython, JRuby, etc.

Garbage Collection?   Automatic memory management   Objects no longer
in use are cleaned up by a “collector”   Developers don’t need to worry about memory cleanup (sorta)

I like pictures

Glossary   Stop-the-world pause (STW) – pause all app threads
  Copy collection – move reachable objects from one region to another   Serial collection – single thread for all GC work   Parallel collection – multiple cores to reduce STW time   Concurrent collection – algorithms working in parallel with app threads   Compaction – defragment objects in memory   Mark – tag all reachable objects   Sweep – remove all unreachable objects

Glossary, cont.   Roots – pointers from thread stacks  
TLAB – each thread gets a block in eden where it can allocate new objects

What are we looking for?   Maximum pause time goal
  Throughput goal   Memory footprint goal

GC outputs   Output gc è -verbosegc   Output to
a specific file è -Xloggc:<file>   More details:   -XX:+PrintGCDetails   -XX:+PrintGCTimeStamps   -XX:+PrintHeapAtGC   -XX:+PrintTenuringDistribution

Output example   [GC 325407K->83000K(776768K), 0.2300771 secs]   [GC 325816K->83372K(776768K),
0.2454258 secs]   [Full GC 267628K->83769K(776768K), 1.8479984 secs]

Output example - CMS GC [1 CMS-initial-mark: 13991K(20288K)] 14103K(22400K), 0.0023781
secs] [GC [DefNew: 2112K->64K(2112K), 0.0837052 secs] 16103K->15476K(22400K), 0.0838519 secs] ... [GC [DefNew: 2077K->63K(2112K), 0.0126205 secs] 17552K->15855K(22400K), 0.0127482 secs] [CMS-concurrent-mark: 0.267/0.374 secs] [GC [DefNew: 2111K->64K(2112K), 0.0190851 secs] 17903K->16154K(22400K), 0.0191903 secs] [CMS-concurrent-preclean: 0.044/0.064 secs] [GC [1 CMS-remark: 16090K(20288K)] 17242K(22400K), 0.0210460 secs] [GC [DefNew: 2112K->63K(2112K), 0.0716116 secs] 18177K->17382K(22400K), 0.0718204 secs] [GC [DefNew: 2111K->63K(2112K), 0.0830392 secs] 19363K->18757K(22400K), 0.0832943 secs] ... [GC [DefNew: 2111K->0K(2112K), 0.0035190 secs] 17527K->15479K(22400K), 0.0036052 secs] [CMS-concurrent-sweep: 0.291/0.662 secs] [GC [DefNew: 2048K->0K(2112K), 0.0013347 secs] 17527K->15479K(27912K), 0.0014231 secs] [CMS-concurrent-reset: 0.016/0.016 secs] [GC [DefNew: 2048K->1K(2112K), 0.0013936 secs] 17527K->15479K(27912K), 0.0014814

Algorithms   Copy   Mark & Sweep

Copy   Heap is divided into 2 spaces (active and
inactive)   STW when active gets full   Live objects moved to inactive   Active is cleared   Spaces change places (inactive becomes active)   PRO:   Only live objects are visited   CON:   Needs 2x real heap size   Lots of overhead

Mark & Sweep   Traverse objects   Mark reachable objects
  Sweep memory and free up unmarked objects   PROS:   Handles cyclic references   No affect on compiler/application   Pauses proportional to heap size   CONS:   All threads are stopped   Fragmentation

Mark, Sweep & Compact   Same as MS but at
the end compact all marked objects   Doesn’t use “free list”   Reference to where live objects are   All objects are at the bottom of the heap

Types of JVMs   Hotspot (23.25-b01)   Open source  
Maintained & distributed by Oracle   Jrockit (28.2.3) [won’t cover it today]   Proprietary   Free to use   Being integrated with Hotspot   IBM J9 [won’t cover it today]   Proprietary   Licensed   Websphere JVM

Introduce the app   Api management tool   Used in
a number of production applications (30+) servicing millions of requests per second   Distributed (in-memory datastore backed by ehcache)   Connection pools, object pools, translation pools, pools of pools.   Short-lived class loaders abound   Services thousands of requests per second at sub 10 millisecond overhead.   www.openrepose.org   https://www.github.com/rackerlabs/repose

Test scenario   Rate limiting + Distributed Datastore in a
2 node cluster   1000 requests per second   1 hour runtime

Introduce the tools   VisualVM (with VisualGC plugin)   http://visualvm.java.net/
  IBM Heap analyzer   https://www.ibm.com/developerworks/community/groups/service/html/ communityview?communityUuid=4544bafe-c7a2-455f-9d43-eb866ea60091   GC analyzer   https://www.ibm.com/developerworks/community/groups/service/html/ communityview?communityUuid=22d56091-3a7b-4497-b36e-634b51838e11   New Relic   http://newrelic.com/   Jhiccup   http://www.azulsystems.com/jHiccup

Hotspot   Generational heap   Eden   Survivor 1  
Survivor 2   Tenured   Perm Gen (goes away in Java 8)   Native area

Why generational?   Most objects die young (stats say 98%.
You should measure J)   GC cycles are small and only for part of heap

Hotspot

Eden Hotspot object allocation Survivor 1 Survivor 2 Tenured Permanent

Survivor 1 Survivor 2 Eden Hotspot object allocation Tenured Permanent

Survivor 1 Hotspot object allocation Eden Tenured Permanent Survivor 2

Hotspot object allocation Eden Survivor 1 Survivor 2 Tenured Permanent
1 1

Hotspot object allocation Eden Survivor 1 Survivor 2 Tenured Permanent
6 1 4 5 1 1

Common metrics   -Xmx – max heap size   Defalt
is minimum value among memory / 4 and maxRam (32 bit= 2gb; 64 bit is a lot more)   -Xms – min heap size   Default is total memory / 64 è too low usually!   -XX:NewRatio=<n> - ratio of young to old

Demo – default heap 8.5-9ms JVM time 41,000 rpm 130%
CPU 570MB footprint 2% GC overhead

Demo – with 1gb static heap and NewRatio=2 3.8-5ms JVM
time (100% improvement) 41,000 – 45,000 rpm 85-110% CPU 1.1GB footprint 1% GC overhead

Hotspot collectors   Serial   Parallel   Concurrent   Concurrent
Mark/Sweep   G1

Hotspot collectors - details Collector name Young collector Old collector
Settings Default Serial copy collector (DefNew) serial mark/sweep/ compact -XX:+UseSerialGC Parallel scavenge / paralle old Parallel copy collector (PSYoungGen) Parallel mark/ sweep/compact -XX:+UseParallelGC Concurrent mark/ sweep serial copy concurrent mark/ sweep -XX: +UseConcMarkSwe epGC -XX:-UseParNewGC Concurrent mark/ sweep – young parallel Parallel copy concurrent mark/ sweep -XX: +UseConcMarkSwe epGC -XX: +UseParNewGC G1 copy collector (region based) incremental mark/ sweep/compact -XX:+UseG1GC

Young collection   Young space is for garbage   Scavenge
– when eden space is “full,” sweep through eden and current survivor space, remove dead objects and move reachable objects to survivor space or old space

Internals of Young GC   Write Barrier   Uses cards
(in card table)   512 bytes of memory has 1 byte in card table   Collection root for young GC   Dirty cards are reset and JVM copies live objects from Eden & one of survivor spaces to other survivor space

Promotion from Young to Old   When survivor space is
full, all remaining live objects are directly allocated to old space   When objects survived certain # of young space collections   –XX:MaxTenuringThreshold   –XX:TargetSurvivorRatio   Hotspot direct allocation to old space   -XX:PretenureSizeThreshold=<n>

Serial   good for "small" applications   Single thread to
perform all GC work   Single processor machines   -XX:+UseSerialGC   Mobile!   For bigger apps è SLOWWWWWWW

Demo – -XX:+UseSerialGC 5.7-7.5ms JVM time (2ms improvement) 40,400 –
45,700 rpm 95-126% CPU 330mb footprint 5% GC overhead

Parallel   Generational collector   Intended for medium/large data sets
with multiprocessor/multithreaded hardware   Multiple threads used to speed up GC   GOAL: make pauses smaller (at the cost of throughput)   -XX:+UseParallelGC   Pre JDK7u4 also need -XX:+UseParallelOldGC

Parallel tips & tricks   -XParallelGCThreads=<N> - set number of
GC threads to use   -XX:MaxGCPauseMillis=<n> - set max pause time goal (you might be forced to run GC more frequently!)   -XX:GCTimeRatio=<n> - ratio of GC time

Concurrent   Intended for medium to large data sets run
on multiprocessor/multithreaded hardware where response time is more important than throughput (affects throughput)   Trades processor resources for shorter MAJOR collection pause times   No benefit on single core machines

Concurrent Mark/Sweep   Meant for apps requiring shorter GC pauses
and have spare processes   Good for stateful apps   NOT for CPU bound applications (uses extra CPU for concurrent threads)   -XX:+UseConcMarkSweepGC

CMS breakdown   Initial mark – collect root references  
Stop the world BUT fast   Concurrent mark – traverse through objects in old space marking reachables (not STW)   Remark – accounts for references changed during mark   Stop the world BUT even faster   Concurrent sweep – scan through old space and reclaim unreachables (not STW)

CMS Tips & Tricks   -XX:+CMSClassUnloadingEnabled = allows to clean
permanent space   Concurrent failure mode: inability to complete collection concurrently (running out of tenured space) – will force full GC! Start CMS before tenured is full (- XX:CMSInitiatingOccupancyFraction=<n>)   Default is 92%

Demo – -XX:+UseConcMarkSweepGC 5.7-7.9ms JVM time (1.5ms improvement over default)
41,500 – 47,600 rpm (10% improvem ent) 112-143% CPU 300mb footprint (100% improvement) 7% GC overhead

Garbage first   Meant for multiprocessor machines with large memories.
  Heap is partitioned into set of equally sized heap regions   Allows for concurrent global marking phase   Collects mostly empty regions first   Lots of necessary heap overhead!   Uses write barrier but in order to not worry about floating garbage (objects that became unreachable during collection), it uses snapshot-at-the-beginning algorithm   -XX:+UseG1GC

G1 details   Uses remembered set è old generation pointers
to young generation objects   Uses same card table algorithm as young collection   G1 only scans sets belonging to region being collected   Concurrent marking phase   Region sizes vary between 1mb and 32mb with goal to have ~2048 regions

G1 tips   Highly adaptive: DON’T MESS WITH IT UNLESS
YOU NEED TO   Don’t set young size – will override the target pause-time goal   Will. Add. Overhead.   Humongous objects go into contiguous humongous regions (other regions aren’t contiguous). May get copied back and forth. Increase region size if you don’t want them spanning multiple regions

Demo – -XX:+UseG1GC 12-14.5ms JVM time (5ms degradation over default)
40,000 – 45,000 rpm 116-136% CPU 390-530 mb footprint 6% GC overhead

Heap considerations – Young space   Bigger young space è
less minor collections occur   Could be bad since that might mean more tenured GC   Objects get early promotion   Smaller young space è long-lived objects stay in young space longer   More young GC   -XX:NewSize == -XX:MaxNewSize will make nursery static   IDEAL: big enough to hold more than 1 set of all concurrent request-response cycle objects

Young space - details   Pause time = time to
span thread stack + time to scan card table to find dirty pages + time to scan roots in old space + time to copy live objects   Proportional to size of old space and number of live objects in heap (INCREASE tenured è INCREASE nursery GC)   Period between young collections = eden space / object allocation   Rate of allocation of long lived objects = approximate aging in survivor * approximate aging in tenured * object allocation

Heap considerations – Survival space   -XX:SurvivorRatio=<n> - ratio of
eden to survivor space size (8 is default)   Limits on how much objects can stay in young space   Good for stateful apps with lots of long lived objects   Too small è premature promotion   Too big è wasted memory (underutilized)   Not enough objects get copied over from eden   IDEAL: big enough to hold active + tenuring request objects

Heap considerations – Tenured   Tenured too small è objects
are copied in young collection longer than needed   IDEAL: long lived objects tenure fast   TIP: tenured space should be able to hold all live data + 10% for over-allocation.   TIP: if you have small heap è balance tenured threshold and survivor space to keep objects in limited young space   TIP: if you have large heap è limit tenuring threshold (promote early and often)   Limit survivor space   Increase eden size (make sure objects have enough space to live and die early)   -XX:MaxTenuringThreshold=<n>

Permanent space considerations   -XX:PermSize=<n> - sets the size of
permanent space   TIP: increase if you need a lot of classes/class loaders to pre-allocate.   TIP: check out -XX:+PrintClassHistogram to get a picture of the rate of class loading   TIP: enable permgen sweep with CMS: -XX: +CMSPermGenSweepingEnabled

Hotspot vs JRockit   Hotspot has fixed heap geometry  
Young, tenured, permanent all have fixed addresses   Jrockit has single heap space with parts used for nursery and others for tenured (not contiguous)

TIPS, TIPS, TIPS!   Avoid thread locals   Compaction is
serial! Buyer, beware!   Use this formula: compactness * throughput * responsiveness = goal   Optimize to increase goal OR   Tune to keep goal constant   MEASURE then TUNE! Don’t prematurely optimize!

MORE TIPS   Ratio of bytes freed over time  
If you have stateless request centric app with lots of short lived objects è give it more young space   If you have a stateful workflow app with old objects è give it more tenured space   IMMUTABILITY IS KING! (Cleans up nice and fast)

MORE MORE TIPS   LARGE objects are:   Expensive to
allocate   Expensive to initialize   Can cause fragmentation   AVOID THEM UNLESS:   They are required objects that are expensive to allocate and/or initialize   Scarce resources

Leaks!   OOM: PermGen space è low permanent space or
class leak   OOM: unable to create new native thread è too many threads in progress and not enough memory on OS outside of Java heap. Decrease heap   OOM: Direct buffer memory è too much mapping memory outside heap (JNI)   OOM: Too much time in GC è heap is too low. Increase young space first   -XX:+HeapDumpOnOutOfMemoryError

GC = best ever?   Probably not: http://sealedabstract.com/rants/ why-mobile-web-apps-are-slow/

Thank you!

Garbage collection on JVM

Garbage collection on JVM

More Decks by dimitry

Other Decks in Programming

Featured

Transcript