DevoxxFr 2022 - Remèdes aux oomkill, warm-ups, et lenteurs pour des conteneurs JVM

Slide 1

Slide 1 text

Remèdes aux oomkill, warm-ups, et lenteurs pour des conteneurs JVM

Slide 2

Slide 2 text

Orateurs Brice Dutheil @BriceDutheil Jean-Philippe Bempel @jpbempel

Slide 3

Slide 3 text

Agenda My container gets oomkilled How does the memory actually work Some case in hands Container gets respawned Things that slow down startup Break

Slide 4

Slide 4 text

The containers are restarting. What’s going on ? $ kubectl get pods NAME READY STATUS RESTARTS AGE my-pod-5759f56c55-cjv57 3/3 Running 7 3d1h

Slide 5

Slide 5 text

The containers are restarting. What’s going on ? On Kubernetes, one should inspect the suspicious pod $ kubectl describe pod my-pod-5759f56c55-cjv57 ... State: Running Started: Mon, 06 Jun 2020 13:39:40 +0200 Last State: Terminated Reason: OOMKilled Exit Code: 137 Started: Thu, 06 Jun 2020 09:20:21 +0200 Finished: Mon, 06 Jun 2020 13:39:38 +0200

Slide 6

Slide 6 text

My container gets oomkilled

Slide 7

Slide 7 text

🔥 🔥 🔥🔥 🔥 🔥 🔥 🚨 Crisis mode 🚨 If containers are oomkilled Just increase the container memory limits and investigate later

Slide 8

Slide 8 text

Monitor and setup alerting

Slide 9

Slide 9 text

Monitor the oomkills In Kubernetes cluster monitor the terminations with metrics ● kube_pod_container_status_last_terminated_reason, if the exit code is 137, the attached reason label will be set to OOMKilled ● Trigger an alert by coupling with kube_pod_container_status_restarts_total

Slide 10

Slide 10 text

Monitor the resident memory of a process RSS Heap Max Heap Liveset memory limit

Slide 11

Slide 11 text

Monitor the resident memory of a process Depending on the telemetry libraries (eg Micrometer) you may have those ● Heap Max : jvm_memory_max_bytes ● Heap Live : jvm_memory_bytes_used ● Process RSS : process_memory_rss_bytes And system ones, eg Kubernetes metrics ● Container RSS : container_memory_rss ● Memory limit : kube_pod_container_resource_limits_memory_bytes

Slide 12

Slide 12 text

💡 Pay attention to the unit Difference between 1 MB and 1 MiB ?

Slide 13

Slide 13 text

💡 Pay attention to the unit The SI notation, decimal based : 1 MB reads as megabyte and means 1000² bytes The IEC notation, binary based : 1 MiB reads as mebibyte and means 1024² bytes ⚠ The JVM uses the binary notation, but uses the legacy units KB, MB, etc. OS command line tools generally use the binary notation. https://en.wikipedia.org/wiki/Binary_prefix#/media/File:Binaryvdecimal.svg At gigabyte scale the difference is almost 7% 1GB ≃ 0.93 GiB

Slide 14

Slide 14 text

Oomkilled ?

Slide 15

Slide 15 text

Oomkilled ? Is it a memory leak ? Is it misconfiguration ?

Slide 16

Slide 16 text

Linux Oomkiller ● Out Of Memory Killer Linux mechanism employed to kill processes when the memory is critically low ● For regular processes the oomkiller selects the bad ones. ● Within a restrained container, i.e. with memory limits, ○ If available memory reaches 0 in this container then the oomkiller terminates all processes ○ There is usually a single process in container

Slide 17

Slide 17 text

Linux oomkiller Oomkills can be reproduced synthetically docker run --memory-swap=100m --memory=100m \ --rm -it azul/zulu-openjdk:11 \ java -Xms100m -XX:+AlwaysPreTouch --version

Slide 18

Slide 18 text

Linux oomkiller And in the system logs $ tail -50 -f $HOME/Library/Containers/com.docker.docker/Data/log/vm/console.log ... [ 6744.445271] java invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0 ... [ 6744.451951] Memory cgroup out of memory: Killed process 4379 (java) total-vm:3106656kB, anon-rss:100844kB, file-rss:15252kB, shmem-rss:0kB, UID:0 pgtables:432kB oom_score_adj:0 [ 6744.473995] oom_reaper: reaped process 4379 (java), now anon-rss:0kB, file-rss:32kB, shmem-rss:0kB ...

Slide 19

Slide 19 text

Oomkilled ? Is it a memory leak ? or … Is it misconfiguration ?

Slide 20

Slide 20 text

Memory of a process

Slide 21

Slide 21 text

Memory of a JVM process

Slide 22

Slide 22 text

Memory As JVM based developers ● Used to think about JVM Heap sizing, mostly -Xms, -Xmx, … ● Possibly some deployment use container-aware flags: -XX:MinRAMPercentage, -XX:MaxRAMPercentage, … JVM Heap Xmx or MaxRAMPercentage

Slide 23

Slide 23 text

Why should I be concerned by native? ● But ignoring other memory zones 💡 Referred to as native memory, Or on the JVM as off-heap memory

Slide 24

Slide 24 text

Why should I be concerned by native? JDK9 landed container support! Still need to need to make the cross multiplication yourself.

Slide 25

Slide 25 text

Why should I be concerned by native? Still have no idea what is happening off-heap JVM Heap ❓

Slide 26

Slide 26 text

Why should I be concerned by native? Still have no idea what is happening off-heap JVM Heap https://giphy.com/gifs/bitcoin-crypto-blockchain-trN9ht5RlE3Dcwavg2

Slide 27

Slide 27 text

If you don’t know what’s there, … How can you size properly the heap or the container ? JVM Heap

Slide 28

Slide 28 text

JVM Memory Breakdown Running A JVM requires memory: ● The Java Heap

Slide 29

Slide 29 text

JVM Memory Breakdown Running A JVM requires memory: ● The Java Heap ● …

Slide 30

Slide 30 text

JVM Memory Breakdown Running A JVM requires memory: ● The Java Heap ● The Meta Space (pre-JDK 8 the Permanent Generation) ● …

Slide 31

Slide 31 text

JVM Memory Breakdown Running A JVM requires memory: ● The Java Heap ● The Meta Space (pre-JDK 8 the Permanent Generation) ● Direct byte buffers ● Code cache (compiled code) ● Garbage Collector (like card table) ● Compiler (C1/C2) ● Symbols ● etc.

Slide 32

Slide 32 text

Slide 33

Slide 33 text

JVM Memory Breakdown Except a few flags for meta space, code cache, or direct memory There’s no control over memory consumption of the other components But It is possible to get their size at runtime.

Slide 34

Slide 34 text

Let’s try first to monitor

Slide 35

Slide 35 text

Let’s try first to monitor Eg with micrometer time series ● jvm_memory_used_bytes ● jvm_memory_committed_bytes ● jvm_memory_max_bytes Dimensions ● area : heap or nonheap ● id : memory zone, depends on GC and JVM jvm_memory_used_bytes{area="nonheap",id="CodeHeap 'profiled nmethods'",} 8231168.0 jvm_memory_used_bytes{area="heap",id="G1 Survivor Space",} 5242880.0 jvm_memory_used_bytes{area="heap",id="G1 Old Gen",} 1.164288E7 jvm_memory_used_bytes{area="nonheap",id="Metaspace",} 4.180964E7 jvm_memory_used_bytes{area="nonheap",id="CodeHeap 'non-nmethods'",} 1233536.0 jvm_memory_used_bytes{area="heap",id="G1 Eden Space",} 1.2582912E7 jvm_memory_used_bytes{area="nonheap",id="Compressed Class Space",} 5207416.0 jvm_memory_used_bytes{area="nonheap",id="CodeHeap 'non-profiled nmethods'",} 1590528.0

Slide 36

Slide 36 text

Let’s try first to monitor Don’t forget the JVM native buffers ● jvm_buffer_total_capacity_bytes

Slide 37

Slide 37 text

Monitor them

Slide 38

Slide 38 text

Monitor them RSS k8s memory limit Heap max

Slide 39

Slide 39 text

Monitor them Eden Old gen

Slide 40

Slide 40 text

Monitor them JVM off-heap pools Not really practical to look at

Slide 41

Slide 41 text

Monitor them 💡Stack the pools (if supported by your observability tool) Missing Data RSS Stacked pools

Slide 42

Slide 42 text

Monitoring is only as good as data is there Observability metrics rely on MBean to get memory areas Most JVM don’t export metrics for everything that uses memory

Slide 43

Slide 43 text

Time to investigate the footprint With diagnostic tools

Slide 44

Slide 44 text

RSS is the real footprint $ ps o pid,rss -p $(pidof java) PID RSS 6 4701120

Slide 45

Slide 45 text

jcmd – a swiss knife Who knows ? Who used it already ?

Slide 46

Slide 46 text

jcmd – a swiss knife Get the actual flag values $ jcmd $(pidof java) VM.flags | tr ' ' '\n' 6: ... -XX:InitialHeapSize=4563402752 -XX:InitialRAMPercentage=85.000000 -XX:MarkStackSize=4194304 -XX:MaxHeapSize=4563402752 -XX:MaxNewSize=2736783360 -XX:MaxRAMPercentage=85.000000 -XX:MinHeapDeltaBytes=2097152 -XX:NativeMemoryTracking=summary ... PID Xms Xmx

Slide 47

Slide 47 text

JVM’s Native Memory Tracking 1. Start the JVM with -XX:NativeMemoryTracking=summary 2. Later run jcmd $(pidof java) VM.native_memory Modes ● summary ● detail ● baseline/ diff

Slide 48

Slide 48 text

Slide 49

Slide 49 text

Slide 50

Slide 50 text

$ jcmd $(pidof java) VM.native_memory 6: Native Memory Tracking: Total: reserved=7168324KB, committed=5380868KB - Java Heap (reserved=4456448KB, committed=4456448KB) (mmap: reserved=4456448KB, committed=4456448KB) - Class (reserved=1195628KB, committed=165788KB) (classes #28431) ( instance classes #26792, array classes #1639) (malloc=5740KB #87822) (mmap: reserved=1189888KB, committed=160048KB) ( Metadata: ) ( reserved=141312KB, committed=139876KB) ( used=135945KB) ( free=3931KB) ( waste=0KB =0.00%) ( Class space:) ( reserved=1048576KB, committed=20172KB) ( used=17864KB) ( free=2308KB) ( waste=0KB =0.00%) - Thread (reserved=696395KB, committed=85455KB) (thread #674) (stack: reserved=692812KB, committed=81872KB) (malloc=2432KB #4046) (arena=1150KB #1347) - Code (reserved=251877KB, committed=105201KB) (malloc=4189KB #11718) (mmap: reserved=247688KB, committed=101012KB) - GC (reserved=230739KB, committed=230739KB) (malloc=32031KB #63631) (mmap: reserved=198708KB, committed=198708KB) - Compiler (reserved=5914KB, committed=5914KB) (malloc=6143KB #3281) (arena=180KB #5) - Internal (reserved=24460KB, committed=24460KB) (malloc=24460KB #13140) - Other (reserved=267034KB, committed=267034KB) (malloc=267034KB #631) - Symbol (reserved=28915KB, committed=28915KB) (malloc=25423KB #330973) (arena=3492KB #1) - Native Memory Tracking (reserved=8433KB, committed=8433KB) (malloc=117KB #1498) (tracking overhead=8316KB) - Arena Chunk (reserved=217KB, committed=217KB) (malloc=217KB) - Logging (reserved=7KB, committed=7KB) (malloc=7KB #266) - Arguments (reserved=19KB, committed=19KB) (malloc=19KB #521) Total: reserved=7168324KB, committed=5380868KB Class (reserved=1195628KB, committed=165788KB) Thread (reserved=696395KB, committed=85455KB) Code (reserved=251877KB, committed=105201KB) GC (reserved=230739KB, committed=230739KB) Compiler (reserved=5914KB, committed=5914KB) Internal (reserved=24460KB, committed=24460KB) Other (reserved=267034KB, committed=267034KB)

Slide 51

Slide 51 text

Direct byte buffers Those are the memory segments that are allocated outside the Java heap. Unused buffers are only freed upon GC. Netty for example use them. ● < JDK 11, they are reported in the Internal section ● ≥ JDK 11, they are reported in the Other section Internal (reserved=24460KB, committed=24460KB) Other (reserved=267034KB, committed=267034KB)

Slide 52

Slide 52 text

Garbage Collection GC is actually more than only taking care of the garbage. It’s a full blown memory management for the Java Heap, and it requires memory for its internal data structures (E.g. for G1 regions, remembered sets, etc.) On small containers this might be a thing to consider GC (reserved=230739KB, committed=230739KB)

Slide 53

Slide 53 text

Threads Threads also appear to take some space Thread (reserved=696395KB, committed=85455KB) (thread #674) (stack: reserved=692812KB, committed=81872KB) (malloc=2432KB #4046) (arena=1150KB #1347)

Slide 54

Slide 54 text

Native Memory Tracking 👍 Good insights on the JVM sub-systems

Slide 55

Slide 55 text

Native Memory Tracking Good insights on the JVM sub-systems, but Does NMT show everything ? Is NMT data correct ? ⚠ Careful about the overhead! Measure if this is important for you !

Slide 56

Slide 56 text

Huh what virtual, committed, reserved memory? virtual memory : memory management technique that provides an "idealized abstraction of the storage resources that are actually available on a given machine" which "creates the illusion to users of a very large memory". reserved memory : Contiguous chunk of memory from the virtual memory that the program requested to the OS. committed memory : writable subset of reserved memory, might be backed by physical storage

Slide 57

Slide 57 text

Native Memory Tracking Basically what NMT show is this

Slide 58

Slide 58 text

Native Memory Tracking Basically what NMT show is this

Slide 59

Slide 59 text

Native Memory Tracking Basically what NMT show is this

Slide 60

Slide 60 text

Huh, what virtual, committed, reserved memory? used heap : amount of memory occupied by live objects and to a certain extent object that are unreachable but not yet collected by the GC committed heap : the size of the writable heap memory where the JVM can write objects. This value seats between -Xms and -Xmx values heap max size : the limit of the heap (-Xmx) #JVM

Slide 61

Slide 61 text

Native Memory Tracking Basically what NMT show is : how the JVM subsystems are using the available space

Slide 62

Slide 62 text

Native Memory Tracking Good insights on the JVM sub-systems, but Does NMT show everything ? Is NMT data correct ?

Slide 63

Slide 63 text

So virtual memory ?

Slide 64

Slide 64 text

Virtual memory ? Virtual memory implies memory management. It is a OS feature ● to maximize the utilization of the physical RAM ● to reduce the complexity of handling shared access to physical RAM By providing processes an abstraction of the available memory

Slide 65

Slide 65 text

Virtual memory On Linux memory is split in pages (usually 4 KiB) Never used pages remain virtual, that is without physical storage Used pages is called Resident memory

Slide 66

Slide 66 text

Virtual memory The numbers shown in NMT are actually about what the JVM asked for. Total: reserved=7168324KB, committed=5380868KB

Slide 67

Slide 67 text

Virtual memory Not the real memory usage Total: reserved=7168324KB, committed=5380868KB

Slide 68

Slide 68 text

Native Memory Tracking Good insights on the JVM sub-systems, but Does NMT show everything ? Nope Is NMT data correct ? Yes, but not for resident memory usage

Slide 69

Slide 69 text

What does it means with JVM flags ? For the Java heap, -Xms / -Xmx ⇒ indication of how much memory heap memory is reserved

Slide 70

Slide 70 text

What does it means with JVM flags ? For the Java heap, -Xms / -Xmx ⇒ indication of how much memory heap memory is reserved Also -XX:MaxPermSize, -XX:MaxMetaspaceSize, -Xss, -XX:MaxDirectMemorySize ⇒ indication of how much memory heap memory is/can be reserved These flags do have a big impact on JVM subsystems, as they may trigger or not some behaviors, like : - GC if metaspace is too small - Heap resizing ig Xms ≠ Xmx - …

Slide 71

Slide 71 text

💡Memory mapped files They are not reported by Native Memory Tracking, yet They can be accounted in the RSS.

Slide 72

Slide 72 text

💡Memory mapped files In Java, using FileChannel.read alone, ⇒ Rely on the native OS read method (in unistd.h) ⇒ Use the OS page cache

Slide 73

Slide 73 text

💡Memory mapped files In Java, using FileChannel.read alone, ⇒ Rely on the native OS read method (in unistd.h) ⇒ Use the OS page cache But using FileChannel.map(MapMode, pos, length) ⇒ Rely on the mmap OS method (in sys/mman.h) ⇒ Load the requested content into the addressable space of the process ❌

Slide 74

Slide 74 text

pmap To really deep dive you need to explore the memory mapping ● Via /proc/{pid}/smaps ● Or via pmap [-x|-X] {pid} Address Kbytes RSS Dirty Mode Mapping ... 00007fe51913b000 572180 20280 0 r--s- large-file.tar.xz ...

Slide 75

Slide 75 text

How to configure memory requirement preprod prod-asia prod-eu prod-us jvm-service jvm-service jvm-service jvm-service

Slide 76

Slide 76 text

How to configure memory requirement Is it possible to extract a formula ?

Slide 77

Slide 77 text

How to configure memory requirement Different environments ⇒ different load, different subsystem behavior E.g. preprod : 100 req/s 👉 40 java threads total 👉 mostly liveness endpoints 👉 low versatility in data 👉 low GC activity requirement prod-us : 1000 req/s 👉 200 java threads total 👉 mostly business endpoints 👉 variance in data 👉 higher GC requirements

Slide 78

Slide 78 text

How to configure memory requirement Is it possible to extract a formula ? Not that straightforward. Some might point to -XX:*RAMPercentage flags, it sets the Java heap size as a function of the available physical memory. It works. ⚠ -XX:InitialRAMPercentage ⟹ -Xms mem < 96MiB -XX:MinRAMPercentage ⟹ -Xmx mem > 96MiB -XX:MaxRAMPercentage ⟹ -Xmx

Slide 79

Slide 79 text

How to configure memory requirement Is it possible to extract a formula ? 1 GiB 4 GiB prod-us preprod MaxRAMPercentage = 85 Java heap Java heap

Slide 80

Slide 80 text

How to configure memory requirement Is it possible to extract a formula ? 1 GiB 4 GiB prod-us preprod Java heap MaxRAMPercentage = 85 Java heap

Slide 81

Slide 81 text

How to configure memory requirement Is it possible to extract a formula ? 1 GiB 4 GiB prod-us preprod Java heap ≃ 850 MiB Java heap ≃ 3.40 GiB MaxRAMPercentage = 85

Slide 82

Slide 82 text

How to configure memory requirement Is it possible to extract a formula ? 1 GiB 4 GiB prod-us preprod Java heap ≃ 850 MiB Java heap ≃ 3.4 GiB MaxRAMPercentage = 85 ~ 150 MiB for every subsystems Maybe OK for quiet workloads ~ 600 MiB for every subsystems Likely not enough for loaded systems ⟹ leads to oomkill

Slide 83

Slide 83 text

How to configure memory requirement Traffic, Load are not linear, and do not have linear effects ● MaxRAMPercentage is a linear function of the container available RAM ● Too low MaxRAMPercentage ⟹ waste of space ● Too high MaxRAMPercentage ⟹ risk of oomkills ● Requires to find the sweet spot for all deployments ● Requires to adjust if load changes ● Need to convert back a percentage to raw value

Slide 84

Slide 84 text

How to configure memory requirement -XX:*RAMPercentage flags sort of works, Its drawbacks don’t make this quite compelling. ✅ Prefer -Xms / -Xmx

Slide 85

Slide 85 text

How to configure memory requirement Let’s have a look at the actual measures RSS

Slide 86

Slide 86 text

How to configure memory requirement If Xms and Xmx have the same size, heap is fixed, so focus on “native” memory RSS ● GC internals ● Threads ● Direct memory buffers ● Mapped file buffers ● Metaspace ● Code cache ● …

Slide 87

Slide 87 text

How to configure memory requirement It is very hard to predict the actual requirement to for all these. Can we add the values of these zones? Yes but it’s not really maintainable. Don’t mess with until you actually need to! E.g. for each JVM subsystem, you’ll need to understand to predict the actual size. It’s hard, and requires deep knowledge of the JVM. Just don’t !

Slide 88

Slide 88 text

Slide 89

Slide 89 text

Slide 90

Slide 90 text

How to configure memory requirement In our experience it’s best to actually retrofit. What does it mean ? Give a larger memory limit to the container, much higher than the max heap size. 1. Observe the RSS evolution 2. If RSS stabilizes after some time Heap RSS RSS stabilizing Container memory limit at 5 GiB

Slide 91

Slide 91 text

How to configure memory requirement In our experience it’s best to actually retrofit. What does it mean ? Give a larger memory limit to the container, much higher than the max heap size. 1. Observe the RSS evolution 2. If RSS stabilizes after some time 3. Set the new memory limit with enough leeway (eg 200 MiB) Heap RSS New memory limit With some leeway for RSS increase Container memory limit at 5 GiB

Slide 92

Slide 92 text

How to configure memory requirement If the graphs show this RSS less than Heap size 🧐 RSS

Slide 93

Slide 93 text

How to configure memory requirement If the graphs show this RSS less than Heap size 🧐 Remember virtual memory! If a page has not been used, then it’s virtual RSS Java heap untouched RSS

Slide 94

Slide 94 text

How to configure memory requirement ⚠ If the Java heap is not fully used (as in RSS), the RSS measure to get the max memory utilisation will be wrong To avoid the virtual memory pitfall use -XX:+AlwaysPreTouch All Java heap pages touched RSS

Slide 95

Slide 95 text

Memory consumption still look too big 😩

Slide 96

Slide 96 text

Memory consumption still look too big 😩 Case in hand with Netty

Slide 97

Slide 97 text

Case in hand : Netty Buffers ● Handles pool of DirectByteBuffers (simplified) ● Allocates large chunks and subdivides to satisfy allocations Problem: the more requests to handle the more it may allocate buffers & consume direct memory (native) If not capped ⟹ OOMKill

Slide 98

Slide 98 text

Controlling Netty Buffers ● JVM options -XX:MaxDirectMemorySize Hard limit on direct ByteBuffers total size Throws OutOfMemoryError ● Property io.netty.maxDirectMemory to control only Netty buffers? ⇒ No, it’s more complicated

Slide 99

Slide 99 text

Controlling Netty Buffers Properties controlling Netty Buffer Pool ThreadLocal Caches! Depends on the number of thread in EventLoops ● io.netty.allocator.cacheTrimInterval ● io.netty.allocator.useCacheForAllThreads ● io.netty.allocation.cacheTrimIntervalMillis ● io.netty.allocator.maxCachedBufferCapacity ● io.netty.allocator.numDirectArenas

Slide 100

Slide 100 text

Controlling Netty Buffers ThreadLocal Caches! Depends on the number of thread in EventLoops private static EventLoopGroup getBossGroup(boolean useEpoll) { if (useEpoll) { return new EpollEventLoopGroup(NB_THREADS); } else { return new NioEventLoopGroup(NB_THREADS); } }

Slide 101

Slide 101 text

Shaded Netty Buffers Beware of multiple shaded Netty libraries They share the same properties!

Slide 102

Slide 102 text

Controlling Netty Buffers

Slide 103

Slide 103 text

Case in hand : native allocator Small but steady RSS increase

Slide 104

Slide 104 text

Case in hand : native allocator If something doesn’t add up : check the native allocator, but why ? To get memory any program must either ● Call the OS asking for a memory mapping via mmap function ● Call the C standard library malloc function On Linux, standard library = glibc

Slide 105

Slide 105 text

Case in hand : native allocator The glibc’s malloc is managing memory via technic called Arena memory management Unfortunately there’s no serviceability tooling around glibc arena management (unless modifying the program to call C API) It may be possible to extrapolate things using a tool like pmap

Slide 106

Slide 106 text

Case in hand : native allocator Analyzing memory mapping 00007fe164000000 2736 2736 2736 rw--- [ anon ] 00007fe1642ac000 62800 0 0 ----- [ anon ] Virtual 64 MiB RSS ~ 2.6 MiB

Slide 107

Slide 107 text

Case in hand : native allocator Analyzing memory mapping 00007fe164000000 2736 2736 2736 rw--- [ anon ] 00007fe1642ac000 62800 0 0 ----- [ anon ] Virtual 64 MiB x 257 ⟹ RSS ~1.2 GiB RSS ~ 2.6 MiB

Slide 108

Slide 108 text

Case in hand : native allocator ● Glibc reacts to CPUs, application threads ● On each access there’s a lock ● Higher number of threads ⟹ higher contention on the arenas ⟹ lead glibc to create more arenas ● There are some tuning options in particular MALLOC_ARENA_MAX, M_MMAP_THRESHOLD, … ⚠ Significant understanding of how glibc’s malloc work, allocation size, etc

Slide 109

Slide 109 text

Case in hand : native allocator Better solution, change the application native allocator ● tcmalloc from Google’s gperftools ● jemalloc from Facebook ● minimalloc from Microsoft LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc.so

Slide 110

Slide 110 text

Case in hand : native allocator

Slide 111

Slide 111 text

Case in hand : native allocator If using tcmalloc or jemalloc, one step away from native allocation profiling. Useful to narrow native memory leak.

Slide 112

Slide 112 text

My Container gets re-spawn

Slide 113

Slide 113 text

Container Restarted

Slide 114

Slide 114 text

Demo minikube + Petclinic

Slide 115

Slide 115 text

Quick Fix Increase Probe liveness timeout (either initial delay or interval) livenessProbe: httpGet: path: / port: 8080 initialDelaySeconds: 60 periodSeconds: 10

Slide 116

Slide 116 text

JIT Compilation

Slide 117

Slide 117 text

Troubleshooting Compile Time Use jstat -compiler to see cumulated compilation time (in s) $ jstat -compiler 1 Compiled Failed Invalid Time FailedType FailedMethod 6002 0 0 101.16 0

Slide 118

Slide 118 text

Troubleshooting using JFR Use java -XX:StartFlightRecording jcmd 1 JFR.dump name=1 filename=petclinic.jfr jfr print --events jdk.CompilerStatistics petclinic.jfr

Slide 119

Slide 119 text

Troubleshooting using JFR

Slide 120

Slide 120 text

Measuring startup time docker run --cpus= -ti spring-petclinic CPUs JVM Startup time (s) Compile time (s) 4 8.402 17.36 2 8.458 10.17 1 15.797 20.22 0.8 20.731 21.71 0.4 41.55 46.51 0.2 86.279 92.93

Slide 121

Slide 121 text

C1 vs C2 C1 + C2 C1 only # compiled methods 6,117 5,084 # C1 compiled methods 5,254 5,084 # C2 compiled methods 863 0 Total Time (ms) 21,678 1,234 Total Time in C1 (ms) 2,071 1,234 Total Time in C2 (ms) 19,607 0

Slide 122

Slide 122 text

TieredCompilation Interpreter C1 + Profiling C2 Compilation Level 0 3 4

Slide 123

Slide 123 text

TieredCompilation queues C2 C1 C2 C2 M1 M2 M3 M4 M5

Slide 124

Slide 124 text

TieredCompilation Heuristics Level transitions: ● 0 ➟ 2 ➟ 3 ➟ 4 (C2 Q too long) ● 0 ➟ (3 ➟ 2) ➟ 4 (C1 Q too long, change level in-Q) ● 0 ➟ (3 or 2) ➟ 1 (trivial method or can’t compiled in C2) ● 0 ➟ 4 (can’t compiled in C1) Note: level 3 is 30% slower than level 2 Interpreter C1 + Profiling C2 Comp Level 0 3 4 C1 1 C1 + Limited Profiling 2

Slide 125

Slide 125 text

Compiler Settings To only use C1 JIT compiler: -XX:TieredStopAtLevel=1 To adjust C2 compiler threads: -XX:CICompilerCount=

Slide 126

Slide 126 text

Measuring startup time docker run --cpus= -ti spring-petclinic CPUs JVM Startup time (s) Compile time (s) JVM Startup time (s) XX:TieredStopAtLevel=1 Compile time (s) 4 8.402 17.36 6.908 (-18%) 1.47 2 8.458 10.17 6.877 (-19%) 1.41 1 15.797 20.22 8.821 (-44%) 1.74 0.8 20.731 21.71 10.857 (-48%) 2.08 0.4 41.55 46.51 22.225 (-47%) 3.67 0.2 86.279 92.93 45.706 (-47%) 6.95

Slide 127

Slide 127 text

Slide 128

Slide 128 text

Troubleshooting GC Use -Xlog:gc / -XX:+PrintGCDetails

Slide 129

Slide 129 text

Troubleshooting GC with JFR/JMC

Slide 130

Slide 130 text

Settings properly GC: Metadata Threshold To avoid Full GC for loading more class anre Metaspace resize: Set initial Metaspace size high enough to load all your required classes -XX:MetaspaceSize=512M

Slide 131

Slide 131 text

Settings properly GC Use a fixed heap size : -Xms = -Xmx -XX:InitialHeapSize = -XX:MaxHeapSize Heap resize done during Full GC for SerialGC & Parallel GC. G1 is able to resize without FullGC (regions, not the metaspace)

Slide 132

Slide 132 text

GC ergonomics: GC selection To verify in log GC (-Xlog:gc): CPU Memory GC < 2 < 2GB Serial ≥ 2 < 2GB Serial < 2 ≥ 2GB Serial ≥ 2 ≥ 2 GB Parallel(

Slide 133

Slide 133 text

GC ergonomics: # threads selection -XX:ParallelGCThreads= Used for Parallelizing work during STW phases # physical cores ParallelGCThreads ≤ 8 # cores > 8 8 + ⅝ * (# cores - 8)

Slide 134

Slide 134 text

GC ergonomics: # threads selection -XX:ConcGCThreads= Used for concurrent work while application is running G1 Shenandoah ZGC Max(ParallelGCThreads + 2 / 4, 1) ¼ # cores ¼ if dynamic or ⅛ # cores

Slide 135

Slide 135 text

CPU resource tuning

Slide 136

Slide 136 text

CPU Resources shares, quotas ?

Slide 137

Slide 137 text

CPU shares Sharing cpu among containers of a node Correspond to Requests for Kubernetes Allow to use all the CPUs if needed sharing with all others containers $ cat /sys/fs/cgroup/cpu.weight 20 $ cat /sys/fs/cgroup/cpu.weight 10 resources: requests: cpu: 500m resources: requests: cpu: 250m

Slide 138

Slide 138 text

CPU quotas Fixing limits of CPU used by a container Correspond to Limits to kubernetes resources: limits: cpu: 500m resources: limits: cpu: 250m $ cat /sys/fs/cgroup/cpu.max 50000 100000 $ cat /sys/fs/cgroup/cpu.max 25000 100000

Slide 139

Slide 139 text

Shares / Quotas CPU is shared among multiple process. Ill processes could consume all the computing bandwidth. Cgroups help prevent that but require to define boundaries. A 100% A 100% A 100% A 100% C waiting be scheduled B waiting be scheduled 🚦

Slide 140

Slide 140 text

Shares / Quotas The lower bound of a CPU request is called shares. A CPU core is divided in 1024 “slices”. A host with 4 CPU will have 4096 shares.

Slide 141

Slide 141 text

Slide 142

Slide 142 text

Shares / Quotas Programs also have the notion of shares. The OS will distributes these computing slices propertionnally. Process asking for 1432 shares (~1.4 CPU) Process asking for 2048 shares (2 CPU) Process asking for 616 shares = 4096 Each are guaranteed to have what they asked for. 💡 Upper bounds are not enforced, if there’s CPU available a process can burst

Slide 143

Slide 143 text

Shares / Quotas Programs also have the notion of shares. The OS will distributes these computing slices propertionnally. Process asking for 1432 shares (~1.4 CPU) Process asking for 2048 shares (2 CPU) Process asking for 616 shares = 4096 Each are guaranteed to have what they asked for. 💡 Pod schedulers like Kubernetes, use this mechanism to place a pod were enough computation is available 💡 Upper bounds are not enforced, if there’s CPU available a process can burst

Slide 144

Slide 144 text

Shares / Quotas Different mechanism used to limit a process. CPU is split in periods of 100 ms (by default) A fraction of a CPU is called millicore, and it’s a thousandth Exemple : 100 * ( 500 / 1000 ) = 50ms resources: limits: cpu: 500m 50 ms per period of 100 ms Period millicores cpu fraction

Slide 145

Slide 145 text

Shares / Quotas Now Indeed, it means the limit applies to all accounted cores : 4 CPU ⟹ 4 x 100ms = 400ms resources: limits: cpu: 2500m 250 ms per period of 100 ms 🧐

Slide 146

Slide 146 text

Shares / Quotas Shares and quota have nothing to do with a hardware socket resources: limits: cpu: 1 limits: cpu: 1

Slide 147

Slide 147 text

Shares / Quotas Shares and quota have nothing to do with a hardware socket resources: limits: cpu: 1 limits: cpu: 1

Slide 148

Slide 148 text

Shares / Quotas If the process reaches its limit, it will get throttled ie it will have to wait for the next period. Eg s process can consume a 200ms budget on ● 2 cores with 100ms on each ● 8 cores with 25ms on each

Slide 149

Slide 149 text

CPU Throttling When you reach limit with CPU quotas, throttling happens Throttling ⟹ STW pauses Monitor throttling: Cgroup v1: /sys/fs/cgroup/cpu,cpuacct//cpu.stat Cgroup v2: /sys/fs/cgroup/cpu.stat ● nr_periods – number of periods that any thread in the cgroup was runnable ● nr_throttled – number of runnable periods in which the application used its entire quota and was throttled ● throttled_time – sum total amount of time individual threads within the cgroup were throttled

Slide 150

Slide 150 text

CPU Throttling with JFR JFR Container event jdk.ContainerCPUThrottling

Slide 151

Slide 151 text

availableProcessors ergonomics Setting CPU shares/quotas have a direct impact on Runtime.availableProcessors() API Shares Quotas Period availableProcessors() 4096 -1 100 000 4 (Shares / 1024) 1024 300 000 100 000 3 (Quotas / Period)

Slide 152

Slide 152 text

availableProcessors ergonomics Runtime.availableProcessors() API is used to : ● size some concurrent structures ● ForkJoinPool, used for Parallel Streams, CompletableFuture, …

Slide 153

Slide 153 text

Tuning CPU Trade-off cpu needs for startup time VS request time ● Adjust CPU shares / CPU quotas ● Adjust liveness timeout ● Use readiness / startup probes

Slide 154

Slide 154 text

Conclusion

Slide 155

Slide 155 text

Memory ● JVM memory is not only Java heap ● Native parts are less known, and difficult to monitor and estimate ● Yet they are important moving part to account to avoid OOMKills ● Bonus revise virtual memory

Slide 156

Slide 156 text

Startup ● Containers with <2 cpus are an constraint environment for JVM ● Need to keep in mind that JVM subsystems like JIT or GC need to be adjusted for requirements ● To be aware of these subsystems helps to find the balance between resources and requirements of your application

Slide 157

Slide 157 text

References

Slide 158

Slide 158 text

References Using Jdk Flight Recorder and Jdk Mission Control MaxRAMPercentage is not what I whished for Off-Heap reconnaissance Startup, Containers and TieredCompilation Hotspot JVM performance tuning guidelines Application Dynamic Class Data Sharing in HotSpot JVM Jdk18 G1 Parallel GC Changes Unthrottled fixing cpu limits in the cloud Best Practices Java single-core containers Containerize your Java applications