JavaOne 2025: Advancing Java Profiling

Slide 1

Slide 1 text

Advancing Java Profiling Achieving precision and stability with JFR, eBPF and user context Jaroslav Bachorik Staff Software Engineer, Datadog

Slide 2

Slide 2 text

So, what do you do for a living?

Slide 3

Slide 3 text

Java Profiling

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Tooling to collect data points to explain application performance and behaviour …

Slide 6

Slide 6 text

???? ????

Slide 7

Slide 7 text

Helping to solve computer riddles … 🧌

Slide 8

Slide 8 text

… for last 20 years

Slide 9

Slide 9 text

What is hogging the CPU? How come the pod is OOM-killed? Are other processes influencing me? How to correlate performance metrics to my business The Riddles

Slide 10

Slide 10 text

JDK Flight Recorder (JFR) A lightweight JVM tool for profiling and diagnostics, capturing runtime events like method execution, GC, and thread activity with minimal overhead. Performance detective toolbox at your fingertips

Slide 11

Slide 11 text

Slide 12

Slide 12 text

What is hogging the CPU? How come the pod is OOM-killed? Are other processes influencing me? How to correlate performance metrics to my business The Riddles

Slide 13

Slide 13 text

What hogs my CPU? Also known as ‘CPU Profiling’ CPU usage at code locations - usually in form of stacktraces jdk.ExecutionSample - built-in JFR event to measure CPU usage

Slide 14

Slide 14 text

ExecutionSample event each N milliseconds capture five stacktraces pick next non-blocked Java thread suspend thread capture the thread stack resume thread :: end Thread list Runnable Blocked Processed: 0 Captured : 0 Errors: 0 ST 1 ST 2 ST 3 ST 5 ST 4

Slide 15

Slide 15 text

each N milliseconds capture five stacktraces pick next non-blocked Java thread suspend thread capture the thread stack resume thread :: end Thread list Runnable Blocked Processed: 0 Captured : 0 Errors: 0 ExecutionSample event ST 1 ST 2 ST 3 ST 5 ST 4

Slide 16

Slide 16 text

each N milliseconds capture five stacktraces pick next non-blocked Java thread suspend thread capture the thread stack resume thread :: end Thread list Runnable Blocked Processed: 1 Captured : 1 Errors: 0 ExecutionSample event ST 1 ST 2 ST 3 ST 5 ST 4

Slide 17

Slide 17 text

each N milliseconds capture five stacktraces pick next non-blocked Java thread suspend thread capture the thread stack resume thread :: end Thread list Runnable Blocked Processed: 2 Captured : 1 Errors: 0 ExecutionSample event ST 1 ST 2 ST 3 ST 5 ST 4

Slide 18

Slide 18 text

each N milliseconds capture five stacktraces pick next non-blocked Java thread suspend thread capture the thread stack resume thread :: end Thread list Runnable Blocked Processed: 2 Captured : 1 Errors: 1 X ExecutionSample event ST 1 ST 2 ST 3 ST 5 ST 4

Slide 19

Slide 19 text

each N milliseconds capture five stacktraces pick next non-blocked Java thread suspend thread capture the thread stack resume thread :: end Thread list Runnable Blocked Processed: 3 Captured : 1 Errors: 1 X ExecutionSample event ST 1 ST 2 ST 3 ST 5 ST 4

Slide 20

Slide 20 text

each N milliseconds capture five stacktraces pick next non-blocked Java thread suspend thread capture the thread stack resume thread :: end Thread list Runnable Blocked Processed: 3 Captured : 2 Errors: 1 X ExecutionSample event ST 1 ST 2 ST 3 ST 5 ST 4

Slide 21

Slide 21 text

each N milliseconds capture five stacktraces pick next non-blocked Java thread suspend thread capture the thread stack resume thread :: end Thread list Runnable Blocked Processed: 6 Captured : 5 Total : 8 Errors : 1 Blocked: 3 ExecutionSample event ST 1 ST 2 ST 3 ST 5 ST 4

Slide 22

Slide 22 text

Problems with ExecutionSample event T1 T4 10 ms time wrap around threads list 50 ms

Slide 23

Slide 23 text

Problems with ExecutionSample event T1 T4 10 ms time wrap around threads list 50 ms

Slide 24

Slide 24 text

Problems with ExecutionSample event Effective Sampling Period

Slide 25

Slide 25 text

Problems with ExecutionSampleEvent Effective Sampling Period Scales linearly with thread count and stack-walk error rate

Slide 26

Slide 26 text

Problems with ExecutionSampleEvent Effective Sampling Period No clear relation between CPU and samples

Slide 27

Slide 27 text

[1] https://speakerdeck.com/parttimenerd/inner-workings-of-safepoints [2] https://psy-lob-saw.blogspot.com/2015/12/safepoints.html Safepoints, anybody?

Slide 28

Slide 28 text

Symbolication and safepoints 0x00000001155d8dc0 0x00000001155df1a4 0x00000001155d9de4 Stack Trace List of frames Frame == address

Slide 29

Slide 29 text

Mapping address to metadata Interpreted code - preserves BCI - simple lookup Compiled code - needs debug info by JIT - inlining complicates it

Slide 30

Slide 30 text

# {method} {0x00000001314004c8} 'mul' '(II)I' in 'com/datadog/demo1/Multi' … 0x000000011370d454: ; {other} 0x000000011370d454: 1f20 03d5 | 1f00 80f2 | 1f00 80f2 | ff43 0091 | 5301 0034 | 8a2b 42f9 | 0300 0014 | 8a2b 42f9 0x000000011370d474: f303 0c2a | b502 140b 0x000000011370d47c: ; ImmutableOopMap {} ;*goto {reexecute=1 rethrow=0 return_oop=0} ; - (reexecute) com.datadog.demo1.Multi::mul@13 (line 17) … > java -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -jar multi.jar

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Biased toward safepoint

Slide 33

Slide 33 text

Debug info only at safepoints

Slide 34

Slide 34 text

# {method} {0x00000001314004c8} 'mul' '(II)I' in 'com/datadog/demo1/Multi' … 0x00000001130c15d4: ; {other} 0x00000001130c15d4: 1f20 03d5 | 1f00 80f2 | 1f00 80f2 0x00000001130c15e0: ;*iload_1 {reexecute=0 rethrow=0 return_oop=0} ; - com.datadog.demo1.Multi::mul@2 (line 15) 0x00000001130c15e0: ff43 0091 | 5301 0034 | 8a2b 42f9 | 0300 0014 0x00000001130c15f0: ;*goto {reexecute=0 rethrow=0 return_oop=0} ; - com.datadog.demo1.Multi::mul@13 (line 17) 0x00000001130c15f0: 8a2b 42f9 | f303 0c2a 0x00000001130c15f8: ;*iadd {reexecute=0 rethrow=0 return_oop=0} ; - com.datadog.demo1.Multi::mul@8 (line 16) … > java -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints -XX:+PrintAssembly -jar multi.jar

Slide 35

Slide 35 text

# {method} {0x00000001314004c8} 'mul' '(II)I' in 'com/datadog/demo1/Multi' … 0x00000001130c15d4: ; {other} 0x00000001130c15d4: 1f20 03d5 | 1f00 80f2 | 1f00 80f2 0x00000001130c15e0: ;*iload_1 {reexecute=0 rethrow=0 return_oop=0} ; - com.datadog.demo1.Multi::mul@2 (line 15) 0x00000001130c15e0: ff43 0091 | 5301 0034 | 8a2b 42f9 | 0300 0014 0x00000001130c15f0: ;*goto {reexecute=0 rethrow=0 return_oop=0} ; - com.datadog.demo1.Multi::mul@13 (line 17) 0x00000001130c15f0: 8a2b 42f9 | f303 0c2a 0x00000001130c15f8: ;*iadd {...} ; - com.datadog.demo1.Multi::mul@8 (line 16) …

Slide 36

Slide 36 text

Unbiased line mapping

Slide 37

Slide 37 text

-XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints

Slide 38

Slide 38 text

Summary: ExecutionSample event Approximation of CPU usage Unstable relation to real CPU usage - ‘effective’ sampling period - more visible in async/reactive frameworks Can it be improved?

Slide 39

Slide 39 text

CPUTimeExecutionSample Directly correlated to CPU usage Using kernel scheduler support - timer_create - Linux only - similar approach as async-profiler [1] JEP in-progress [2] [1] https://github.com/async-profiler/async-profiler [2] https://openjdk.org/jeps/8337789

Slide 40

Slide 40 text

thread has consumed N ms raise signal for that thread walk the stack enqueue the trace MPSC Queue symbolicate frames emit events CPUTimeExecutionSample Scheduler Signal Handler JFR Thread

Slide 41

Slide 41 text

Summary: CPUTimeExecutionSample event Truthful representation of CPU usage “Serialization” queue is a bottleneck - many producers, only single consumer - dropped samples are acknowledged Linux only

Slide 42

Slide 42 text

Improving CPUTimeExecutionSample event Unwind Java stack at a safepoint [1] - removes the queue bottleneck - improves stability, maintains precision [1] https://openjdk.org/jeps/8350338

Slide 43

Slide 43 text

Slide 44

Slide 44 text

# {method} {0x00000001314004c8} 'mul' '(II)I' in 'com/datadog/demo1/Multi' … 0x000000011370d454: ; {other} 0x000000011370d454: 1f20 03d5 | 1f00 80f2 | 1f00 80f2 | ff43 0091 | 5301 0034 | 8a2b 42f9 | 0300 0014 | 8a2b 42f9 0x000000011370d474: f303 0c2a | b502 140b 0x000000011370d47c: ; ImmutableOopMap {} ;*goto {reexecute=1 rethrow=0 return_oop=0} ; - (reexecute) com.datadog.demo1.Multi::mul@13 (line 17) … Unwind stack at safepoint Sample (record PC) Safepoint Location

Slide 45

Slide 45 text

# {method} {0x00000001314004c8} 'mul' '(II)I' in 'com/datadog/demo1/Multi' … 0x000000011370d454: ; {other} 0x000000011370d454: 1f20 03d5 | 1f00 80f2 | 1f00 80f2 | ff43 0091 | 5301 0034 | 8a2b 42f9 | 0300 0014 | 8a2b 42f9 0x000000011370d474: f303 0c2a | b502 140b 0x000000011370d47c: ; ImmutableOopMap {} ;*goto {reexecute=1 rethrow=0 return_oop=0} ; - (reexecute) com.datadog.demo1.Multi::mul@13 (line 17) … Unwind stack at safepoint Unwind from recorded PC Safepoint Location

Slide 46

Slide 46 text

Unwind stack at safepoint Stable unwinding - cooperating with JVM - unwinding only when it is safe Preserves exact location - samples not restricted to safepoints

Slide 47

Slide 47 text

What is hogging the CPU? How come the pod is OOM-killed? Are other processes influencing me? How to correlate performance metrics to my business The Riddles

Slide 48

Slide 48 text

Where is all my, what you call it, memory? Also known as ‘Memory Profiling’ Allocated memory by code locations - usually in form of stacktraces jdk.ObjectAllocationSample jdk.ObjectAllocationInNewTLAB jdk.ObjectAllocationOutsideTLAB - built-in JFR events to capture heap allocations

Slide 49

Slide 49 text

Java Heap Allocations Very fast pointer-bump Thread Local Allocation Buffers (TLAB) - avoid contention by allocating thread-locally first - when filled-up spill to the heap - fill TLAB -> spill to heap -> rince’n’repeat - you can read more in JVM Anatomy Quark #4 [1] [1] https://shipilev.net/jvm/anatomy-quarks/4-tlab-allocation

Slide 50

Slide 50 text

How TLAB works TLAB T1 Java Heap Small TLAB T2 Large Allocation S Allocation L Occupied Free Overflow

Slide 51

Slide 51 text

TLAB spill and JFR events Java Heap Small [JFR Event] ObjectAllocationInNewTLAB Large TLAB T1 TLAB T2 Allocation S Allocation L [JFR Event] ObjectAllocationOutsideTLAB Occupied Free Overflow

Slide 52

Slide 52 text

TLAB stored in heap Java Heap Small Large TLAB T1 TLAB T2 Occupied Free Overflow

Slide 53

Slide 53 text

TLAB events are very frequent and can easily flood recording

Slide 54

Slide 54 text

A (100B) B (211B) A (100B) A (100B) C[] (411B) B (813B) D (4MB) C[] (87B) Adaptive sampler [4 samples/T] Object Allocation Samples Time period T Random Samples B (2 * 1MB) C[] (3 * 1MB) D (4 * 1MB) C[] (2 * 1MB) TLAB events (TLAB = 1MB) == 11 TLABs == 11 TLABs

Slide 55

Slide 55 text

JFR Object Allocation Sample Based on TLAB events Employs adaptive sampler - modulates emission rate to achieve the target Allows upscaling from samples - records the ‘skipped’ allocations Available since JDK 16 [1] Gory details in my blog post [2] [1] https://bugs.openjdk.org/browse/JDK-8257602 [2] https://jbachorik.github.io/posts/jfr-allocation-profiling

Slide 56

Slide 56 text

Biased on TLAB size!

Slide 57

Slide 57 text

JVMTI Allocation Sampler TLAB “Soft-max” Sampler Allocation Trigger sample Set new soft-max (exponential distribution random generator)

Slide 58

Slide 58 text

JVMTI Allocation Sampler Using soft-max to avoid bias JEP 331: Low-overhead Heap Profiling in JDK 11 [1] JVMTI only - no JFR integration as of now - immutable sampling interval - single soft-max JFR integration would be interesting [1] https://openjdk.org/jeps/331

Slide 59

Slide 59 text

What is hogging the CPU? How come the pod is OOM-killed? Are other processes influencing me? How to correlate performance metrics to my business The Riddles

Slide 60

Slide 60 text

Heap profiling What occupies memory and why “Dead or Alive” Heap-dumps jdk.OldObjectSample

Slide 61

Slide 61 text

A (100B) B (211B) A (100B) A (100B) C[] (411B) B (813B) D (4MB) C[] (87B) Time period T TLAB events (TLAB = 1MB) Liveness Tracker Instance Removed A (100B) B (211B) A (100B) A (100B) C[] (411B) B (813B) D (4MB) C[] (87B) GC

Slide 62

Slide 62 text

Slide 63

Slide 63 text

OldObjectSample event Using TLAB events (kind-of) Keeps track of liveness (Optional) Reference chains to GC roots - fairly cheap alternative to heap dumps - helps answering ‘how is keeping the memory’ No upscaling to real used heap

Slide 64

Slide 64 text

Reference chains Collected at the end of chunk Must perform heap walk - stop-the world - time-bounded via config - may not get all chains due to time limit ArrayList String “Hello” String “world” byte[] byte[] Static field [GC Root] OldObjectSample Event

Slide 65

Slide 65 text

Biased on TLAB size!

Slide 66

Slide 66 text

Advancing OldObjectSample event Use non-biased sampler Speed up reference chains - ‘Parallel and incremental heap scanning’ [1] Improve data representativeness - do not favour ‘large’ samples - rather keep random samples [1] https://bugs.openjdk.org/browse/JDK-8214535

Slide 67

Slide 67 text

What is hogging the CPU? How comes the pod is OOM-killed? Are other processes influencing me? Can we pack more containers on a machine safely? The Riddles

Slide 68

Slide 68 text

Enter eBPF?!

Slide 69

Slide 69 text

eBPF is making the Linux Kernel programmable at native execution speed! https://ebpf.io/

Slide 70

Slide 70 text

eBPF runtime Courtesy of Mohammed Abboulaide

Slide 71

Slide 71 text

Courtesy of Mohammed Abboulaide

Slide 72

Slide 72 text

Courtesy of Mohammed Abboulaide

Slide 73

Slide 73 text

CPU Profiling: The eBPF Way Scheduler probe - watches the CPU time spent by thread - unwinds native user/kernel stack - unwinds runtime specific stack - JVM stack with the help of vmstructs [1] Probe is not ‘maskable’ by syscalls Probe code has limits due to eBPF [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/runtime/vmStructs.cpp

Slide 74

Slide 74 text

Memory Profiling: The eBPF Way Does not exist as of now - at least not for JVM There is a hope - JVM has ready-to-use USDT support [1][2][3] - we can “just” probe JVMTI Allocation Sampler [1] https://docs.oracle.com/javase/8/docs/technotes/guides/vm/dtrace.html [2] https://leezhenghui.github.io/linux/2019/03/05/exploring-usdt-on-linux.html [3] https://myaut.github.io/dtrace-stap-book/app/java.html

Slide 75

Slide 75 text

provider hotspot_jvmti { probe AllocObject__sample(char*, size_t, size_t); }; Create new provider in JVM Call the probe HOTSPOT_JVMTI_ALLOCOBJECT_SAMPLE(obj->klass()->name()->as_ C_string(), allocation_size, bytes_since_allocation); https://github.com/DataDog/openjdk-jdk/commit/686b8454e1f99b7d75ceb9af6c5b68dbb49e9ffd src/hotspot/os/posix/dtrace/hotspot_jvmti. d src/hotspot/share/runtime/threadHeapSampler.cp p

Slide 76

Slide 76 text

Attach probe sudo bpftrace -p -e ' usdt:*:hotspot_jvmti:AllocObject__sample { printf("Alloc: %s (%d, %d)\n", str(arg0), arg1, arg2); }' Probe trace Class: s/Some (16, 62673) Class: s/c/i/RedBlackTree$Tree (32, 71939) Class: s/c/i/RedBlackTree$Tree (32, 55760) Class: s/c/i/TreeMap (24, 537917) https://github.com/DataDog/openjdk-jdk/commit/686b8454e1f99b7d75ceb9af6c5b68dbb49e9ffd

Slide 77

Slide 77 text

OpenTelemetry (OTel) eBPF profiler OTel (Open Telemetry) [1] - CNCF (Cloud Native Computing Foundation) [2] incubating project OTel Profiling Signal [2] - SES1247:Continuous Profiling - the 4th Pillar of Observability, M.Hirt Continuous Profiling - the 4th Pillar of Observability OTel Profiling Agent [3] - donated to CNCF/OTel by Elastic - Datadog full-host profiler fork [4] - Multi-runtime (Go, Rust, JVM etc.) [1] https://opentelemetry.io/ [2] https://www.cncf.io/ [3] https://github.com/open-telemetry/opentelemetry-ebpf-profiler [4] https://github.com/DataDog/dd-otel-host-profiler/

Slide 78

Slide 78 text

Slide 79

Slide 79 text

What is hogging the CPU? How comes the pod is OOM-killed? Are other processes influencing me? How to correlate performance metrics to my business The Riddles

Slide 80

Slide 80 text

Labels everywhere!

Slide 81

Slide 81 text

Profiling data dimensions Sample value - CPU Time, Allocated memory … Originating REST API calls Distributed tracing ID Customer ID Anything else that makes sense for the user Allow slice’n’dice of profiling data Sometimes also referred to as ‘Profiling Context’

Slide 82

Slide 82 text

Quick poll: What is flamegraph? https://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html

Slide 83

Slide 83 text

Profiling labels T1 10 ms S1-2 S3-6 S7-8 Aggregated stacktraces [Flamegraph] 8 samples

Slide 84

Slide 84 text

Operations (distributed tracer) T1 S1-2 S3-6 S7-8 REST /api/showAll DB select * from … UI renderForm()

Slide 85

Slide 85 text

Customers (user defined context) T1 S1-2 S3-6 S7-8 Big Customer

Slide 86

Slide 86 text

Slice’n’dice All Samples

Slide 87

Slide 87 text

Slice’n’dice REST /api/showAll

Slide 88

Slide 88 text

Slice’n’dice REST /api/showAll Big Customer +

Slide 89

Slide 89 text

Profiling labels Improved understanding, navigation and analytics Rely on user specific context Proved valuable in Datadog Java profiler [1] OTel is looking at adding labels to the eBPF profiler [2] Experimental prototypes for profiling labels in JFR [3, 4] [1] https://github.com/DataDog/java-profiler [2] https://www.elastic.co/observability-labs/blog/continuous-profiling-distributed-tracing-correlation [3] https://jbachorik.github.io/posts/seeing-in-context_1 [4] https://jbachorik.github.io/posts/different-shade-of-context

Slide 90

Slide 90 text

What is hogging the CPU? How comes the pod is OOM-killed? Are other processes influencing me? How to correlate performance metrics to my business The Riddles

Slide 91

Slide 91 text

Thank you! Jaroslav Bachorik @bachorikj https://www.linkedin.com/in/jbachorik/ https://github.com/jbachorik Resources and links Questions?