JavaOne 2025: Advancing Java Profiling

Advancing Java Profiling Achieving precision and stability with JFR, eBPF
and user context Jaroslav Bachorik Staff Software Engineer, Datadog

So, what do you do for a living?

Java Profiling

Tooling to collect data points to explain application performance and
behaviour …

???? ????

Helping to solve computer riddles … 🧌

… for last 20 years

What is hogging the CPU? How come the pod is
OOM-killed? Are other processes influencing me? How to correlate performance metrics to my business The Riddles

JDK Flight Recorder (JFR) A lightweight JVM tool for profiling
and diagnostics, capturing runtime events like method execution, GC, and thread activity with minimal overhead. Performance detective toolbox at your fingertips

What hogs my CPU? Also known as ‘CPU Profiling’ CPU
usage at code locations - usually in form of stacktraces jdk.ExecutionSample - built-in JFR event to measure CPU usage

ExecutionSample event each N milliseconds capture five stacktraces pick next
non-blocked Java thread suspend thread capture the thread stack resume thread :: end Thread list Runnable Blocked Processed: 0 Captured : 0 Errors: 0 ST 1 ST 2 ST 3 ST 5 ST 4

each N milliseconds capture five stacktraces pick next non-blocked Java
thread suspend thread capture the thread stack resume thread :: end Thread list Runnable Blocked Processed: 0 Captured : 0 Errors: 0 ExecutionSample event ST 1 ST 2 ST 3 ST 5 ST 4

thread suspend thread capture the thread stack resume thread :: end Thread list Runnable Blocked Processed: 2 Captured : 1 Errors: 1 X ExecutionSample event ST 1 ST 2 ST 3 ST 5 ST 4

thread suspend thread capture the thread stack resume thread :: end Thread list Runnable Blocked Processed: 6 Captured : 5 Total : 8 Errors : 1 Blocked: 3 ExecutionSample event ST 1 ST 2 ST 3 ST 5 ST 4

Problems with ExecutionSample event T1 T4 10 ms time wrap
around threads list 50 ms

Problems with ExecutionSample event Effective Sampling Period

Problems with ExecutionSampleEvent Effective Sampling Period Scales linearly with thread
count and stack-walk error rate

Problems with ExecutionSampleEvent Effective Sampling Period No clear relation between
CPU and samples

[1] https://speakerdeck.com/parttimenerd/inner-workings-of-safepoints [2] https://psy-lob-saw.blogspot.com/2015/12/safepoints.html Safepoints, anybody?

Symbolication and safepoints 0x00000001155d8dc0 0x00000001155df1a4 0x00000001155d9de4 Stack Trace List of
frames Frame == address

Mapping address to metadata Interpreted code - preserves BCI -
simple lookup Compiled code - needs debug info by JIT - inlining complicates it

# {method} {0x00000001314004c8} 'mul' '(II)I' in 'com/datadog/demo1/Multi' … 0x000000011370d454: ;
{other} 0x000000011370d454: 1f20 03d5 | 1f00 80f2 | 1f00 80f2 | ff43 0091 | 5301 0034 | 8a2b 42f9 | 0300 0014 | 8a2b 42f9 0x000000011370d474: f303 0c2a | b502 140b 0x000000011370d47c: ; ImmutableOopMap {} ;*goto {reexecute=1 rethrow=0 return_oop=0} ; - (reexecute) com.datadog.demo1.Multi::mul@13 (line 17) … > java -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -jar multi.jar

{other} 0x000000011370d454: 1f20 03d5 | 1f00 80f2 | 1f00 80f2 | ff43 0091 | 5301 0034 | 8a2b 42f9 | 0300 0014 | 8a2b 42f9 0x000000011370d474: f303 0c2a | b502 140b 0x000000011370d47c: ; ImmutableOopMap {} ;*goto {reexecute=1 rethrow=0 return_oop=0} ; - (reexecute) com.datadog.demo1.Multi::mul@13 (line 17) …

Biased toward safepoint

Debug info only at safepoints

# {method} {0x00000001314004c8} 'mul' '(II)I' in 'com/datadog/demo1/Multi' … 0x00000001130c15d4: ;
{other} 0x00000001130c15d4: 1f20 03d5 | 1f00 80f2 | 1f00 80f2 0x00000001130c15e0: ;*iload_1 {reexecute=0 rethrow=0 return_oop=0} ; - com.datadog.demo1.Multi::mul@2 (line 15) 0x00000001130c15e0: ff43 0091 | 5301 0034 | 8a2b 42f9 | 0300 0014 0x00000001130c15f0: ;*goto {reexecute=0 rethrow=0 return_oop=0} ; - com.datadog.demo1.Multi::mul@13 (line 17) 0x00000001130c15f0: 8a2b 42f9 | f303 0c2a 0x00000001130c15f8: ;*iadd {reexecute=0 rethrow=0 return_oop=0} ; - com.datadog.demo1.Multi::mul@8 (line 16) … > java -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints -XX:+PrintAssembly -jar multi.jar

# {method} {0x00000001314004c8} 'mul' '(II)I' in 'com/datadog/demo1/Multi' … 0x00000001130c15d4: ;
{other} 0x00000001130c15d4: 1f20 03d5 | 1f00 80f2 | 1f00 80f2 0x00000001130c15e0: ;*iload_1 {reexecute=0 rethrow=0 return_oop=0} ; - com.datadog.demo1.Multi::mul@2 (line 15) 0x00000001130c15e0: ff43 0091 | 5301 0034 | 8a2b 42f9 | 0300 0014 0x00000001130c15f0: ;*goto {reexecute=0 rethrow=0 return_oop=0} ; - com.datadog.demo1.Multi::mul@13 (line 17) 0x00000001130c15f0: 8a2b 42f9 | f303 0c2a 0x00000001130c15f8: ;*iadd {...} ; - com.datadog.demo1.Multi::mul@8 (line 16) …

Unbiased line mapping

-XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints

Summary: ExecutionSample event Approximation of CPU usage Unstable relation to
real CPU usage - ‘effective’ sampling period - more visible in async/reactive frameworks Can it be improved?

CPUTimeExecutionSample Directly correlated to CPU usage Using kernel scheduler support
- timer_create - Linux only - similar approach as async-profiler [1] JEP in-progress [2] [1] https://github.com/async-profiler/async-profiler [2] https://openjdk.org/jeps/8337789

thread has consumed N ms raise signal for that thread
walk the stack enqueue the trace MPSC Queue symbolicate frames emit events CPUTimeExecutionSample Scheduler Signal Handler JFR Thread

Summary: CPUTimeExecutionSample event Truthful representation of CPU usage “Serialization” queue
is a bottleneck - many producers, only single consumer - dropped samples are acknowledged Linux only

Improving CPUTimeExecutionSample event Unwind Java stack at a safepoint [1]
- removes the queue bottleneck - improves stability, maintains precision [1] https://openjdk.org/jeps/8350338

{other} 0x000000011370d454: 1f20 03d5 | 1f00 80f2 | 1f00 80f2 | ff43 0091 | 5301 0034 | 8a2b 42f9 | 0300 0014 | 8a2b 42f9 0x000000011370d474: f303 0c2a | b502 140b 0x000000011370d47c: ; ImmutableOopMap {} ;*goto {reexecute=1 rethrow=0 return_oop=0} ; - (reexecute) com.datadog.demo1.Multi::mul@13 (line 17) … Unwind stack at safepoint Safepoint Location

{other} 0x000000011370d454: 1f20 03d5 | 1f00 80f2 | 1f00 80f2 | ff43 0091 | 5301 0034 | 8a2b 42f9 | 0300 0014 | 8a2b 42f9 0x000000011370d474: f303 0c2a | b502 140b 0x000000011370d47c: ; ImmutableOopMap {} ;*goto {reexecute=1 rethrow=0 return_oop=0} ; - (reexecute) com.datadog.demo1.Multi::mul@13 (line 17) … Unwind stack at safepoint Sample (record PC) Safepoint Location

{other} 0x000000011370d454: 1f20 03d5 | 1f00 80f2 | 1f00 80f2 | ff43 0091 | 5301 0034 | 8a2b 42f9 | 0300 0014 | 8a2b 42f9 0x000000011370d474: f303 0c2a | b502 140b 0x000000011370d47c: ; ImmutableOopMap {} ;*goto {reexecute=1 rethrow=0 return_oop=0} ; - (reexecute) com.datadog.demo1.Multi::mul@13 (line 17) … Unwind stack at safepoint Unwind from recorded PC Safepoint Location

Unwind stack at safepoint Stable unwinding - cooperating with JVM
- unwinding only when it is safe Preserves exact location - samples not restricted to safepoints

Where is all my, what you call it, memory? Also
known as ‘Memory Profiling’ Allocated memory by code locations - usually in form of stacktraces jdk.ObjectAllocationSample jdk.ObjectAllocationInNewTLAB jdk.ObjectAllocationOutsideTLAB - built-in JFR events to capture heap allocations

Java Heap Allocations Very fast pointer-bump Thread Local Allocation Buffers
(TLAB) - avoid contention by allocating thread-locally first - when filled-up spill to the heap - fill TLAB -> spill to heap -> rince’n’repeat - you can read more in JVM Anatomy Quark #4 [1] [1] https://shipilev.net/jvm/anatomy-quarks/4-tlab-allocation

How TLAB works TLAB T1 Java Heap Small TLAB T2
Large Allocation S Allocation L Occupied Free Overflow

TLAB spill and JFR events Java Heap Small [JFR Event]
ObjectAllocationInNewTLAB Large TLAB T1 TLAB T2 Allocation S Allocation L [JFR Event] ObjectAllocationOutsideTLAB Occupied Free Overflow

TLAB stored in heap Java Heap Small Large TLAB T1
TLAB T2 Occupied Free Overflow

TLAB events are very frequent and can easily flood recording

A (100B) B (211B) A (100B) A (100B) C[] (411B)
B (813B) D (4MB) C[] (87B) Adaptive sampler [4 samples/T] Object Allocation Samples Time period T Random Samples B (2 * 1MB) C[] (3 * 1MB) D (4 * 1MB) C[] (2 * 1MB) TLAB events (TLAB = 1MB) == 11 TLABs == 11 TLABs

JFR Object Allocation Sample Based on TLAB events Employs adaptive
sampler - modulates emission rate to achieve the target Allows upscaling from samples - records the ‘skipped’ allocations Available since JDK 16 [1] Gory details in my blog post [2] [1] https://bugs.openjdk.org/browse/JDK-8257602 [2] https://jbachorik.github.io/posts/jfr-allocation-profiling

Biased on TLAB size!

JVMTI Allocation Sampler TLAB “Soft-max” Sampler Allocation Trigger sample Set
new soft-max (exponential distribution random generator)

JVMTI Allocation Sampler Using soft-max to avoid bias JEP 331:
Low-overhead Heap Profiling in JDK 11 [1] JVMTI only - no JFR integration as of now - immutable sampling interval - single soft-max JFR integration would be interesting [1] https://openjdk.org/jeps/331

Heap profiling What occupies memory and why “Dead or Alive”
Heap-dumps jdk.OldObjectSample

A (100B) B (211B) A (100B) A (100B) C[] (411B)
B (813B) D (4MB) C[] (87B) Time period T TLAB events (TLAB = 1MB) Liveness Tracker Instance Removed A (100B) B (211B) A (100B) A (100B) C[] (411B) B (813B) D (4MB) C[] (87B) GC

OldObjectSample event Using TLAB events (kind-of) Keeps track of liveness
(Optional) Reference chains to GC roots - fairly cheap alternative to heap dumps - helps answering ‘how is keeping the memory’ No upscaling to real used heap

Reference chains Collected at the end of chunk Must perform
heap walk - stop-the world - time-bounded via config - may not get all chains due to time limit ArrayList String “Hello” String “world” byte[] byte[] Static field [GC Root] OldObjectSample Event

Biased on TLAB size!

Advancing OldObjectSample event Use non-biased sampler Speed up reference chains
- ‘Parallel and incremental heap scanning’ [1] Improve data representativeness - do not favour ‘large’ samples - rather keep random samples [1] https://bugs.openjdk.org/browse/JDK-8214535

What is hogging the CPU? How comes the pod is
OOM-killed? Are other processes influencing me? Can we pack more containers on a machine safely? The Riddles

Enter eBPF?!

eBPF is making the Linux Kernel programmable at native execution
speed! https://ebpf.io/

eBPF runtime Courtesy of Mohammed Abboulaide

Courtesy of Mohammed Abboulaide

CPU Profiling: The eBPF Way Scheduler probe - watches the
CPU time spent by thread - unwinds native user/kernel stack - unwinds runtime specific stack - JVM stack with the help of vmstructs [1] Probe is not ‘maskable’ by syscalls Probe code has limits due to eBPF [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/runtime/vmStructs.cpp

Memory Profiling: The eBPF Way Does not exist as of
now - at least not for JVM There is a hope - JVM has ready-to-use USDT support [1][2][3] - we can “just” probe JVMTI Allocation Sampler [1] https://docs.oracle.com/javase/8/docs/technotes/guides/vm/dtrace.html [2] https://leezhenghui.github.io/linux/2019/03/05/exploring-usdt-on-linux.html [3] https://myaut.github.io/dtrace-stap-book/app/java.html

provider hotspot_jvmti { probe AllocObject__sample(char*, size_t, size_t); }; Create new
provider in JVM Call the probe HOTSPOT_JVMTI_ALLOCOBJECT_SAMPLE(obj->klass()->name()->as_ C_string(), allocation_size, bytes_since_allocation); https://github.com/DataDog/openjdk-jdk/commit/686b8454e1f99b7d75ceb9af6c5b68dbb49e9ffd src/hotspot/os/posix/dtrace/hotspot_jvmti. d src/hotspot/share/runtime/threadHeapSampler.cp p

Attach probe sudo bpftrace -p <pid> -e ' usdt:*:hotspot_jvmti:AllocObject__sample {
printf("Alloc: %s (%d, %d)\n", str(arg0), arg1, arg2); }' Probe trace Class: s/Some (16, 62673) Class: s/c/i/RedBlackTree$Tree (32, 71939) Class: s/c/i/RedBlackTree$Tree (32, 55760) Class: s/c/i/TreeMap (24, 537917) https://github.com/DataDog/openjdk-jdk/commit/686b8454e1f99b7d75ceb9af6c5b68dbb49e9ffd

OpenTelemetry (OTel) eBPF profiler OTel (Open Telemetry) [1] - CNCF
(Cloud Native Computing Foundation) [2] incubating project OTel Profiling Signal [2] - SES1247:Continuous Profiling - the 4th Pillar of Observability, M.Hirt Continuous Profiling - the 4th Pillar of Observability OTel Profiling Agent [3] - donated to CNCF/OTel by Elastic - Datadog full-host profiler fork [4] - Multi-runtime (Go, Rust, JVM etc.) [1] https://opentelemetry.io/ [2] https://www.cncf.io/ [3] https://github.com/open-telemetry/opentelemetry-ebpf-profiler [4] https://github.com/DataDog/dd-otel-host-profiler/

OpenTelemetry (OTel) eBPF profiler OTel (Open Telemetry) [1] - CNCF
(Cloud Native Computing Foundation) [2] incubating project OTel Profiling Signal [2] - SES1247:Continuous Profiling - the 4th Pillar of Observability, M.Hirt Continuous Profiling - the 4th Pillar of Observability OTel Profiling Agent [3] - donated to CNCF/OTel by Elastic - Datadog full-host profiler fork [4] - Multi-runtime (Go, Rust, JVM etc.) [1] https://opentelemetry.io/ [2] https://www.cncf.io/ [3] https://github.com/open-telemetry/opentelemetry-ebpf-profiler [4] https://github.com/DataDog/dd-otel-host-profiler/ N JVM le behin !

Labels everywhere!

Profiling data dimensions Sample value - CPU Time, Allocated memory
… Originating REST API calls Distributed tracing ID Customer ID Anything else that makes sense for the user Allow slice’n’dice of profiling data Sometimes also referred to as ‘Profiling Context’

Quick poll: What is flamegraph? https://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html

Profiling labels T1 10 ms S1-2 S3-6 S7-8 Aggregated stacktraces
[Flamegraph] 8 samples

Operations (distributed tracer) T1 S1-2 S3-6 S7-8 REST /api/showAll DB
select * from … UI renderForm()

Customers (user defined context) T1 S1-2 S3-6 S7-8 Big Customer

Slice’n’dice All Samples

Slice’n’dice REST /api/showAll

Slice’n’dice REST /api/showAll Big Customer +

Profiling labels Improved understanding, navigation and analytics Rely on user
specific context Proved valuable in Datadog Java profiler [1] OTel is looking at adding labels to the eBPF profiler [2] Experimental prototypes for profiling labels in JFR [3, 4] [1] https://github.com/DataDog/java-profiler [2] https://www.elastic.co/observability-labs/blog/continuous-profiling-distributed-tracing-correlation [3] https://jbachorik.github.io/posts/seeing-in-context_1 [4] https://jbachorik.github.io/posts/different-shade-of-context

Thank you! Jaroslav Bachorik @bachorikj https://www.linkedin.com/in/jbachorik/ https://github.com/jbachorik Resources and links
Questions?

JavaOne 2025: Advancing Java Profiling

JavaOne 2025: Advancing Java Profiling

Other Decks in Programming

Featured

Transcript