Upgrade to Pro — share decks privately, control downloads, hide ads and more …

JavaOne 2025: Advancing Java Profiling

JavaOne 2025: Advancing Java Profiling

Java developers struggle with the trade-off between precise profiling and application stability. Unofficial methods like AsyncGetCallTrace offer precision but risk crashes, while official APIs such as JVMTI, JMX, and StackWalker are stable but biased due to safepoint polling. Java Flight Recorder (JFR) reduces this bias but introduces interpolation errors. This talk addresses these challenges with three key topics: • Combining precise sampling with JFR’s stability to remove biases without sacrificing reliability. • Using eBPF probes and examining JVM internals for safer, detailed profiling. • Enhancing precision with user-supplied profiling contexts. We will review the history of Java profilers and discuss the future of JVM profiling, with a focus on CPU profiling. Attendees will gain insights into new methods aimed at achieving accurate, stable performance analysis in Java applications, comparing them with async-profiler and existing tools.

Jaroslav Bachorik

March 19, 2025
Tweet

Other Decks in Programming

Transcript

  1. Advancing Java Profiling Achieving precision and stability with JFR, eBPF

    and user context Jaroslav Bachorik Staff Software Engineer, Datadog
  2. ?

  3. What is hogging the CPU? How come the pod is

    OOM-killed? Are other processes influencing me? How to correlate performance metrics to my business The Riddles
  4. JDK Flight Recorder (JFR) A lightweight JVM tool for profiling

    and diagnostics, capturing runtime events like method execution, GC, and thread activity with minimal overhead. Performance detective toolbox at your fingertips
  5. JDK Flight Recorder (JFR) A lightweight JVM tool for profiling

    and diagnostics, capturing runtime events like method execution, GC, and thread activity with minimal overhead. Performance detective toolbox at your fingertips
  6. What is hogging the CPU? How come the pod is

    OOM-killed? Are other processes influencing me? How to correlate performance metrics to my business The Riddles
  7. What hogs my CPU? Also known as ‘CPU Profiling’ CPU

    usage at code locations - usually in form of stacktraces jdk.ExecutionSample - built-in JFR event to measure CPU usage
  8. ExecutionSample event each N milliseconds capture five stacktraces pick next

    non-blocked Java thread suspend thread capture the thread stack resume thread :: end Thread list Runnable Blocked Processed: 0 Captured : 0 Errors: 0 ST 1 ST 2 ST 3 ST 5 ST 4
  9. each N milliseconds capture five stacktraces pick next non-blocked Java

    thread suspend thread capture the thread stack resume thread :: end Thread list Runnable Blocked Processed: 0 Captured : 0 Errors: 0 ExecutionSample event ST 1 ST 2 ST 3 ST 5 ST 4
  10. each N milliseconds capture five stacktraces pick next non-blocked Java

    thread suspend thread capture the thread stack resume thread :: end Thread list Runnable Blocked Processed: 1 Captured : 1 Errors: 0 ExecutionSample event ST 1 ST 2 ST 3 ST 5 ST 4
  11. each N milliseconds capture five stacktraces pick next non-blocked Java

    thread suspend thread capture the thread stack resume thread :: end Thread list Runnable Blocked Processed: 2 Captured : 1 Errors: 0 ExecutionSample event ST 1 ST 2 ST 3 ST 5 ST 4
  12. each N milliseconds capture five stacktraces pick next non-blocked Java

    thread suspend thread capture the thread stack resume thread :: end Thread list Runnable Blocked Processed: 2 Captured : 1 Errors: 1 X ExecutionSample event ST 1 ST 2 ST 3 ST 5 ST 4
  13. each N milliseconds capture five stacktraces pick next non-blocked Java

    thread suspend thread capture the thread stack resume thread :: end Thread list Runnable Blocked Processed: 3 Captured : 1 Errors: 1 X ExecutionSample event ST 1 ST 2 ST 3 ST 5 ST 4
  14. each N milliseconds capture five stacktraces pick next non-blocked Java

    thread suspend thread capture the thread stack resume thread :: end Thread list Runnable Blocked Processed: 3 Captured : 2 Errors: 1 X ExecutionSample event ST 1 ST 2 ST 3 ST 5 ST 4
  15. each N milliseconds capture five stacktraces pick next non-blocked Java

    thread suspend thread capture the thread stack resume thread :: end Thread list Runnable Blocked Processed: 6 Captured : 5 Total : 8 Errors : 1 Blocked: 3 ExecutionSample event ST 1 ST 2 ST 3 ST 5 ST 4
  16. Mapping address to metadata Interpreted code - preserves BCI -

    simple lookup Compiled code - needs debug info by JIT - inlining complicates it
  17. # {method} {0x00000001314004c8} 'mul' '(II)I' in 'com/datadog/demo1/Multi' … 0x000000011370d454: ;

    {other} 0x000000011370d454: 1f20 03d5 | 1f00 80f2 | 1f00 80f2 | ff43 0091 | 5301 0034 | 8a2b 42f9 | 0300 0014 | 8a2b 42f9 0x000000011370d474: f303 0c2a | b502 140b 0x000000011370d47c: ; ImmutableOopMap {} ;*goto {reexecute=1 rethrow=0 return_oop=0} ; - (reexecute) com.datadog.demo1.Multi::mul@13 (line 17) … > java -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -jar multi.jar
  18. # {method} {0x00000001314004c8} 'mul' '(II)I' in 'com/datadog/demo1/Multi' … 0x000000011370d454: ;

    {other} 0x000000011370d454: 1f20 03d5 | 1f00 80f2 | 1f00 80f2 | ff43 0091 | 5301 0034 | 8a2b 42f9 | 0300 0014 | 8a2b 42f9 0x000000011370d474: f303 0c2a | b502 140b 0x000000011370d47c: ; ImmutableOopMap {} ;*goto {reexecute=1 rethrow=0 return_oop=0} ; - (reexecute) com.datadog.demo1.Multi::mul@13 (line 17) …
  19. # {method} {0x00000001314004c8} 'mul' '(II)I' in 'com/datadog/demo1/Multi' … 0x00000001130c15d4: ;

    {other} 0x00000001130c15d4: 1f20 03d5 | 1f00 80f2 | 1f00 80f2 0x00000001130c15e0: ;*iload_1 {reexecute=0 rethrow=0 return_oop=0} ; - com.datadog.demo1.Multi::mul@2 (line 15) 0x00000001130c15e0: ff43 0091 | 5301 0034 | 8a2b 42f9 | 0300 0014 0x00000001130c15f0: ;*goto {reexecute=0 rethrow=0 return_oop=0} ; - com.datadog.demo1.Multi::mul@13 (line 17) 0x00000001130c15f0: 8a2b 42f9 | f303 0c2a 0x00000001130c15f8: ;*iadd {reexecute=0 rethrow=0 return_oop=0} ; - com.datadog.demo1.Multi::mul@8 (line 16) … > java -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints -XX:+PrintAssembly -jar multi.jar
  20. # {method} {0x00000001314004c8} 'mul' '(II)I' in 'com/datadog/demo1/Multi' … 0x00000001130c15d4: ;

    {other} 0x00000001130c15d4: 1f20 03d5 | 1f00 80f2 | 1f00 80f2 0x00000001130c15e0: ;*iload_1 {reexecute=0 rethrow=0 return_oop=0} ; - com.datadog.demo1.Multi::mul@2 (line 15) 0x00000001130c15e0: ff43 0091 | 5301 0034 | 8a2b 42f9 | 0300 0014 0x00000001130c15f0: ;*goto {reexecute=0 rethrow=0 return_oop=0} ; - com.datadog.demo1.Multi::mul@13 (line 17) 0x00000001130c15f0: 8a2b 42f9 | f303 0c2a 0x00000001130c15f8: ;*iadd {...} ; - com.datadog.demo1.Multi::mul@8 (line 16) …
  21. Summary: ExecutionSample event Approximation of CPU usage Unstable relation to

    real CPU usage - ‘effective’ sampling period - more visible in async/reactive frameworks Can it be improved?
  22. CPUTimeExecutionSample Directly correlated to CPU usage Using kernel scheduler support

    - timer_create - Linux only - similar approach as async-profiler [1] JEP in-progress [2] [1] https://github.com/async-profiler/async-profiler [2] https://openjdk.org/jeps/8337789
  23. thread has consumed N ms raise signal for that thread

    walk the stack enqueue the trace MPSC Queue symbolicate frames emit events CPUTimeExecutionSample Scheduler Signal Handler JFR Thread
  24. Summary: CPUTimeExecutionSample event Truthful representation of CPU usage “Serialization” queue

    is a bottleneck - many producers, only single consumer - dropped samples are acknowledged Linux only
  25. Improving CPUTimeExecutionSample event Unwind Java stack at a safepoint [1]

    - removes the queue bottleneck - improves stability, maintains precision [1] https://openjdk.org/jeps/8350338
  26. # {method} {0x00000001314004c8} 'mul' '(II)I' in 'com/datadog/demo1/Multi' … 0x000000011370d454: ;

    {other} 0x000000011370d454: 1f20 03d5 | 1f00 80f2 | 1f00 80f2 | ff43 0091 | 5301 0034 | 8a2b 42f9 | 0300 0014 | 8a2b 42f9 0x000000011370d474: f303 0c2a | b502 140b 0x000000011370d47c: ; ImmutableOopMap {} ;*goto {reexecute=1 rethrow=0 return_oop=0} ; - (reexecute) com.datadog.demo1.Multi::mul@13 (line 17) … Unwind stack at safepoint Safepoint Location
  27. # {method} {0x00000001314004c8} 'mul' '(II)I' in 'com/datadog/demo1/Multi' … 0x000000011370d454: ;

    {other} 0x000000011370d454: 1f20 03d5 | 1f00 80f2 | 1f00 80f2 | ff43 0091 | 5301 0034 | 8a2b 42f9 | 0300 0014 | 8a2b 42f9 0x000000011370d474: f303 0c2a | b502 140b 0x000000011370d47c: ; ImmutableOopMap {} ;*goto {reexecute=1 rethrow=0 return_oop=0} ; - (reexecute) com.datadog.demo1.Multi::mul@13 (line 17) … Unwind stack at safepoint Sample (record PC) Safepoint Location
  28. # {method} {0x00000001314004c8} 'mul' '(II)I' in 'com/datadog/demo1/Multi' … 0x000000011370d454: ;

    {other} 0x000000011370d454: 1f20 03d5 | 1f00 80f2 | 1f00 80f2 | ff43 0091 | 5301 0034 | 8a2b 42f9 | 0300 0014 | 8a2b 42f9 0x000000011370d474: f303 0c2a | b502 140b 0x000000011370d47c: ; ImmutableOopMap {} ;*goto {reexecute=1 rethrow=0 return_oop=0} ; - (reexecute) com.datadog.demo1.Multi::mul@13 (line 17) … Unwind stack at safepoint Unwind from recorded PC Safepoint Location
  29. Unwind stack at safepoint Stable unwinding - cooperating with JVM

    - unwinding only when it is safe Preserves exact location - samples not restricted to safepoints
  30. What is hogging the CPU? How come the pod is

    OOM-killed? Are other processes influencing me? How to correlate performance metrics to my business The Riddles
  31. Where is all my, what you call it, memory? Also

    known as ‘Memory Profiling’ Allocated memory by code locations - usually in form of stacktraces jdk.ObjectAllocationSample jdk.ObjectAllocationInNewTLAB jdk.ObjectAllocationOutsideTLAB - built-in JFR events to capture heap allocations
  32. Java Heap Allocations Very fast pointer-bump Thread Local Allocation Buffers

    (TLAB) - avoid contention by allocating thread-locally first - when filled-up spill to the heap - fill TLAB -> spill to heap -> rince’n’repeat - you can read more in JVM Anatomy Quark #4 [1] [1] https://shipilev.net/jvm/anatomy-quarks/4-tlab-allocation
  33. How TLAB works TLAB T1 Java Heap Small TLAB T2

    Large Allocation S Allocation L Occupied Free Overflow
  34. TLAB spill and JFR events Java Heap Small [JFR Event]

    ObjectAllocationInNewTLAB Large TLAB T1 TLAB T2 Allocation S Allocation L [JFR Event] ObjectAllocationOutsideTLAB Occupied Free Overflow
  35. TLAB stored in heap Java Heap Small Large TLAB T1

    TLAB T2 Occupied Free Overflow
  36. A (100B) B (211B) A (100B) A (100B) C[] (411B)

    B (813B) D (4MB) C[] (87B) Adaptive sampler [4 samples/T] Object Allocation Samples Time period T Random Samples B (2 * 1MB) C[] (3 * 1MB) D (4 * 1MB) C[] (2 * 1MB) TLAB events (TLAB = 1MB) == 11 TLABs == 11 TLABs
  37. JFR Object Allocation Sample Based on TLAB events Employs adaptive

    sampler - modulates emission rate to achieve the target Allows upscaling from samples - records the ‘skipped’ allocations Available since JDK 16 [1] Gory details in my blog post [2] [1] https://bugs.openjdk.org/browse/JDK-8257602 [2] https://jbachorik.github.io/posts/jfr-allocation-profiling
  38. JVMTI Allocation Sampler TLAB “Soft-max” Sampler Allocation Trigger sample Set

    new soft-max (exponential distribution random generator)
  39. JVMTI Allocation Sampler Using soft-max to avoid bias JEP 331:

    Low-overhead Heap Profiling in JDK 11 [1] JVMTI only - no JFR integration as of now - immutable sampling interval - single soft-max JFR integration would be interesting [1] https://openjdk.org/jeps/331
  40. What is hogging the CPU? How come the pod is

    OOM-killed? Are other processes influencing me? How to correlate performance metrics to my business The Riddles
  41. A (100B) B (211B) A (100B) A (100B) C[] (411B)

    B (813B) D (4MB) C[] (87B) Time period T TLAB events (TLAB = 1MB) Liveness Tracker Instance Removed A (100B) B (211B) A (100B) A (100B) C[] (411B) B (813B) D (4MB) C[] (87B) GC
  42. A (100B) B (211B) A (100B) A (100B) C[] (411B)

    B (813B) D (4MB) C[] (87B) Time period T TLAB events (TLAB = 1MB) Liveness Tracker Instance Removed A (100B) B (211B) A (100B) A (100B) C[] (411B) B (813B) D (4MB) C[] (87B) GC
  43. OldObjectSample event Using TLAB events (kind-of) Keeps track of liveness

    (Optional) Reference chains to GC roots - fairly cheap alternative to heap dumps - helps answering ‘how is keeping the memory’ No upscaling to real used heap
  44. Reference chains Collected at the end of chunk Must perform

    heap walk - stop-the world - time-bounded via config - may not get all chains due to time limit ArrayList String “Hello” String “world” byte[] byte[] Static field [GC Root] OldObjectSample Event
  45. Advancing OldObjectSample event Use non-biased sampler Speed up reference chains

    - ‘Parallel and incremental heap scanning’ [1] Improve data representativeness - do not favour ‘large’ samples - rather keep random samples [1] https://bugs.openjdk.org/browse/JDK-8214535
  46. What is hogging the CPU? How comes the pod is

    OOM-killed? Are other processes influencing me? Can we pack more containers on a machine safely? The Riddles
  47. CPU Profiling: The eBPF Way Scheduler probe - watches the

    CPU time spent by thread - unwinds native user/kernel stack - unwinds runtime specific stack - JVM stack with the help of vmstructs [1] Probe is not ‘maskable’ by syscalls Probe code has limits due to eBPF [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/runtime/vmStructs.cpp
  48. Memory Profiling: The eBPF Way Does not exist as of

    now - at least not for JVM There is a hope - JVM has ready-to-use USDT support [1][2][3] - we can “just” probe JVMTI Allocation Sampler [1] https://docs.oracle.com/javase/8/docs/technotes/guides/vm/dtrace.html [2] https://leezhenghui.github.io/linux/2019/03/05/exploring-usdt-on-linux.html [3] https://myaut.github.io/dtrace-stap-book/app/java.html
  49. provider hotspot_jvmti { probe AllocObject__sample(char*, size_t, size_t); }; Create new

    provider in JVM Call the probe HOTSPOT_JVMTI_ALLOCOBJECT_SAMPLE(obj->klass()->name()->as_ C_string(), allocation_size, bytes_since_allocation); https://github.com/DataDog/openjdk-jdk/commit/686b8454e1f99b7d75ceb9af6c5b68dbb49e9ffd src/hotspot/os/posix/dtrace/hotspot_jvmti. d src/hotspot/share/runtime/threadHeapSampler.cp p
  50. Attach probe sudo bpftrace -p <pid> -e ' usdt:*:hotspot_jvmti:AllocObject__sample {

    printf("Alloc: %s (%d, %d)\n", str(arg0), arg1, arg2); }' Probe trace Class: s/Some (16, 62673) Class: s/c/i/RedBlackTree$Tree (32, 71939) Class: s/c/i/RedBlackTree$Tree (32, 55760) Class: s/c/i/TreeMap (24, 537917) https://github.com/DataDog/openjdk-jdk/commit/686b8454e1f99b7d75ceb9af6c5b68dbb49e9ffd
  51. OpenTelemetry (OTel) eBPF profiler OTel (Open Telemetry) [1] - CNCF

    (Cloud Native Computing Foundation) [2] incubating project OTel Profiling Signal [2] - SES1247:Continuous Profiling - the 4th Pillar of Observability, M.Hirt Continuous Profiling - the 4th Pillar of Observability OTel Profiling Agent [3] - donated to CNCF/OTel by Elastic - Datadog full-host profiler fork [4] - Multi-runtime (Go, Rust, JVM etc.) [1] https://opentelemetry.io/ [2] https://www.cncf.io/ [3] https://github.com/open-telemetry/opentelemetry-ebpf-profiler [4] https://github.com/DataDog/dd-otel-host-profiler/
  52. OpenTelemetry (OTel) eBPF profiler OTel (Open Telemetry) [1] - CNCF

    (Cloud Native Computing Foundation) [2] incubating project OTel Profiling Signal [2] - SES1247:Continuous Profiling - the 4th Pillar of Observability, M.Hirt Continuous Profiling - the 4th Pillar of Observability OTel Profiling Agent [3] - donated to CNCF/OTel by Elastic - Datadog full-host profiler fork [4] - Multi-runtime (Go, Rust, JVM etc.) [1] https://opentelemetry.io/ [2] https://www.cncf.io/ [3] https://github.com/open-telemetry/opentelemetry-ebpf-profiler [4] https://github.com/DataDog/dd-otel-host-profiler/ N JVM le behin !
  53. What is hogging the CPU? How comes the pod is

    OOM-killed? Are other processes influencing me? How to correlate performance metrics to my business The Riddles
  54. Profiling data dimensions Sample value - CPU Time, Allocated memory

    … Originating REST API calls Distributed tracing ID Customer ID Anything else that makes sense for the user Allow slice’n’dice of profiling data Sometimes also referred to as ‘Profiling Context’
  55. Profiling labels Improved understanding, navigation and analytics Rely on user

    specific context Proved valuable in Datadog Java profiler [1] OTel is looking at adding labels to the eBPF profiler [2] Experimental prototypes for profiling labels in JFR [3, 4] [1] https://github.com/DataDog/java-profiler [2] https://www.elastic.co/observability-labs/blog/continuous-profiling-distributed-tracing-correlation [3] https://jbachorik.github.io/posts/seeing-in-context_1 [4] https://jbachorik.github.io/posts/different-shade-of-context
  56. What is hogging the CPU? How comes the pod is

    OOM-killed? Are other processes influencing me? How to correlate performance metrics to my business The Riddles