Quick off the blocks!
rapid start options for
your Java application
NISHIKAWA, Akihiro (@logico_jp)
Cloud Solution Architect
Microsoft
Slide 2
Slide 2 text
Who am I?
{
"name": "Akihiro Nishikawa",
"country": "Japan",
"working-for": "Microsoft",
"favourites": [
"JVM",
"GraalVM",
"Azure"
],
"expertise": [
"Application integration",
"Container and Serverless"
]
}
Slide 3
Slide 3 text
Agenda
Why startup gets much more important
Options to improve startup time
Slide 4
Slide 4 text
Survey in JCConf Taiwan 2024 (as of 13:50)
Which JDK version are you using now? Which framework are you using mainly?
0 10 20 30 40 50
Spring Boot
Struts 2
Hibernate
Quarkus
Ktor
Armeria
React
# of users
0 5 10 15 20 25 30
23 or later
21
17
11
8
6 or ealier
# of users
Slide 5
Slide 5 text
Survey at JJUG night seminar in Tokyo...
As of September 12 2024
1
4
38
18
0
2
0 5 10 15 20 25 30 35 40
23 or later
22
LTS (11/17/21)
8
7
6 or earlier
Slide 6
Slide 6 text
Why startup gets important
Slide 7
Slide 7 text
What the word “performance” stands for?
Startup
CPU Usage
Throughput
Latency
Size
Memory footprint
Slide 8
Slide 8 text
In the serverless and container era, short-lived
applications are favoured over resident ones.
Startup Latency Throughput Footprint
Java apps
running on
application
server
△ ◎[1] ◎[1] ○
Expectations
from serverless
container
perspective
◎ ◎ ◎ ◎
[1] This is improved gradually over time.
Slide 9
Slide 9 text
Poor cold start
performance is not
favoured...
Slide 10
Slide 10 text
Boot sequence
Slide 11
Slide 11 text
Startup
What happens in starting Java applications?
JVM
• Load and Initialize
• Generate bytecode
templates
JVM
• Load application
classes
• Initialize application
classes
• Application specific
initialization
JVM
• Compile/deoptimize/recompile
Application
• Process specific workloads
JVM Startup Application Startup Application Warmup
Fast Quick Long time
Slide 12
Slide 12 text
Life cycle (image)
CL: class loading
JIT: JIT compilation
GC: garbage collection
Slide 13
Slide 13 text
Tiered compilation
C1 (a.k.a. client compiler)
Shorter time for compilation
Not so highly optimized
Not so better throughput
C2 (a.k.a. server compiler)
Longer time for compilation
Highly optimized
Better throughput
Slide 14
Slide 14 text
C1
Interpreter C2
Compilation Level
C1 full optimization (no profiling)
C1 with invocation and back-edge counters
C1 full profiling (level2 + MDO: MethodDataOop)
0 1 2 3 4
Slide 15
Slide 15 text
0 1 2 3 4
C1
Interpreter C2
Compilation Level
Normal path
Delayed due to
C2 capacity
Deoptimization
0 1 2 3 4
Slide 16
Slide 16 text
Options to improve startup time
Slide 17
Slide 17 text
Custom JRE
CDS Archive
Native Image
Warm up in
advance
Code caching CRaC/CRIU
C1 only
JIT
Centralization
Leyden
Slide 18
Slide 18 text
No content
Slide 19
Slide 19 text
Benchmark environment
Allocated resources
per container
vCore: 2
RAM: 4GiB
JDK 21 (21.0.4)
GC : G1
Max heap : 75% allocation
Application
framework
Micronaut 4.6.2
Option +UseStringDeduplication
Other options might be specified in each case.
Measurement Run 1000 times
Average / Percentile (50, 90, 95, 99)
W..., wait! Does just
extracting executable
JAR file improve
startup time?
Slide 22
Slide 22 text
Result
Average time: (JAR) 931ms, (Extracted) 888ms
JAR file extraction
100.00%
92.86%
80.00% 85.00% 90.00% 95.00% 100.00%
JAR
Extracted JAR
99P
95P
90P
50P
Average
Slide 23
Slide 23 text
Improve class loading
Custom JRE
CDS Archive
Slide 24
Slide 24 text
Custom JRE
Reduce the number of classes to be loaded.
jdeps
jdeps –R \
-cp "target/lib/[" \
-[print-module-deps \
-[ignore-missing-deps \
-[multi-release 21 \
target/App.jar
jlink
jlink -[module-path $JAVA_HOME/jmods \
-[add-modules ${MODULE_LIST} \
-[no-header-files \
-[no-man-pages \
-[output linked
# -[compress=0/1/2 is deprecated
Slide 25
Slide 25 text
Result
Custom JRE
100.00%
92.86%
94.85%
80.00% 85.00% 90.00% 95.00% 100.00%
JAR
Extracted JAR
jlink
99P
95P
90P
50P
Average
Slide 26
Slide 26 text
Benefits and cautions
Benefits
Startup time and memory footprint
are improved since the number of
classes to be loaded is decreased.
Cautions
A little bit efforts are required to
create custom JRE (e.g., Multi-stage
build to create container image).
Note that jdeps sometimes does
not find dependency modules like
jdk.crypto.ec.
From JDK 22, jdk.crypto.ec is included
in java.base.
Reduces JRE size only.
Custom JRE
Slide 27
Slide 27 text
CDS Archive
Change the way to load classes
App CDS (JEP 310 / JDK 10)
Application Class Data Sharing (AppCDS) stores classes used by your applications in an
archive file. (The java Command (oracle.com))
Default CDS (JEP 341 / JDK 12)
Created at the JDK build time by running -Xshare:dump, using G1 GC and 128M Java heap
(Oracle JDK / Class Data Sharing (oracle.com))
Dynamic CDS (JEP 350 / JDK 13)
Dynamic CDS archive extends application class-data sharing (AppCDS) to allow dynamic
archiving of classes when a Java application exits. (Class Data Sharing (oracle.com))
CDS Archive
Slide 28
Slide 28 text
# Create Static CDS archive
$java -Xshare:off \
-XX:DumpLoadedClassList= -jar app.jar
$java -Xshare:dump -XX:SharedArchiveFile= \
-XX:SharedClassListFile=
# Create Dynamic CDS archive at exiting application
$ java -XX:ArchiveClassesAtExit= -jar app.jar
# Use the CDS archive with application
$ java -XX:SharedArchiveFile= -jar app.jar
# Create CDS Archive automatically (since JDK 19)
$ java -XX:+AutoCreateSharedArchive \
–XX:SharedArchiveFile= -jar app.jar
CDS Archive
Other options are found in The java Command (oracle.com)
Slide 29
Slide 29 text
Result (Static CDS only)
CDS Archive
100.00%
92.86%
94.85%
43.52%
40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00%
JAR
Extracted JAR
jlink
CDS
99P
95P
90P
50P
Average
Slide 30
Slide 30 text
Result (Static & Dynamic CDS w/ training)
CDS Archive
100.00%
92.86%
94.85%
43.52%
42.55%
40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00%
JAR
Extracted JAR
jlink
CDS
CDS Combined
99P
95P
90P
50P
Average
Slide 31
Slide 31 text
Benefits and cautions
Benefits
Improve time to load classes.
Available in any platforms
Can coexist Dynamic CDS and static
CDS.
Can also use CDS archives with
custom JRE.
Cautions
As applications are updated, we
must recreate CDS archive.
CDS Archive
Slide 32
Slide 32 text
If using CDS based on extracted JAR?
CDS Archive
100.00%
43.52%
42.55%
88.30%
89.33%
40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00%
JAR
CDS
CDS Combined
CDS extracted
CDS extracted combined
99P
95P
90P
50P
Average
Slide 33
Slide 33 text
Use only C1 without profiling
-XX:TieredStopAtLevel=1
JVM selects C2 by default when CPU on the platform is multi-core
processors or using 64-bit VMs.
With just C1, there is no profiling overhead, so could we get better
performance than with profiling enabled?
According to some cloud vendors’ document, C1 is one of the
options to improve startup time.
Customize Java runtime startup behavior for Lambda functions - AWS Lambda (amazon.com)
C1 only
Slide 34
Slide 34 text
Result
C1 only
100.00%
92.86%
43.52%
42.55%
83.90%
40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00%
JAR
Extracted JAR
CDS
CDS Combined
C1
99P
95P
90P
50P
Average
Slide 35
Slide 35 text
Benefits and cautions
Benefits
Short-live applications can gain
benefits.
As no profiling occurs, startup time
will be reduced.
Custom JRE, CDS archive, and this
can be used together.
Cautions
This setting is not useful for long
running applications, since such
applications should leverage code
generated by C2, which is highly
optimized.
C1 only
AOT (Ahead of time) compilation
Resolve dependencies and compile codes at build time.
JDK 9-17: experimental (deprecated and removed)
JDK Support
GraalVM (Native Image)
OpenJ9
OpenJDK (Project Leyden) etc.
Result
Native Image
100.00%
92.86%
94.85%
43.52%
42.55%
83.90%
0.91%
0.00% 20.00% 40.00% 60.00% 80.00% 100.00%
JAR
Extracted JAR
jlink
CDS
CDS Combined
C1
Native
99P
95P
90P
50P
Average
Slide 40
Slide 40 text
Result
Native Image
100.00%
92.86%
94.85%
43.52%
42.55%
83.90%
0.91%
0.00% 20.00% 40.00% 60.00% 80.00% 100.00%
JAR
Extracted JAR
jlink
CDS
CDS Combined
C1
Native
99P
95P
90P
50P
Average
0.91%
0.80% 0.85% 0.90% 0.95% 1.00% 1.05%
Native
Slide 41
Slide 41 text
Benefits and cautions
Benefits
Applications can start rapidly.
Lower memory footprint and other
advantages.
Cautions
Hardware/Platform (CPU/OS)
specific
Longer build time
A little bit effort is required for
reflection support.
Native Image
Slide 42
Slide 42 text
Short-cut to reach peak performance
Warm up in advance
Code cache
Checkpoint
Slide 43
Slide 43 text
Use checkpoints
CRIU (Checpoint/Restore in Userspace)
CRIU support - (eclipse.dev)
CRaC (Coordinate Restore at Checkpoint)
Java on CRaC - Optimize JVM Start-Up | Azul
CRaC/CRIU
Slide 44
Slide 44 text
CRaC (Coordinate Restore at Checkpoint)
Bypassed
Application start point
Slide 45
Slide 45 text
Please note that...
"CRaC implementation creates the checkpoint only if the whole Java
instance state can be stored in the image. Resources like open files
or sockets are cannot, so it is required to release them when
checkpoint is made. CRaC emits notifications for an application to
prepare for the checkpoint and return to operating state after
restore."
https://github.com/CRaC/docs
CRaC/CRIU
Slide 46
Slide 46 text
CRaC
# 1. Start an application in the checkpoint mode.
$JAVA_HOME/bin/java \
-XX:CRaCCheckpointTo= -jar App.jar
# 2. After warm up, Request checkpoint
jcmd App.jar JDK.checkpoint
# 3. Restore the snapshot
$JAVA_HOME/bin/java -XX:CRaCRestoreFrom=
CRaC/CRIU
Slide 47
Slide 47 text
Result
CRaC/CRIU
100.00%
92.86%
94.85%
43.52%
42.55%
83.90%
0.91%
2.39%
0.00% 20.00% 40.00% 60.00% 80.00% 100.00%
JAR
Extracted JAR
jlink
CDS
CDS Combined
C1
Native
CraC 99P
95P
90P
50P
Average
Benefits and cautions
Benefits
Work well for containers.
Startup time is quite short.
Cautions
Strictly same dependencies and
environment between executions is
required.
Project is undergoing.
Privileged operation is required.
Some efforts to capture checkpoint
(Automation is a key...)
CRaC/CRIU
Slide 50
Slide 50 text
100.00%
92.86%
94.85%
43.52%
42.55%
83.90%
0.91%
2.39%
0.00% 20.00% 40.00% 60.00% 80.00% 100.00%
JAR
Extracted JAR
jlink
CDS
CDS Combined
C1
Native
CraC
Startup time ratio (Base=100, smaller is better) 99P
95P
90P
50P
Average
Slide 51
Slide 51 text
Future
Slide 52
Slide 52 text
Project Leyden
https://openjdk.org/projects/leyden/
Goal
Improve the startup time, time to peak performance, and
footprint of Java programs.
Focus
Standardize AOT for Hotspot JVM
Start native, but support and optimize dynamic stuff later
Resources
Project Leyden (openjdk.org)
Slide 53
Slide 53 text
Concept: Shifting computation (1/2)
from runtime to earlier experimental executions, known as training runs.
Unified Cache Data
Store (CDS) Archive
Store class metadata, heap objects,
profiling data, and compiled code.
-XX:CacheDataStore
Loaded Classes in
CDS Archives
Preload classes as soon as the application
starts.
-XX:+PreloadSharedClasses
Method Profiles in
CDS Archives
Store method profiles from training runs in
the CDS archive, allowing the Just-In-Time
(JIT) compiler to start compiling earlier
during warm-up.
-XX:+RecordTraining
-XX:+ReplayTraining
AOT Resolution of
Constant Pool
Entries
Resolve many constant pool entries during
the training run, improving start-up times
and enabling better code generation by
the AOT compiler.
-XX:+ArchiveFieldReferences
-XX:+ArchiveMethodReferences
-XX:+ArchiveInvokeDynamic
Slide 54
Slide 54 text
Concept: Shifting computation (2/2)
from runtime to earlier experimental executions, known as training runs.
AOT Compilation of
Java Methods
Identify frequently used methods during
the training run, compiles them, and stores
them with the CDS archive.
-XX:+StoreCachedCode
-XX:+LoadCachedCode
-XX:CachedCodeFile
AOT Generation of
Dynamic Proxies
and Reflection Data
Reduce start-up times by generating
dynamic proxies and reflection data.
-XX:+ArchiveDynamicProxies
-XX:+ArchiveReflectionData
Class Loader
Lookup Cache
Speed up repeated class lookups, which
are common in application frameworks, by
caching them.
-XX:+ArchiveLoaderLookupCache
Slide 55
Slide 55 text
Benchmark environment for Leyden
Allocated resources
per container
vCore: 2
RAM: 4GiB
JDK 21 (21.0.4) Leyden Early-Access Builds (java.net)
GC : G1
Max heap : 75% allocation
Application
framework
Micronaut 4.6.2
Option +UseStringDeduplication
-XX:CacheDataStore=
Measurement Run 1000 times
Average / Percentile (50, 90, 95, 99)
Slide 56
Slide 56 text
At first, call the following command for training apps
and generating AOT compiled code.
java -XX:+UseG1GC \
-XX:MaxRAMPercentage=75 \
-XX:InitialRAMPercentage=75 \
-XX:+UseStringDeduplication \
-XX:CacheDataStore= \
-jar App.jar
Slide 57
Slide 57 text
(Currently) two files are generated.
The file contains classes, heap objects and profiling
data harvested from the training run.
.code The file contains AOT-compiled methods, optimized
for the execution behaviors observed during the
training run.
[NOTE] Data in this file will be merged
into in a future release.
Slide 58
Slide 58 text
Leyden tries to improve startup time with several ways.
Code cache store
AOT compiled code
Slide 59
Slide 59 text
Next, call the same command to run with the
generated CDS (Cache Data Store) file.
java -XX:+UseG1GC \
-XX:MaxRAMPercentage=75 \
-XX:InitialRAMPercentage=75 \
-XX:+UseStringDeduplication \
-XX:CacheDataStore= \
-jar App.jar
Slide 60
Slide 60 text
Result
Leyden
100.00%
92.86%
94.85%
43.52%
42.55%
83.90%
0.91%
2.39%
33.10%
0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00%
JAR
Extracted JAR
jlink
CDS
CDS Combined
C1
Native
CraC
Leyden 99P
95P
90P
50P
Average
100.00%
92.86%
94.85%
43.52%
42.55%
83.90%
0.91%
2.39%
33.10%
0.00% 20.00% 40.00% 60.00% 80.00% 100.00%
JAR
Extracted JAR
jlink
CDS
CDS Combined
C1
Native
CraC
Leyden
Startup time ratio (Base=100, smaller is better)
99P
95P
90P
50P
Average
Slide 63
Slide 63 text
Takeaways
Slide 64
Slide 64 text
Takeaways
We have several options to improve startup time.
Updating Java version is also another option.
Several projects such as CRaC and Leyden to improve startup time
are now on-going.
Please note that we should choose the most suitable technique
based on characteristics and requirements of applications.