Slide 1

Slide 1 text

Quick off the blocks! rapid start options for your Java application NISHIKAWA, Akihiro (@logico_jp) Cloud Solution Architect Microsoft

Slide 2

Slide 2 text

Who am I? { "name": "Akihiro Nishikawa", "country": "Japan", "working-for": "Microsoft", "favourites": [ "JVM", "GraalVM", "Azure" ], "expertise": [ "Application integration", "Container and Serverless" ] }

Slide 3

Slide 3 text

Agenda  Why startup gets much more important  Options to improve startup time

Slide 4

Slide 4 text

Survey in JCConf Taiwan 2024 (as of 13:50) Which JDK version are you using now? Which framework are you using mainly? 0 10 20 30 40 50 Spring Boot Struts 2 Hibernate Quarkus Ktor Armeria React # of users 0 5 10 15 20 25 30 23 or later 21 17 11 8 6 or ealier # of users

Slide 5

Slide 5 text

Survey at JJUG night seminar in Tokyo... As of September 12 2024 1 4 38 18 0 2 0 5 10 15 20 25 30 35 40 23 or later 22 LTS (11/17/21) 8 7 6 or earlier

Slide 6

Slide 6 text

Why startup gets important

Slide 7

Slide 7 text

What the word “performance” stands for? Startup CPU Usage Throughput Latency Size Memory footprint

Slide 8

Slide 8 text

In the serverless and container era, short-lived applications are favoured over resident ones. Startup Latency Throughput Footprint Java apps running on application server △ ◎[1] ◎[1] ○ Expectations from serverless container perspective ◎ ◎ ◎ ◎ [1] This is improved gradually over time.

Slide 9

Slide 9 text

Poor cold start performance is not favoured...

Slide 10

Slide 10 text

Boot sequence

Slide 11

Slide 11 text

Startup What happens in starting Java applications? JVM • Load and Initialize • Generate bytecode templates JVM • Load application classes • Initialize application classes • Application specific initialization JVM • Compile/deoptimize/recompile Application • Process specific workloads JVM Startup Application Startup Application Warmup Fast Quick Long time

Slide 12

Slide 12 text

Life cycle (image) CL: class loading JIT: JIT compilation GC: garbage collection

Slide 13

Slide 13 text

Tiered compilation C1 (a.k.a. client compiler)  Shorter time for compilation  Not so highly optimized  Not so better throughput C2 (a.k.a. server compiler)  Longer time for compilation  Highly optimized  Better throughput

Slide 14

Slide 14 text

C1 Interpreter C2 Compilation Level C1 full optimization (no profiling) C1 with invocation and back-edge counters C1 full profiling (level2 + MDO: MethodDataOop) 0 1 2 3 4

Slide 15

Slide 15 text

0 1 2 3 4 C1 Interpreter C2 Compilation Level Normal path Delayed due to C2 capacity Deoptimization 0 1 2 3 4

Slide 16

Slide 16 text

Options to improve startup time

Slide 17

Slide 17 text

Custom JRE CDS Archive Native Image Warm up in advance Code caching CRaC/CRIU C1 only JIT Centralization Leyden

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

Benchmark environment Allocated resources per container vCore: 2 RAM: 4GiB JDK 21 (21.0.4) GC : G1 Max heap : 75% allocation Application framework Micronaut 4.6.2 Option +UseStringDeduplication Other options might be specified in each case. Measurement Run 1000 times Average / Percentile (50, 90, 95, 99)

Slide 20

Slide 20 text

Baseline java -XX:+UseG1GC \ -XX:MaxRAMPercentage=75 \ -XX:InitialRAMPercentage=75 \ -XX:+UseStringDeduplication \ -jar App.jar

Slide 21

Slide 21 text

W..., wait! Does just extracting executable JAR file improve startup time?

Slide 22

Slide 22 text

Result Average time: (JAR) 931ms, (Extracted) 888ms JAR file extraction 100.00% 92.86% 80.00% 85.00% 90.00% 95.00% 100.00% JAR Extracted JAR 99P 95P 90P 50P Average

Slide 23

Slide 23 text

Improve class loading Custom JRE CDS Archive

Slide 24

Slide 24 text

Custom JRE Reduce the number of classes to be loaded. jdeps jdeps –R \ -cp "target/lib/[" \ -[print-module-deps \ -[ignore-missing-deps \ -[multi-release 21 \ target/App.jar jlink jlink -[module-path $JAVA_HOME/jmods \ -[add-modules ${MODULE_LIST} \ -[no-header-files \ -[no-man-pages \ -[output linked # -[compress=0/1/2 is deprecated

Slide 25

Slide 25 text

Result Custom JRE 100.00% 92.86% 94.85% 80.00% 85.00% 90.00% 95.00% 100.00% JAR Extracted JAR jlink 99P 95P 90P 50P Average

Slide 26

Slide 26 text

Benefits and cautions Benefits  Startup time and memory footprint are improved since the number of classes to be loaded is decreased. Cautions  A little bit efforts are required to create custom JRE (e.g., Multi-stage build to create container image).  Note that jdeps sometimes does not find dependency modules like jdk.crypto.ec.  From JDK 22, jdk.crypto.ec is included in java.base.  Reduces JRE size only. Custom JRE

Slide 27

Slide 27 text

CDS Archive Change the way to load classes  App CDS (JEP 310 / JDK 10)  Application Class Data Sharing (AppCDS) stores classes used by your applications in an archive file. (The java Command (oracle.com))  Default CDS (JEP 341 / JDK 12)  Created at the JDK build time by running -Xshare:dump, using G1 GC and 128M Java heap (Oracle JDK / Class Data Sharing (oracle.com))  Dynamic CDS (JEP 350 / JDK 13)  Dynamic CDS archive extends application class-data sharing (AppCDS) to allow dynamic archiving of classes when a Java application exits. (Class Data Sharing (oracle.com)) CDS Archive

Slide 28

Slide 28 text

# Create Static CDS archive $java -Xshare:off \ -XX:DumpLoadedClassList= -jar app.jar $java -Xshare:dump -XX:SharedArchiveFile= \ -XX:SharedClassListFile= # Create Dynamic CDS archive at exiting application $ java -XX:ArchiveClassesAtExit= -jar app.jar # Use the CDS archive with application $ java -XX:SharedArchiveFile= -jar app.jar # Create CDS Archive automatically (since JDK 19) $ java -XX:+AutoCreateSharedArchive \ –XX:SharedArchiveFile= -jar app.jar CDS Archive Other options are found in The java Command (oracle.com)

Slide 29

Slide 29 text

Result (Static CDS only) CDS Archive 100.00% 92.86% 94.85% 43.52% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% JAR Extracted JAR jlink CDS 99P 95P 90P 50P Average

Slide 30

Slide 30 text

Result (Static & Dynamic CDS w/ training) CDS Archive 100.00% 92.86% 94.85% 43.52% 42.55% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% JAR Extracted JAR jlink CDS CDS Combined 99P 95P 90P 50P Average

Slide 31

Slide 31 text

Benefits and cautions Benefits  Improve time to load classes.  Available in any platforms  Can coexist Dynamic CDS and static CDS.  Can also use CDS archives with custom JRE. Cautions  As applications are updated, we must recreate CDS archive. CDS Archive

Slide 32

Slide 32 text

If using CDS based on extracted JAR? CDS Archive 100.00% 43.52% 42.55% 88.30% 89.33% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% JAR CDS CDS Combined CDS extracted CDS extracted combined 99P 95P 90P 50P Average

Slide 33

Slide 33 text

Use only C1 without profiling -XX:TieredStopAtLevel=1  JVM selects C2 by default when CPU on the platform is multi-core processors or using 64-bit VMs.  With just C1, there is no profiling overhead, so could we get better performance than with profiling enabled?  According to some cloud vendors’ document, C1 is one of the options to improve startup time. Customize Java runtime startup behavior for Lambda functions - AWS Lambda (amazon.com) C1 only

Slide 34

Slide 34 text

Result C1 only 100.00% 92.86% 43.52% 42.55% 83.90% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% JAR Extracted JAR CDS CDS Combined C1 99P 95P 90P 50P Average

Slide 35

Slide 35 text

Benefits and cautions Benefits  Short-live applications can gain benefits.  As no profiling occurs, startup time will be reduced.  Custom JRE, CDS archive, and this can be used together. Cautions  This setting is not useful for long running applications, since such applications should leverage code generated by C2, which is highly optimized. C1 only

Slide 36

Slide 36 text

Offloading JIT compilation AOT Compilation (Native Image) JIT Centralization

Slide 37

Slide 37 text

AOT (Ahead of time) compilation Resolve dependencies and compile codes at build time. JDK 9-17: experimental (deprecated and removed) JDK Support GraalVM (Native Image) OpenJ9 OpenJDK (Project Leyden) etc.

Slide 38

Slide 38 text

GraalVM Native Image Generic Micronaut Spring Boot $ mvn -Pnative spring-boot:build-image $ gradle bootBuildImage # Using Native Build Tools $ mvn -Pnative native:compile $ gradle nativeCompile $ native-image App.class $ native-image -jar App.jar $ mvn package -Dpackaging=native-image $ gradle nativeCompile

Slide 39

Slide 39 text

Result Native Image 100.00% 92.86% 94.85% 43.52% 42.55% 83.90% 0.91% 0.00% 20.00% 40.00% 60.00% 80.00% 100.00% JAR Extracted JAR jlink CDS CDS Combined C1 Native 99P 95P 90P 50P Average

Slide 40

Slide 40 text

Result Native Image 100.00% 92.86% 94.85% 43.52% 42.55% 83.90% 0.91% 0.00% 20.00% 40.00% 60.00% 80.00% 100.00% JAR Extracted JAR jlink CDS CDS Combined C1 Native 99P 95P 90P 50P Average 0.91% 0.80% 0.85% 0.90% 0.95% 1.00% 1.05% Native

Slide 41

Slide 41 text

Benefits and cautions Benefits  Applications can start rapidly.  Lower memory footprint and other advantages. Cautions  Hardware/Platform (CPU/OS) specific  Longer build time  A little bit effort is required for reflection support. Native Image

Slide 42

Slide 42 text

Short-cut to reach peak performance Warm up in advance Code cache Checkpoint

Slide 43

Slide 43 text

Use checkpoints  CRIU (Checpoint/Restore in Userspace) CRIU support - (eclipse.dev)  CRaC (Coordinate Restore at Checkpoint) Java on CRaC - Optimize JVM Start-Up | Azul CRaC/CRIU

Slide 44

Slide 44 text

CRaC (Coordinate Restore at Checkpoint) Bypassed Application start point

Slide 45

Slide 45 text

Please note that... "CRaC implementation creates the checkpoint only if the whole Java instance state can be stored in the image. Resources like open files or sockets are cannot, so it is required to release them when checkpoint is made. CRaC emits notifications for an application to prepare for the checkpoint and return to operating state after restore." https://github.com/CRaC/docs CRaC/CRIU

Slide 46

Slide 46 text

CRaC # 1. Start an application in the checkpoint mode. $JAVA_HOME/bin/java \ -XX:CRaCCheckpointTo= -jar App.jar # 2. After warm up, Request checkpoint jcmd App.jar JDK.checkpoint # 3. Restore the snapshot $JAVA_HOME/bin/java -XX:CRaCRestoreFrom= CRaC/CRIU

Slide 47

Slide 47 text

Result CRaC/CRIU 100.00% 92.86% 94.85% 43.52% 42.55% 83.90% 0.91% 2.39% 0.00% 20.00% 40.00% 60.00% 80.00% 100.00% JAR Extracted JAR jlink CDS CDS Combined C1 Native CraC 99P 95P 90P 50P Average

Slide 48

Slide 48 text

Result CRaC/CRIU 100.00% 92.86% 94.85% 43.52% 42.55% 83.90% 0.91% 2.39% 0.00% 20.00% 40.00% 60.00% 80.00% 100.00% JAR Extracted JAR jlink CDS CDS Combined C1 Native CraC 99P 95P 90P 50P Average 0.91% 2.39% 0.00% 0.50% 1.00% 1.50% 2.00% 2.50% 3.00% Native CraC

Slide 49

Slide 49 text

Benefits and cautions Benefits  Work well for containers.  Startup time is quite short. Cautions  Strictly same dependencies and environment between executions is required.  Project is undergoing.  Privileged operation is required.  Some efforts to capture checkpoint (Automation is a key...) CRaC/CRIU

Slide 50

Slide 50 text

100.00% 92.86% 94.85% 43.52% 42.55% 83.90% 0.91% 2.39% 0.00% 20.00% 40.00% 60.00% 80.00% 100.00% JAR Extracted JAR jlink CDS CDS Combined C1 Native CraC Startup time ratio (Base=100, smaller is better) 99P 95P 90P 50P Average

Slide 51

Slide 51 text

Future

Slide 52

Slide 52 text

Project Leyden https://openjdk.org/projects/leyden/  Goal  Improve the startup time, time to peak performance, and footprint of Java programs.  Focus  Standardize AOT for Hotspot JVM  Start native, but support and optimize dynamic stuff later  Resources  Project Leyden (openjdk.org)

Slide 53

Slide 53 text

Concept: Shifting computation (1/2) from runtime to earlier experimental executions, known as training runs. Unified Cache Data Store (CDS) Archive Store class metadata, heap objects, profiling data, and compiled code. -XX:CacheDataStore Loaded Classes in CDS Archives Preload classes as soon as the application starts. -XX:+PreloadSharedClasses Method Profiles in CDS Archives Store method profiles from training runs in the CDS archive, allowing the Just-In-Time (JIT) compiler to start compiling earlier during warm-up. -XX:+RecordTraining -XX:+ReplayTraining AOT Resolution of Constant Pool Entries Resolve many constant pool entries during the training run, improving start-up times and enabling better code generation by the AOT compiler. -XX:+ArchiveFieldReferences -XX:+ArchiveMethodReferences -XX:+ArchiveInvokeDynamic

Slide 54

Slide 54 text

Concept: Shifting computation (2/2) from runtime to earlier experimental executions, known as training runs. AOT Compilation of Java Methods Identify frequently used methods during the training run, compiles them, and stores them with the CDS archive. -XX:+StoreCachedCode -XX:+LoadCachedCode -XX:CachedCodeFile AOT Generation of Dynamic Proxies and Reflection Data Reduce start-up times by generating dynamic proxies and reflection data. -XX:+ArchiveDynamicProxies -XX:+ArchiveReflectionData Class Loader Lookup Cache Speed up repeated class lookups, which are common in application frameworks, by caching them. -XX:+ArchiveLoaderLookupCache

Slide 55

Slide 55 text

Benchmark environment for Leyden Allocated resources per container vCore: 2 RAM: 4GiB JDK 21 (21.0.4) Leyden Early-Access Builds (java.net) GC : G1 Max heap : 75% allocation Application framework Micronaut 4.6.2 Option +UseStringDeduplication -XX:CacheDataStore= Measurement Run 1000 times Average / Percentile (50, 90, 95, 99)

Slide 56

Slide 56 text

At first, call the following command for training apps and generating AOT compiled code. java -XX:+UseG1GC \ -XX:MaxRAMPercentage=75 \ -XX:InitialRAMPercentage=75 \ -XX:+UseStringDeduplication \ -XX:CacheDataStore= \ -jar App.jar

Slide 57

Slide 57 text

(Currently) two files are generated. The file contains classes, heap objects and profiling data harvested from the training run. .code The file contains AOT-compiled methods, optimized for the execution behaviors observed during the training run. [NOTE] Data in this file will be merged into in a future release.

Slide 58

Slide 58 text

Leyden tries to improve startup time with several ways. Code cache store AOT compiled code

Slide 59

Slide 59 text

Next, call the same command to run with the generated CDS (Cache Data Store) file. java -XX:+UseG1GC \ -XX:MaxRAMPercentage=75 \ -XX:InitialRAMPercentage=75 \ -XX:+UseStringDeduplication \ -XX:CacheDataStore= \ -jar App.jar

Slide 60

Slide 60 text

Result Leyden 100.00% 92.86% 94.85% 43.52% 42.55% 83.90% 0.91% 2.39% 33.10% 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% JAR Extracted JAR jlink CDS CDS Combined C1 Native CraC Leyden 99P 95P 90P 50P Average

Slide 61

Slide 61 text

Result Leyden 100.00% 92.86% 94.85% 43.52% 42.55% 83.90% 0.91% 2.39% 33.10% 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% JAR Extracted JAR jlink CDS CDS Combined C1 Native CraC Leyden 99P 95P 90P 50P Average 0.91% 2.39% 33.10% 0.00% 10.00% 20.00% 30.00% 40.00% Native CraC Leyden

Slide 62

Slide 62 text

100.00% 92.86% 94.85% 43.52% 42.55% 83.90% 0.91% 2.39% 33.10% 0.00% 20.00% 40.00% 60.00% 80.00% 100.00% JAR Extracted JAR jlink CDS CDS Combined C1 Native CraC Leyden Startup time ratio (Base=100, smaller is better) 99P 95P 90P 50P Average

Slide 63

Slide 63 text

Takeaways

Slide 64

Slide 64 text

Takeaways  We have several options to improve startup time.  Updating Java version is also another option.  Several projects such as CRaC and Leyden to improve startup time are now on-going.  Please note that we should choose the most suitable technique based on characteristics and requirements of applications.

Slide 65

Slide 65 text

Thank you!