Slide 1

Slide 1 text

Make your Java applications start faster - now and future - NISHIKAWA, Akihiro

Slide 2

Slide 2 text

Who am I? { "name": "NISHIKAWA, Akihiro", "job": "Cloud Solution Architect@Microsoft", "love": [ "JVM", "GraalVM", "Joining technical conference (speaker as well as audience)" ], "expertise": [ "Application integration", "Container and Serverless solution" ] }

Slide 3

Slide 3 text

Agenda ž Importance of startup time ž Startup procedure ž Options to improve startup time ž Future

Slide 4

Slide 4 text

Importance of startup time

Slide 5

Slide 5 text

Performance... Which is your concern? Startup CPU Usage Throughput Latency Size Memory footprint

Slide 6

Slide 6 text

Situation has been changing All aspects are expected, of course J Startup Latency Throughput Footprint Remarks △ ○à◎ ○à◎ ○ No need to worry startup in case of applications resided on application server. Requirements for serverless applications and autoscaling containers Characteristics of typical Java applications Startup Latency Throughput Footprint Remarks !!! !! !! !! In case of short-live applications, startup time is much more important.

Slide 7

Slide 7 text

Serverless adoption 1. Node.js (62.9%) 2. Python (20.8%) 3. Go (6.4%) 4. Java (6.1%) 5. C# (3.8%) Source : The Future of Java by Mark Little – YouTube Devoxx UK 2022

Slide 8

Slide 8 text

Startup procedure

Slide 9

Slide 9 text

Startup What happens in starting Java applications ? JVM • Load and Initialize • Generate bytecode templates JVM • Load application classes • Initialize application classes • Application specific initialization JVM • Compile/deoptimize/recompile Application • Process specific workloads JVM Startup Application Startup Application Warmup 1st operation 2nd operation or later Fast Quick Long time

Slide 10

Slide 10 text

Life cycle image

Slide 11

Slide 11 text

Tiered compilation Introduced in Java 7 and enabled by default in Java 8. C1 (a.k.a. client compiler) ž Shorter time to compile ž Not so highly optimized ž Not so better throughput C2 (a.k.a. server compiler) ž Longer time to compile ž Highly optimized ž Better throughput

Slide 12

Slide 12 text

Compilation Level 0 1 2 3 4 C1 Interpreter C2 C1 full optimization (no profiling) C1 with invocation and back-edge counters C1 full profiling (level2 + MDO: MethodDataOop)

Slide 13

Slide 13 text

Compilation Level 0 1 2 3 4 Normal path Delayed due to C2 capacity 0 1 2 3 4 Optimization is not valuable C1 Interpreter C2

Slide 14

Slide 14 text

Default thresholds / JDK 17 $ java -XX:+PrintFlagsFinal -version | grep CompileThreshold intx CompileThreshold = 10000 {pd product} {default} double CompileThresholdScaling = 1.000000 {product} {default} uintx IncreaseFirstTierCompileThresholdAt = 50 {product} {default} intx Tier2CompileThreshold = 0 {product} {default} intx Tier3CompileThreshold = 2000 {product} {default} intx Tier4CompileThreshold = 15000 {product} {default} java version "17.0.6" 2023-01-17 LTS Java(TM) SE Runtime Environment (build 17.0.6+9-LTS-190) Java HotSpot(TM) 64-Bit Server VM (build 17.0.6+9-LTS-190, mixed mode, sharing) Note: Tiered compilation is enabled by default since Java 8 (You can disable it, of course). When tiered compilation is enabled, JVM does not use CompileThreshold parameter.

Slide 15

Slide 15 text

Method compilation life cycle Run Interpreter C1 C2 Code cache Save C1 compiled code(s) Save C2 compiled code(s) Profiling Profiling >= Tier3CompileThreshold Deoptimization >= Tier4CompileThreshold Deoptimize compiled code Interpret and profile

Slide 16

Slide 16 text

Startup time and performance - Fibonacci numbers // java Fib.java 40 --> 40th number is 102,334,155 // -XX:+UseG1GC -Xmx2g -Xms2g -XX:+UseStringDeduplication public class Fib { public static void main(String... args) { long num = 0; if(args.length != 1) return; num = Long.valueOf(args[0]); System.out.printf("%d(st/nd/rd/th) >> %d¥n", num, fib(num)); } static long fib(long n) { if(n < 2) return n; return fib(n - 2) + fib(n - 1); } }

Slide 17

Slide 17 text

Results run on Intel Core i7 2.80 GHz (4 core, Hyper-thread enabled) / 16GB RAM Compilation time (sec) Execution time (sec) Interpreter only N/A 22.890 C1 Only (no profiling in Interpreter) C1: 1.912 2.073 C2 Only (no profiling in C1) C2: 16.745 2.511 Tiered compilation (Interpreter > C1) C1: 0.248 0.756 Tiered compilation (Interpreter > C1 > C2) C1: 0.261 C2: 0.526 0.127

Slide 18

Slide 18 text

Options to improve startup time

Slide 19

Slide 19 text

Spec and configuration Hardware Intel Core i7 2.80 GHz (4 core, Hyper-thread enabled) 16GB RAM OS Ubuntu 22.04 JDK 17 (17.0.6) GC : G1 Heap : Max/Min Heap: 2g Application framework Micronaut 3.8.5 Option +UseStringDeduplication Other options might be specified in each case. Measurement Run 100 times Average / Percentile (50, 90, 95, 99)

Slide 20

Slide 20 text

1. Baseline java -XX:+UseG1GC ¥ -Xmx2g -Xms2g ¥ -XX:+UseStringDeduplication ¥ -jar App.jar

Slide 21

Slide 21 text

Note -Xverify:none and –noverify ž Deprecated since JDK 13 and will be removed in the future release. [JDK-8218003] Release Note: Deprecated Java Options -Xverify:none and -noverify - Java Bug System (openjdk.org) For users who need to run without startup verification ž AppCDS allows you to archive their classes. The classes are verified during archiving and avoid verification at runtime.

Slide 22

Slide 22 text

🤔 Literally, which option can we take to improve application startup time?

Slide 23

Slide 23 text

Reduce time for class loading ž Especially works for ž JVM Startup ž Application Startup

Slide 24

Slide 24 text

1) Custom JRE Reduce the number of classes to be loaded. jdeps jlink jlink --compress=2 ¥ --module-path $JAVA_HOME/jmods ¥ --add-modules ¥ java.base, java.compiler,¥ java.desktop, java.management,¥ java.naming, java.sql, ¥ java.xml, jdk.unsupported ¥ --no-header-files ¥ --no-man-pages ¥ --output linked jdeps –R ¥ -cp "target/dependency/*" ¥ --print-module-deps ¥ --ignore-missing-deps ¥ --multi-release 17 ¥ target/App.jar # java.base,java.compiler, # java.desktop,java.management, # java.naming,java.sql,java.xml, # jdk.unsupported

Slide 25

Slide 25 text

1) Custom JRE Slightly improved, but not so better. 0 0.2 0.4 0.6 0.8 1 1.2 Average 50 percentile 90 percentile 95 percentile 99 percentile Baseline Custom JRE

Slide 26

Slide 26 text

1) Custom JRE Benefits and drawbacks Benefits ž Startup time and memory footprint are improved since the number of classes to be loaded is decreased. Drawbacks ž A little bit efforts are required to create custom JRE. ž Multi-stage build to create container images ž Note that jdeps sometimes does not find dependency modules like jdk.crypto.ec.

Slide 27

Slide 27 text

2) CDS Archive Change the way to load classes ž CDS was introduced in 8u40. ž Default CDS (JEP 341 / JDK 12) ž App CDS (JEP 310 / JDK 10) ž Dynamic CDS (JEP 350 / JDK 13) ž Some distributions don't include default CDS archive. ž e.g. Microsoft Build of OpenJDK Release Notes for the Microsoft Build of OpenJDK | Microsoft Learn

Slide 28

Slide 28 text

2) CDS Archive Create and use an archive to run an application # Create Dynamic CDS archive at exiting application $ java -XX:ArchiveClassesAtExit= -jar App.jar # Use the CDS archive with application $ java -XX:SharedArchiveFile= -jar App.jar CDS archive and custom JRE can be used together.

Slide 29

Slide 29 text

2) CDS Archive Improved than custom JRE case 0 0.2 0.4 0.6 0.8 1 1.2 Average 50 percentile 90 percentile 95 percentile 99 percentile Baseline Custom JRE CDS CDS+custom JRE

Slide 30

Slide 30 text

2) CDS Archive Benefits and drawbacks Benefits ž Improve time to load classes. ž Available in any platforms ž Dynamic CDS and custom JRE can be used together. Drawbacks ž As the application is updated, CDS archive should be created.

Slide 31

Slide 31 text

🤔 If not only application startup time but also throughput and latency are required, which option can we take?

Slide 32

Slide 32 text

Use only C1 -XX:TieredStopAtLevel=1 JVM selects C2 by default when CPU on the platform is multi-core processors or using 64-bit VMs. If choosing to use only C1, ž There is no profiling overhead. ž You will get better performance than if profiling is enabled.

Slide 33

Slide 33 text

Use only C1 Startup time 0 0.2 0.4 0.6 0.8 1 1.2 Average 50 percentile 90 percentile 95 percentile 99 percentile Baseline Custom JRE CDS CDS+custom JRE C1 C1 + CDS + custom JRE

Slide 34

Slide 34 text

Use only C1 Benefits and drawbacks Benefits ž Short-live applications can gain benefits. ž As no profiling occurs, startup time will be reduced. ž Custom JRE, CDS archive, and this can be used together. Drawbacks ž This setting is not useful for long running applications, since such applications should leverage generated code by C2, which is highly optimized.

Slide 35

Slide 35 text

🤔 If resources such as CPU and memory are quite restricted, which option can we take?

Slide 36

Slide 36 text

Offloading JIT workload ž Works for ž Application Startup ž Application Warmup (in some cases)

Slide 37

Slide 37 text

1) AOT compilation Resolve dependencies in advance and package into standalone executables. AOT Compilation (Ahead of time) ž JDK 9-17: experimental ž GraalVM Native Image ž OpenJ9 AOT, etc.

Slide 38

Slide 38 text

1) AOT compilation GraalVM Native Image Generic Micronaut Spring $ mvn -Pnative spring-boot:build-image $ gradle bootBuildImage # Using Native Build Tools $ mvn -Pnative native:compile $ gradle nativeCompile $ native-image App.class $ native-image -jar App.jar $ mvn package -Dpackaging=native-image $ gradle nativeCompile

Slide 39

Slide 39 text

1) AOT compilation Startup time (GraalVM Enterprise 22.3.1 JDK 17 / PGO is not used) 0 0.2 0.4 0.6 0.8 1 1.2 Average 50 percentile 90 percentile 95 percentile 99 percentile Baseline Custom JRE CDS CDS+custom JRE C1 C1 + CDS + custom JRE AOT

Slide 40

Slide 40 text

1) AOT compilation Benefits and drawbacks Benefits ž Start rapidly ž Lower memory footprint Drawbacks ž Hardware/Platform (CPU/OS) specific ž Long compilation time ž Generated executable is ž bigger than the original jar file. ž not suitable for long running (as of now). ž A little bit effort is required for reflection support.

Slide 41

Slide 41 text

2) Distributed JIT If JIT compilation is offloaded to other environment and generated codes returned from JIT compilation environment are used in runtime environment, would performance be improved? Distributed JIT ž OpenJ9 JITServer (Eclipse OpenJ9) ž IBM Semeru Runtime - Resources and Tools - IBM Developer - IBM Developer ž Azul Cloud Native Compiler ž Java Compilation in the Cloud | Cloud Native Compiler (azul.com)

Slide 42

Slide 42 text

2) Distributed JIT Concept JIT compilation runs in each JVM . VM or Containers VM or Containers VM or Containers VM or Containers VM or Containers VM or Containers Java Application JVM JIT Compilation

Slide 43

Slide 43 text

2) Distributed JIT Concept JIT compilation runs in dedicated JVM instance. ž Each JVM instance communicates with the JIT JVM instance. VM or Containers VM or Containers VM or Containers VM or Containers VM or Containers VM or Containers Java Application JVM JIT Compilation JIT Server or Cloud Native Compiler Service JIT Compilation Request compilation à ß Return generated codes

Slide 44

Slide 44 text

2) Distributed JIT Benefits and drawbacks Benefits ž Java applications could run on smaller resources. ž Especially useful for apps running on containers. ž JIT compilation might be faster (depending upon circumstances). Drawbacks ž Network latency ž Maybe not suitable for super short- live applications ž Not standardized yet ž OpenJ9 ž Azul Cloud Native Compiler

Slide 45

Slide 45 text

🤔 If special needs arise to gain high throughput at any expenses since the beginning, which action can we take?

Slide 46

Slide 46 text

Reach peak performance faster ž Works for ž Application Startup (in some cases) ž Application Warmup

Slide 47

Slide 47 text

JIT Caching Ordinally, JIT compilation runs with profiling data collected in interpreter (and/or C1) phase. If we can take snapshots and persist them in storages to restore, C2 generated hot codes might provide high performance.

Slide 48

Slide 48 text

JIT Caching JWarmup (Alibaba Dragonwell) ž JEP draft: JWarmup precompile java hot methods at application startup (openjdk.org) Azul ReadyNow! / Compile Stashing (Azul) ž ReadyNow!® - Azul | Better Java Performance, Superior Java Support ž Using Compile Stashing (azul.com) CRaC (Coordinated Restore at Checkpoint) ž Based on CRIU (Checkpoint/Restore In Userspace). ž Azul Provides the CRaC in AWS SnapStart Builds | Foojay.io (Java 11 based) Dynamic AOT / CRIU support (OpenJ9) ž Fast JVM startup with OpenJ9 CRIU Support – Eclipse OpenJ9 Blog

Slide 49

Slide 49 text

JIT Caching How to take a snapshot (checkpoint) “CRaC implementation creates the checkpoint only if the whole Java instance state can be stored in the image. Resources like open files or sockets are cannot, so it is required to release them when checkpoint is made. CRaC emits notifications for an application to prepare for the checkpoint and return to operating state after restore.” https://github.com/CRaC/docs

Slide 50

Slide 50 text

JIT Caching How to take a snapshot Examples: CRaC/docs (github.com) ž Tomcat / Spring boot ž Quarkus ž Micronaut # 1. Start the sample application in the checkpoint mode. $JAVA_HOME/bin/java -XX:CRaCCheckpointTo= -jar App.jar # 2. After warm up, Request checkpoint (take a snapshot) jcmd App.jar JDK.checkpoint # 3. Restore the snapshot $JAVA_HOME/bin/java -XX:CRaCRestoreFrom=

Slide 51

Slide 51 text

JIT Caching Performance 0 0.2 0.4 0.6 0.8 1 1.2 Average 50 percentile 90 percentile 95 percentile 99 percentile Baseline Custom JRE CDS CDS+custom JRE C1 + CDS + custom JRE AOT CRaC

Slide 52

Slide 52 text

JIT Caching Benefits and drawbacks Benefits ž Well warmed up codes are available whenever the application starts. ž Startup time is almost the same as AOT case. Drawbacks ž Platform dependencies ž Not standardized yet ž Require persistent storage ž The same dependencies and environment between runs. ž Some efforts to capture checkpoint (Development framework would cover them in future…)

Slide 53

Slide 53 text

Future

Slide 54

Slide 54 text

Project Leyden openjdk.org/projects/leyden Goal ž Improve the startup time, time to peak performance, and footprint of Java programs. Focus ž Standardize AOT for Hotspot JVM ž Start native, but support and optimize dynamic stuff later

Slide 55

Slide 55 text

Project Galahad proposed by Douglas Simon (Oracle Labs) Goal ž Java-related GraalVM tech and help to prepare the JDK community for potential incubation into the main release in the future. Focus ž Contributing the latest version of the GraalVM just-in-time (JIT) compiler and integrating it as an alternative to the existing JIT compiler of the HotSpot VM. ž Bring in the necessary ahead-of-time (AOT) compilation technology to make this new JIT compiler written in Java available instantly on JVM start. ž Galahad will pay close attention to Leyden and track the Leyden specification as it evolves.

Slide 56

Slide 56 text

Key takeaways

Slide 57

Slide 57 text

Key takeaways You have several options to run faster and improve performance! Several improvements help Java applications start faster. Several projects are on-going or being proposed.

Slide 58

Slide 58 text

No content