Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[SnowOne 2023] Xie Junfeng: Overview of OpenJDK RISC-V port

jugnsk
March 17, 2023

[SnowOne 2023] Xie Junfeng: Overview of OpenJDK RISC-V port

We'll introduce our work of porting openjdk to riscv, especially our work to support instruction set extensions.

We will also introduce the application of Pointer Masking in ZGC.

jugnsk

March 17, 2023
Tweet

More Decks by jugnsk

Other Decks in Programming

Transcript

  1. Huawei Proprietary - Restricted Distribution 2 Content 1. Introduction 2.

    No Flag Register 3. Zifencei: FENCE.I 4. Vector Extension 5. Bitmanipulation Extension 6. JIT Extension: Pointer Masking 7. Summary
  2. Huawei Proprietary - Restricted Distribution 3 Content 1. Introduction 2.

    No Flag Register 3. Zifencei: FENCE.I 4. Vector Extension 5. Bitmanipulation Extension 6. JIT Extension: Pointer Masking 7. Summary
  3. Huawei Proprietary - Restricted Distribution 4 Introduction • About Me

    > Xie Junfeng > Fudan University > Huawei Software Developer > Mainly participate in RISC-V 32 Zero & TBI (Pointer Masking) > [email protected]
  4. Huawei Proprietary - Restricted Distribution 5 Introduction • Our Team

    > the owner of OpenJDK RISC-V port > https://github.com/openjdk/riscv-port • Key Members > [email protected] > [email protected] (resigned) > [email protected]
  5. Huawei Proprietary - Restricted Distribution 6 Introduction • RISC-V >

    an open and free instruction set architecture > modular, collaborative • Huawei, OpenJDK and RISC-V > a lot of java services > adapt to Huawei-developed hardware > RISC-V is free and open > further exploration based on RISC-V
  6. Huawei Proprietary - Restricted Distribution 7 Introduction • Our Work

    on OpenJDK mainline > Instruction Sets: RV64IMAFDCV, Zba, Zbb, Zicsr, Zifencei > Zero, Template Interpreter, C1, C2 > All GC algorithms (including ZGC and Shenandoah) > RV32 Zero Interpreter > JEP 425: Virtual Thread (JDK 20) > JEP 424: Foreign Function & Memory API (JDK 21) > Generational ZGC (ZGC repo) • Our Plan > backport JDK 8/11/17 (in progress) > JEP 426: Vector API
  7. Huawei Proprietary - Restricted Distribution 8 Introduction • Enable Extensions

    Manually • Check Extensions at Runtime product(bool, UseRVC, false, "Use RVC instructions") \ product(bool, UseRVA22U64, false, EXPERIMENTAL, "Use RVA22U64 profile") \ product(bool, UseRVV, false, EXPERIMENTAL, "Use RVV instructions") \ product(bool, UseZba, false, EXPERIMENTAL, "Use Zba instructions") \ product(bool, UseZbb, false, EXPERIMENTAL, "Use Zbb instructions") \ product(bool, UseZbs, false, EXPERIMENTAL, "Use Zbs instructions") \ product(bool, UseZfhmin, false, EXPERIMENTAL, "Use Zfhmin instructions") \ product(bool, UseZic64b, false, EXPERIMENTAL, "Use Zic64b instructions") \ product(bool, UseZicbom, false, EXPERIMENTAL, "Use Zicbom instructions") \ product(bool, UseZicbop, false, EXPERIMENTAL, "Use Zicbop instructions") \ product(bool, UseZicboz, false, EXPERIMENTAL, "Use Zicboz instructions") \ if (UseRVV) { if (!(_features & CPU_V)) { warning("RVV is not supported on this CPU"); FLAG_SET_DEFAULT(UseRVV, false); } else { ... } }
  8. Huawei Proprietary - Restricted Distribution 9 413 53 0 50

    100 150 200 250 300 350 400 450 max-JOPS critical-JOPS OpenJDK SPECjbb2015 0% 2000% 4000% 6000% 8000% 10000% 12000% OpenJDK RISCV64 Server JDK vs Zero JDK Zero JDK Server JDK SPECjvm2008 SPECjbb2015[2] [1] Data source from https://twitter.com/shipilev/status/1479179438625595399, on Hifive Unmatched Board [2] Hifive Unleashed Board, data provided by PLCT 39x faster(GEOMEAN)than Zero JDK[1] Introduction
  9. Huawei Proprietary - Restricted Distribution 10 Content 1. Introduction 2.

    No Flag Register 3. Zifencei: FENCE.I 4. Vector Extension 5. Bitmanipulation Extension 6. JIT Extension: Pointer Masking 7. Summary
  10. Huawei Proprietary - Restricted Distribution 11 void LIR_List::set_cmp_oprs(LIR_Op* op) {

    switch (op->code()) { case lir_cmp: _cmp_opr1 = op->as_Op2()->in_opr1(); _cmp_opr2 = op->as_Op2()->in_opr2(); break; case lir_branch: // fall through case lir_cond_float_branch: if (op->as_OpBranch()->cond() != lir_cond_always) { op->as_Op2()->set_in_opr1(_cmp_opr1); op->as_Op2()->set_in_opr2(_cmp_opr2); } break; case lir_cmove: op->as_Op4()->set_in_opr3(_cmp_opr1); op->as_Op4()->set_in_opr4(_cmp_opr2); break; case ...: ...; } } No Flag Register • Compare & Jump in one instruction > so there is no flag register > in RISC-V: beq(op1, op2, label) > in other arch: cmp(op1, op2) & beq(label) • C1: save oprs at cmp & consume them at branch • C2: use t1 as the flag register do nothing but save the two operators consume the two operators // On riscv, the physical flag register is missing, so we // use t1 instead, to bridge the RegFlag semantics in // share/opto. reg_def RFLAGS (SOC, SOC, Op_RegFlags, 6, x6->as_VMReg()); Our solution on C1 Our solution on C2
  11. Huawei Proprietary - Restricted Distribution 12 Content 1. Introduction 2.

    No Flag Register 3. Zifencei: FENCE.I 4. Vector Extension 5. Bitmanipulation Extension 6. JIT Extension: Pointer Masking 7. Summary
  12. Huawei Proprietary - Restricted Distribution 13 Zifencei: FENCE.I • RISC-V

    Weak Memory Ordering (RVWMO) • Hart (short for hardware thread) > not software-managed thread contexts > similar to the "logical core" on other ISAs > a Java thread is not bound to a specific hart • FENCE.I > visibility of the modified instruction > only for current hart! (thread A on hart 1) modify code fence.i (thread A rescheduled to hart 2) execute code (old code)
  13. Huawei Proprietary - Restricted Distribution 14 Zifencei: FENCE.I • Syscall:

    flush_icache > do not use fence.i at user level > fence.i to all harts via inter-processor interrupt > so no fence.i before the relocatable code // Hart 1 (read hart) void MacroAssembler::emit_static_call_stub() { ... // ifence(); // <- not needed mov_metadata(xmethod, (Metadata*)NULL); // <- patchable code ... } // Hart 2 (write hart) void NativeMovConstReg::set_data(intptr_t x) { ... // Store x into the instruction stream. MacroAssembler::pd_patch_instruction_size(instruction_address(), (address)x); // <- write code ICache::invalidate_range(instruction_address(), movptr_instruction_size); // <- syscall here ... } modify code (flush_icache syscall) fence fence.i ipi: all harts invoke fence.i continue
  14. Huawei Proprietary - Restricted Distribution 15 Content 1. Introduction 2.

    No Flag Register 3. Zifencei: FENCE.I 4. Vector Extension 5. Bitmanipulation Extension 6. JIT Extension: Pointer Masking 7. Summary
  15. Huawei Proprietary - Restricted Distribution 16 Vector Extension • Superword

    Level Parallelism (SLP) > hotspot supports SLP for auto-vectorization • Single Instruction Multiple Data (SIMD) > x86, ARM, MIPS... > incremental design > must be coded for data width • RISC-V Vector > generates fewer instructions than SIMD > hides implementation details > uses vsetvl/vsetivli/vsetvli instructions to set vector width for vector instructions instruct vaddB(vReg dst, vReg src1, vReg src2) %{ ... uint length_in_bytes = Matcher::vector_length_in_bytes(this); if (VM_Version::use_neon_for_vector(length_in_bytes)) { __ addv($dst$$FloatRegister, get_arrangement(this), $src1$$FloatRegister, $src2$$FloatRegister); } else { assert(UseSVE > 0, "must be sve"); __ sve_add($dst$$FloatRegister, __ B, $src1$$FloatRegister, $src2$$FloatRegister); } ... %} instruct vaddB(vReg dst, vReg src1, vReg src2) %{ ... __ rvv_vsetvli(T_BYTE, Matcher::vector_length_in_bytes(this)); __ vadd_vv(as_VectorRegister($dst$$reg), as_VectorRegister($src1$$reg), as_VectorRegister($src2$$reg)); ... %} Aarch64 RISC-V
  16. Huawei Proprietary - Restricted Distribution 17 Vector Extension • Todo

    > eliminate redundancy of "vsetvli" instruction equipped with each vector operation. - peephole optimization - add a vector length node and analysis the control flow and data flow • Performance > no hardware, just qemu > so no performance data yet instruct vaddB(vReg dst, vReg src1, vReg src2) %{ ... __ rvv_vsetvli(T_BYTE, Matcher::vector_length_in_bytes(this)); __ vadd_vv(as_VectorRegister($dst$$reg), as_VectorRegister($src1$$reg), as_VectorRegister($src2$$reg)); ... %}
  17. Huawei Proprietary - Restricted Distribution 18 Content 1. Introduction 2.

    No Flag Register 3. Zifencei: FENCE.I 4. Vector Extension 5. Bitmanipulation Extension 6. JIT Extension: Pointer Masking 7. Summary
  18. Huawei Proprietary - Restricted Distribution 19 Bitmanipulation Extension • Zba,

    Zbb, Part of Zbs • Reduce code size public static long reverseBytes(long i) { i = (i & 0x00ff00ff00ff00ffL) << 8 | (i >>> 8) & 0x00ff00ff00ff00ffL; return (i << 48) | ((i & 0xffff0000L) << 16) | ((i >>> 16) & 0xffff0000L) | (i >>> 48); } lui t2,0xff0 addiw t2,t2,255 slli t2,t2,0x10 addi t2,t2,255 # 0xff00ff slli t2,t2,0x10 addi t2,t2,255 and t3,a1,t2 srli t4,a1,0x8 and t2,t4,t2 slli t3,t3,0x8 or t2,t3,t2 lui t3,0x10 addiw t3,t3,-1 slli t3,t3,0x10 and t4,t2,t3 srli t6,t2,0x10 slli t4,t4,0x10 slli t5,t2,0x30 and t3,t6,t3 or t4,t5,t4 srli t2,t2,0x30 or t3,t4,t3 or a0,t3,t2 revb8
  19. Huawei Proprietary - Restricted Distribution 20 Bitmanipulation Extension • Zba,

    Zbb, Part of Zbs • Reduce code size public static void main(String[] args) { int mylist3[] = {1,2,3,4,5,6,7,8}; int mylist4[] = {8,7,6,5,4,3,2,1}; int base1 = 2; int base2 = 3; for (int i = 0; i < 1000000; i++) { mylist3[base1] = i; mylist4[base2] = i; } } addw t4,s2,zero addw t2,s4,zero slli t4,t4,0x2 slli t2,t2,0x2 add t4,t4,a2 ;*iastore {reexecute=0 rethrow=0 return_oop=0} ; - Shadd::main@103 (line 8) add t2,t2,s3 ;*iastore {reexecute=0 rethrow=0 return_oop=0} ; - Shadd::main@109 (line 9) sw s1,16(t4) ;*iastore {reexecute=0 rethrow=0 return_oop=0} ; - Shadd::main@103 (line 8) sw s1,16(t2) ;*iload {reexecute=0 rethrow=0 return_oop=0} ; - Shadd::main@92 (line 7) sh2add a0,s2,t2 ;*iastore {reexecute=0 rethrow=0 return_oop=0} ; - Shadd::main@103 (line 8) sh2add t6,s4,s3 ;*iastore {reexecute=0 rethrow=0 return_oop=0} ; - Shadd::main@109 (line 9) sw s1,16(a0) ;*iastore {reexecute=0 rethrow=0 return_oop=0} ; - Shadd::main@103 (line 8) sw s1,16(t6) ;*iload {reexecute=0 rethrow=0 return_oop=0} ; - Shadd::main@92 (line 7)asm sh2add
  20. Huawei Proprietary - Restricted Distribution 21 Bitmanipulation Extension • Code

    Size Reduction on SPECjvm2008 > bitmanip-v1.0.0-rc[1] for OpenJDK > geometric mean code size reduction: 2.9% > again, no hardware, just qemu [1] https://github.com/riscv/riscv-bitmanip/releases/tag/1.0.0 benchmark compress crypto derby mpegaudio scimark serial sunflow xml code size reduction (%) 2.81% 7.49% 1.98% 2.54% 6.18% 1.85% 2.87% 1.46%
  21. Huawei Proprietary - Restricted Distribution 22 Content 1. Introduction 2.

    No Flag Register 3. Zifencei: FENCE.I 4. Vector Extension 5. Bitmanipulation Extension 6. JIT Extension: Pointer Masking 7. Summary
  22. Huawei Proprietary - Restricted Distribution 23 JIT Extension: Pointer Masking

    • RISC-V Pointer Masking > causes the MMU to ignore the top N bits of the effective address > the proposal is not released yet • Tagged Pointer > folding a few bits as additional data into a pointer (GC state, data type, etc) > save memory > reduce a load > overhead of dereference Tagged pointer of Objective-C Value tagging of Javascript V8 Colored pointer of ZGC[1] [1] The max heap of ZGC can be 4TB, 8TB or 16TB. The colored pointer varies depending on the max heap. Here we only consider the 4TB case.
  23. Huawei Proprietary - Restricted Distribution 24 JIT Extension: Pointer Masking

    • Implementations of Tagged Pointer tagged pointer software right shift AND a mask multi mapping linux: mmap() windows: CreateFileMapping() extra instruction dTLB load misses ZGC
  24. Huawei Proprietary - Restricted Distribution 25 JIT Extension: Pointer Masking

    • OpenJDK17 ZGC Colored Pointer > 4 colored bits (42nd~45th) > only three situations: 100, 010, 001 [2] > multi mapping (map to 3 locations) > dTLB load misses [1] http://cr.openjdk.java.net/~pliden/slides/ZGC-PLMeetup-2019.pdf [2] Only three of the four bits are used. The 'finalizable' bit is not considered in Multi Mapping. [1]
  25. Huawei Proprietary - Restricted Distribution 26 JIT Extension: Pointer Masking

    • Implementations of Tagged Pointer tagged pointer software right shift AND a mask multi mapping linux: mmap() windows: CreateFileMapping() hardware sparc: virtual address mask armv8: top byte ignore risc-v: pointer masking extra instruction dTLB load misses no cross-platform ZGC what we want to do
  26. Huawei Proprietary - Restricted Distribution 27 JIT Extension: Pointer Masking

    • ArmV8 Top Byte Ignore (TBI) > ignore the most significant 8 bits of the virtual address > similar to Pointer Masking • Layout of Colored Pointer with TBI > move colored bits from 42~45 to 59~62 > (the 63rd bit has been occupied by StackWatermarkState) // Multi-Mapping +--------------------+----+-----------------------------------------------+ |00000000 00000000 00|1111|11 11111111 11111111 11111111 11111111 11111111| +--------------------+----+-----------------------------------------------+ | | * 41-0 Object Offset (42-bits, 4TB address space) | * 45-42 Metadata Bits (4-bits) 0001 = Marked0 * 63-46 Fixed (18-bits, always zero) 0010 = Marked1 0100 = Remapped 1000 = Finalizable // TBI ++----+-------------------+-----------------------------------------------+ 0|1111|000 00000000 000000|11 11111111 11111111 11111111 11111111 11111111| ++----+-------------------+-----------------------------------------------+ | | * 41-0 Object Offset (42-bits, 4TB address space) | * 58-42 Fixed (18-bits, always zero) * 62-59 Metadata Bits (4-bits) 0001 = Marked0 0010 = Marked1 0100 = Remapped 1000 = Finalizable ignored
  27. Huawei Proprietary - Restricted Distribution 28 JIT Extension: Pointer Masking

    • Performance > BiShengJDK17, Kunpeng platform[1] > SPECjbb2015 ↗7~8% > dTLB-load-misses/(load+store) ↘23.66% • Open source in BiSheng JDK17[2] > -XX:+UseTBI[3] [1] Experiment results may vary on different hardware. Our environment: BiShengJDK17, kunpeng 920, 128 cpus, 500GB memory, SPECjbb2015 multi-jvmmode [2] https://gitee.com/openeuler/bishengjdk-17/pulls/27 [3] We later implemented hot patching of Generational ZGC in UseTBI, which further improved performance. The data above does not include the improvement of hot patching. 0 20000 40000 60000 80000 100000 120000 140000 160000 max jOPS critical jOPS SPECjbb2015 - Score Multi Mapping TBI 0,00% 1,00% 2,00% 3,00% 4,00% 5,00% 6,00% TBI Multi Mapping SPECjbb2015 - dTLB-load-miss/(load+store)
  28. Huawei Proprietary - Restricted Distribution 29 JIT Extension: Pointer Masking

    • Generational ZGC > planned to be released in JDK 21 to adapt to applications with high allocation rates > pointer layout // A zpointer is a combination of the address bits (heap base bit + offset) // and two low-order metadata bytes, with the following layout: // |48 bits VA|RRRRMMmmFFrr0000 // **** : Used by load barrier // ********** : Used by mark barrier // ************ : Used by store barrier // **** : Reserved bits // The table below describes what each color does. // +-------------+-------------------+--------------------------+ // | Bit pattern | Description | Included colors | // +-------------+-------------------+--------------------------+ // | rr | Remembered bits | Remembered[0, 1] | // +-------------+-------------------+--------------------------+ // | FF | Finalizable bits | Finalizable[0, 1] | // +-------------+-------------------+--------------------------+ // | mm | Marked young bits | MarkedYoung[0, 1] | // +-------------+-------------------+--------------------------+ // | MM | Marked old bits | MarkedOld[0, 1] | // +-------------+-------------------+--------------------------+ // | RRRR | Remapped bits | Remapped[00, 01, 10, 11] | // +-------------+-------------------+--------------------------+ ZLoadBarrierStubC2* const stub = ZLoadBarrierStubC2::create(node, ref_addr, ref); Label good; __ relocate(barrier_Relocation::spec(), ZBarrierRelocationFormatLoadGoodBeforeTbz); __ tbz(ref, barrier_Relocation::unpatched, good); __ b(*stub->entry()); __ bind(good); __ lsr(ref, ref, ZPointerLoadShift); // uncolor __ bind(*stub->continuation()); ~3% in SPECjbb2015
  29. Huawei Proprietary - Restricted Distribution 30 JIT Extension: Pointer Masking

    • Implementations of Tagged Pointer tagged pointer software right shift AND a mask multi mapping linux: mmap() windows: CreateFileMapping() hardware sparc: virtual address mask armv8: top byte ignore risc-v: pointer masking extra instruction dTLB load misses no cross-platform Generational ZGC ZGC what we want to do
  30. Huawei Proprietary - Restricted Distribution 31 Content 1. Introduction 2.

    No Flag Register 3. Zifencei: FENCE.I 4. Vector Extension 5. Bitmanipulation Extension 6. JIT Extension: Pointer Masking 7. Summary
  31. Huawei Proprietary - Restricted Distribution 32 Summary • Introduced the

    work we have done to port OpenJDK to RISC-V > Focused on RISC-V Pointer Masking and introduced an experiment based on TBI • Huawei BishengJDK > https://gitee.com/openeuler/bishengjdk-8 > https://gitee.com/openeuler/bishengjdk-11 > https://gitee.com/openeuler/bishengjdk-17 • OpenJDK RISC-V Binary > https://builds.shipilev.net/openjdk-jdk/
  32. Copyright©2018 Huawei Technologies Co., Ltd. All Rights Reserved. The information

    in this document may contain predictive statements including, without limitation, statements regarding the future financial and operating results, future product portfolio, new technology, etc. There are a number of factors that could cause actual results and developments to differ materially from those expressed or implied in the predictive statements. Therefore, such information is provided for reference purpose only and constitutes neither an offer nor an acceptance. Huawei may change the information at any time without notice. 把数字世界带入每个人、每个家庭、 每个组织,构建万物互联的智能世界。 Bring digital to every person, home and organization for a fully connected, intelligent world. Thank you.