Slide 1

Slide 1 text

SIMD Parallel Programming with the Vector API Advanced parallel computing José Paumard Java Developer Advocate Java Platform Group

Slide 2

Slide 2 text

https://twitter.com/JosePaumard https://github.com/JosePaumard https://www.youtube.com/c/JosePaumard01 https://www.youtube.com/user/java https://www.youtube.com/hashtag/jepcafe https://fr.slideshare.net/jpaumard https://www.pluralsight.com/authors/jose- paumard https://dev.java

Slide 3

Slide 3 text

4/24/2024 Copyright © 2023, Oracle and/or its affiliates 3 https://dev.java/

Slide 4

Slide 4 text

Tune in! Copyright © 2021, Oracle and/or its affiliates | 4 Inside Java Newscast JEP Café Road To 21 series Inside.java Inside Java Podcast Sip of Java Cracking the Java coding interview

Slide 5

Slide 5 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 5 JEP 469, 7th Incubator in 22, and…

Slide 6

Slide 6 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | Confidential: Internal/Restricted/Highly Restricted 6 What is SIMD computing? Single Instruction Multiple Data

Slide 7

Slide 7 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 7 Parallel streams Computing Things in Parallel int[] aLotOfData = ...; Arrays.stream(aLotOfData) .parallel() .map(Some::mapping) .filter(Some::filtering) .reduce(identityElement, Some::reduction);

Slide 8

Slide 8 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 8 Parallel arrays Computing Things in Parallel int[] aLotOfData = ...; Arrays.parallelSort(aLotOfData);

Slide 9

Slide 9 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 9 It relies on concurrency and the splitting of your data Computing Things in Parallel

Slide 10

Slide 10 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 10 It relies on concurrency and the splitting of your data Computing Things in Parallel

Slide 11

Slide 11 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 11 It relies on concurrency and the splitting of your data Computing Things in Parallel

Slide 12

Slide 12 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 12 It relies on concurrency and the splitting of your data Computing Things in Parallel

Slide 13

Slide 13 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 13 It relies on concurrency and the splitting of your data Computing Things in Parallel

Slide 14

Slide 14 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 14 It relies on concurrency and the splitting of your data Computing Things in Parallel

Slide 15

Slide 15 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 15 At some point your data elements are small enough to be processed Computing Things in Parallel

Slide 16

Slide 16 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 16 Then these partial results are merged using a merge function Computing Things in Parallel R1 R2 R3 R4 R5 R6 R7 R8

Slide 17

Slide 17 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 17 Until the final result can be computed Computing Things in Parallel M(R1 , R2 ) M(R3 , R4 ) M(R5 , R6 ) M(R7 , R8 )

Slide 18

Slide 18 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 18 Until the final result can be computed Computing Things in Parallel M(M1 , M2 ) M(M3 , M4 )

Slide 19

Slide 19 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 19 Until the final result can be computed The merge function can be the same as the reduction operation (SUM, MIN, MAX) But it can be different (merge sort vs sort) Computing Things in Parallel Final Result

Slide 20

Slide 20 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 20 All this is computed in different threads, from the Common Fork Join Pool So you may come across visibility, race condition, or blocking issues Don’t do I/O in a parallel stream! Don’t use non-splitable data sources! Don’t use some operations (skip, limit) Computing Things in Parallel

Slide 21

Slide 21 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 21 What kind of gain can you expect? It depends on your merge operation! If it is the same: the number of cores, minus the overhead If it is not: well, it depends! (measure, don’t guess!) Computing Things in Parallel

Slide 22

Slide 22 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 22 SIMD computing is completely different

Slide 23

Slide 23 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 23 What is SIMD? Short answer: an old concept Parallel Computing of Things

Slide 24

Slide 24 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 24 Less short answer: it has to do with how CPU are working Parallel Computing of Things Reg A PU Reg B Result

Slide 25

Slide 25 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 25 The PU loads instruction, that can load data Parallel Computing of Things 10 PU 20 LOAD

Slide 26

Slide 26 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 26 The PU loads instruction, that can load data, and do something with it Parallel Computing of Things 10 PU 20 30 LOAD ADD

Slide 27

Slide 27 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 27 SIMD works with several PU and registers Parallel Computing of Things memory PGM

Slide 28

Slide 28 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 28 SIMD works with several PU and registers, can load all of them on the same CPU cycle Parallel Computing of Things 10 10 LOAD 21 30 5 12 8 3 memory

Slide 29

Slide 29 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 29 SIMD works with several PU and registers, can load all of them on the same CPU cycle, and add them Parallel Computing of Things 10 10 20 LOAD ADD 21 30 51 5 12 17 8 3 11 memory

Slide 30

Slide 30 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 30 No threads, no splitting of your data, no partial reductions Overhead? Parallel Computing of Things

Slide 31

Slide 31 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 31 Two questions: 1) What is this data memory? How can you access it? 2) What are the available operations? Parallel Computing of Things

Slide 32

Slide 32 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 32 SIMD relies on: 1) specific hardware elements on each CPU core 2) specific assembly instructions AVX-256, AVX-512 No concurrency! You need to check what is your CPU capable of! The SIMD Way of Computing

Slide 33

Slide 33 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 33 This memory is a specific array in your CPU From 512 bits to 128, with different instructions sets Parallel Computing of Things 256 bits 4 longs or doubles 8 ints or floats 16 shorts 32 bytes

Slide 34

Slide 34 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 34 Pros: 1) Does not rely on concurrent programming 2) No overhead for splitting and merging data Cons: 1) Still an overhead to load and unload your vectors 2) Algorithms can be more complex 3) Efficient code depends on what your CPU can do The SIMD Way of Computing

Slide 35

Slide 35 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 35 Let us add two vectors How is the Vector API Working? int[] v1 = ...; int[] v2 = ...; int[] result = new int[v1.length]; var V1 = IntVector.fromArray(species, v1, 0); var V2 = IntVector.fromArray(species, v2, 0);

Slide 36

Slide 36 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 36 Let us add two vectors The species object carries info on the CPU and the vector type How is the Vector API Working? int[] v1 = ...; int[] v2 = ...; int[] result = new int[v1.length]; var species = IntVector.SPECIES_PREFERRED; var V1 = IntVector.fromArray(species, v1, 0); var V2 = IntVector.fromArray(species, v2, 0);

Slide 37

Slide 37 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 37 Let us add two vectors The species object carries info on the CPU and the vector type How is the Vector API Working? private static final Species species = IntVector.SPECIES_PREFERRED; int[] v1 = ...; int[] v2 = ...; int[] result = new int[v1.length]; var V1 = IntVector.fromArray(species, v1, 0); var V2 = IntVector.fromArray(species, v2, 0);

Slide 38

Slide 38 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 38 Let us add two vectors How is the Vector API Working? int[] v1 = ...; int[] v2 = ...; int[] result = new int[v1.length]; var V1 = IntVector.fromArray(species, v1, 0); var V2 = IntVector.fromArray(species, v2, 0); var RESULT = V1.add(V2); RESULT.intoArray(result, 0);

Slide 39

Slide 39 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 39 Let us add two vectors This add operation maps to an assembly instruction, that adds all the components of V1 and V2 in 1 CPU cycle The overhead is the copying of the arrays into the vectors, back and forth. How is the Vector API Working? var RESULT = V1.add(V2);

Slide 40

Slide 40 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 40 ADD, SUB, MULT, DIV ABS, NEG, MIN, MAX EQ, GT, LT Bit-wise operations Parallel Operations

Slide 41

Slide 41 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 41 The size of the array has to match the size of the vector your SIMD kernel can handle. For 256 bits: 8 ints. How can you handle larger arrays? For loop FTW! Caveat!

Slide 42

Slide 42 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 42 Large Arrays Species species = IntVector.SPECIES_PREFERRED; for (int index = 0; index < v1.length; index += 8) { var V1 = IntVector.fromArray(species, v1, index); var V2 = IntVector.fromArray(species, v2, index); var RESULT = V1.add(V2); RESULT.intoArray(result, index); }

Slide 43

Slide 43 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 43 Large Arrays Species species = IntVector.SPECIES_PREFERRED; for (int index = 0; index < v1.length; index += species.length()) { var V1 = IntVector.fromArray(species, v1, index); var V2 = IntVector.fromArray(species, v2, index); var RESULT = V1.add(V2); RESULT.intoArray(result, index); }

Slide 44

Slide 44 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 44 Hmmm… not quite! What if your array has 2222 elements? 2222 = 277x8 + 6 There will be 6 leftover elements… Two solutions, depending on what your CPU can do Are we Done?

Slide 45

Slide 45 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 45 1st Solution Species species = IntVector.SPECIES_PREFERRED; for (int index = 0; index < v1.length; index += species.length()) { var mask = species.indexInRange(index, v1.length); var V1 = IntVector.fromArray(species, v1, index, mask); var V2 = IntVector.fromArray(species, v2, index, mask); var RESULT = V1.add(V2, mask); RESULT.intoArray(result, index, mask); }

Slide 46

Slide 46 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 46 It relies on the use of masks … this is not supported by all SIMD kernels / CPU So there is still another pattern 1st Solution

Slide 47

Slide 47 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 47 2nd Solution Species species = IntVector.SPECIES_PREFERRED; int index = 0; for (; index < species.loopBound(v1.length); index += species.length()) { var V1 = IntVector.fromArray(species, v1, index); var V2 = IntVector.fromArray(species, v2, index); var RESULT = V1.add(V2); RESULT.intoArray(result, index); } for (int index2 = index; index2 < v1.length; index2++) { result[index2] = v1[index2] + v2[index2]; }

Slide 48

Slide 48 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 48 Measure, don’t guess! If your CPU does not support masking, the 2nd pattern may be faster Which One is Better?

Slide 49

Slide 49 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 49 In the context of the Vector API Vectors have lanes and not components So lanes and components are actually the same thing Lane = the way your data flows through the SIMD machine of your computer Some Vocabulary

Slide 50

Slide 50 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 50 Two kinds of operations 1) Lane-wise operations: operate on a given lane for two vectors ADD, SUB, etc… are lane-wise operations 2) Cross-lane operations: operate on the different lanes of a vector MAX, MIN, SORT are cross-lanes operations Operations on Lanes

Slide 51

Slide 51 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 51 Computing the Norm of a Vector – V1 Species species = FloatVector.SPECIES_PREFERRED; float sum = 0f; for ( ) { var V = FloatVector.fromArray(...); var V2 = V.mult(V); sum += V2.reduceLanes(VectorOperators.ADD); } float norm = Math.sqrt(norm);

Slide 52

Slide 52 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 52 Computing the Norm of a Vector – V2 Species species = FloatVector.SPECIES_PREFERRED; var SUM = FloatVector.zero(species); // can call broadcast with a value for ( ) { var V = FloatVector.fromArray(...); var V2 = V.mult(V); SUM = SUM.add(V2); } float norm = Math.sqrt(SUM.reduceLanes(VectorOperators.ADD));

Slide 53

Slide 53 text

Vector Operators Java Day Copyright © 2024, Oracle and/or its affiliates 53 VectorOperator Unary Binary Ternary Test Conversion Comparison

Slide 54

Slide 54 text

Vector Operators Java Day Copyright © 2024, Oracle and/or its affiliates 54 Unary NOT ZOMO ABS NEG BIT_COUNT TRAILING_ZERO_COUNT LEADING_ZERO_COUNT REVERSE REVERSE_BYTES SIN COS TAN SINH COSH TANH ASIN ACOS ATAN EXP LOG LOG10 EXPM1 LOG1P SQRT CBRT

Slide 55

Slide 55 text

Vector Operators Java Day Copyright © 2024, Oracle and/or its affiliates 55 Binary SUB DIV AND_NOT LSHL ASHR LSHR ROL ROR COMPRESS_BITS EXPAND_BITS ATAN2 POW HYPOT Associative ADD MULT MIN MAX FIRST_NON_ZERO AND OR XOR OR_UNCHECKED

Slide 56

Slide 56 text

Vector Operators Java Day Copyright © 2024, Oracle and/or its affiliates 56 Ternary BITWISE_BLEND FMA

Slide 57

Slide 57 text

Vector Operators Java Day Copyright © 2024, Oracle and/or its affiliates 57 Test IS_DEFAULT IS_NEGATIVE IS_FINITE IS_NAN IS_INFINITE

Slide 58

Slide 58 text

Vector Operators Java Day Copyright © 2024, Oracle and/or its affiliates 58 Comparison EQ NE LT LE GT GE UNSIGNED_LT UNSIGNED_LE UNSIGNED_GT UNSIGNED_GE

Slide 59

Slide 59 text

Vector Operators Java Day Copyright © 2024, Oracle and/or its affiliates 59 Conversion Byte Short Integer Long Float Double Byte B2S B2I B2L B2F B2D Short S2B S2I S2L S2F S2D Integer I2B I2S I2L I2F I2D Long L2B L2S L2I L2F L2D Float F2B F2S F2I F2L F2D Double D2B D2S D2I D2L D2F

Slide 60

Slide 60 text

Vector Operators Java Day Copyright © 2024, Oracle and/or its affiliates 60 Conversion Double -> Long REINTERPRET_D2L Long -> Double REINTERPRET_L2D Float -> Integer REINTERPRET_F2I Integer -> Float REINTERPRET_I2F

Slide 61

Slide 61 text

Vector Operators Java Day Copyright © 2024, Oracle and/or its affiliates 61 Conversion Byte -> Short ZERO_EXTEND_B2S Byte -> Integer ZERO_EXTEND_B2I Byte -> Long ZERO_EXTEND_B2L Short -> Integer ZERO_EXTEND_S2I Short -> Long ZERO_EXTEND_S2L Integer -> Long ZERO_EXTEND_I2L

Slide 62

Slide 62 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 62 Computing an Average Species species = FloatVector.SPECIES_PREFERRED; for ( ) { var V = FloatVector.fromArray(...); var sum = SUM.reduceLanes(VectorOperators.ADD) } float average = sum / v1.length;

Slide 63

Slide 63 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 63 Because the VectorOperators objects map specific assembly instructions of your SIMD kernel Why Cant you Pass a Lambda?

Slide 64

Slide 64 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 64 Filtering is about selecting the lanes that match a criteria You cant avoid masking here! Filtering Lanes 3 8 5 1 6 9 7 5 > 5

Slide 65

Slide 65 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 65 Filtering is about selecting the lanes that match a criteria You cant avoid masking here! Filtering Lanes 3 8 5 1 6 9 7 5 > 5 F T F F T T T F = mask

Slide 66

Slide 66 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 66 Filtering is about selecting the lanes that match a criteria You cant avoid masking here! Filtering Lanes 3 8 5 1 6 9 7 5 > 5 F T F F T T T F = mask 8 6 9 7 - - - - compress:

Slide 67

Slide 67 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 67 Compression + copy into the result array Filtering Lanes V1 Species species = IntVector.SPECIES_PREFERRED; int maxIndex = 0; for ( ) { var mask = V.compare(VectorOperators.GT, 5); V.compress(mask).intoArray(result, maxIndex); maxIndex += mask.trueCount(); // the number of compressed elements }

Slide 68

Slide 68 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 68 You can also copy with a mask into the result array Filtering Lanes V2 Species species = IntVector.SPECIES_PREFERRED; int maxIndex = 0; for ( ) { var mask = V.compare(VectorOperators.GT, 5); V.intoArray(result, maxIndex, mask); maxIndex += mask.trueCount(); // the number of compressed elements }

Slide 69

Slide 69 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 69 No need to compress Filtering Lanes + Reduction Species species = DoubleSpecies.SPECIES_PREFERRED; double sum = 0d; double count = 0d; for ( ) { var mask = V.compare(VectorOperators.GT, 5); sum += V.reduceLanes(VectorOperators.ADD, mask); count += mask.trueCount(); // the number of added elements } double average = sum / count;

Slide 70

Slide 70 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 70 https://github.com/openjdk/jdk/tree/master/test/micro/org/openjdk/bench/jdk/incubator/vector https://bit.ly/vector-api Performances

Slide 71

Slide 71 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 71 String comparison! String size JDK Scalar JDK Vector API 16 18ns ±2ns 17ns ±1ns 32 40ns ±4ns 8ns ±1ns 64 61ns ±1ns 10ns ±1ns 128 123ns ±3ns 16ns ±1ns 1024 940ns ±9ns 90ns ±1ns Performances

Slide 72

Slide 72 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 72 The Vector API can be used in many places in the JDK to improve the performances of common operations Array and String comparison String compare ignore case String charset conversion Hash code of String and arrays computation Array sorting (no merge sort!) Arc Tangent computation Applications

Slide 73

Slide 73 text

4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 73 The Vector API can be used in many places in the JDK to improve the performances of common operations Linear Algebra! Neural Networks Machine Learning Artificial Intelligence Applications

Slide 74

Slide 74 text

The Vector API rocks!

Slide 75

Slide 75 text

The Vector API rocks! Questions?