Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SIMD Parallel Programming with the Vector API

José
April 24, 2024

SIMD Parallel Programming with the Vector API

The first version of the Vector API was published as an incubator feature with the JDK 16. The JDK 22 released the 7th incubator version, and the API is now stable enough so that we can examine how it is working. The Vector API can dramatically speed up your in-memory computations, by leveraging the capabilities of the SIMD kernel (Single Instruction Multiple Data) of your CPU cores. This SIMD architecture is nothing but new, at is was developped in super computers in the 80s and the 90s. This presentation explains what is the SIMD programming, how it differs from the parallel streams, and how things are working under the hood. You will learn how you can use this API and what performance gains you may expect for your in-memory computations.

José

April 24, 2024
Tweet

More Decks by José

Other Decks in Programming

Transcript

  1. SIMD Parallel Programming with the Vector API Advanced parallel computing

    José Paumard Java Developer Advocate Java Platform Group
  2. Tune in! Copyright © 2021, Oracle and/or its affiliates |

    4 Inside Java Newscast JEP Café Road To 21 series Inside.java Inside Java Podcast Sip of Java Cracking the Java coding interview
  3. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | Confidential:

    Internal/Restricted/Highly Restricted 6 What is SIMD computing? Single Instruction Multiple Data
  4. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 7

    Parallel streams Computing Things in Parallel int[] aLotOfData = ...; Arrays.stream(aLotOfData) .parallel() .map(Some::mapping) .filter(Some::filtering) .reduce(identityElement, Some::reduction);
  5. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 8

    Parallel arrays Computing Things in Parallel int[] aLotOfData = ...; Arrays.parallelSort(aLotOfData);
  6. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 9

    It relies on concurrency and the splitting of your data Computing Things in Parallel
  7. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 10

    It relies on concurrency and the splitting of your data Computing Things in Parallel
  8. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 11

    It relies on concurrency and the splitting of your data Computing Things in Parallel
  9. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 12

    It relies on concurrency and the splitting of your data Computing Things in Parallel
  10. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 13

    It relies on concurrency and the splitting of your data Computing Things in Parallel
  11. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 14

    It relies on concurrency and the splitting of your data Computing Things in Parallel
  12. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 15

    At some point your data elements are small enough to be processed Computing Things in Parallel
  13. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 16

    Then these partial results are merged using a merge function Computing Things in Parallel R1 R2 R3 R4 R5 R6 R7 R8
  14. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 17

    Until the final result can be computed Computing Things in Parallel M(R1 , R2 ) M(R3 , R4 ) M(R5 , R6 ) M(R7 , R8 )
  15. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 18

    Until the final result can be computed Computing Things in Parallel M(M1 , M2 ) M(M3 , M4 )
  16. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 19

    Until the final result can be computed The merge function can be the same as the reduction operation (SUM, MIN, MAX) But it can be different (merge sort vs sort) Computing Things in Parallel Final Result
  17. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 20

    All this is computed in different threads, from the Common Fork Join Pool So you may come across visibility, race condition, or blocking issues Don’t do I/O in a parallel stream! Don’t use non-splitable data sources! Don’t use some operations (skip, limit) Computing Things in Parallel
  18. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 21

    What kind of gain can you expect? It depends on your merge operation! If it is the same: the number of cores, minus the overhead If it is not: well, it depends! (measure, don’t guess!) Computing Things in Parallel
  19. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 23

    What is SIMD? Short answer: an old concept Parallel Computing of Things
  20. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 24

    Less short answer: it has to do with how CPU are working Parallel Computing of Things Reg A PU Reg B Result
  21. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 25

    The PU loads instruction, that can load data Parallel Computing of Things 10 PU 20 LOAD
  22. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 26

    The PU loads instruction, that can load data, and do something with it Parallel Computing of Things 10 PU 20 30 LOAD ADD
  23. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 27

    SIMD works with several PU and registers Parallel Computing of Things memory PGM
  24. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 28

    SIMD works with several PU and registers, can load all of them on the same CPU cycle Parallel Computing of Things 10 10 LOAD 21 30 5 12 8 3 memory
  25. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 29

    SIMD works with several PU and registers, can load all of them on the same CPU cycle, and add them Parallel Computing of Things 10 10 20 LOAD ADD 21 30 51 5 12 17 8 3 11 memory
  26. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 30

    No threads, no splitting of your data, no partial reductions Overhead? Parallel Computing of Things
  27. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 31

    Two questions: 1) What is this data memory? How can you access it? 2) What are the available operations? Parallel Computing of Things
  28. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 32

    SIMD relies on: 1) specific hardware elements on each CPU core 2) specific assembly instructions AVX-256, AVX-512 No concurrency! You need to check what is your CPU capable of! The SIMD Way of Computing
  29. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 33

    This memory is a specific array in your CPU From 512 bits to 128, with different instructions sets Parallel Computing of Things 256 bits 4 longs or doubles 8 ints or floats 16 shorts 32 bytes
  30. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 34

    Pros: 1) Does not rely on concurrent programming 2) No overhead for splitting and merging data Cons: 1) Still an overhead to load and unload your vectors 2) Algorithms can be more complex 3) Efficient code depends on what your CPU can do The SIMD Way of Computing
  31. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 35

    Let us add two vectors How is the Vector API Working? int[] v1 = ...; int[] v2 = ...; int[] result = new int[v1.length]; var V1 = IntVector.fromArray(species, v1, 0); var V2 = IntVector.fromArray(species, v2, 0);
  32. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 36

    Let us add two vectors The species object carries info on the CPU and the vector type How is the Vector API Working? int[] v1 = ...; int[] v2 = ...; int[] result = new int[v1.length]; var species = IntVector.SPECIES_PREFERRED; var V1 = IntVector.fromArray(species, v1, 0); var V2 = IntVector.fromArray(species, v2, 0);
  33. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 37

    Let us add two vectors The species object carries info on the CPU and the vector type How is the Vector API Working? private static final Species species = IntVector.SPECIES_PREFERRED; int[] v1 = ...; int[] v2 = ...; int[] result = new int[v1.length]; var V1 = IntVector.fromArray(species, v1, 0); var V2 = IntVector.fromArray(species, v2, 0);
  34. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 38

    Let us add two vectors How is the Vector API Working? int[] v1 = ...; int[] v2 = ...; int[] result = new int[v1.length]; var V1 = IntVector.fromArray(species, v1, 0); var V2 = IntVector.fromArray(species, v2, 0); var RESULT = V1.add(V2); RESULT.intoArray(result, 0);
  35. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 39

    Let us add two vectors This add operation maps to an assembly instruction, that adds all the components of V1 and V2 in 1 CPU cycle The overhead is the copying of the arrays into the vectors, back and forth. How is the Vector API Working? var RESULT = V1.add(V2);
  36. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 40

    ADD, SUB, MULT, DIV ABS, NEG, MIN, MAX EQ, GT, LT Bit-wise operations Parallel Operations
  37. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 41

    The size of the array has to match the size of the vector your SIMD kernel can handle. For 256 bits: 8 ints. How can you handle larger arrays? For loop FTW! Caveat!
  38. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 42

    Large Arrays Species species = IntVector.SPECIES_PREFERRED; for (int index = 0; index < v1.length; index += 8) { var V1 = IntVector.fromArray(species, v1, index); var V2 = IntVector.fromArray(species, v2, index); var RESULT = V1.add(V2); RESULT.intoArray(result, index); }
  39. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 43

    Large Arrays Species species = IntVector.SPECIES_PREFERRED; for (int index = 0; index < v1.length; index += species.length()) { var V1 = IntVector.fromArray(species, v1, index); var V2 = IntVector.fromArray(species, v2, index); var RESULT = V1.add(V2); RESULT.intoArray(result, index); }
  40. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 44

    Hmmm… not quite! What if your array has 2222 elements? 2222 = 277x8 + 6 There will be 6 leftover elements… Two solutions, depending on what your CPU can do Are we Done?
  41. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 45

    1st Solution Species species = IntVector.SPECIES_PREFERRED; for (int index = 0; index < v1.length; index += species.length()) { var mask = species.indexInRange(index, v1.length); var V1 = IntVector.fromArray(species, v1, index, mask); var V2 = IntVector.fromArray(species, v2, index, mask); var RESULT = V1.add(V2, mask); RESULT.intoArray(result, index, mask); }
  42. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 46

    It relies on the use of masks … this is not supported by all SIMD kernels / CPU So there is still another pattern 1st Solution
  43. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 47

    2nd Solution Species species = IntVector.SPECIES_PREFERRED; int index = 0; for (; index < species.loopBound(v1.length); index += species.length()) { var V1 = IntVector.fromArray(species, v1, index); var V2 = IntVector.fromArray(species, v2, index); var RESULT = V1.add(V2); RESULT.intoArray(result, index); } for (int index2 = index; index2 < v1.length; index2++) { result[index2] = v1[index2] + v2[index2]; }
  44. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 48

    Measure, don’t guess! If your CPU does not support masking, the 2nd pattern may be faster Which One is Better?
  45. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 49

    In the context of the Vector API Vectors have lanes and not components So lanes and components are actually the same thing Lane = the way your data flows through the SIMD machine of your computer Some Vocabulary
  46. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 50

    Two kinds of operations 1) Lane-wise operations: operate on a given lane for two vectors ADD, SUB, etc… are lane-wise operations 2) Cross-lane operations: operate on the different lanes of a vector MAX, MIN, SORT are cross-lanes operations Operations on Lanes
  47. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 51

    Computing the Norm of a Vector – V1 Species species = FloatVector.SPECIES_PREFERRED; float sum = 0f; for ( ) { var V = FloatVector.fromArray(...); var V2 = V.mult(V); sum += V2.reduceLanes(VectorOperators.ADD); } float norm = Math.sqrt(norm);
  48. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 52

    Computing the Norm of a Vector – V2 Species species = FloatVector.SPECIES_PREFERRED; var SUM = FloatVector.zero(species); // can call broadcast with a value for ( ) { var V = FloatVector.fromArray(...); var V2 = V.mult(V); SUM = SUM.add(V2); } float norm = Math.sqrt(SUM.reduceLanes(VectorOperators.ADD));
  49. Vector Operators Java Day Copyright © 2024, Oracle and/or its

    affiliates 53 VectorOperator <I> Unary <I> Binary <I> Ternary <I> Test <I> Conversion <I> Comparison <I>
  50. Vector Operators Java Day Copyright © 2024, Oracle and/or its

    affiliates 54 Unary <I> NOT ZOMO ABS NEG BIT_COUNT TRAILING_ZERO_COUNT LEADING_ZERO_COUNT REVERSE REVERSE_BYTES SIN COS TAN SINH COSH TANH ASIN ACOS ATAN EXP LOG LOG10 EXPM1 LOG1P SQRT CBRT
  51. Vector Operators Java Day Copyright © 2024, Oracle and/or its

    affiliates 55 Binary <I> SUB DIV AND_NOT LSHL ASHR LSHR ROL ROR COMPRESS_BITS EXPAND_BITS ATAN2 POW HYPOT Associative <I> ADD MULT MIN MAX FIRST_NON_ZERO AND OR XOR OR_UNCHECKED
  52. Vector Operators Java Day Copyright © 2024, Oracle and/or its

    affiliates 56 Ternary <I> BITWISE_BLEND FMA
  53. Vector Operators Java Day Copyright © 2024, Oracle and/or its

    affiliates 57 Test <I> IS_DEFAULT IS_NEGATIVE IS_FINITE IS_NAN IS_INFINITE
  54. Vector Operators Java Day Copyright © 2024, Oracle and/or its

    affiliates 58 Comparison <I> EQ NE LT LE GT GE UNSIGNED_LT UNSIGNED_LE UNSIGNED_GT UNSIGNED_GE
  55. Vector Operators Java Day Copyright © 2024, Oracle and/or its

    affiliates 59 Conversion <I> Byte Short Integer Long Float Double Byte B2S B2I B2L B2F B2D Short S2B S2I S2L S2F S2D Integer I2B I2S I2L I2F I2D Long L2B L2S L2I L2F L2D Float F2B F2S F2I F2L F2D Double D2B D2S D2I D2L D2F
  56. Vector Operators Java Day Copyright © 2024, Oracle and/or its

    affiliates 60 Conversion <I> Double -> Long REINTERPRET_D2L Long -> Double REINTERPRET_L2D Float -> Integer REINTERPRET_F2I Integer -> Float REINTERPRET_I2F
  57. Vector Operators Java Day Copyright © 2024, Oracle and/or its

    affiliates 61 Conversion <I> Byte -> Short ZERO_EXTEND_B2S Byte -> Integer ZERO_EXTEND_B2I Byte -> Long ZERO_EXTEND_B2L Short -> Integer ZERO_EXTEND_S2I Short -> Long ZERO_EXTEND_S2L Integer -> Long ZERO_EXTEND_I2L
  58. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 62

    Computing an Average Species species = FloatVector.SPECIES_PREFERRED; for ( ) { var V = FloatVector.fromArray(...); var sum = SUM.reduceLanes(VectorOperators.ADD) } float average = sum / v1.length;
  59. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 63

    Because the VectorOperators objects map specific assembly instructions of your SIMD kernel Why Cant you Pass a Lambda?
  60. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 64

    Filtering is about selecting the lanes that match a criteria You cant avoid masking here! Filtering Lanes 3 8 5 1 6 9 7 5 > 5
  61. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 65

    Filtering is about selecting the lanes that match a criteria You cant avoid masking here! Filtering Lanes 3 8 5 1 6 9 7 5 > 5 F T F F T T T F = mask
  62. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 66

    Filtering is about selecting the lanes that match a criteria You cant avoid masking here! Filtering Lanes 3 8 5 1 6 9 7 5 > 5 F T F F T T T F = mask 8 6 9 7 - - - - compress:
  63. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 67

    Compression + copy into the result array Filtering Lanes V1 Species species = IntVector.SPECIES_PREFERRED; int maxIndex = 0; for ( ) { var mask = V.compare(VectorOperators.GT, 5); V.compress(mask).intoArray(result, maxIndex); maxIndex += mask.trueCount(); // the number of compressed elements }
  64. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 68

    You can also copy with a mask into the result array Filtering Lanes V2 Species species = IntVector.SPECIES_PREFERRED; int maxIndex = 0; for ( ) { var mask = V.compare(VectorOperators.GT, 5); V.intoArray(result, maxIndex, mask); maxIndex += mask.trueCount(); // the number of compressed elements }
  65. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 69

    No need to compress Filtering Lanes + Reduction Species species = DoubleSpecies.SPECIES_PREFERRED; double sum = 0d; double count = 0d; for ( ) { var mask = V.compare(VectorOperators.GT, 5); sum += V.reduceLanes(VectorOperators.ADD, mask); count += mask.trueCount(); // the number of added elements } double average = sum / count;
  66. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 70

    https://github.com/openjdk/jdk/tree/master/test/micro/org/openjdk/bench/jdk/incubator/vector https://bit.ly/vector-api Performances
  67. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 71

    String comparison! String size JDK Scalar JDK Vector API 16 18ns ±2ns 17ns ±1ns 32 40ns ±4ns 8ns ±1ns 64 61ns ±1ns 10ns ±1ns 128 123ns ±3ns 16ns ±1ns 1024 940ns ±9ns 90ns ±1ns Performances
  68. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 72

    The Vector API can be used in many places in the JDK to improve the performances of common operations Array and String comparison String compare ignore case String charset conversion Hash code of String and arrays computation Array sorting (no merge sort!) Arc Tangent computation Applications
  69. 4/24/2024 Copyright © 2021, Oracle and/or its affiliates | 73

    The Vector API can be used in many places in the JDK to improve the performances of common operations Linear Algebra! Neural Networks Machine Learning Artificial Intelligence Applications