Slide 1

Slide 1 text

Game of Loom: implementation patterns and performance implications playing with virtual threads by Mario Fusco @mariofusco

Slide 2

Slide 2 text

What virtual threads are … ❖ JDK native threads are thin wrappers around operating system threads ➢ Native threads are costly and memory consuming ➢ You cannot have too many of them ❖ Virtual threads make a better use of OS threads getting rid of the 1:1 relationship with operating system threads ➢ Many virtual threads can be multiplexed on the same OS thread (carrier) ➢ When a virtual thread calls a blocking operation the JDK performs a nonblocking OS call and automatically suspends the virtual thread until the operation finishes

Slide 3

Slide 3 text

… and why we need them - the WRONG reason ❖ Native threads: ➢ 2kB metadata ➢ Preallocated 1MB stack ➢ 1-10 µs for context switch ❖ Virtual threads: ➢ 200-300B metadata ➢ Pay-as-you-go stack (allocated on heap) ➢ Some ns (or below 1µs) for context switch

Slide 4

Slide 4 text

… and why we need them - the WRONG reason ❖ Native threads: ➢ 2kB metadata ➢ Preallocated 1MB stack ➢ 1-10 µs for context switch ❖ Virtual threads: ➢ 200-300B metadata ➢ Pay-as-you-go stack (allocated on heap) ➢ Some ns (or below 1µs) for context switch You can create MILLIONS of threads!!!

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

Typical multithreading application How you designed it

Slide 7

Slide 7 text

Typical multithreading application How it actually works

Slide 8

Slide 8 text

Virtual threads aren't great because you can have gazillions of them. That's just a side effect of their design and behavior which are the actual great things.

Slide 9

Slide 9 text

… and why we need them - the RIGHT reason Virtual threads allow to write asynchronous code in a way that looks synchronous. The main goal of virtual threads is having scalable application without paying the cost in terms of more complex code. Program Complexity Program Scalability One Thread per task Reactive Programing Virtual Threads

Slide 10

Slide 10 text

… and why we need them - the RIGHT reason What virtual threads are really good for is… waiting Virtual threads allow to write asynchronous code in a way that looks synchronous. The main goal of virtual threads is having scalable application without paying the cost in terms of more complex code. Program Complexity Program Scalability One Thread per task Reactive Programing Virtual Threads

Slide 11

Slide 11 text

The Spiral of Innovation Concurrent Programming in Java

Slide 12

Slide 12 text

The Spiral of Innovation Concurrent Programming in Java Green threads

Slide 13

Slide 13 text

The Spiral of Innovation Concurrent Programming in Java Green threads Native threads

Slide 14

Slide 14 text

The Spiral of Innovation Concurrent Programming in Java Green threads Native threads Events ( actors / message passing )

Slide 15

Slide 15 text

The Spiral of Innovation Concurrent Programming in Java Green threads Native threads Events ( actors / message passing ) Reactive programming

Slide 16

Slide 16 text

The Spiral of Innovation Concurrent Programming in Java Green threads Native threads Events ( actors / message passing ) Reactive programming Virtual threads (Project Loom)

Slide 17

Slide 17 text

Virtual threads are syntactic sugar for reactive programming Virtual threads do not enable anything that was impossible before. They allow to achieve the same result of reactive programming in a much easier and less verbose way.

Slide 18

Slide 18 text

Virtual threads are syntactic sugar for reactive programming // sum ages of persons having name starting with A ... int sumAges = persons.stream() .filter( p -> p.getName().startsWith("A") ) .reduce( 0, (acc, p) -> acc + p.getAge(), Integer::sum); Virtual threads do not enable anything that was impossible before. They allow to achieve the same result of reactive programming in a much easier and less verbose way. Loom is for concurrent programming what Lambdas has been for functional programming.

Slide 19

Slide 19 text

Virtual threads are syntactic sugar for reactive programming // sum ages of persons having name starting with A ... int sumAges = persons.stream() .filter( p -> p.getName().startsWith("A") ) .reduce( 0, (acc, p) -> acc + p.getAge(), Integer::sum); // ... same without lambdas sumAges = persons.stream() .filter(new Predicate() { @Override public boolean test(Person person) { return person.getName().startsWith("A"); } }) .reduce(0, new BiFunction() { @Override public Integer apply(Integer acc, Person person) { return acc + person.getAge(); } }, new BinaryOperator() { @Override public Integer apply(Integer a, Integer b) { return a + b; } }); Virtual threads do not enable anything that was impossible before. They allow to achieve the same result of reactive programming in a much easier and less verbose way. Loom is for concurrent programming what Lambdas has been for functional programming. You could do FP even without λs but at what cost?

Slide 20

Slide 20 text

Virtual threads are syntactic sugar for reactive programming … but not only that

Slide 21

Slide 21 text

Virtual threads are syntactic sugar for reactive programming … but not only that

Slide 22

Slide 22 text

Virtual threads are syntactic sugar for reactive programming … but not only that for other conversations that should have been a blog post 😎

Slide 23

Slide 23 text

(Possible) issues with Virtual Threads ❖ Loom uses the Java heap to store virtual threads stacks and other metadata. For large number of virtual threads with deep stacks this may have a relevant impact on the GC.

Slide 24

Slide 24 text

(Possible) issues with Virtual Threads ❖ Loom uses the Java heap to store virtual threads stacks and other metadata. For large number of virtual threads with deep stacks this may have a relevant impact on the GC. ❖ Pinning: synchronized blocks and other specific constructs (like JNI functions) are not recognized as blocking operations, thus actually blocking not only the virtual thread but also its carrier.

Slide 25

Slide 25 text

(Possible) issues with Virtual Threads ❖ Loom uses the Java heap to store virtual threads stacks and other metadata. For large number of virtual threads with deep stacks this may have a relevant impact on the GC. ❖ Pinning: synchronized blocks and other specific constructs (like JNI functions) are not recognized as blocking operations, thus actually blocking not only the virtual thread but also its carrier. ❖ Massive context switches (in user space) of a huge number of virtual threads can potentially cause a larger amount of cache misses.

Slide 26

Slide 26 text

(Possible) issues with Virtual Threads ❖ Loom uses the Java heap to store virtual threads stacks and other metadata. For large number of virtual threads with deep stacks this may have a relevant impact on the GC. ❖ Pinning: synchronized blocks and other specific constructs (like JNI functions) are not recognized as blocking operations, thus actually blocking not only the virtual thread but also its carrier. ❖ Massive context switches (in user space) of a huge number of virtual threads can potentially cause a larger amount of cache misses. ❖ Virtual Threads scheduler is not pluggable and currently hardcoded to the Fork/Join Pool scheduler.

Slide 27

Slide 27 text

(Possible) issues with Virtual Threads ❖ Loom uses the Java heap to store virtual threads stacks and other metadata. For large number of virtual threads with deep stacks this may have a relevant impact on the GC. ❖ Pinning: synchronized blocks and other specific constructs (like JNI functions) are not recognized as blocking operations, thus actually blocking not only the virtual thread but also its carrier. ❖ Massive context switches (in user space) of a huge number of virtual threads can potentially cause a larger amount of cache misses. ❖ Virtual Threads scheduler is not pluggable and currently hardcoded to the Fork/Join Pool scheduler. ❖ Virtual Threads are not pre-emptive, they cannot be descheduled while running heavy calculations without ever calling a JDK’s blocking methods, so they are not a good fit for CPU-bound tasks when fairness is key.

Slide 28

Slide 28 text

(Possible) issues with Virtual Threads ❖ Loom uses the Java heap to store virtual threads stacks and other metadata. For large number of virtual threads with deep stacks this may have a relevant impact on the GC. ❖ Pinning: synchronized blocks and other specific constructs (like JNI functions) are not recognized as blocking operations, thus actually blocking not only the virtual thread but also its carrier. ❖ Massive context switches (in user space) of a huge number of virtual threads can potentially cause a larger amount of cache misses. ❖ Virtual Threads scheduler is not pluggable and currently hardcoded to the Fork/Join Pool scheduler. ❖ Virtual Threads are not pre-emptive, they cannot be descheduled while running heavy calculations without ever calling a JDK’s blocking methods, so they are not a good fit for CPU-bound tasks when fairness is key. ❖ Virtual threads support thread locals, but since they can be very numerous, thread locals could have a huge memory footprint and should be used sparingly. They could be replaced by Extent-Local Variables (JEP-429).

Slide 29

Slide 29 text

Game of Loom – The original Loom based implementation https://github.com/ebarlas/game-of-life-csp

Slide 30

Slide 30 text

What is CSP?

Slide 31

Slide 31 text

What is CSP?

Slide 32

Slide 32 text

What is CSP? CSP is a concurrent programming model using communication as synchronization primitive

Slide 33

Slide 33 text

Game of Loom – The original Loom based implementation ❖ All communications between threads happen through a (monodirectional) blocking channel public class Channel { private final BlockingQueue queue = new LinkedBlockingQueue<>(); T take() { try { return queue.take(); } catch (InterruptedException e) { return null; } } void put(T value) { try { queue.put(value); } catch (InterruptedException e) { // abort } } }

Slide 34

Slide 34 text

Game of Loom – The original Loom based implementation ❖ Each grid cell is an independent process communicating with its neighbour via channels public class Cell { private boolean alive; private final Channel tickChannel; private final Channel resultChannel; private final List> inChannels; private final List> outChannels; Cell(CellOptions options) { this.alive = options.alive(); this.tickChannel = options.tickChannel(); this.resultChannel = options.resultChannel(); this.inChannels = options.inChannels(); this.outChannels = options.outChannels(); } void start() { Thread.startVirtualThread(this::run); } }

Slide 35

Slide 35 text

Game of Loom – The original Loom based implementation ❖ Each grid cell reads neighbours’ states and calculate and publish its own in an infinite loop public class Cell { private void run() { while (true) { tickChannel.take(); // wait for tick stimulus outChannels.forEach(ch -> ch.put(alive)); // announce liveness to neighbors // receive liveness from neighbors int neighbors = inChannels.stream().map(Channel::take).mapToInt(b -> b ? 1 : 0).sum(); alive = nextState(alive, neighbors); // calculate next state based on game of life rules resultChannel.put(alive); // announce resulting next state } } private static boolean nextState(boolean alive, int neighbors) { if (alive) { return neighbors == 2 || neighbors == 3; } return neighbors == 3; } }

Slide 36

Slide 36 text

Game of Loom – The original Loom based implementation ❖ The GameOfLife class coordinates all cells, gathering their states and publishing it for UI consumption public class GameOfLife { void start() { cells.forEach(Cell::start); Thread.startVirtualThread(this::run); } private void run() { while (true) { dims.forEachRowCol((r, c) -> tickChannels.get(r).get(c).put(true)); // emit tick for every cell boolean[][] grid = new boolean[dims.rows()][dims.cols()]; // prepare result boolean matrix dims.forEachRowCol((r, c) -> grid[r][c] = resultChannels.get(r).get(c).take()); // populate results gridChannel.put(grid); // emit aggregated liveness matrix sleep(); } } }

Slide 37

Slide 37 text

Game of Loom – Demo

Slide 38

Slide 38 text

Benchmarking Game of Life @State(Scope.Benchmark) @Warmup(iterations = 5) @Measurement(iterations = 10) @Fork(2) public class GameOfLifeBenchmark { @Param({"5", "25", "100"}) // 874, 5074, 49324 cells private int padding; @Param({"true", "false"}) private boolean useVirtualThreads; @Param({"true", "false"}) private boolean threadPerCell; @Param({"BlockingQueue", "BlockingTransfer", "LockedSingleValue", "OneToOneParking"}) private String channelType; private GameOfLife gameOfLife; @Setup public void setup() { // setup GameOfLife according to benchmark parameters, omitted ... } @Benchmark @BenchmarkMode(Mode.Throughput) public boolean[][] benchmark() { return gameOfLife.calculateFrameBlocking(); } }

Slide 39

Slide 39 text

Benchmarking Game of Life @State(Scope.Benchmark) @Warmup(iterations = 5) @Measurement(iterations = 10) @Fork(2) public class GameOfLifeBenchmark { @Param({"5", "25", "100"}) // 874, 5074, 49324 cells private int padding; @Param({"true", "false"}) private boolean useVirtualThreads; @Param({"true", "false"}) private boolean threadPerCell; @Param({"BlockingQueue", "BlockingTransfer", "LockedSingleValue", "OneToOneParking"}) private String channelType; private GameOfLife gameOfLife; @Setup public void setup() { // setup GameOfLife according to benchmark parameters, omitted ... } @Benchmark @BenchmarkMode(Mode.Throughput) public boolean[][] benchmark() { return gameOfLife.calculateFrameBlocking(); } } Use either native or virtual threads

Slide 40

Slide 40 text

Benchmarking Game of Life @State(Scope.Benchmark) @Warmup(iterations = 5) @Measurement(iterations = 10) @Fork(2) public class GameOfLifeBenchmark { @Param({"5", "25", "100"}) // 874, 5074, 49324 cells private int padding; @Param({"true", "false"}) private boolean useVirtualThreads; @Param({"true", "false"}) private boolean threadPerCell; @Param({"BlockingQueue", "BlockingTransfer", "LockedSingleValue", "OneToOneParking"}) private String channelType; private GameOfLife gameOfLife; @Setup public void setup() { // setup GameOfLife according to benchmark parameters, omitted ... } @Benchmark @BenchmarkMode(Mode.Throughput) public boolean[][] benchmark() { return gameOfLife.calculateFrameBlocking(); } } Use either native or virtual threads Having one thread per cell or per core

Slide 41

Slide 41 text

Benchmarking Game of Life @State(Scope.Benchmark) @Warmup(iterations = 5) @Measurement(iterations = 10) @Fork(2) public class GameOfLifeBenchmark { @Param({"5", "25", "100"}) // 874, 5074, 49324 cells private int padding; @Param({"true", "false"}) private boolean useVirtualThreads; @Param({"true", "false"}) private boolean threadPerCell; @Param({"BlockingQueue", "BlockingTransfer", "LockedSingleValue", "OneToOneParking"}) private String channelType; private GameOfLife gameOfLife; @Setup public void setup() { // setup GameOfLife according to benchmark parameters, omitted ... } @Benchmark @BenchmarkMode(Mode.Throughput) public boolean[][] benchmark() { return gameOfLife.calculateFrameBlocking(); } } Use either native or virtual threads Having one thread per cell or per core Different channel implementations

Slide 42

Slide 42 text

Native vs. Virtual Threads public class GameOfLife { private final Consumer runner; public GameOfLife(boolean useVirtualThreads) { if (useVirtualThreads) this.runner = Thread::startVirtualThread; else this.runner = Executors.newFixedThreadPool(getThreadPoolSize(), r -> { Thread t = new Thread(r); t.setDaemon(true); return t; })::execute; } public boolean[][] calculateFrame() { tick.tick(); // emit tick event for every cell boolean[][] grid = new boolean[dimensions.rows()][dimensions.cols()]; // prepare result boolean matrix resultChannels.forEachChannel( Channel::take, grid ); // populate matrix with results gridChannel.put(grid); // emit aggregated liveness matrix return grid; } public boolean[][] calculateFrameBlocking() { runner.accept(() -> calculateFrame()); return gridChannel.take(); // block until the result is ready } }

Slide 43

Slide 43 text

Native vs. Virtual Threads public class GameOfLife { private final Consumer runner; public GameOfLife(boolean useVirtualThreads) { if (useVirtualThreads) this.runner = Thread::startVirtualThread; else this.runner = Executors.newFixedThreadPool(getThreadPoolSize(), r -> { Thread t = new Thread(r); t.setDaemon(true); return t; })::execute; } public boolean[][] calculateFrame() { tick.tick(); // emit tick event for every cell boolean[][] grid = new boolean[dimensions.rows()][dimensions.cols()]; // prepare result boolean matrix resultChannels.forEachChannel( Channel::take, grid ); // populate matrix with results gridChannel.put(grid); // emit aggregated liveness matrix return grid; } public boolean[][] calculateFrameBlocking() { runner.accept(() -> calculateFrame()); return gridChannel.take(); // block until the result is ready } } Executes a Runnable on a virtual or a native thread

Slide 44

Slide 44 text

Native vs. Virtual Threads public class GameOfLife { private final Consumer runner; public GameOfLife(boolean useVirtualThreads) { if (useVirtualThreads) this.runner = Thread::startVirtualThread; else this.runner = Executors.newFixedThreadPool(getThreadPoolSize(), r -> { Thread t = new Thread(r); t.setDaemon(true); return t; })::execute; } public boolean[][] calculateFrame() { tick.tick(); // emit tick event for every cell boolean[][] grid = new boolean[dimensions.rows()][dimensions.cols()]; // prepare result boolean matrix resultChannels.forEachChannel( Channel::take, grid ); // populate matrix with results gridChannel.put(grid); // emit aggregated liveness matrix return grid; } public boolean[][] calculateFrameBlocking() { runner.accept(() -> calculateFrame()); return gridChannel.take(); // block until the result is ready } } Executes a Runnable on a virtual or a native thread Called by UI to get next frame

Slide 45

Slide 45 text

Native vs. Virtual Threads public class GameOfLife { private final Consumer runner; public GameOfLife(boolean useVirtualThreads) { if (useVirtualThreads) this.runner = Thread::startVirtualThread; else this.runner = Executors.newFixedThreadPool(getThreadPoolSize(), r -> { Thread t = new Thread(r); t.setDaemon(true); return t; })::execute; } public boolean[][] calculateFrame() { tick.tick(); // emit tick event for every cell boolean[][] grid = new boolean[dimensions.rows()][dimensions.cols()]; // prepare result boolean matrix resultChannels.forEachChannel( Channel::take, grid ); // populate matrix with results gridChannel.put(grid); // emit aggregated liveness matrix return grid; } public boolean[][] calculateFrameBlocking() { runner.accept(() -> calculateFrame()); return gridChannel.take(); // block until the result is ready } } Executes a Runnable on a virtual or a native thread Called by UI to get next frame Blocking headless version Used in benchmarks

Slide 46

Slide 46 text

Native vs. Virtual Threads – Benchmark Results Benchmark (padding) (useVirtualThreads) Mode Cnt Score Error Units GameOfLifeBenchmark.benchmark 5 true thrpt 20 972.997 ± 51.172 ops/s GameOfLifeBenchmark.benchmark 5 false thrpt 20 128.482 ± 5.816 ops/s GameOfLifeBenchmark.benchmark 25 true thrpt 20 118.315 ± 4.583 ops/s GameOfLifeBenchmark.benchmark 25 false thrpt 20 17.654 ± 0.579 ops/s GameOfLifeBenchmark.benchmark 100 true thrpt 20 4.376 ± 0.425 ops/s GameOfLifeBenchmark.benchmark 100 false thrpt 20 NaN ---

Slide 47

Slide 47 text

Native vs. Virtual Threads – Benchmark Results Benchmark (padding) (useVirtualThreads) Mode Cnt Score Error Units GameOfLifeBenchmark.benchmark 5 true thrpt 20 972.997 ± 51.172 ops/s GameOfLifeBenchmark.benchmark 5 false thrpt 20 128.482 ± 5.816 ops/s GameOfLifeBenchmark.benchmark 25 true thrpt 20 118.315 ± 4.583 ops/s GameOfLifeBenchmark.benchmark 25 false thrpt 20 17.654 ± 0.579 ops/s GameOfLifeBenchmark.benchmark 100 true thrpt 20 4.376 ± 0.425 ops/s GameOfLifeBenchmark.benchmark 100 false thrpt 20 NaN --- ~7x OOM

Slide 48

Slide 48 text

Mixed-Mode Flame Graphs ❖ X-axis alphabetical stack sort ie NOT A TIME SERIES! ❖ Width is proportional to samples presence ❖ Colors ➢ green - Java ➢ aqua - Inlined Java ➢ orange - Kernel code ➢ red - C libraries (eg JNI) ➢ yellow - C++ (eg JVM) ❖ Color intensity is randomized to differentiate frames ❖ Thread ID is added as a base frame

Slide 49

Slide 49 text

Native Threads Execution

Slide 50

Slide 50 text

Native Threads Execution Cell.run

Slide 51

Slide 51 text

Virtual Threads Execution

Slide 52

Slide 52 text

Work stealing (FJ pool) Virtual Threads Execution

Slide 53

Slide 53 text

Virtual Threads Execution Cell.run Work stealing (FJ pool)

Slide 54

Slide 54 text

Native vs. Virtual Threads – Context Switch jmh -prof “perfnorm:events=context-switches”

Slide 55

Slide 55 text

Native vs. Virtual Threads – Context Switch jmh -prof “perfnorm:events=context-switches” Benchmark (padding) (useVirtualThreads) Mode Cnt Score Error Units GameOfLifeBenchmark.benchmark 25 true thrpt 20 118.315 ± 4.583 ops/s GameOfLifeBenchmark:context-switches 25 true thrpt 20 10.923 #/op GameOfLifeBenchmark.benchmark 25 false thrpt 20 17.654 ± 0.579 ops/s GameOfLifeBenchmark:context-switches 25 false thrpt 20 15741.886 #/op

Slide 56

Slide 56 text

Native vs. Virtual Threads – Context Switch jmh -prof “perfnorm:events=context-switches” Benchmark (padding) (useVirtualThreads) Mode Cnt Score Error Units GameOfLifeBenchmark.benchmark 25 true thrpt 20 118.315 ± 4.583 ops/s GameOfLifeBenchmark:context-switches 25 true thrpt 20 10.923 #/op GameOfLifeBenchmark.benchmark 25 false thrpt 20 17.654 ± 0.579 ops/s GameOfLifeBenchmark:context-switches 25 false thrpt 20 15741.886 #/op

Slide 57

Slide 57 text

Native vs. Virtual Threads – Context Switch htop on native threads htop on virtual threads jmh -prof “perfnorm:events=context-switches” Benchmark (padding) (useVirtualThreads) Mode Cnt Score Error Units GameOfLifeBenchmark.benchmark 25 true thrpt 20 118.315 ± 4.583 ops/s GameOfLifeBenchmark:context-switches 25 true thrpt 20 10.923 #/op GameOfLifeBenchmark.benchmark 25 false thrpt 20 17.654 ± 0.579 ops/s GameOfLifeBenchmark:context-switches 25 false thrpt 20 15741.886 #/op Kernel space User space

Slide 58

Slide 58 text

Experimenting with different Channels types ❖ All communications between threads happen through a (monodirectional) blocking channel public class Channel { private final BlockingQueue queue = new LinkedBlockingQueue<>(); T take() { try { return queue.take(); } catch (InterruptedException e) { return null; } } void put(T value) { try { queue.put(value); } catch (InterruptedException e) { // abort } } }

Slide 59

Slide 59 text

Experimenting with different Channels types ❖ All communications between threads happen through a (monodirectional) blocking channel ❖ The channel contains (at most) a single value at time and it is used to synchronize the producer and consumer (CSP pattern) acting as a rendez-vous point. public class Channel { private final BlockingRendezVous rendezVous; public Channel(BlockingRendezVous.Type type) { rendezVous = BlockingRendezVous.of(type); } T take() { return rendezVous.take(); } void put(T value) { rendezVous.put(value); } }

Slide 60

Slide 60 text

Experimenting with different Channels types public interface BlockingRendezVous { void put(T x) throws InterruptedException; T take() throws InterruptedException; enum Type { BlockingQueue, // the original one based on a LinkedBlockingQueue BlockingTransfer, // same but based on a LinkedTransferQueue (variation of M&S lock-free queue algorithm) LockedSingleValue, // locked data structure similar to a BlockingQueue but holding one single value OneToOneParking // lock-free specialized one-producer/one-consumer channel } static BlockingRendezVous of(Type type) { return switch (type) { case BlockingQueue -> new BlockingQueue<>(); case BlockingTransfer -> new BlockingTransfer<>(); case LockedSingleValue -> new LockedSingleValue<>(); case OneToOneParking -> new OneToOneParkingSingleValue<>(); default -> throw new AssertionError("unknown type: " + type); }; } }

Slide 61

Slide 61 text

Channels types – LockedSingleValue final class LockedSingleValue implements BlockingRendezVous { private final Lock lock = new ReentrantLock(); private final Condition notFull = lock.newCondition(); private final Condition notEmpty = lock.newCondition(); private T value; @Override public void put(T x) throws InterruptedException { lock.lock(); try { if (value != null) { notFull.await(); } value = x; notEmpty.signal(); } finally { lock.unlock(); } } @Override public T take() throws InterruptedException { lock.lock(); try { if (value == null) { notEmpty.await(); } T x = value; value = null; notFull.signal(); return x; } finally { lock.unlock(); } } }

Slide 62

Slide 62 text

Channels types – OneToOneParkingSingleValue final class OneToOneParkingSingleValue implements BlockingRendezVous { private volatile Thread waitSomeValue = null; private volatile Thread waitEmpty = null; private volatile T value; @Override public void put(T x) throws InterruptedException { Objects.requireNonNull(x); while (value != null) { Thread cThread = Thread.currentThread(); waitEmpty = currentThread; try { if (value == null) break; LockSupport.park(); if (cThread.isInterrupted()) { throw new InterruptedException(); } } finally { waitEmpty = null; } } value = x; LockSupport.unpark(waitSomeValue); } @Override public T take() throws InterruptedException { T t; while ((t = value) == null) { Thread cThread = Thread.currentThread(); waitSomeValue = ct; try { t = value; if (t != null) break; LockSupport.park(); if (cThread.isInterrupted()) { throw new InterruptedException(); } } finally { waitSomeValue = null; } } assert t != null; value = null; LockSupport.unpark(waitEmpty); return t; }

Slide 63

Slide 63 text

Different Channels types – Benchmark results Native Threads Benchmark (channelType) (padding) Mode Cnt Score Error Units GameOfLifeBenchmark.benchmark BlockingQueue 25 thrpt 20 17.654 ± 0.579 ops/s GameOfLifeBenchmark.benchmark BlockingTransfer 25 thrpt 20 16.652 ± 0.650 ops/s GameOfLifeBenchmark.benchmark LockedSingleValue 25 thrpt 20 17.428 ± 0.683 ops/s GameOfLifeBenchmark.benchmark OneToOneParking 25 thrpt 20 18.440 ± 0.893 ops/s

Slide 64

Slide 64 text

Different Channels types – Benchmark results Native Threads Benchmark (channelType) (padding) Mode Cnt Score Error Units GameOfLifeBenchmark.benchmark BlockingQueue 25 thrpt 20 17.654 ± 0.579 ops/s GameOfLifeBenchmark.benchmark BlockingTransfer 25 thrpt 20 16.652 ± 0.650 ops/s GameOfLifeBenchmark.benchmark LockedSingleValue 25 thrpt 20 17.428 ± 0.683 ops/s GameOfLifeBenchmark.benchmark OneToOneParking 25 thrpt 20 18.440 ± 0.893 ops/s Virtual Threads Benchmark (channelType) (padding) Mode Cnt Score Error Units GameOfLifeBenchmark.benchmark BlockingQueue 25 thrpt 20 118.315 ± 4.583 ops/s GameOfLifeBenchmark.benchmark BlockingTransfer 25 thrpt 20 154.734 ± 6.313 ops/s GameOfLifeBenchmark.benchmark LockedSingleValue 25 thrpt 20 161.351 ± 7.030 ops/s GameOfLifeBenchmark.benchmark OneToOneParking 25 thrpt 20 284.477 ± 12.121 ops/s

Slide 65

Slide 65 text

Different Channels types – Benchmark results Native Threads Benchmark (channelType) (padding) Mode Cnt Score Error Units GameOfLifeBenchmark.benchmark BlockingQueue 25 thrpt 20 17.654 ± 0.579 ops/s GameOfLifeBenchmark.benchmark BlockingTransfer 25 thrpt 20 16.652 ± 0.650 ops/s GameOfLifeBenchmark.benchmark LockedSingleValue 25 thrpt 20 17.428 ± 0.683 ops/s GameOfLifeBenchmark.benchmark OneToOneParking 25 thrpt 20 18.440 ± 0.893 ops/s Virtual Threads Benchmark (channelType) (padding) Mode Cnt Score Error Units GameOfLifeBenchmark.benchmark BlockingQueue 25 thrpt 20 118.315 ± 4.583 ops/s GameOfLifeBenchmark.benchmark BlockingTransfer 25 thrpt 20 154.734 ± 6.313 ops/s GameOfLifeBenchmark.benchmark LockedSingleValue 25 thrpt 20 161.351 ± 7.030 ops/s GameOfLifeBenchmark.benchmark OneToOneParking 25 thrpt 20 284.477 ± 12.121 ops/s

Slide 66

Slide 66 text

LinkedBlockingQueue based channel analysis

Slide 67

Slide 67 text

LinkedBlockingQueue based channel analysis

Slide 68

Slide 68 text

OneToOneParking based channel analysis

Slide 69

Slide 69 text

OneToOneParking based channel analysis

Slide 70

Slide 70 text

Thread per cell vs. Thread per core public abstract class GameOfLife { public void start() { startCells(); startGame(); } protected abstract int getThreadPoolSize(); protected abstract void startCells(); }

Slide 71

Slide 71 text

Thread per cell vs. Thread per core public abstract class GameOfLife { public void start() { startCells(); startGame(); } protected abstract int getThreadPoolSize(); protected abstract void startCells(); } Abstracts the number of threads to be used … and how cells are evaluated

Slide 72

Slide 72 text

Thread per cell vs. Thread per core public class ThreadPerCellGameOfLife extends GameOfLife { public ThreadPerCellGameOfLife(Dimensions dim, boolean[][] seed, int period, Channel gridChannel, boolean logRate, boolean useVirtualThreads, BlockingRendezVous.Type type) { super(dim, seed, period, gridChannel, logRate, useVirtualThreads, type); } @Override protected void startCells() { cells.forEach(cell -> runner.accept(cell::run)); } @Override protected int getThreadPoolSize() { return cells.size() + 1; } } public abstract class GameOfLife { public void start() { startCells(); startGame(); } protected abstract int getThreadPoolSize(); protected abstract void startCells(); } Abstracts the number of threads to be used … and how cells are evaluated

Slide 73

Slide 73 text

Thread per cell vs. Thread per core public class ThreadPerCellGameOfLife extends GameOfLife { public ThreadPerCellGameOfLife(Dimensions dim, boolean[][] seed, int period, Channel gridChannel, boolean logRate, boolean useVirtualThreads, BlockingRendezVous.Type type) { super(dim, seed, period, gridChannel, logRate, useVirtualThreads, type); } @Override protected void startCells() { cells.forEach(cell -> runner.accept(cell::run)); } @Override protected int getThreadPoolSize() { return cells.size() + 1; } } public abstract class GameOfLife { public void start() { startCells(); startGame(); } protected abstract int getThreadPoolSize(); protected abstract void startCells(); } Run each cell in its own thread Abstracts the number of threads to be used … and how cells are evaluated One thread per cell + 1 for coordinator

Slide 74

Slide 74 text

Thread per cell vs. Thread per core public class ThreadPerCoreGameOfLife extends GameOfLife { private final List cellsGroups; public ThreadPerCoreGameOfLife(Dimensions dim, boolean[][] seed, int period, Channel gridChannel, boolean logRate, boolean useVirtualThreads, BlockingRendezVous.Type type) { super(dim, seed, period, gridChannel, logRate, useVirtualThreads, type); cellsGroups = createCellGroups(); } private List createCellGroups() { List cellsGroups = new ArrayList<>(); int nThread = Runtime.getRuntime().availableProcessors(); int cellsPerGroup = (int) Math.round( (double) cells.size() / (double) (nThread-1) ); int groupStart = 0; for (int i = 0; i <= nThread-1; i++) { cellsGroups.add( new CellsGroup( cells.subList(groupStart, groupStart + cellsPerGroup) ) ); groupStart += cellsPerGroup; } cellsGroups.add( new CellsGroup( cells.subList(groupStart, cells.size()) ) ); return cellsGroups; } @Override protected void startCells() { cellsGroups.forEach( cg -> runner.accept(cg::run) ); } @Override protected int getThreadPoolSize() { return Runtime.getRuntime().availableProcessors(); } }

Slide 75

Slide 75 text

Thread per cell vs. Thread per core public class ThreadPerCoreGameOfLife extends GameOfLife { private final List cellsGroups; public ThreadPerCoreGameOfLife(Dimensions dim, boolean[][] seed, int period, Channel gridChannel, boolean logRate, boolean useVirtualThreads, BlockingRendezVous.Type type) { super(dim, seed, period, gridChannel, logRate, useVirtualThreads, type); cellsGroups = createCellGroups(); } private List createCellGroups() { List cellsGroups = new ArrayList<>(); int nThread = Runtime.getRuntime().availableProcessors(); int cellsPerGroup = (int) Math.round( (double) cells.size() / (double) (nThread-1) ); int groupStart = 0; for (int i = 0; i <= nThread-1; i++) { cellsGroups.add( new CellsGroup( cells.subList(groupStart, groupStart + cellsPerGroup) ) ); groupStart += cellsPerGroup; } cellsGroups.add( new CellsGroup( cells.subList(groupStart, cells.size()) ) ); return cellsGroups; } @Override protected void startCells() { cellsGroups.forEach( cg -> runner.accept(cg::run) ); } @Override protected int getThreadPoolSize() { return Runtime.getRuntime().availableProcessors(); } } Thread per core Run each group in its own thread group cells in n-1 groups n-th is for coordinator

Slide 76

Slide 76 text

Cells group public class CellsGroup { private final List cells; public CellsGroup(List cells) { this.cells = cells; } public void run() { while (true) { cells.forEach( Cell::notifyLiveness ); cells.forEach( Cell::calculateNextState ); } } }

Slide 77

Slide 77 text

Thread per cell vs. Thread per core – Benchmark results Thread per cell Benchmark (padding) (useVirtualThreads) Mode Cnt Score Error Units GameOfLifeBenchmark.benchmark 5 true thrpt 20 972.997 ± 51.172 ops/s GameOfLifeBenchmark.benchmark 5 false thrpt 20 128.482 ± 5.816 ops/s GameOfLifeBenchmark.benchmark 25 true thrpt 20 118.315 ± 4.583 ops/s GameOfLifeBenchmark.benchmark 25 false thrpt 20 17.654 ± 0.579 ops/s

Slide 78

Slide 78 text

Thread per cell vs. Thread per core – Benchmark results Thread per cell Benchmark (padding) (useVirtualThreads) Mode Cnt Score Error Units GameOfLifeBenchmark.benchmark 5 true thrpt 20 972.997 ± 51.172 ops/s GameOfLifeBenchmark.benchmark 5 false thrpt 20 128.482 ± 5.816 ops/s GameOfLifeBenchmark.benchmark 25 true thrpt 20 118.315 ± 4.583 ops/s GameOfLifeBenchmark.benchmark 25 false thrpt 20 17.654 ± 0.579 ops/s Thread per core Benchmark (padding) (useVirtualThreads) Mode Cnt Score Error Units GameOfLifeBenchmark.benchmark 5 true thrpt 20 1569.228 ± 91.740 ops/s GameOfLifeBenchmark.benchmark 5 false thrpt 20 1890.566 ± 139.547 ops/s GameOfLifeBenchmark.benchmark 25 true thrpt 20 158.609 ± 2.007 ops/s GameOfLifeBenchmark.benchmark 25 false thrpt 20 163.086 ± 1.434 ops/s

Slide 79

Slide 79 text

Thread per cell vs. Thread per core – Benchmark results Thread per cell Benchmark (padding) (useVirtualThreads) Mode Cnt Score Error Units GameOfLifeBenchmark.benchmark 5 true thrpt 20 972.997 ± 51.172 ops/s GameOfLifeBenchmark.benchmark 5 false thrpt 20 128.482 ± 5.816 ops/s GameOfLifeBenchmark.benchmark 25 true thrpt 20 118.315 ± 4.583 ops/s GameOfLifeBenchmark.benchmark 25 false thrpt 20 17.654 ± 0.579 ops/s Thread per core Benchmark (padding) (useVirtualThreads) Mode Cnt Score Error Units GameOfLifeBenchmark.benchmark 5 true thrpt 20 1569.228 ± 91.740 ops/s GameOfLifeBenchmark.benchmark 5 false thrpt 20 1890.566 ± 139.547 ops/s GameOfLifeBenchmark.benchmark 25 true thrpt 20 158.609 ± 2.007 ops/s GameOfLifeBenchmark.benchmark 25 false thrpt 20 163.086 ± 1.434 ops/s Change the rules of the game and you’ll have a different winner

Slide 80

Slide 80 text

What about native compilation? Swing/AWT is not supported

Slide 81

Slide 81 text

What about native compilation? Swing/AWT is not supported Necessary native compilation flag + Textual UI

Slide 82

Slide 82 text

Benchmarking native compilation vs. JDK BlockingQueue Native: Completed 5000 frames in 62586ms = 79.890071 frames per second JDK: Completed 5000 frames in 35618ms = 140.378460 frames per second 2000 frames warmup + 5000 frames benchmarking

Slide 83

Slide 83 text

Benchmarking native compilation vs. JDK BlockingQueue Native: Completed 5000 frames in 62586ms = 79.890071 frames per second JDK: Completed 5000 frames in 35618ms = 140.378460 frames per second OneToOneParking Native: Completed 5000 frames in 19681ms = 254.052131 frames per second JDK: Completed 5000 frames in 13398ms = 373.190028 frames per second 2000 frames warmup + 5000 frames benchmarking

Slide 84

Slide 84 text

Conclusions ❖ Virtual threads aren’t faster threads ➢ They don’t magically execute more instructions per second than native ones do. ➢ What they are really good for is waiting (but pay attention to pinning). ➢ In other words their goal is maximizing the utilization of external resources and then improving throughput. ➢ This is the same goal of reactive programming, but in a much more readable and developers friendly way.

Slide 85

Slide 85 text

Conclusions ❖ Virtual threads aren’t faster threads ➢ They don’t magically execute more instructions per second than native ones do. ➢ What they are really good for is waiting (but pay attention to pinning). ➢ In other words their goal is maximizing the utilization of external resources and then improving throughput. ➢ This is the same goal of reactive programming, but in a much more readable and developers friendly way. ❖ How good are virtual threads? ➢ They are really good in I/O bound scenarios. ➢ The work-stealing scheduler and lack of fairness due to the fact that they are not pre-emptive make them less suitable for CPU-bound scenarios. ➢ They are not drop in replacements for native threads in each and every situations.

Slide 86

Slide 86 text

Conclusions ❖ Virtual threads aren’t faster threads ➢ They don’t magically execute more instructions per second than native ones do. ➢ What they are really good for is waiting (but pay attention to pinning). ➢ In other words their goal is maximizing the utilization of external resources and then improving throughput. ➢ This is the same goal of reactive programming, but in a much more readable and developers friendly way. ❖ How good are virtual threads? ➢ They are really good in I/O bound scenarios. ➢ The work-stealing scheduler and lack of fairness due to the fact that they are not pre-emptive make them less suitable for CPU-bound scenarios. ➢ They are not drop in replacements for native threads in each and every situations. Domain knowledge still matters … a lot!

Slide 87

Slide 87 text

References ❖ Original Game of Life CSP implementation ➢ https://github.com/ebarlas/game-of-life-csp ❖ My fork used in this presentation ➢ https://github.com/mariofusco/game-of-life-csp ❖ Loom and Thread Fairness ➢ https://www.morling.dev/blog/loom-and-thread-fairness/ ❖ From Imperative to Pure-Functional and Back Again: Monads vs. Scoped Continuations ➢ https://www.javacodegeeks.com/2015/08/from-imperative-to-pure-functional-and-back-again-monads-vs-scoped-continuations.html ❖ Coming to Java 19: Virtual threads and platform threads ➢ https://blogs.oracle.com/javamagazine/post/java-loom-virtual-threads-platform-threads ❖ Going inside Java’s Project Loom and virtual threads ➢ https://blogs.oracle.com/javamagazine/post/going-inside-javas-project-loom-and-virtual-threads ❖ Webtide’s Loom Report ➢ https://webtide.com/do-looms-claims-stack-up-part-1/ ➢ https://webtide.com/do-looms-claims-stack-up-part-2/ ❖ Java Mixedmode Flame Graphs ➢ https://www.slideshare.net/brendangregg/javaone-2015-java-mixedmode-flame-graphs ❖ Single Writer Principle ➢ https://mechanical-sympathy.blogspot.com/2011/09/single-writer-principle.html ❖ Red/Blue functions problem ➢ http://journal.stuffwithstuff.com/2015/02/01/what-color-is-your-function/ ➢ https://www.tedinski.com/2018/11/13/function-coloring.html ❖ JEP 429: Extent-Local Variables (Incubator) ➢ https://openjdk.org/jeps/429