Game of Loom: implementation patterns and performance implications playing with virtual threads

Game of Loom: implementation patterns and performance implications playing with
virtual threads by Mario Fusco @mariofusco

What virtual threads are … ❖ JDK native threads are
thin wrappers around operating system threads ➢ Native threads are costly and memory consuming ➢ You cannot have too many of them ❖ Virtual threads make a better use of OS threads getting rid of the 1:1 relationship with operating system threads ➢ Many virtual threads can be multiplexed on the same OS thread (carrier) ➢ When a virtual thread calls a blocking operation the JDK performs a nonblocking OS call and automatically suspends the virtual thread until the operation finishes

… and why we need them - the WRONG reason
❖ Native threads: ➢ 2kB metadata ➢ Preallocated 1MB stack ➢ 1-10 µs for context switch ❖ Virtual threads: ➢ 200-300B metadata ➢ Pay-as-you-go stack (allocated on heap) ➢ Some ns (or below 1µs) for context switch

… and why we need them - the WRONG reason
❖ Native threads: ➢ 2kB metadata ➢ Preallocated 1MB stack ➢ 1-10 µs for context switch ❖ Virtual threads: ➢ 200-300B metadata ➢ Pay-as-you-go stack (allocated on heap) ➢ Some ns (or below 1µs) for context switch You can create MILLIONS of threads!!!

Typical multithreading application How you designed it

Typical multithreading application How it actually works

Virtual threads aren't great because you can have gazillions of
them. That's just a side effect of their design and behavior which are the actual great things.

… and why we need them - the RIGHT reason
Virtual threads allow to write asynchronous code in a way that looks synchronous. The main goal of virtual threads is having scalable application without paying the cost in terms of more complex code. Program Complexity Program Scalability One Thread per task Reactive Programing Virtual Threads

… and why we need them - the RIGHT reason
What virtual threads are really good for is… waiting Virtual threads allow to write asynchronous code in a way that looks synchronous. The main goal of virtual threads is having scalable application without paying the cost in terms of more complex code. Program Complexity Program Scalability One Thread per task Reactive Programing Virtual Threads

The Spiral of Innovation Concurrent Programming in Java

The Spiral of Innovation Concurrent Programming in Java Green threads

Native threads

Native threads Events ( actors / message passing )

Native threads Events ( actors / message passing ) Reactive programming

Native threads Events ( actors / message passing ) Reactive programming Virtual threads (Project Loom)

Virtual threads are syntactic sugar for reactive programming Virtual threads
do not enable anything that was impossible before. They allow to achieve the same result of reactive programming in a much easier and less verbose way.

Virtual threads are syntactic sugar for reactive programming // sum
ages of persons having name starting with A ... int sumAges = persons.stream() .filter( p -> p.getName().startsWith("A") ) .reduce( 0, (acc, p) -> acc + p.getAge(), Integer::sum); Virtual threads do not enable anything that was impossible before. They allow to achieve the same result of reactive programming in a much easier and less verbose way. Loom is for concurrent programming what Lambdas has been for functional programming.

Virtual threads are syntactic sugar for reactive programming // sum
ages of persons having name starting with A ... int sumAges = persons.stream() .filter( p -> p.getName().startsWith("A") ) .reduce( 0, (acc, p) -> acc + p.getAge(), Integer::sum); // ... same without lambdas sumAges = persons.stream() .filter(new Predicate<Person>() { @Override public boolean test(Person person) { return person.getName().startsWith("A"); } }) .reduce(0, new BiFunction<Integer, Person, Integer>() { @Override public Integer apply(Integer acc, Person person) { return acc + person.getAge(); } }, new BinaryOperator<Integer>() { @Override public Integer apply(Integer a, Integer b) { return a + b; } }); Virtual threads do not enable anything that was impossible before. They allow to achieve the same result of reactive programming in a much easier and less verbose way. Loom is for concurrent programming what Lambdas has been for functional programming. You could do FP even without λs but at what cost?

Virtual threads are syntactic sugar for reactive programming … but
not only that

Virtual threads are syntactic sugar for reactive programming … but
not only that for other conversations that should have been a blog post 😎

(Possible) issues with Virtual Threads ❖ Loom uses the Java
heap to store virtual threads stacks and other metadata. For large number of virtual threads with deep stacks this may have a relevant impact on the GC.

heap to store virtual threads stacks and other metadata. For large number of virtual threads with deep stacks this may have a relevant impact on the GC. ❖ Pinning: synchronized blocks and other specific constructs (like JNI functions) are not recognized as blocking operations, thus actually blocking not only the virtual thread but also its carrier.

heap to store virtual threads stacks and other metadata. For large number of virtual threads with deep stacks this may have a relevant impact on the GC. ❖ Pinning: synchronized blocks and other specific constructs (like JNI functions) are not recognized as blocking operations, thus actually blocking not only the virtual thread but also its carrier. ❖ Massive context switches (in user space) of a huge number of virtual threads can potentially cause a larger amount of cache misses.

heap to store virtual threads stacks and other metadata. For large number of virtual threads with deep stacks this may have a relevant impact on the GC. ❖ Pinning: synchronized blocks and other specific constructs (like JNI functions) are not recognized as blocking operations, thus actually blocking not only the virtual thread but also its carrier. ❖ Massive context switches (in user space) of a huge number of virtual threads can potentially cause a larger amount of cache misses. ❖ Virtual Threads scheduler is not pluggable and currently hardcoded to the Fork/Join Pool scheduler.

heap to store virtual threads stacks and other metadata. For large number of virtual threads with deep stacks this may have a relevant impact on the GC. ❖ Pinning: synchronized blocks and other specific constructs (like JNI functions) are not recognized as blocking operations, thus actually blocking not only the virtual thread but also its carrier. ❖ Massive context switches (in user space) of a huge number of virtual threads can potentially cause a larger amount of cache misses. ❖ Virtual Threads scheduler is not pluggable and currently hardcoded to the Fork/Join Pool scheduler. ❖ Virtual Threads are not pre-emptive, they cannot be descheduled while running heavy calculations without ever calling a JDK’s blocking methods, so they are not a good fit for CPU-bound tasks when fairness is key.

heap to store virtual threads stacks and other metadata. For large number of virtual threads with deep stacks this may have a relevant impact on the GC. ❖ Pinning: synchronized blocks and other specific constructs (like JNI functions) are not recognized as blocking operations, thus actually blocking not only the virtual thread but also its carrier. ❖ Massive context switches (in user space) of a huge number of virtual threads can potentially cause a larger amount of cache misses. ❖ Virtual Threads scheduler is not pluggable and currently hardcoded to the Fork/Join Pool scheduler. ❖ Virtual Threads are not pre-emptive, they cannot be descheduled while running heavy calculations without ever calling a JDK’s blocking methods, so they are not a good fit for CPU-bound tasks when fairness is key. ❖ Virtual threads support thread locals, but since they can be very numerous, thread locals could have a huge memory footprint and should be used sparingly. They could be replaced by Extent-Local Variables (JEP-429).

Game of Loom – The original Loom based implementation https://github.com/ebarlas/game-of-life-csp

What is CSP?

What is CSP? CSP is a concurrent programming model using
communication as synchronization primitive

Game of Loom – The original Loom based implementation ❖
All communications between threads happen through a (monodirectional) blocking channel public class Channel<T> { private final BlockingQueue<T> queue = new LinkedBlockingQueue<>(); T take() { try { return queue.take(); } catch (InterruptedException e) { return null; } } void put(T value) { try { queue.put(value); } catch (InterruptedException e) { // abort } } }

Each grid cell is an independent process communicating with its neighbour via channels public class Cell { private boolean alive; private final Channel<Boolean> tickChannel; private final Channel<Boolean> resultChannel; private final List<Channel<Boolean>> inChannels; private final List<Channel<Boolean>> outChannels; Cell(CellOptions options) { this.alive = options.alive(); this.tickChannel = options.tickChannel(); this.resultChannel = options.resultChannel(); this.inChannels = options.inChannels(); this.outChannels = options.outChannels(); } void start() { Thread.startVirtualThread(this::run); } }

Each grid cell reads neighbours’ states and calculate and publish its own in an infinite loop public class Cell { private void run() { while (true) { tickChannel.take(); // wait for tick stimulus outChannels.forEach(ch -> ch.put(alive)); // announce liveness to neighbors // receive liveness from neighbors int neighbors = inChannels.stream().map(Channel::take).mapToInt(b -> b ? 1 : 0).sum(); alive = nextState(alive, neighbors); // calculate next state based on game of life rules resultChannel.put(alive); // announce resulting next state } } private static boolean nextState(boolean alive, int neighbors) { if (alive) { return neighbors == 2 || neighbors == 3; } return neighbors == 3; } }

The GameOfLife class coordinates all cells, gathering their states and publishing it for UI consumption public class GameOfLife { void start() { cells.forEach(Cell::start); Thread.startVirtualThread(this::run); } private void run() { while (true) { dims.forEachRowCol((r, c) -> tickChannels.get(r).get(c).put(true)); // emit tick for every cell boolean[][] grid = new boolean[dims.rows()][dims.cols()]; // prepare result boolean matrix dims.forEachRowCol((r, c) -> grid[r][c] = resultChannels.get(r).get(c).take()); // populate results gridChannel.put(grid); // emit aggregated liveness matrix sleep(); } } }

Game of Loom – Demo

Benchmarking Game of Life @State(Scope.Benchmark) @Warmup(iterations = 5) @Measurement(iterations =
10) @Fork(2) public class GameOfLifeBenchmark { @Param({"5", "25", "100"}) // 874, 5074, 49324 cells private int padding; @Param({"true", "false"}) private boolean useVirtualThreads; @Param({"true", "false"}) private boolean threadPerCell; @Param({"BlockingQueue", "BlockingTransfer", "LockedSingleValue", "OneToOneParking"}) private String channelType; private GameOfLife gameOfLife; @Setup public void setup() { // setup GameOfLife according to benchmark parameters, omitted ... } @Benchmark @BenchmarkMode(Mode.Throughput) public boolean[][] benchmark() { return gameOfLife.calculateFrameBlocking(); } }

10) @Fork(2) public class GameOfLifeBenchmark { @Param({"5", "25", "100"}) // 874, 5074, 49324 cells private int padding; @Param({"true", "false"}) private boolean useVirtualThreads; @Param({"true", "false"}) private boolean threadPerCell; @Param({"BlockingQueue", "BlockingTransfer", "LockedSingleValue", "OneToOneParking"}) private String channelType; private GameOfLife gameOfLife; @Setup public void setup() { // setup GameOfLife according to benchmark parameters, omitted ... } @Benchmark @BenchmarkMode(Mode.Throughput) public boolean[][] benchmark() { return gameOfLife.calculateFrameBlocking(); } } Use either native or virtual threads

10) @Fork(2) public class GameOfLifeBenchmark { @Param({"5", "25", "100"}) // 874, 5074, 49324 cells private int padding; @Param({"true", "false"}) private boolean useVirtualThreads; @Param({"true", "false"}) private boolean threadPerCell; @Param({"BlockingQueue", "BlockingTransfer", "LockedSingleValue", "OneToOneParking"}) private String channelType; private GameOfLife gameOfLife; @Setup public void setup() { // setup GameOfLife according to benchmark parameters, omitted ... } @Benchmark @BenchmarkMode(Mode.Throughput) public boolean[][] benchmark() { return gameOfLife.calculateFrameBlocking(); } } Use either native or virtual threads Having one thread per cell or per core

10) @Fork(2) public class GameOfLifeBenchmark { @Param({"5", "25", "100"}) // 874, 5074, 49324 cells private int padding; @Param({"true", "false"}) private boolean useVirtualThreads; @Param({"true", "false"}) private boolean threadPerCell; @Param({"BlockingQueue", "BlockingTransfer", "LockedSingleValue", "OneToOneParking"}) private String channelType; private GameOfLife gameOfLife; @Setup public void setup() { // setup GameOfLife according to benchmark parameters, omitted ... } @Benchmark @BenchmarkMode(Mode.Throughput) public boolean[][] benchmark() { return gameOfLife.calculateFrameBlocking(); } } Use either native or virtual threads Having one thread per cell or per core Different channel implementations

Native vs. Virtual Threads public class GameOfLife { private final
Consumer<Runnable> runner; public GameOfLife(boolean useVirtualThreads) { if (useVirtualThreads) this.runner = Thread::startVirtualThread; else this.runner = Executors.newFixedThreadPool(getThreadPoolSize(), r -> { Thread t = new Thread(r); t.setDaemon(true); return t; })::execute; } public boolean[][] calculateFrame() { tick.tick(); // emit tick event for every cell boolean[][] grid = new boolean[dimensions.rows()][dimensions.cols()]; // prepare result boolean matrix resultChannels.forEachChannel( Channel::take, grid ); // populate matrix with results gridChannel.put(grid); // emit aggregated liveness matrix return grid; } public boolean[][] calculateFrameBlocking() { runner.accept(() -> calculateFrame()); return gridChannel.take(); // block until the result is ready } }

Consumer<Runnable> runner; public GameOfLife(boolean useVirtualThreads) { if (useVirtualThreads) this.runner = Thread::startVirtualThread; else this.runner = Executors.newFixedThreadPool(getThreadPoolSize(), r -> { Thread t = new Thread(r); t.setDaemon(true); return t; })::execute; } public boolean[][] calculateFrame() { tick.tick(); // emit tick event for every cell boolean[][] grid = new boolean[dimensions.rows()][dimensions.cols()]; // prepare result boolean matrix resultChannels.forEachChannel( Channel::take, grid ); // populate matrix with results gridChannel.put(grid); // emit aggregated liveness matrix return grid; } public boolean[][] calculateFrameBlocking() { runner.accept(() -> calculateFrame()); return gridChannel.take(); // block until the result is ready } } Executes a Runnable on a virtual or a native thread

Consumer<Runnable> runner; public GameOfLife(boolean useVirtualThreads) { if (useVirtualThreads) this.runner = Thread::startVirtualThread; else this.runner = Executors.newFixedThreadPool(getThreadPoolSize(), r -> { Thread t = new Thread(r); t.setDaemon(true); return t; })::execute; } public boolean[][] calculateFrame() { tick.tick(); // emit tick event for every cell boolean[][] grid = new boolean[dimensions.rows()][dimensions.cols()]; // prepare result boolean matrix resultChannels.forEachChannel( Channel::take, grid ); // populate matrix with results gridChannel.put(grid); // emit aggregated liveness matrix return grid; } public boolean[][] calculateFrameBlocking() { runner.accept(() -> calculateFrame()); return gridChannel.take(); // block until the result is ready } } Executes a Runnable on a virtual or a native thread Called by UI to get next frame

Consumer<Runnable> runner; public GameOfLife(boolean useVirtualThreads) { if (useVirtualThreads) this.runner = Thread::startVirtualThread; else this.runner = Executors.newFixedThreadPool(getThreadPoolSize(), r -> { Thread t = new Thread(r); t.setDaemon(true); return t; })::execute; } public boolean[][] calculateFrame() { tick.tick(); // emit tick event for every cell boolean[][] grid = new boolean[dimensions.rows()][dimensions.cols()]; // prepare result boolean matrix resultChannels.forEachChannel( Channel::take, grid ); // populate matrix with results gridChannel.put(grid); // emit aggregated liveness matrix return grid; } public boolean[][] calculateFrameBlocking() { runner.accept(() -> calculateFrame()); return gridChannel.take(); // block until the result is ready } } Executes a Runnable on a virtual or a native thread Called by UI to get next frame Blocking headless version Used in benchmarks

Native vs. Virtual Threads – Benchmark Results Benchmark (padding) (useVirtualThreads)
Mode Cnt Score Error Units GameOfLifeBenchmark.benchmark 5 true thrpt 20 972.997 ± 51.172 ops/s GameOfLifeBenchmark.benchmark 5 false thrpt 20 128.482 ± 5.816 ops/s GameOfLifeBenchmark.benchmark 25 true thrpt 20 118.315 ± 4.583 ops/s GameOfLifeBenchmark.benchmark 25 false thrpt 20 17.654 ± 0.579 ops/s GameOfLifeBenchmark.benchmark 100 true thrpt 20 4.376 ± 0.425 ops/s GameOfLifeBenchmark.benchmark 100 false thrpt 20 NaN ---

Native vs. Virtual Threads – Benchmark Results Benchmark (padding) (useVirtualThreads)
Mode Cnt Score Error Units GameOfLifeBenchmark.benchmark 5 true thrpt 20 972.997 ± 51.172 ops/s GameOfLifeBenchmark.benchmark 5 false thrpt 20 128.482 ± 5.816 ops/s GameOfLifeBenchmark.benchmark 25 true thrpt 20 118.315 ± 4.583 ops/s GameOfLifeBenchmark.benchmark 25 false thrpt 20 17.654 ± 0.579 ops/s GameOfLifeBenchmark.benchmark 100 true thrpt 20 4.376 ± 0.425 ops/s GameOfLifeBenchmark.benchmark 100 false thrpt 20 NaN --- ~7x OOM

Mixed-Mode Flame Graphs ❖ X-axis alphabetical stack sort ie NOT
A TIME SERIES! ❖ Width is proportional to samples presence ❖ Colors ➢ green - Java ➢ aqua - Inlined Java ➢ orange - Kernel code ➢ red - C libraries (eg JNI) ➢ yellow - C++ (eg JVM) ❖ Color intensity is randomized to differentiate frames ❖ Thread ID is added as a base frame

Native Threads Execution

Native Threads Execution Cell.run

Virtual Threads Execution

Work stealing (FJ pool) Virtual Threads Execution

Virtual Threads Execution Cell.run Work stealing (FJ pool)

Native vs. Virtual Threads – Context Switch jmh -prof “perfnorm:events=context-switches”

Native vs. Virtual Threads – Context Switch jmh -prof “perfnorm:events=context-switches”
Benchmark (padding) (useVirtualThreads) Mode Cnt Score Error Units GameOfLifeBenchmark.benchmark 25 true thrpt 20 118.315 ± 4.583 ops/s GameOfLifeBenchmark:context-switches 25 true thrpt 20 10.923 #/op GameOfLifeBenchmark.benchmark 25 false thrpt 20 17.654 ± 0.579 ops/s GameOfLifeBenchmark:context-switches 25 false thrpt 20 15741.886 #/op

Native vs. Virtual Threads – Context Switch htop on native
threads htop on virtual threads jmh -prof “perfnorm:events=context-switches” Benchmark (padding) (useVirtualThreads) Mode Cnt Score Error Units GameOfLifeBenchmark.benchmark 25 true thrpt 20 118.315 ± 4.583 ops/s GameOfLifeBenchmark:context-switches 25 true thrpt 20 10.923 #/op GameOfLifeBenchmark.benchmark 25 false thrpt 20 17.654 ± 0.579 ops/s GameOfLifeBenchmark:context-switches 25 false thrpt 20 15741.886 #/op Kernel space User space

Experimenting with different Channels types ❖ All communications between threads
happen through a (monodirectional) blocking channel public class Channel<T> { private final BlockingQueue<T> queue = new LinkedBlockingQueue<>(); T take() { try { return queue.take(); } catch (InterruptedException e) { return null; } } void put(T value) { try { queue.put(value); } catch (InterruptedException e) { // abort } } }

Experimenting with different Channels types ❖ All communications between threads
happen through a (monodirectional) blocking channel ❖ The channel contains (at most) a single value at time and it is used to synchronize the producer and consumer (CSP pattern) acting as a rendez-vous point. public class Channel<T> { private final BlockingRendezVous<T> rendezVous; public Channel(BlockingRendezVous.Type type) { rendezVous = BlockingRendezVous.of(type); } T take() { return rendezVous.take(); } void put(T value) { rendezVous.put(value); } }

Experimenting with different Channels types public interface BlockingRendezVous<T> { void
put(T x) throws InterruptedException; T take() throws InterruptedException; enum Type { BlockingQueue, // the original one based on a LinkedBlockingQueue BlockingTransfer, // same but based on a LinkedTransferQueue (variation of M&S lock-free queue algorithm) LockedSingleValue, // locked data structure similar to a BlockingQueue but holding one single value OneToOneParking // lock-free specialized one-producer/one-consumer channel } static <T> BlockingRendezVous<T> of(Type type) { return switch (type) { case BlockingQueue -> new BlockingQueue<>(); case BlockingTransfer -> new BlockingTransfer<>(); case LockedSingleValue -> new LockedSingleValue<>(); case OneToOneParking -> new OneToOneParkingSingleValue<>(); default -> throw new AssertionError("unknown type: " + type); }; } }

Channels types – LockedSingleValue final class LockedSingleValue<T> implements BlockingRendezVous<T> {
private final Lock lock = new ReentrantLock(); private final Condition notFull = lock.newCondition(); private final Condition notEmpty = lock.newCondition(); private T value; @Override public void put(T x) throws InterruptedException { lock.lock(); try { if (value != null) { notFull.await(); } value = x; notEmpty.signal(); } finally { lock.unlock(); } } @Override public T take() throws InterruptedException { lock.lock(); try { if (value == null) { notEmpty.await(); } T x = value; value = null; notFull.signal(); return x; } finally { lock.unlock(); } } }

Channels types – OneToOneParkingSingleValue final class OneToOneParkingSingleValue<T> implements BlockingRendezVous<T> {
private volatile Thread waitSomeValue = null; private volatile Thread waitEmpty = null; private volatile T value; @Override public void put(T x) throws InterruptedException { Objects.requireNonNull(x); while (value != null) { Thread cThread = Thread.currentThread(); waitEmpty = currentThread; try { if (value == null) break; LockSupport.park(); if (cThread.isInterrupted()) { throw new InterruptedException(); } } finally { waitEmpty = null; } } value = x; LockSupport.unpark(waitSomeValue); } @Override public T take() throws InterruptedException { T t; while ((t = value) == null) { Thread cThread = Thread.currentThread(); waitSomeValue = ct; try { t = value; if (t != null) break; LockSupport.park(); if (cThread.isInterrupted()) { throw new InterruptedException(); } } finally { waitSomeValue = null; } } assert t != null; value = null; LockSupport.unpark(waitEmpty); return t; }

Different Channels types – Benchmark results Native Threads Benchmark (channelType)
(padding) Mode Cnt Score Error Units GameOfLifeBenchmark.benchmark BlockingQueue 25 thrpt 20 17.654 ± 0.579 ops/s GameOfLifeBenchmark.benchmark BlockingTransfer 25 thrpt 20 16.652 ± 0.650 ops/s GameOfLifeBenchmark.benchmark LockedSingleValue 25 thrpt 20 17.428 ± 0.683 ops/s GameOfLifeBenchmark.benchmark OneToOneParking 25 thrpt 20 18.440 ± 0.893 ops/s

Different Channels types – Benchmark results Native Threads Benchmark (channelType)
(padding) Mode Cnt Score Error Units GameOfLifeBenchmark.benchmark BlockingQueue 25 thrpt 20 17.654 ± 0.579 ops/s GameOfLifeBenchmark.benchmark BlockingTransfer 25 thrpt 20 16.652 ± 0.650 ops/s GameOfLifeBenchmark.benchmark LockedSingleValue 25 thrpt 20 17.428 ± 0.683 ops/s GameOfLifeBenchmark.benchmark OneToOneParking 25 thrpt 20 18.440 ± 0.893 ops/s Virtual Threads Benchmark (channelType) (padding) Mode Cnt Score Error Units GameOfLifeBenchmark.benchmark BlockingQueue 25 thrpt 20 118.315 ± 4.583 ops/s GameOfLifeBenchmark.benchmark BlockingTransfer 25 thrpt 20 154.734 ± 6.313 ops/s GameOfLifeBenchmark.benchmark LockedSingleValue 25 thrpt 20 161.351 ± 7.030 ops/s GameOfLifeBenchmark.benchmark OneToOneParking 25 thrpt 20 284.477 ± 12.121 ops/s

LinkedBlockingQueue based channel analysis

OneToOneParking based channel analysis

Thread per cell vs. Thread per core public abstract class
GameOfLife { public void start() { startCells(); startGame(); } protected abstract int getThreadPoolSize(); protected abstract void startCells(); }

Thread per cell vs. Thread per core public abstract class
GameOfLife { public void start() { startCells(); startGame(); } protected abstract int getThreadPoolSize(); protected abstract void startCells(); } Abstracts the number of threads to be used … and how cells are evaluated

Thread per cell vs. Thread per core public class ThreadPerCellGameOfLife
extends GameOfLife { public ThreadPerCellGameOfLife(Dimensions dim, boolean[][] seed, int period, Channel<boolean[][]> gridChannel, boolean logRate, boolean useVirtualThreads, BlockingRendezVous.Type type) { super(dim, seed, period, gridChannel, logRate, useVirtualThreads, type); } @Override protected void startCells() { cells.forEach(cell -> runner.accept(cell::run)); } @Override protected int getThreadPoolSize() { return cells.size() + 1; } } public abstract class GameOfLife { public void start() { startCells(); startGame(); } protected abstract int getThreadPoolSize(); protected abstract void startCells(); } Abstracts the number of threads to be used … and how cells are evaluated

Thread per cell vs. Thread per core public class ThreadPerCellGameOfLife
extends GameOfLife { public ThreadPerCellGameOfLife(Dimensions dim, boolean[][] seed, int period, Channel<boolean[][]> gridChannel, boolean logRate, boolean useVirtualThreads, BlockingRendezVous.Type type) { super(dim, seed, period, gridChannel, logRate, useVirtualThreads, type); } @Override protected void startCells() { cells.forEach(cell -> runner.accept(cell::run)); } @Override protected int getThreadPoolSize() { return cells.size() + 1; } } public abstract class GameOfLife { public void start() { startCells(); startGame(); } protected abstract int getThreadPoolSize(); protected abstract void startCells(); } Run each cell in its own thread Abstracts the number of threads to be used … and how cells are evaluated One thread per cell + 1 for coordinator

Thread per cell vs. Thread per core public class ThreadPerCoreGameOfLife
extends GameOfLife { private final List<CellsGroup> cellsGroups; public ThreadPerCoreGameOfLife(Dimensions dim, boolean[][] seed, int period, Channel<boolean[][]> gridChannel, boolean logRate, boolean useVirtualThreads, BlockingRendezVous.Type type) { super(dim, seed, period, gridChannel, logRate, useVirtualThreads, type); cellsGroups = createCellGroups(); } private List<CellsGroup> createCellGroups() { List<CellsGroup> cellsGroups = new ArrayList<>(); int nThread = Runtime.getRuntime().availableProcessors(); int cellsPerGroup = (int) Math.round( (double) cells.size() / (double) (nThread-1) ); int groupStart = 0; for (int i = 0; i <= nThread-1; i++) { cellsGroups.add( new CellsGroup( cells.subList(groupStart, groupStart + cellsPerGroup) ) ); groupStart += cellsPerGroup; } cellsGroups.add( new CellsGroup( cells.subList(groupStart, cells.size()) ) ); return cellsGroups; } @Override protected void startCells() { cellsGroups.forEach( cg -> runner.accept(cg::run) ); } @Override protected int getThreadPoolSize() { return Runtime.getRuntime().availableProcessors(); } }

Thread per cell vs. Thread per core public class ThreadPerCoreGameOfLife
extends GameOfLife { private final List<CellsGroup> cellsGroups; public ThreadPerCoreGameOfLife(Dimensions dim, boolean[][] seed, int period, Channel<boolean[][]> gridChannel, boolean logRate, boolean useVirtualThreads, BlockingRendezVous.Type type) { super(dim, seed, period, gridChannel, logRate, useVirtualThreads, type); cellsGroups = createCellGroups(); } private List<CellsGroup> createCellGroups() { List<CellsGroup> cellsGroups = new ArrayList<>(); int nThread = Runtime.getRuntime().availableProcessors(); int cellsPerGroup = (int) Math.round( (double) cells.size() / (double) (nThread-1) ); int groupStart = 0; for (int i = 0; i <= nThread-1; i++) { cellsGroups.add( new CellsGroup( cells.subList(groupStart, groupStart + cellsPerGroup) ) ); groupStart += cellsPerGroup; } cellsGroups.add( new CellsGroup( cells.subList(groupStart, cells.size()) ) ); return cellsGroups; } @Override protected void startCells() { cellsGroups.forEach( cg -> runner.accept(cg::run) ); } @Override protected int getThreadPoolSize() { return Runtime.getRuntime().availableProcessors(); } } Thread per core Run each group in its own thread group cells in n-1 groups n-th is for coordinator

Cells group public class CellsGroup { private final List<Cell> cells;
public CellsGroup(List<Cell> cells) { this.cells = cells; } public void run() { while (true) { cells.forEach( Cell::notifyLiveness ); cells.forEach( Cell::calculateNextState ); } } }

Thread per cell vs. Thread per core – Benchmark results
Thread per cell Benchmark (padding) (useVirtualThreads) Mode Cnt Score Error Units GameOfLifeBenchmark.benchmark 5 true thrpt 20 972.997 ± 51.172 ops/s GameOfLifeBenchmark.benchmark 5 false thrpt 20 128.482 ± 5.816 ops/s GameOfLifeBenchmark.benchmark 25 true thrpt 20 118.315 ± 4.583 ops/s GameOfLifeBenchmark.benchmark 25 false thrpt 20 17.654 ± 0.579 ops/s

Thread per cell Benchmark (padding) (useVirtualThreads) Mode Cnt Score Error Units GameOfLifeBenchmark.benchmark 5 true thrpt 20 972.997 ± 51.172 ops/s GameOfLifeBenchmark.benchmark 5 false thrpt 20 128.482 ± 5.816 ops/s GameOfLifeBenchmark.benchmark 25 true thrpt 20 118.315 ± 4.583 ops/s GameOfLifeBenchmark.benchmark 25 false thrpt 20 17.654 ± 0.579 ops/s Thread per core Benchmark (padding) (useVirtualThreads) Mode Cnt Score Error Units GameOfLifeBenchmark.benchmark 5 true thrpt 20 1569.228 ± 91.740 ops/s GameOfLifeBenchmark.benchmark 5 false thrpt 20 1890.566 ± 139.547 ops/s GameOfLifeBenchmark.benchmark 25 true thrpt 20 158.609 ± 2.007 ops/s GameOfLifeBenchmark.benchmark 25 false thrpt 20 163.086 ± 1.434 ops/s

Thread per cell Benchmark (padding) (useVirtualThreads) Mode Cnt Score Error Units GameOfLifeBenchmark.benchmark 5 true thrpt 20 972.997 ± 51.172 ops/s GameOfLifeBenchmark.benchmark 5 false thrpt 20 128.482 ± 5.816 ops/s GameOfLifeBenchmark.benchmark 25 true thrpt 20 118.315 ± 4.583 ops/s GameOfLifeBenchmark.benchmark 25 false thrpt 20 17.654 ± 0.579 ops/s Thread per core Benchmark (padding) (useVirtualThreads) Mode Cnt Score Error Units GameOfLifeBenchmark.benchmark 5 true thrpt 20 1569.228 ± 91.740 ops/s GameOfLifeBenchmark.benchmark 5 false thrpt 20 1890.566 ± 139.547 ops/s GameOfLifeBenchmark.benchmark 25 true thrpt 20 158.609 ± 2.007 ops/s GameOfLifeBenchmark.benchmark 25 false thrpt 20 163.086 ± 1.434 ops/s Change the rules of the game and you’ll have a different winner

What about native compilation? Swing/AWT is not supported

What about native compilation? Swing/AWT is not supported Necessary native
compilation flag + Textual UI

Benchmarking native compilation vs. JDK BlockingQueue Native: Completed 5000 frames
in 62586ms = 79.890071 frames per second JDK: Completed 5000 frames in 35618ms = 140.378460 frames per second 2000 frames warmup + 5000 frames benchmarking

Benchmarking native compilation vs. JDK BlockingQueue Native: Completed 5000 frames
in 62586ms = 79.890071 frames per second JDK: Completed 5000 frames in 35618ms = 140.378460 frames per second OneToOneParking Native: Completed 5000 frames in 19681ms = 254.052131 frames per second JDK: Completed 5000 frames in 13398ms = 373.190028 frames per second 2000 frames warmup + 5000 frames benchmarking

Conclusions ❖ Virtual threads aren’t faster threads ➢ They don’t
magically execute more instructions per second than native ones do. ➢ What they are really good for is waiting (but pay attention to pinning). ➢ In other words their goal is maximizing the utilization of external resources and then improving throughput. ➢ This is the same goal of reactive programming, but in a much more readable and developers friendly way.

magically execute more instructions per second than native ones do. ➢ What they are really good for is waiting (but pay attention to pinning). ➢ In other words their goal is maximizing the utilization of external resources and then improving throughput. ➢ This is the same goal of reactive programming, but in a much more readable and developers friendly way. ❖ How good are virtual threads? ➢ They are really good in I/O bound scenarios. ➢ The work-stealing scheduler and lack of fairness due to the fact that they are not pre-emptive make them less suitable for CPU-bound scenarios. ➢ They are not drop in replacements for native threads in each and every situations.

magically execute more instructions per second than native ones do. ➢ What they are really good for is waiting (but pay attention to pinning). ➢ In other words their goal is maximizing the utilization of external resources and then improving throughput. ➢ This is the same goal of reactive programming, but in a much more readable and developers friendly way. ❖ How good are virtual threads? ➢ They are really good in I/O bound scenarios. ➢ The work-stealing scheduler and lack of fairness due to the fact that they are not pre-emptive make them less suitable for CPU-bound scenarios. ➢ They are not drop in replacements for native threads in each and every situations. Domain knowledge still matters … a lot!

References ❖ Original Game of Life CSP implementation ➢ https://github.com/ebarlas/game-of-life-csp
❖ My fork used in this presentation ➢ https://github.com/mariofusco/game-of-life-csp ❖ Loom and Thread Fairness ➢ https://www.morling.dev/blog/loom-and-thread-fairness/ ❖ From Imperative to Pure-Functional and Back Again: Monads vs. Scoped Continuations ➢ https://www.javacodegeeks.com/2015/08/from-imperative-to-pure-functional-and-back-again-monads-vs-scoped-continuations.html ❖ Coming to Java 19: Virtual threads and platform threads ➢ https://blogs.oracle.com/javamagazine/post/java-loom-virtual-threads-platform-threads ❖ Going inside Java’s Project Loom and virtual threads ➢ https://blogs.oracle.com/javamagazine/post/going-inside-javas-project-loom-and-virtual-threads ❖ Webtide’s Loom Report ➢ https://webtide.com/do-looms-claims-stack-up-part-1/ ➢ https://webtide.com/do-looms-claims-stack-up-part-2/ ❖ Java Mixedmode Flame Graphs ➢ https://www.slideshare.net/brendangregg/javaone-2015-java-mixedmode-flame-graphs ❖ Single Writer Principle ➢ https://mechanical-sympathy.blogspot.com/2011/09/single-writer-principle.html ❖ Red/Blue functions problem ➢ http://journal.stuffwithstuff.com/2015/02/01/what-color-is-your-function/ ➢ https://www.tedinski.com/2018/11/13/function-coloring.html ❖ JEP 429: Extent-Local Variables (Incubator) ➢ https://openjdk.org/jeps/429

Game of Loom: implementation patterns and perfo...

Game of Loom: implementation patterns and performance implications playing with virtual threads

More Decks by Mario Fusco

Other Decks in Programming

Featured

Transcript