Slide 1

Slide 1 text

Towards Support for Heterogeneous Architectures on Managed Runtime Systems Andy Nisbet Research Fellow Advanced Processor Technologies (APT) Group Andy Nisbet School of Computer Science, University of Manchester, UK [email protected]

Slide 2

Slide 2 text

Future Computing Challenges • Greater performance demands – within lower power budgets • Emergent diverse computational workloads – Computer vision, machine learning, big data • Increasing architecture heterogeneity/diversity 2

Slide 3

Slide 3 text

Meeting Computing Challenges • Need vertical full stack optimisation to meet the demand of increased performance and lower power budgets • Full stack optimisation – provide abstraction and mechanisms for efficient exploitation hw diversity 3

Slide 4

Slide 4 text

Traditional VMs and Heterogeneity • Write once, run anywhere (WORA) on heterogeneous platforms? • CPU heterogeneity in power performance • GPUs: programmable compute accelerators • FPGAs: fine-grained grid of logic Programming optimization & development effort Lowest-CPUs GPUs FPGAs-Highest (long “compilation”) 4

Slide 5

Slide 5 text

Traditional VMs and Heterogeneity • Write once, run anywhere (WORA) doesn’t work ATM with heterogeneous diversity! • We need abstractions & mechanisms to enable efficient exploitation of diversity • We need tools to help identify/find “vertical stack” optimisation opportunities • We need codesign (modify/develop sw/hw) to evaluate different sw/hw co-operations to optimise for these opportunities 5

Slide 6

Slide 6 text

Project BeeHive • Provide abstractions & mechanisms for efficient vertical stack exploitation routes • Tools to make it feasible to identify optimisation opportunities – simulation/profiling • To prototype and implement solutions – Full stack solutions, from application down to hardware – BeeHive will reduce the level of expert knowledge required to perform full stack optimisation • To investigate co-design tradeoffs between hardware versus software solutions • https://github.com/beehive-lab 6

Slide 7

Slide 7 text

Java/OpenCL Acceleration Overview • Enables Java to target architectures with OpenCL compute devices (James Clarkson) [1] • JIT generated OpenCL C from Java bytecode [1] Heterogeneous Managed Runtime Systems: A Computer Vision Case Study VEE16 7

Slide 8

Slide 8 text

8 Java/OpenCL Acceleration Case Study – Kfusion SLAM: 3D Simultaneous Localization & Mapping Kfusion : SLAMBENCH

Slide 9

Slide 9 text

Java/OpenCL GPU SLAM C++ and Java/OpenCL speedup over serial Java 9

Slide 10

Slide 10 text

APTsim: co-design Java/FPGAs 10 • MAST simple mechanism to use FPGAs [1] ‒ Evaluate entire end-end usage of FPGAs in a system, not just trace simulation ‒ Catch design errors as early as possible ‒ 43x speedup on OpenJDK for pre-processing stages of kfusion SLAM via JNI • Fast event driven simulation of Java hotpath - Instrumented execution generates events - FPGA models produce statistics/timing information [1] Heterogeneous Managed Runtime Systems: A Computer Vision Case Study VEE16 MAST the scaling and bilateral filter pre-processing stages

Slide 11

Slide 11 text

APTsim: Hw/Sw Codesign (See [2]) 11 MAST is the glue that enables easy exploitation of FPGA resources • Acceleration • Simulation [2] The potential of dynamic binary modification and CPU/FPGA SoCs for simulation FCCM17

Slide 12

Slide 12 text

Java/FPGA Accelerators Roadmap • Earlier slides – manually JNI and MAST [1] • Java/OpenCL -high level synthesis to FPGA – Can already use mechanism of JNI & MAST – Need abstraction & mechanisms to use MAST FPGA accelerators directly & automatically • Accelerators for SLAM computer vision [1] Heterogeneous Managed Runtime Systems: A Computer Vision Case Study VEE16 12

Slide 13

Slide 13 text

MaxSim/Co-Design • Relates micro-architectural simulation events to VM and language features • Enables research, development & evaluation of novel hardware extensions – Codesign for power & performance analysis of new VM features [3] MaxSim: A Simulation Platform for Managed Applications, Rodchenko et al, ISPASS17 best paper award [4] Type Information Elimination from Objects on Architectures with Tagged Pointers Support Rodchenko et al, (in press) IEEE Trans. Comp. 13

Slide 14

Slide 14 text

MaxSim/Codesign • Evaluate co-design benefits – VM/language aspects max heap size – Performance and power with McPAT • Demonstrates use-cases of pointer tagging • Lightweight micro-architectural profiling – Pointer tags used to store class type or object allocation site information (site estimation and class ID) – Fast simulation of microarchitecture performance and power estimation of all DaCapo benchmarks in < 1 day • Profiling to identify co-design opportunities – Dynamic load elimination for array length retrieval – Store array length in pointer tag, use hardware support • Supports object field layout optimisations 14

Slide 15

Slide 15 text

MaxSim/CoDesign • Micro-architecture characterization (statistics) – L2/L3 cache misses per kilo instructions – Energy spent in GC, versus non-GC – Instructions per clock (IPC) … • Fine-grained text output (see Rodchenko’s thesis) 15

Slide 16

Slide 16 text

MaxSim • HSS using class information pointer elimination using pointer tagging • Array length elimination using tagging 16

Slide 17

Slide 17 text

Underlying MaxineVM Status/Stability • MaxineVM simplicity/flexibility means it is feasible to prototype new ideas – APTsim hotpath instrumentation – Integration with ZSim to enable MaxSim • x86-64, ARMv7 32bit • SPECjvm2008 and DaCapo-9.12-bach • X86-64 inspector examine runtime VM state • ARMv7 low-level debugging support method- id with gdb “to appear in VMIL17” • Performance circa 2x slower than hotspot 17

Slide 18

Slide 18 text

Beehive: https://github.com/beehive-lab • MaxineVM – x86-64, ARMv7, ARMv8 in progress … – Updating to JDK8/Graal for performance boost • OpenCL for heterogeneous execution – GPUs and FPGAs in progress • MaxSim: MaxineVM + ZSim – Software simulation/codesign for managed workloads • MAMBO (Pin like tool for ARM) – Dynamic binary instrumentation for ARMv7 & ARMv8 • APTsim= MAST + instrumentation – FPGA accelerated simulation – MAST support for FPGA IP/accelerators 18

Slide 19

Slide 19 text

Compilers & Runtimes Ioanna-Maria Alifieraki James Clarkson Timothy Hartley Andy Nisbet Andrey Rodchenko Foivos Zakkak Cosmin Gorgovan Hardware & Simulators Nikos Foutris Will Toms Oscar Palomar John Mawer Academic Staff Javier Navaridas Christos Kotselidis Jim Garside Antoniu Pop John Goodacre Mikel Lujan FPGAs John Mawer Oscar Palomar Athanasios Stratikopoulos

Slide 20

Slide 20 text

Questions? 20