Slide 1

Slide 1 text

Heterogeneous Parallel Computing K20 & CUDA5 MW13 Conference 19 February 2013 Wellington Thanks to Prof Manuel Ujaldon of University of Malaga and Dr Michael Dinneen of University of Auckland for providing illustrations to and reviewing the paper

Slide 2

Slide 2 text

Content 1. HPC Introduction (2) 2. CUDA Hardware (3) 3. CUDA Software (4) 4. Advances in HPC (3) 5. High Level Compilation (1) 6. Digital Content Creation (2)

Slide 3

Slide 3 text

1- HPC Introduction, Top500 Source of data: www.top500.org for November 2012 3,945 8,162 786,432 Mira- BlueGene/Q Argonne National Lab 4 12,660 10,510 705,024 K computer SPARC64 RIKEN AICS 3 7,890 16,324 1,572,864 Sequoia- BlueGene/Q Lawrence Livermore NL 2 8,209 17,590 560,640 Titan- Cray XK7 Oak Ridge National Lab 1 Power kW R max TFLOPS Cores System Site Rank

Slide 4

Slide 4 text

2- HPC Introduction, heterogeneous

Slide 5

Slide 5 text

3- CUDA Compute Capabilities • Desirable Targets – Highest performance – Lowest consumption – Cheapest price – Easiest to program 2880 1536 336 512 240 128 Total Core 192 192 48 32 8 8 Core /MP 15 8 7 16 30 16 Multi- Proc 3.5 3.0 2.1 2.0 1.2 1.0 CCC 2013 2012 2011 2010 2008 2006 Date Kepler GK110 Kepler GK104 Fermi GF104 Fermi GF100 GT200 G80 Architec ture

Slide 6

Slide 6 text

4- CUDA Hardware, Memory DRAM refers to Global Memory

Slide 7

Slide 7 text

5- CUDA Hardware, Cores • K20 – 13 multiprocessors each with 192 processors – 3x of Fermi – TDP is 225W, same for 3 generations

Slide 8

Slide 8 text

6- CUDA Software Libraries & Compilers

Slide 9

Slide 9 text

7- CUDA Software • For OpenCL, replace NVCC ~ OpenCL PTX Code ~ GPU Code CUDA GPU ~ OpenCL GPU • PTX ~ OpenCL Expectations Dinneen findings PTX versus OpenCL Standards

Slide 10

Slide 10 text

8- CUDA Software • Dinneen suggested from PTX to OpenMP From OpenMP to OpenACC

Slide 11

Slide 11 text

9- CUDA Software PTX code OpenCL Swan & Cat http://gpuocelot.gatech.edu http://multiscalelab.org/swan

Slide 12

Slide 12 text

10- Advances in HPC Hyper Q - Remove bottleneck - Improve utilisation

Slide 13

Slide 13 text

11- Advances in HPC Dynamic Parallelism Coarse Grid- low performance Fine Grid- high power Dynamic Grid- highest for lowest

Slide 14

Slide 14 text

12- Advances in HPC Titan

Slide 15

Slide 15 text

13- High Level Compilation • Skip complexity of lower level – Optimiser with GUI to hide GPU complexity – COTS examples and Nicolescu finding Optimiser GUI Compiler GPU Hardware Trade off between •GPU Memory Hierarchy •Kernel Allocation •CPU-GPU Coordination Efforts: CUDA Chill matched CUBLAS, hCUDA, PyUBLAS, etc

Slide 16

Slide 16 text

14- Digital Concept Creation • Not Another Molecular Dynamics simulation – noted for its parallel efficiency – often used to simulate large systems (100 millions of atoms) – developed by University of Illinois in 1995 – since matured and scalable to thousands of processors. Latest stable version is 2.9 Viruses are very small intra-cellular parasites that invade the cells of virtually all known organisms. They reproduce by utilizing the cell's machinery to replicate viral proteins and genomic material, generally damaging or killing the host cell in the process Source: University of Illinois NAMD

Slide 17

Slide 17 text

15- Digital Content Creation • Another Dimension of parallelism – Visualisation by Quadro – Computation by Tesla Maximus

Slide 18

Slide 18 text

Summary – Heterogeneous Parallel Computing • Very Young & Not Proprietary – Fast architectural progresses • Knowledge is King – Least programming effort – Least costs – Max performance