TN Chan

Heterogeneous Parallel Computing K20 & CUDA5 MW13 Conference 19 February
2013 Wellington Thanks to Prof Manuel Ujaldon of University of Malaga and Dr Michael Dinneen of University of Auckland for providing illustrations to and reviewing the paper

Content 1. HPC Introduction (2) 2. CUDA Hardware (3) 3.
CUDA Software (4) 4. Advances in HPC (3) 5. High Level Compilation (1) 6. Digital Content Creation (2)

1- HPC Introduction, Top500 Source of data: www.top500.org for November
2012 3,945 8,162 786,432 Mira- BlueGene/Q Argonne National Lab 4 12,660 10,510 705,024 K computer SPARC64 RIKEN AICS 3 7,890 16,324 1,572,864 Sequoia- BlueGene/Q Lawrence Livermore NL 2 8,209 17,590 560,640 Titan- Cray XK7 Oak Ridge National Lab 1 Power kW R max TFLOPS Cores System Site Rank

2- HPC Introduction, heterogeneous

3- CUDA Compute Capabilities • Desirable Targets – Highest performance
– Lowest consumption – Cheapest price – Easiest to program 2880 1536 336 512 240 128 Total Core 192 192 48 32 8 8 Core /MP 15 8 7 16 30 16 Multi- Proc 3.5 3.0 2.1 2.0 1.2 1.0 CCC 2013 2012 2011 2010 2008 2006 Date Kepler GK110 Kepler GK104 Fermi GF104 Fermi GF100 GT200 G80 Architec ture

4- CUDA Hardware, Memory DRAM refers to Global Memory

5- CUDA Hardware, Cores • K20 – 13 multiprocessors each
with 192 processors – 3x of Fermi – TDP is 225W, same for 3 generations

6- CUDA Software Libraries & Compilers

7- CUDA Software • For OpenCL, replace NVCC ~ OpenCL
PTX Code ~ GPU Code CUDA GPU ~ OpenCL GPU • PTX ~ OpenCL Expectations Dinneen findings PTX versus OpenCL Standards

8- CUDA Software • Dinneen suggested from PTX to OpenMP
From OpenMP to OpenACC

9- CUDA Software PTX code OpenCL Swan & Cat http://gpuocelot.gatech.edu
http://multiscalelab.org/swan

10- Advances in HPC Hyper Q - Remove bottleneck -
Improve utilisation

11- Advances in HPC Dynamic Parallelism Coarse Grid- low performance
Fine Grid- high power Dynamic Grid- highest for lowest

12- Advances in HPC Titan

13- High Level Compilation • Skip complexity of lower level
– Optimiser with GUI to hide GPU complexity – COTS examples and Nicolescu finding Optimiser GUI Compiler GPU Hardware Trade off between •GPU Memory Hierarchy •Kernel Allocation •CPU-GPU Coordination Efforts: CUDA Chill matched CUBLAS, hCUDA, PyUBLAS, etc

14- Digital Concept Creation • Not Another Molecular Dynamics simulation
– noted for its parallel efficiency – often used to simulate large systems (100 millions of atoms) – developed by University of Illinois in 1995 – since matured and scalable to thousands of processors. Latest stable version is 2.9 Viruses are very small intra-cellular parasites that invade the cells of virtually all known organisms. They reproduce by utilizing the cell's machinery to replicate viral proteins and genomic material, generally damaging or killing the host cell in the process Source: University of Illinois NAMD

15- Digital Content Creation • Another Dimension of parallelism –
Visualisation by Quadro – Computation by Tesla Maximus

Summary – Heterogeneous Parallel Computing • Very Young & Not
Proprietary – Fast architectural progresses • Knowledge is King – Least programming effort – Least costs – Max performance

TN Chan

TN Chan

Multicore World 2013

More Decks by Multicore World 2013

Other Decks in Technology

Featured

Transcript

Heterogeneous Parallel Computing K20 & CUDA5 MW13 Conference 19 February

Content 1. HPC Introduction (2) 2. CUDA Hardware (3) 3.

1- HPC Introduction, Top500 Source of data: www.top500.org for November

2- HPC Introduction, heterogeneous

3- CUDA Compute Capabilities • Desirable Targets – Highest performance

4- CUDA Hardware, Memory DRAM refers to Global Memory

5- CUDA Hardware, Cores • K20 – 13 multiprocessors each

6- CUDA Software Libraries & Compilers

7- CUDA Software • For OpenCL, replace NVCC ~ OpenCL

8- CUDA Software • Dinneen suggested from PTX to OpenMP

9- CUDA Software PTX code OpenCL Swan & Cat http://gpuocelot.gatech.edu

10- Advances in HPC Hyper Q - Remove bottleneck -

11- Advances in HPC Dynamic Parallelism Coarse Grid- low performance

12- Advances in HPC Titan

13- High Level Compilation • Skip complexity of lower level

14- Digital Concept Creation • Not Another Molecular Dynamics simulation

15- Digital Content Creation • Another Dimension of parallelism –

Summary – Heterogeneous Parallel Computing • Very Young & Not