Upgrade to Pro — share decks privately, control downloads, hide ads and more …

TN Chan

TN Chan

Heterogeneous Parallel Computing with K20 & CUDA5

Multicore World 2013

February 16, 2013
Tweet

More Decks by Multicore World 2013

Other Decks in Technology

Transcript

  1. Heterogeneous Parallel Computing K20 & CUDA5 MW13 Conference 19 February

    2013 Wellington Thanks to Prof Manuel Ujaldon of University of Malaga and Dr Michael Dinneen of University of Auckland for providing illustrations to and reviewing the paper
  2. Content 1. HPC Introduction (2) 2. CUDA Hardware (3) 3.

    CUDA Software (4) 4. Advances in HPC (3) 5. High Level Compilation (1) 6. Digital Content Creation (2)
  3. 1- HPC Introduction, Top500 Source of data: www.top500.org for November

    2012 3,945 8,162 786,432 Mira- BlueGene/Q Argonne National Lab 4 12,660 10,510 705,024 K computer SPARC64 RIKEN AICS 3 7,890 16,324 1,572,864 Sequoia- BlueGene/Q Lawrence Livermore NL 2 8,209 17,590 560,640 Titan- Cray XK7 Oak Ridge National Lab 1 Power kW R max TFLOPS Cores System Site Rank
  4. 3- CUDA Compute Capabilities • Desirable Targets – Highest performance

    – Lowest consumption – Cheapest price – Easiest to program 2880 1536 336 512 240 128 Total Core 192 192 48 32 8 8 Core /MP 15 8 7 16 30 16 Multi- Proc 3.5 3.0 2.1 2.0 1.2 1.0 CCC 2013 2012 2011 2010 2008 2006 Date Kepler GK110 Kepler GK104 Fermi GF104 Fermi GF100 GT200 G80 Architec ture
  5. 5- CUDA Hardware, Cores • K20 – 13 multiprocessors each

    with 192 processors – 3x of Fermi – TDP is 225W, same for 3 generations
  6. 7- CUDA Software • For OpenCL, replace NVCC ~ OpenCL

    PTX Code ~ GPU Code CUDA GPU ~ OpenCL GPU • PTX ~ OpenCL Expectations Dinneen findings PTX versus OpenCL Standards
  7. 11- Advances in HPC Dynamic Parallelism Coarse Grid- low performance

    Fine Grid- high power Dynamic Grid- highest for lowest
  8. 13- High Level Compilation • Skip complexity of lower level

    – Optimiser with GUI to hide GPU complexity – COTS examples and Nicolescu finding Optimiser GUI Compiler GPU Hardware Trade off between •GPU Memory Hierarchy •Kernel Allocation •CPU-GPU Coordination Efforts: CUDA Chill matched CUBLAS, hCUDA, PyUBLAS, etc
  9. 14- Digital Concept Creation • Not Another Molecular Dynamics simulation

    – noted for its parallel efficiency – often used to simulate large systems (100 millions of atoms) – developed by University of Illinois in 1995 – since matured and scalable to thousands of processors. Latest stable version is 2.9 Viruses are very small intra-cellular parasites that invade the cells of virtually all known organisms. They reproduce by utilizing the cell's machinery to replicate viral proteins and genomic material, generally damaging or killing the host cell in the process Source: University of Illinois NAMD
  10. 15- Digital Content Creation • Another Dimension of parallelism –

    Visualisation by Quadro – Computation by Tesla Maximus
  11. Summary – Heterogeneous Parallel Computing • Very Young & Not

    Proprietary – Fast architectural progresses • Knowledge is King – Least programming effort – Least costs – Max performance