Upgrade to Pro — share decks privately, control downloads, hide ads and more …

John Ashley, NVIDIA

Avatar for oxfordtkp oxfordtkp
July 09, 2012
360

John Ashley, NVIDIA

GPU Computing for Radio Astronomy

Avatar for oxfordtkp

oxfordtkp

July 09, 2012
Tweet

Transcript

  1. Agenda Introductions NVIDIA enables YOU to do better science GPU

    HW – Past, Present, Future New CUDA Features for high throughput science NVIDIA and Radio Astronomy (so far) Question & Answer
  2. NVIDIA Compute Business Model Leverage volume graphics market to serve

    HPC HPC needs outstrip HPC market’s ability to fund the development Computational graphics and compute are highly aligned High efficiency parallel computing is core to Nvidia GeForce Quadro Tegra
  3. Heterogeneous Computing is Mainstream 1,000,000’s Early Adopters Research Universi es

    Supercompu ng Centers Oil & Gas CAE CFD Finance Rendering Data Analy cs Life Sciences Defense Weather Climate Plasma Physics 100,000’s HPC Chasm Finance Defense Government Higher Ed Research Oil and gas Life Sciences Manufacturing Seismic Processing Reservoir Simula on Astrophysics Molecular Dynamics Weather / Climate Signal Processing Satellite Imaging Video Analy cs Bio-chemistry Bio-informa cs Material Science Genomics Risk Analy cs Monte Carlo Op ons Pricing Insurance Structural Mechanics Computa onal Fluid Dynamics Electromagne cs GPU Applications span all major Industries Number of Applications and Developers has reached the early majority. We are Crossing the Chasm
  4. Worldwide GPU Supercomputer Momentum 0 10 20 30 40 2006

    2007 2008 2009 2010 2011 # of GPU accelerated systems on Top500 Tesla GPUs Launched First Double Precision GPU Tesla 20- series (Fermi) Launched
  5. Tesla Near Term Roadmap Continue intense focus on perf/W/$ Ease

    of programming, porting, tuning Breadth of applicability Finer-grain parallel work, more CPU/GPU overlap GPUs will work well on any apps with high parallelism Tighter integration CPUs and GPUs, shared hierarchical memory, system interconnect 16 2 4 6 8 10 12 14 2008 2010 2012 2014 Tesla Fermi Kepler Maxwell Sustained DP GF / W
  6. 0 50 100 150 200 250 2007 2008 2009 2010

    2011 2012 Peak Memory Bandwidth GBytes/sec M1060 Nehalem 3 GHz Westmere 3 GHz 8-core Sandy Bridge 3 GHz M2070 M2090 Kepler 0 200 400 600 800 1000 1200 2007 2008 2009 2010 2011 2012 Peak Double Precision FP GFlops/sec Nehalem 3 GHz Westmere 3 GHz M2070 M2090 M1060 8-core Sandy Bridge 3 GHz Kepler NVIDIA GPU (ECC off) x86 CPU Double Precision: NVIDIA GPU Double Precision: x86 CPU The Performance Gap Widens Further (and further)
  7. Two GPUs based on Kepler Architecture Kepler I April’12 Optimized

    for single precision 2X Perf/W vs. Fermi “Gemini” config (2GPU/ card) 4 GB / GPU, 250W / card Target markets Seismic, Defense, signal and image processing Kepler II 4Q’12 High double precision perf 3X Perf/W vs. Fermi 1 GPU per board Up to 6 GB / GPU @ 225W Target markets Traditional HPC, Finance, Manufacturing, etc.
  8. K10: 3072 Power-Efficient Cores Product Name M2090 K10 GPU Architecture

    Fermi Kepler GK104 # of GPUs 1 2 Per GPU Board Single Precision Flops 1.3 TF 2.29 TF 4.58 TF Double Precision Flops 0.66 TF 0.095 TF 0.190 TF # CUDA Cores 512 1536 3072 Memory size 6 GB 4GB 8 GB Memory BW (ECC off) 177.6 GB/s 160GB/s 320 GB/s PCI-Express Gen 2: 8 GB/s Gen 3: 16 GB/s
  9. DARPA Study Identifies Four Challenges for ExaScale Computing Report published

    September 28, 2008: Four Major Challenges Energy and Power challenge Memory and Storage challenge Concurrency and Locality challenge Resiliency challenge Number one issue is power Extrapolations of current architectures and technology indicate over 100MW for an Exaflop! Power also constrains what we can put on a chip Available at www.darpa.mil/ipto/personnel/docs/ExaScale_Study_Initial.pdf
  10. Optimize the Storage Hierarchy 2 Tailor Memory to the Application

    3 Data Movement Dominates Power 1 Power is the Problem
  11. DFMA DFMA DFMA DFMA Main Registers LSI LSI Operand Registers

    L0 I$ L0 D$ Lane — DFMAs, 20 GFLOPS P P P P P P P P Switch L1$ SM — 8 lanes, 160 GFLOPS 1024 SRAM Banks, 256KB each NI MC MC SM SM SM SM NoC SM LP LP SRA M SRA M SRA M Chip — 128 SMs, 20.48 TFLOPS + 8 Latency Processors GPU Chip 20TF DP 256MB 1.4TB/s DRAM BW 150GB/s Network BW DRAM Stack DRAM Stack DRAM Stack NV Memory Node MCM — 20 TFLOPS + 256 GB Echelon Architecture (1/2)
  12. CUDA By the Numbers: CUDA Capable GPUs >375,000,000 Toolkit Downloads

    >1,000,000 Active Developers >120,000 Universities Teaching CUDA >500
  13. The Soul of CUDA: Our Driving Philosophy Heterogeneous computing is

    the future CUDA is the right platform for heterogeneous computing Future Homogeneous Program, Heterogeneous Execution GPUs can take advantage of all types of application parallelism Standard OS & language support for fine-grained parallelism (e.g. C++17) Unified Memory Architecture at GPU, Node, and Cluster levels Continued focus on productivity, performance Continued platform of choice for research and innovation
  14. GPU Computing Developer Ecosystem Debuggers & Profilers cuda-gdb NV Visual

    Profiler Parallel Nsight Visual Studio Allinea TotalView MATLAB Mathematica NI LabView pyCUDA Numerical Packages C C++ Fortran OpenCL DirectCompute Java Python GPU Compilers PGI Accelerator CAPS HMPP mCUDA OpenMP Parallelizing Compilers SDK BLAS FFT LAPACK NPP Sparse Imaging RNG Libraries & Examples OEM Solution Provider GPGPU Consultants & Training ANEO GPU Tech
  15. Hyper-Q CPU Processes Simultaneously Run MPI Tasks on Kepler FERMI

    1 MPI Task at a Time KEPLER 32 Simultaneous MPI Tasks
  16. NVIDIA GPUDirect™ now supports RDMA GPU 1 GPU 2 PCI-e

    System Memory GDDR5 Memory GDDR5 Memory CPU Network Card Server 1 PCI-e GPU 1 GPU 2 GDDR5 Memory GDDR5 Memory System Memory CPU Network Card Server 2 Network
  17. Opening the CUDA Platform with LLVM CUDA compiler source now

    available with Open source LLVM Compiler SDK includes specification documentation, examples, and verifier Provides ability for anyone to add CUDA to new languages and processors Learn more at http://developer.nvidia.com/cuda-source CUDA C, C++, Fortran LLVM Compiler For CUDA NVIDIA GPUs x86 CPUs New Language Support New Processor Support
  18. Where can GPUs be Applied? Cross correlation – GPU correlators

    ideal for large antenna counts High performance open-source library, developed by Clark (NVIDIA) * Kepler II estimated performance similar to SGEMM (high % of peak) Calibration and Imaging Gridding - Coordinate mapping of input data to a regular grid Dominant time sink in compute pipeline - exascale required for SKA2 Other image processing steps CUFFT – Highly optimized Fast Fourier Transform library Coordinate transformations natural for a graphics processor Signal Processing For example, Pulsar detection * https://github.com/GPU-correlators/xGPU
  19. GPUs in Radio Astronomy Already an essential tool in radio

    astronomy ASKAP Stage 1 – Western Australia LEDA – United States of America LOFAR – Europe MWA – Western Australia PAPER – South Africa LOFAR LEDA MWA ASKAP PAPER
  20. Key Points plus Q&A NVIDIA is excited to enable new

    science is here to help is in compute for the long haul (exascale and beyond) Is already a part of the Radio Astronomy ecosystem Contacts Jeremy Purches, Business Development Manager – [email protected] Tim Lanfear, Solution Architect – [email protected] Mike Clark, DevTech Astronomy – [email protected] John Ashley, Solution Architect – [email protected]