John Ashley, NVIDIA

GPU Computing for Radio Astronomy Image: SPDO/Swinburne Astronomy Productions John
Ashley, NVIDIA

Agenda Introductions NVIDIA enables YOU to do better science GPU
HW – Past, Present, Future New CUDA Features for high throughput science NVIDIA and Radio Astronomy (so far) Question & Answer

NVIDIA enables better science

NVIDIA Compute Business Model Leverage volume graphics market to serve
HPC HPC needs outstrip HPC market’s ability to fund the development Computational graphics and compute are highly aligned High efficiency parallel computing is core to Nvidia GeForce Quadro Tegra

Heterogeneous Computing is Mainstream 1,000,000’s Early Adopters Research Universi es
Supercompu ng Centers Oil & Gas CAE CFD Finance Rendering Data Analy cs Life Sciences Defense Weather Climate Plasma Physics 100,000’s HPC Chasm Finance Defense Government Higher Ed Research Oil and gas Life Sciences Manufacturing Seismic Processing Reservoir Simula on Astrophysics Molecular Dynamics Weather / Climate Signal Processing Satellite Imaging Video Analy cs Bio-chemistry Bio-informa cs Material Science Genomics Risk Analy cs Monte Carlo Op ons Pricing Insurance Structural Mechanics Computa onal Fluid Dynamics Electromagne cs GPU Applications span all major Industries Number of Applications and Developers has reached the early majority. We are Crossing the Chasm

Worldwide GPU Supercomputer Momentum 0 10 20 30 40 2006
2007 2008 2009 2010 2011 # of GPU accelerated systems on Top500 Tesla GPUs Launched First Double Precision GPU Tesla 20- series (Fermi) Launched

GPU HW – Past, Present and Future

Tesla Near Term Roadmap Continue intense focus on perf/W/$ Ease
of programming, porting, tuning Breadth of applicability Finer-grain parallel work, more CPU/GPU overlap GPUs will work well on any apps with high parallelism Tighter integration CPUs and GPUs, shared hierarchical memory, system interconnect 16 2 4 6 8 10 12 14 2008 2010 2012 2014 Tesla Fermi Kepler Maxwell Sustained DP GF / W

0 50 100 150 200 250 2007 2008 2009 2010
2011 2012 Peak Memory Bandwidth GBytes/sec M1060 Nehalem 3 GHz Westmere 3 GHz 8-core Sandy Bridge 3 GHz M2070 M2090 Kepler 0 200 400 600 800 1000 1200 2007 2008 2009 2010 2011 2012 Peak Double Precision FP GFlops/sec Nehalem 3 GHz Westmere 3 GHz M2070 M2090 M1060 8-core Sandy Bridge 3 GHz Kepler NVIDIA GPU (ECC off) x86 CPU Double Precision: NVIDIA GPU Double Precision: x86 CPU The Performance Gap Widens Further (and further)

Two GPUs based on Kepler Architecture Kepler I April’12 Optimized
for single precision 2X Perf/W vs. Fermi “Gemini” config (2GPU/ card) 4 GB / GPU, 250W / card Target markets Seismic, Defense, signal and image processing Kepler II 4Q’12 High double precision perf 3X Perf/W vs. Fermi 1 GPU per board Up to 6 GB / GPU @ 225W Target markets Traditional HPC, Finance, Manufacturing, etc.

K10: 3072 Power-Efficient Cores Product Name M2090 K10 GPU Architecture
Fermi Kepler GK104 # of GPUs 1 2 Per GPU Board Single Precision Flops 1.3 TF 2.29 TF 4.58 TF Double Precision Flops 0.66 TF 0.095 TF 0.190 TF # CUDA Cores 512 1536 3072 Memory size 6 GB 4GB 8 GB Memory BW (ECC off) 177.6 GB/s 160GB/s 320 GB/s PCI-Express Gen 2: 8 GB/s Gen 3: 16 GB/s

DARPA Study Identifies Four Challenges for ExaScale Computing Report published
September 28, 2008: Four Major Challenges Energy and Power challenge Memory and Storage challenge Concurrency and Locality challenge Resiliency challenge Number one issue is power Extrapolations of current architectures and technology indicate over 100MW for an Exaflop! Power also constrains what we can put on a chip Available at www.darpa.mil/ipto/personnel/docs/ExaScale_Study_Initial.pdf

Echelon Team

Optimize the Storage Hierarchy 2 Tailor Memory to the Application
3 Data Movement Dominates Power 1 Power is the Problem

DFMA DFMA DFMA DFMA Main Registers LSI LSI Operand Registers
L0 I$ L0 D$ Lane — DFMAs, 20 GFLOPS P P P P P P P P Switch L1$ SM — 8 lanes, 160 GFLOPS 1024 SRAM Banks, 256KB each NI MC MC SM SM SM SM NoC SM LP LP SRA M SRA M SRA M Chip — 128 SMs, 20.48 TFLOPS + 8 Latency Processors GPU Chip 20TF DP 256MB 1.4TB/s DRAM BW 150GB/s Network BW DRAM Stack DRAM Stack DRAM Stack NV Memory Node MCM — 20 TFLOPS + 256 GB Echelon Architecture (1/2)

New CUDA Features for High Throughput Science

CUDA By the Numbers: CUDA Capable GPUs >375,000,000 Toolkit Downloads
>1,000,000 Active Developers >120,000 Universities Teaching CUDA >500

The Soul of CUDA: Our Driving Philosophy Heterogeneous computing is
the future CUDA is the right platform for heterogeneous computing Future Homogeneous Program, Heterogeneous Execution GPUs can take advantage of all types of application parallelism Standard OS & language support for fine-grained parallelism (e.g. C++17) Unified Memory Architecture at GPU, Node, and Cluster levels Continued focus on productivity, performance Continued platform of choice for research and innovation

GPU Computing Developer Ecosystem Debuggers & Profilers cuda-gdb NV Visual
Profiler Parallel Nsight Visual Studio Allinea TotalView MATLAB Mathematica NI LabView pyCUDA Numerical Packages C C++ Fortran OpenCL DirectCompute Java Python GPU Compilers PGI Accelerator CAPS HMPP mCUDA OpenMP Parallelizing Compilers SDK BLAS FFT LAPACK NPP Sparse Imaging RNG Libraries & Examples OEM Solution Provider GPGPU Consultants & Training ANEO GPU Tech

CPU GPU CPU GPU Dynamic Parallelism

Dynamic Parallelism Makes GPU Computing Easier & Broadens Reach Too
coarse Just right Too fine

Hyper-Q CPU Processes Simultaneously Run MPI Tasks on Kepler FERMI
1 MPI Task at a Time KEPLER 32 Simultaneous MPI Tasks

NVIDIA GPUDirect™ now supports RDMA GPU 1 GPU 2 PCI-e
System Memory GDDR5 Memory GDDR5 Memory CPU Network Card Server 1 PCI-e GPU 1 GPU 2 GDDR5 Memory GDDR5 Memory System Memory CPU Network Card Server 2 Network

Opening the CUDA Platform with LLVM CUDA compiler source now
available with Open source LLVM Compiler SDK includes specification documentation, examples, and verifier Provides ability for anyone to add CUDA to new languages and processors Learn more at http://developer.nvidia.com/cuda-source CUDA C, C++, Fortran LLVM Compiler For CUDA NVIDIA GPUs x86 CPUs New Language Support New Processor Support

NVIDIA and Radio Astronomy (so far)

Radio Telescope Data Flow (Cartoon Version)

Where can GPUs be Applied? Cross correlation – GPU correlators
ideal for large antenna counts High performance open-source library, developed by Clark (NVIDIA) * Kepler II estimated performance similar to SGEMM (high % of peak) Calibration and Imaging Gridding - Coordinate mapping of input data to a regular grid Dominant time sink in compute pipeline - exascale required for SKA2 Other image processing steps CUFFT – Highly optimized Fast Fourier Transform library Coordinate transformations natural for a graphics processor Signal Processing For example, Pulsar detection * https://github.com/GPU-correlators/xGPU

GPUs in Radio Astronomy Already an essential tool in radio
astronomy ASKAP Stage 1 – Western Australia LEDA – United States of America LOFAR – Europe MWA – Western Australia PAPER – South Africa LOFAR LEDA MWA ASKAP PAPER

Key Points plus Q&A NVIDIA is excited to enable new
science is here to help is in compute for the long haul (exascale and beyond) Is already a part of the Radio Astronomy ecosystem Contacts Jeremy Purches, Business Development Manager – [email protected] Tim Lanfear, Solution Architect – [email protected] Mike Clark, DevTech Astronomy – [email protected] John Ashley, Solution Architect – [email protected]

John Ashley, NVIDIA

John Ashley, NVIDIA

oxfordtkp

More Decks by oxfordtkp

Featured

Transcript

GPU Computing for Radio Astronomy Image: SPDO/Swinburne Astronomy Productions John

Agenda Introductions NVIDIA enables YOU to do better science GPU

NVIDIA enables better science

NVIDIA Compute Business Model Leverage volume graphics market to serve

Heterogeneous Computing is Mainstream 1,000,000’s Early Adopters Research Universi es

Worldwide GPU Supercomputer Momentum 0 10 20 30 40 2006

GPU HW – Past, Present and Future

Tesla Near Term Roadmap Continue intense focus on perf/W/$ Ease

0 50 100 150 200 250 2007 2008 2009 2010

Two GPUs based on Kepler Architecture Kepler I April’12 Optimized

K10: 3072 Power-Efficient Cores Product Name M2090 K10 GPU Architecture

DARPA Study Identifies Four Challenges for ExaScale Computing Report published

Echelon Team

Optimize the Storage Hierarchy 2 Tailor Memory to the Application

DFMA DFMA DFMA DFMA Main Registers LSI LSI Operand Registers

New CUDA Features for High Throughput Science

CUDA By the Numbers: CUDA Capable GPUs >375,000,000 Toolkit Downloads

The Soul of CUDA: Our Driving Philosophy Heterogeneous computing is

GPU Computing Developer Ecosystem Debuggers & Profilers cuda-gdb NV Visual

CPU GPU CPU GPU Dynamic Parallelism

Dynamic Parallelism Makes GPU Computing Easier & Broadens Reach Too

Hyper-Q CPU Processes Simultaneously Run MPI Tasks on Kepler FERMI

NVIDIA GPUDirect™ now supports RDMA GPU 1 GPU 2 PCI-e

Opening the CUDA Platform with LLVM CUDA compiler source now

NVIDIA and Radio Astronomy (so far)

Radio Telescope Data Flow (Cartoon Version)

Where can GPUs be Applied? Cross correlation – GPU correlators

GPUs in Radio Astronomy Already an essential tool in radio

Key Points plus Q&A NVIDIA is excited to enable new