Slide 1

Slide 1 text

A BRIEF OVERVIEW OF GPU COMPUTING Christopher Knox Dragonfly Science Thursday, 7 February 13

Slide 2

Slide 2 text

OVERVIEW • What is GPU Computing? • What does the GPU computing ecosystem look like in 2013? • How can you get started with GPU computing? Thursday, 7 February 13

Slide 3

Slide 3 text

WHAT IS A GPU? • (Originally) Hardware dedicated to turning triangles in colored pixels. • (Now) Massively parallel thread based processor with fast memory, a large (fast) registry file, separated from the main processor and memory by a slow bus. Thursday, 7 February 13

Slide 4

Slide 4 text

THE GRAPHICS PIPELINE • Geometry Stage: Lots of matrix multiplication • Pixel Stage: Lots of array multiplication and subtraction with fast memory access Thursday, 7 February 13

Slide 5

Slide 5 text

SOME GPGPU HISTORY • The original GPUs were not programmable • ~2003 NV30 generation introduced limited programmability • Required mapping onto triangles, textures, and pixels • Easy to fall off hardware, only had fp16 registers • Hard to efficiently use both geometry and pixel blocks • 2007 NVIDIA released CUDA Thursday, 7 February 13

Slide 6

Slide 6 text

WHAT IS CUDA? • Compute Unified Device Architecture ... (not used now) • 3 different parts • Hardware dedicated to compute (not graphics) • Driver layer to access GPU as compute device • C/C++ extensions and a compiler Thursday, 7 February 13

Slide 7

Slide 7 text

CUDA HARDWARE • Key features of graphics pipeline threads • Threads never talk to each other • Pixel processing threads write to a pre- determined location • CUDA hardware addresses these issues Thursday, 7 February 13

Slide 8

Slide 8 text

GPU: GRAPHICS MODE Thursday, 7 February 13

Slide 9

Slide 9 text

GPU: COMPUTE MODE Thursday, 7 February 13

Slide 10

Slide 10 text

C/C++ EXTENSIONS AND THE CUDA COMPILER • __global__, __device__, __host__ function specifiers • execution configuration kernel<<>>(...) •built-in variables __global__ and __device__ functions dim3 gridDim dim3 blockDim dim3 blockIdx dim3 threadIdx Thursday, 7 February 13

Slide 11

Slide 11 text

SO WHY ISN’T IT EASY? • grids, blocks, warps, and threads • different classes of memory • memory access patterns - memory coalescing • keeping the GPU feed - register/shared memory pressure Thursday, 7 February 13

Slide 12

Slide 12 text

FEEDING A STREAMING MULTIPROCESSOR Thursday, 7 February 13

Slide 13

Slide 13 text

TWO STRATEGIES FOR AVOIDING IDLE PROCESSORS CPU • Content switching is costly • Avoid context switching • CPUs are very smart • Branch prediction, out of order execution, prefetch GPU • Context switching is cheap • Context switching is everything • GPUs (SMs) are dumb • None of that stuff - its your job! Thursday, 7 February 13

Slide 14

Slide 14 text

WARPS THREADS DON’T LIVE ALONE • A warp (32 threads) is the atomic unit of scheduling • 32 threads execute in lock-step • All threads in a warp execute all branches Thursday, 7 February 13

Slide 15

Slide 15 text

HOW TO SYNCHRONIZE 2688 CORES? • You can’t! (Well at least not with reasonable efficiency ...) • You can only (efficiently) communicate between threads running on the same streaming multiprocessor • Problems need to be decomposed into and SM friendly structure Thursday, 7 February 13

Slide 16

Slide 16 text

STREAMING MULTIPROCESSORS Thursday, 7 February 13

Slide 17

Slide 17 text

GRIDS, BLOCKS, WARPS, AND THREADS • Grid: Entire 1D, 2D, or 3D block of threads to execute • Block: Sub-domain of grid that can be loaded into a single SM • Thread: Kernel is executed once per thread • Warp: A bundle of threads scheduled together Thursday, 7 February 13

Slide 18

Slide 18 text

BALANCING THE BLOCK SIZE • Block is too large • Won’t fit onto a SM • Restricts the amount of shared memory and register space available to a given thread • Thread occupancy • Block is too small • Not enough warps available leading to idle time Thursday, 7 February 13

Slide 19

Slide 19 text

ALTERNATIVES TO CUDA • OpenCL: Standardized, portable(ish) • Direct Compute • OpenGL (4.3) Compute Shaders Thursday, 7 February 13

Slide 20

Slide 20 text

IF YOU DON’T HAVE TO WRITE CUDA THEN DON’T! • Microsoft Accelerated Massive Parallelism (AMP) • http://msdn.microsoft.com/en-us/library/vstudio/hh265137.aspx • parallel_for_each function • OpenACC Compiler Directives • http://www.openacc-standard.org/ • #pragma acc ... • PGI Compilers • http://www.pgroup.com/ Thursday, 7 February 13

Slide 21

Slide 21 text

GETTING CLEVERER • Numba and NumbaPro • http://docs.continuum.io/numbapro/index.html • http://numba.pydata.org/ • http://continuum.io/ • Run Numpy code on a GPU via a decorator Thursday, 7 February 13

Slide 22

Slide 22 text

GETTING STARTED • CUDA Zone • https://developer.nvidia.com/category/zone/cuda-zone • Minimal requirement • GPU and the CUDA toolkit • Amazon Cloud Instances Thursday, 7 February 13