Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A brief overview of GPU computing

A brief overview of GPU computing

Getting into GPU computing is often both promising and frustrating. I worked at NVIDIA in Finland, and was an early user of CUDA. I will provide a brief overview of the current ecosystem, and the strengths and weaknesses of the key technologies.

Avatar for Christopher Knox

Christopher Knox

February 11, 2013
Tweet

Other Decks in Technology

Transcript

  1. OVERVIEW • What is GPU Computing? • What does the

    GPU computing ecosystem look like in 2013? • How can you get started with GPU computing? Thursday, 7 February 13
  2. WHAT IS A GPU? • (Originally) Hardware dedicated to turning

    triangles in colored pixels. • (Now) Massively parallel thread based processor with fast memory, a large (fast) registry file, separated from the main processor and memory by a slow bus. Thursday, 7 February 13
  3. THE GRAPHICS PIPELINE • Geometry Stage: Lots of matrix multiplication

    • Pixel Stage: Lots of array multiplication and subtraction with fast memory access Thursday, 7 February 13
  4. SOME GPGPU HISTORY • The original GPUs were not programmable

    • ~2003 NV30 generation introduced limited programmability • Required mapping onto triangles, textures, and pixels • Easy to fall off hardware, only had fp16 registers • Hard to efficiently use both geometry and pixel blocks • 2007 NVIDIA released CUDA Thursday, 7 February 13
  5. WHAT IS CUDA? • Compute Unified Device Architecture ... (not

    used now) • 3 different parts • Hardware dedicated to compute (not graphics) • Driver layer to access GPU as compute device • C/C++ extensions and a compiler Thursday, 7 February 13
  6. CUDA HARDWARE • Key features of graphics pipeline threads •

    Threads never talk to each other • Pixel processing threads write to a pre- determined location • CUDA hardware addresses these issues Thursday, 7 February 13
  7. C/C++ EXTENSIONS AND THE CUDA COMPILER • __global__, __device__, __host__

    function specifiers • execution configuration kernel<<<dim3 grid, dim3 block>>>(...) •built-in variables __global__ and __device__ functions dim3 gridDim dim3 blockDim dim3 blockIdx dim3 threadIdx Thursday, 7 February 13
  8. SO WHY ISN’T IT EASY? • grids, blocks, warps, and

    threads • different classes of memory • memory access patterns - memory coalescing • keeping the GPU feed - register/shared memory pressure Thursday, 7 February 13
  9. TWO STRATEGIES FOR AVOIDING IDLE PROCESSORS CPU • Content switching

    is costly • Avoid context switching • CPUs are very smart • Branch prediction, out of order execution, prefetch GPU • Context switching is cheap • Context switching is everything • GPUs (SMs) are dumb • None of that stuff - its your job! Thursday, 7 February 13
  10. WARPS THREADS DON’T LIVE ALONE • A warp (32 threads)

    is the atomic unit of scheduling • 32 threads execute in lock-step • All threads in a warp execute all branches Thursday, 7 February 13
  11. HOW TO SYNCHRONIZE 2688 CORES? • You can’t! (Well at

    least not with reasonable efficiency ...) • You can only (efficiently) communicate between threads running on the same streaming multiprocessor • Problems need to be decomposed into and SM friendly structure Thursday, 7 February 13
  12. GRIDS, BLOCKS, WARPS, AND THREADS • Grid: Entire 1D, 2D,

    or 3D block of threads to execute • Block: Sub-domain of grid that can be loaded into a single SM • Thread: Kernel is executed once per thread • Warp: A bundle of threads scheduled together Thursday, 7 February 13
  13. BALANCING THE BLOCK SIZE • Block is too large •

    Won’t fit onto a SM • Restricts the amount of shared memory and register space available to a given thread • Thread occupancy • Block is too small • Not enough warps available leading to idle time Thursday, 7 February 13
  14. ALTERNATIVES TO CUDA • OpenCL: Standardized, portable(ish) • Direct Compute

    • OpenGL (4.3) Compute Shaders Thursday, 7 February 13
  15. IF YOU DON’T HAVE TO WRITE CUDA THEN DON’T! •

    Microsoft Accelerated Massive Parallelism (AMP) • http://msdn.microsoft.com/en-us/library/vstudio/hh265137.aspx • parallel_for_each function • OpenACC Compiler Directives • http://www.openacc-standard.org/ • #pragma acc ... • PGI Compilers • http://www.pgroup.com/ Thursday, 7 February 13
  16. GETTING CLEVERER • Numba and NumbaPro • http://docs.continuum.io/numbapro/index.html • http://numba.pydata.org/

    • http://continuum.io/ • Run Numpy code on a GPU via a decorator Thursday, 7 February 13
  17. GETTING STARTED • CUDA Zone • https://developer.nvidia.com/category/zone/cuda-zone • Minimal requirement

    • GPU and the CUDA toolkit • Amazon Cloud Instances Thursday, 7 February 13