Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A brief overview of GPU computing

A brief overview of GPU computing

Getting into GPU computing is often both promising and frustrating. I worked at NVIDIA in Finland, and was an early user of CUDA. I will provide a brief overview of the current ecosystem, and the strengths and weaknesses of the key technologies.

Christopher Knox

February 11, 2013
Tweet

Other Decks in Technology

Transcript

  1. A BRIEF OVERVIEW OF GPU COMPUTING
    Christopher Knox
    Dragonfly Science
    Thursday, 7 February 13

    View Slide

  2. OVERVIEW
    • What is GPU Computing?
    • What does the GPU computing ecosystem look like in 2013?
    • How can you get started with GPU computing?
    Thursday, 7 February 13

    View Slide

  3. WHAT IS A GPU?
    • (Originally) Hardware dedicated to
    turning triangles in colored pixels.
    • (Now) Massively parallel thread based
    processor with fast memory, a large
    (fast) registry file, separated from the
    main processor and memory by a
    slow bus.
    Thursday, 7 February 13

    View Slide

  4. THE GRAPHICS PIPELINE
    • Geometry Stage: Lots of
    matrix multiplication
    • Pixel Stage: Lots of array
    multiplication and subtraction
    with fast memory access
    Thursday, 7 February 13

    View Slide

  5. SOME GPGPU HISTORY
    • The original GPUs were not programmable
    • ~2003 NV30 generation introduced limited programmability
    • Required mapping onto triangles, textures, and pixels
    • Easy to fall off hardware, only had fp16 registers
    • Hard to efficiently use both geometry and pixel blocks
    • 2007 NVIDIA released CUDA
    Thursday, 7 February 13

    View Slide

  6. WHAT IS CUDA?
    • Compute Unified Device Architecture ... (not used now)
    • 3 different parts
    • Hardware dedicated to compute (not graphics)
    • Driver layer to access GPU as compute device
    • C/C++ extensions and a compiler
    Thursday, 7 February 13

    View Slide

  7. CUDA HARDWARE
    • Key features of graphics
    pipeline threads
    • Threads never talk to
    each other
    • Pixel processing
    threads write to a pre-
    determined location
    • CUDA hardware
    addresses these issues
    Thursday, 7 February 13

    View Slide

  8. GPU: GRAPHICS MODE
    Thursday, 7 February 13

    View Slide

  9. GPU: COMPUTE MODE
    Thursday, 7 February 13

    View Slide

  10. C/C++ EXTENSIONS AND THE
    CUDA COMPILER
    • __global__, __device__, __host__ function specifiers
    • execution configuration
    kernel<<>>(...)
    •built-in variables __global__ and __device__ functions
    dim3 gridDim
    dim3 blockDim
    dim3 blockIdx
    dim3 threadIdx
    Thursday, 7 February 13

    View Slide

  11. SO WHY ISN’T IT EASY?
    • grids, blocks, warps, and threads
    • different classes of memory
    • memory access patterns - memory coalescing
    • keeping the GPU feed - register/shared memory pressure
    Thursday, 7 February 13

    View Slide

  12. FEEDING A STREAMING
    MULTIPROCESSOR
    Thursday, 7 February 13

    View Slide

  13. TWO STRATEGIES FOR
    AVOIDING IDLE PROCESSORS
    CPU
    • Content switching is costly
    • Avoid context switching
    • CPUs are very smart
    • Branch prediction, out of order
    execution, prefetch
    GPU
    • Context switching is cheap
    • Context switching is everything
    • GPUs (SMs) are dumb
    • None of that stuff - its your job!
    Thursday, 7 February 13

    View Slide

  14. WARPS
    THREADS DON’T LIVE ALONE
    • A warp (32 threads) is
    the atomic unit of
    scheduling
    • 32 threads execute in
    lock-step
    • All threads in a warp
    execute all branches
    Thursday, 7 February 13

    View Slide

  15. HOW TO SYNCHRONIZE 2688
    CORES?
    • You can’t! (Well at least not with reasonable efficiency ...)
    • You can only (efficiently) communicate between threads
    running on the same streaming multiprocessor
    • Problems need to be decomposed into and SM friendly
    structure
    Thursday, 7 February 13

    View Slide

  16. STREAMING
    MULTIPROCESSORS
    Thursday, 7 February 13

    View Slide

  17. GRIDS, BLOCKS, WARPS, AND
    THREADS
    • Grid: Entire 1D, 2D, or 3D block
    of threads to execute
    • Block: Sub-domain of grid that
    can be loaded into a single SM
    • Thread: Kernel is executed
    once per thread
    • Warp: A bundle of threads
    scheduled together
    Thursday, 7 February 13

    View Slide

  18. BALANCING THE BLOCK SIZE
    • Block is too large
    • Won’t fit onto a SM
    • Restricts the amount of
    shared memory and
    register space available to
    a given thread
    • Thread occupancy
    • Block is too small
    • Not enough warps available
    leading to idle time
    Thursday, 7 February 13

    View Slide

  19. ALTERNATIVES TO CUDA
    • OpenCL: Standardized, portable(ish)
    • Direct Compute
    • OpenGL (4.3) Compute Shaders
    Thursday, 7 February 13

    View Slide

  20. IF YOU DON’T HAVE TO
    WRITE CUDA THEN DON’T!
    • Microsoft Accelerated Massive Parallelism (AMP)
    • http://msdn.microsoft.com/en-us/library/vstudio/hh265137.aspx
    • parallel_for_each function
    • OpenACC Compiler Directives
    • http://www.openacc-standard.org/
    • #pragma acc ...
    • PGI Compilers
    • http://www.pgroup.com/
    Thursday, 7 February 13

    View Slide

  21. GETTING CLEVERER
    • Numba and NumbaPro
    • http://docs.continuum.io/numbapro/index.html
    • http://numba.pydata.org/
    • http://continuum.io/
    • Run Numpy code on a GPU via a decorator
    Thursday, 7 February 13

    View Slide

  22. GETTING STARTED
    • CUDA Zone
    • https://developer.nvidia.com/category/zone/cuda-zone
    • Minimal requirement
    • GPU and the CUDA toolkit
    • Amazon Cloud Instances
    Thursday, 7 February 13

    View Slide