Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Brief Introduction to GPGPU Computing with Nvidia CUDA

A Brief Introduction to GPGPU Computing with Nvidia CUDA

A very brief and (mostly) nontechnical introduction to the origins and concepts behind GPGPU computing using Nvidia CUDA for non Computer Scientists.

Originally presented internally as a Learning Lunch for the IHMC Robot Lab group (http://ihmc.us). These Learning Lunches are highly interactive sessions and unfortunately not usually recorded, so a great deal of information from this 1 hour talk is not in the slide deck. But the general idea is still there.

Original Keynote slides available here: http://d.pr/f/xwoH

Doug Stephen

June 07, 2012
Tweet

More Decks by Doug Stephen

Other Decks in Technology

Transcript

  1. HETEROGENEOUS COMPUTING With Nvidia CUDA Thursday, June 7, 12 Heterogeneous

    Computing is the use of several distinct specialized hardware tools in a single computing system in order to accelerate complex calculations. This talk will deal with how GPGPU programming integrates with this idea.
  2. MOORE’S LAW Thursday, June 7, 12 On tick-tock: Intel creates

    a new chip one year, then improves that same chip the next year, then a new chip, back and forth. Tick tock. Limited by reality: Thermal constraints, electromagnetic interferences as transistors get closer, etc.
  3. The number of transistors on consumer integrated circuits doubles around

    every two years MOORE’S LAW Source: Wikipedia Thursday, June 7, 12 On tick-tock: Intel creates a new chip one year, then improves that same chip the next year, then a new chip, back and forth. Tick tock. Limited by reality: Thermal constraints, electromagnetic interferences as transistors get closer, etc.
  4. The number of transistors on consumer integrated circuits doubles around

    every two years The Intel “tick-tock” MOORE’S LAW Source: Wikipedia Thursday, June 7, 12 On tick-tock: Intel creates a new chip one year, then improves that same chip the next year, then a new chip, back and forth. Tick tock. Limited by reality: Thermal constraints, electromagnetic interferences as transistors get closer, etc.
  5. The number of transistors on consumer integrated circuits doubles around

    every two years The Intel “tick-tock” Limited by reality (physics is kind of a jackass) MOORE’S LAW Source: Wikipedia Thursday, June 7, 12 On tick-tock: Intel creates a new chip one year, then improves that same chip the next year, then a new chip, back and forth. Tick tock. Limited by reality: Thermal constraints, electromagnetic interferences as transistors get closer, etc.
  6. PARALLEL PROGRAMMING IS HARD, YOU GUYS. Decades of algorithms and

    theory based on serial, iterative computing (c.f. graph theory, memoization, dynamic programming, numerical analysis, linear programming…) Introduction of nondeterministic bottlenecks (process forking or thread spawning overhead, scheduling, resource contention, time-slicing…) Data access & I/O becomes an algorithmic concern on top of being a performance overhead concern Thursday, June 7, 12
  7. *OKAY, MAYBE NOT THAT BAD AT MATH Thursday, June 7,

    12 CPU’s have to deal with general purpose instruction sets like moving around registers and interfacing with a multitude of daughter card interfaces. Some modern CPU’s are starting to incorporate some hardware that improves their performance on this stuff (SSE) but still not as powerful as dedicated hardware
  8. *OKAY, MAYBE NOT THAT BAD AT MATH General purpose CPU’s

    have to do things other than math Thursday, June 7, 12 CPU’s have to deal with general purpose instruction sets like moving around registers and interfacing with a multitude of daughter card interfaces. Some modern CPU’s are starting to incorporate some hardware that improves their performance on this stuff (SSE) but still not as powerful as dedicated hardware
  9. *OKAY, MAYBE NOT THAT BAD AT MATH General purpose CPU’s

    have to do things other than math Their hardware reflects this Thursday, June 7, 12 CPU’s have to deal with general purpose instruction sets like moving around registers and interfacing with a multitude of daughter card interfaces. Some modern CPU’s are starting to incorporate some hardware that improves their performance on this stuff (SSE) but still not as powerful as dedicated hardware
  10. *OKAY, MAYBE NOT THAT BAD AT MATH General purpose CPU’s

    have to do things other than math Their hardware reflects this Jack of all trades, Master of none Thursday, June 7, 12 CPU’s have to deal with general purpose instruction sets like moving around registers and interfacing with a multitude of daughter card interfaces. Some modern CPU’s are starting to incorporate some hardware that improves their performance on this stuff (SSE) but still not as powerful as dedicated hardware
  11. *OKAY, MAYBE NOT THAT BAD AT MATH General purpose CPU’s

    have to do things other than math Their hardware reflects this Jack of all trades, Master of none CPU’s are great with integers. Too bad there aren’t a lot of integers in the real world. Thursday, June 7, 12 CPU’s have to deal with general purpose instruction sets like moving around registers and interfacing with a multitude of daughter card interfaces. Some modern CPU’s are starting to incorporate some hardware that improves their performance on this stuff (SSE) but still not as powerful as dedicated hardware
  12. *OKAY, MAYBE NOT THAT BAD AT MATH General purpose CPU’s

    have to do things other than math Their hardware reflects this Jack of all trades, Master of none CPU’s are great with integers. Too bad there aren’t a lot of integers in the real world. CPU’s are good at math. They’re just not good at fast math. Thursday, June 7, 12 CPU’s have to deal with general purpose instruction sets like moving around registers and interfacing with a multitude of daughter card interfaces. Some modern CPU’s are starting to incorporate some hardware that improves their performance on this stuff (SSE) but still not as powerful as dedicated hardware
  13. I’M GOING FAST! Thursday, June 7, 12 Fast Inverse Square

    Root -> Fast *RECIPROCAL* Square Root (poorly named, IMO) Useful for normalizing vectors, etc.
  14. I’M GOING FAST! float Q_rsqrt( float number ) { long

    i; float x2, y; const float threehalfs = 1.5F; x2 = number * 0.5F; y = number; i = * ( long * ) &y; // evil floating point bit level hacking i = 0x5f3759df - ( i >> 1 ); // what the fuck? y = * ( float * ) &i; y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration // y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed return y; } This is the original “Fast Inverse Square Root” as it appears in Quake III, including original comments. Thursday, June 7, 12 Fast Inverse Square Root -> Fast *RECIPROCAL* Square Root (poorly named, IMO) Useful for normalizing vectors, etc.
  15. WHAT THE F**K INDEED. This is graphics code. Graphics programming

    is deeply rooted in Linear Algebra and Matrix Theory Thursday, June 7, 12
  16. WHAT THE F**K INDEED. This is graphics code. Graphics programming

    is deeply rooted in Linear Algebra and Matrix Theory Graphics programmers get tired of patting themselves on the back for being “clever” Thursday, June 7, 12
  17. WHAT THE F**K INDEED. This is graphics code. Graphics programming

    is deeply rooted in Linear Algebra and Matrix Theory Graphics programmers get tired of patting themselves on the back for being “clever” The GPU is born Thursday, June 7, 12
  18. GPU’S ARE THE DR. PEPPER TO A CPU’S MR. PIBB*

    GPU’s are designed, at a hardware level, to work with floating point vectors This makes them very good at wrangling matrices Not only can they do the math itself really fast, they are Massively Parallel in nature: They can do fast math to lots of things at once * Shoulda stayed in school, Pibb. Thursday, June 7, 12
  19. PHYSICS ENGINES The math behind graphics and physics is fundamentally

    similar Physics models are offloaded to GPU’s, but are coupled with animation/rendering and output is sent to the display Thursday, June 7, 12
  20. GPGPU COMPUTING IS BORN Thursday, June 7, 12 Nvidia CUDA

    is dedicated to using GPU’s to aid in computation; API and tools provide extremely low level hardware access OpenCL (Open Compute Language) is for general parallel architectures (CPU, GPU, etc.) and is less mature. This reflects in the performance of the libraries. OpenCL isn’t *worse*, it’s just different. If you need to be as close to the
  21. GPGPU COMPUTING IS BORN Eventually, clever folk figure out a

    way to trick GPU’s in to sending their output back to program memory space instead of the render pipeline Thursday, June 7, 12 Nvidia CUDA is dedicated to using GPU’s to aid in computation; API and tools provide extremely low level hardware access OpenCL (Open Compute Language) is for general parallel architectures (CPU, GPU, etc.) and is less mature. This reflects in the performance of the libraries. OpenCL isn’t *worse*, it’s just different. If you need to be as close to the
  22. GPGPU COMPUTING IS BORN Eventually, clever folk figure out a

    way to trick GPU’s in to sending their output back to program memory space instead of the render pipeline GPU Manufacturers don’t fight this trend; they decide to play along Nvidia CUDA OpenCL Thursday, June 7, 12 Nvidia CUDA is dedicated to using GPU’s to aid in computation; API and tools provide extremely low level hardware access OpenCL (Open Compute Language) is for general parallel architectures (CPU, GPU, etc.) and is less mature. This reflects in the performance of the libraries. OpenCL isn’t *worse*, it’s just different. If you need to be as close to the
  23. NVIDIA COMPUTE UNIFIED DEVICE ARCHITECTURE Main program dispatches “kernels” to

    the GPU GPU works on data, passes the result back Thursday, June 7, 12 Kernels are basically maps in the sense of functional forms (the “apply-to-all” functional form). When a kernel is dispatched, it applies a single functional operation to all elements in a data set.
  24. THREADS, BLOCKS, GRIDS CUDA is comprised of blocks of threads,

    and grids of blocks Threads in the same block may communicate Communication between blocks handled by shared memory (slow) “Multidimensional” algorithm design Thursday, June 7, 12 Threads are 1D, Blocks can be up to 3D, Grids can be up to 2D. Allows for the developer to set the topology of the data to be aligned with the memory and processor topology for efficiency
  25. ALGORITHM DESIGN STRATEGIES ARE CHANGED DRASTICALLY The transfer of (usually

    very large) chunks of data back and forth from the GPU to main RAM is now an algorithmic and overhead concern GPU’s can only hold so much in their onboard RAM; you don’t get to page on a GPU Alignment to word boundaries is suddenly a really big deal Information about problem domain dimensionality needs to be somewhat known Useless on small data sets Thursday, June 7, 12 When data sets align to a word boundary, performance increases. In many cases, the data set will be strided or padded to enforce this. Even better is when the number of elements to be operated on is a power of 2; an example is that in some of my old work data sets that were in the ~300^2 -> ~400^2 range would take longer to process than data that was exactly 512^2. This is an artifact of the hardware.
  26. SO HOW DO WE DESIGN GPGPU ALGORITHMS? Thursday, June 7,

    12 Vectorization -> transforming sets of independent “units” of information in to an array or matrix representation Padding -> stride arrays to word boundaries Revisit naïve algorithms in the sense that many algorithms presented in an academic context that are sub-optimal
  27. SO HOW DO WE DESIGN GPGPU ALGORITHMS? Data set “vectorization”

    Thursday, June 7, 12 Vectorization -> transforming sets of independent “units” of information in to an array or matrix representation Padding -> stride arrays to word boundaries Revisit naïve algorithms in the sense that many algorithms presented in an academic context that are sub-optimal
  28. SO HOW DO WE DESIGN GPGPU ALGORITHMS? Data set “vectorization”

    Memory Padding Thursday, June 7, 12 Vectorization -> transforming sets of independent “units” of information in to an array or matrix representation Padding -> stride arrays to word boundaries Revisit naïve algorithms in the sense that many algorithms presented in an academic context that are sub-optimal
  29. SO HOW DO WE DESIGN GPGPU ALGORITHMS? Data set “vectorization”

    Memory Padding Minimize Transfers Thursday, June 7, 12 Vectorization -> transforming sets of independent “units” of information in to an array or matrix representation Padding -> stride arrays to word boundaries Revisit naïve algorithms in the sense that many algorithms presented in an academic context that are sub-optimal
  30. SO HOW DO WE DESIGN GPGPU ALGORITHMS? Data set “vectorization”

    Memory Padding Minimize Transfers Revisit naïve algorithms Thursday, June 7, 12 Vectorization -> transforming sets of independent “units” of information in to an array or matrix representation Padding -> stride arrays to word boundaries Revisit naïve algorithms in the sense that many algorithms presented in an academic context that are sub-optimal
  31. EXAMPLES Thursday, June 7, 12 Map-Reduce example: Parallel summation of

    a large quantity of numbers (such as summation of a huge state vector) can be parallelized by performing the summation pairwise iteratively, with a thread dedicated to each pair. O(log n) vs. O(n). Parallel sorting: Parallel Radix Sort, Parallel Tournament Sort
  32. EXAMPLES Parallel Map-Reduce Useful for normalizations, averages, etc. Thursday, June

    7, 12 Map-Reduce example: Parallel summation of a large quantity of numbers (such as summation of a huge state vector) can be parallelized by performing the summation pairwise iteratively, with a thread dedicated to each pair. O(log n) vs. O(n). Parallel sorting: Parallel Radix Sort, Parallel Tournament Sort
  33. EXAMPLES Parallel Map-Reduce Useful for normalizations, averages, etc. Parallel sorting

    (!) Improvement over serial tree sorting algorithms by performing slower, naïve divide-and-conquer algorithms in a parallel manner Thursday, June 7, 12 Map-Reduce example: Parallel summation of a large quantity of numbers (such as summation of a huge state vector) can be parallelized by performing the summation pairwise iteratively, with a thread dedicated to each pair. O(log n) vs. O(n). Parallel sorting: Parallel Radix Sort, Parallel Tournament Sort
  34. BIG EXAMPLE: PARALLEL DOOLITTLE LU DECOMPOSITION WITH PIVOTING LU Decomposition

    Refresher: Element (i,n) in the Lower Triangular Matrix L: Elements in U created organically from multiplications performed on A for each column in A calulate each element of l,i Thursday, June 7, 12 Calculation step is nested inner loop.
  35. BUT LU HAS RELIANCE ON THE PREVIOUS STEP... It relies

    on the results from the previous column, however… Processing each *column* is an orthogonal problem We can “vectorize” the computations to create each column in parallel, reducing the complexity of LU from O(n) to O(sqrt n)! Thursday, June 7, 12 Because the only dependence on prior work in the algorithm lives in the outer loop, the inner loop can be parallelized so that all elements L(i,j) are calculated simultaneously.