A Brief Introduction to GPGPU Computing with Nvidia CUDA

HETEROGENEOUS COMPUTING With Nvidia CUDA Thursday, June 7, 12 Heterogeneous
Computing is the use of several distinct specialized hardware tools in a single computing system in order to accelerate complex calculations. This talk will deal with how GPGPU programming integrates with this idea.

MOORE’S LAW Thursday, June 7, 12 On tick-tock: Intel creates
a new chip one year, then improves that same chip the next year, then a new chip, back and forth. Tick tock. Limited by reality: Thermal constraints, electromagnetic interferences as transistors get closer, etc.

The number of transistors on consumer integrated circuits doubles around
every two years MOORE’S LAW Source: Wikipedia Thursday, June 7, 12 On tick-tock: Intel creates a new chip one year, then improves that same chip the next year, then a new chip, back and forth. Tick tock. Limited by reality: Thermal constraints, electromagnetic interferences as transistors get closer, etc.

every two years The Intel “tick-tock” MOORE’S LAW Source: Wikipedia Thursday, June 7, 12 On tick-tock: Intel creates a new chip one year, then improves that same chip the next year, then a new chip, back and forth. Tick tock. Limited by reality: Thermal constraints, electromagnetic interferences as transistors get closer, etc.

every two years The Intel “tick-tock” Limited by reality (physics is kind of a jackass) MOORE’S LAW Source: Wikipedia Thursday, June 7, 12 On tick-tock: Intel creates a new chip one year, then improves that same chip the next year, then a new chip, back and forth. Tick tock. Limited by reality: Thermal constraints, electromagnetic interferences as transistors get closer, etc.

IF WE CAN’T MAKE PROCESSORS FASTER, LET’S JUST MAKE MORE
PROCESSORS! Thursday, June 7, 12

TURNS OUT... Thursday, June 7, 12

PARALLEL PROGRAMMING IS HARD, YOU GUYS. Decades of algorithms and
theory based on serial, iterative computing (c.f. graph theory, memoization, dynamic programming, numerical analysis, linear programming…) Introduction of nondeterministic bottlenecks (process forking or thread spawning overhead, scheduling, resource contention, time-slicing…) Data access & I/O becomes an algorithmic concern on top of being a performance overhead concern Thursday, June 7, 12

BUT THE REAL KICKER? Thursday, June 7, 12

GENERAL PURPOSE PROCESSORS ARE PRETTY BAD AT MATH* Thursday, June
7, 12

*OKAY, MAYBE NOT THAT BAD AT MATH Thursday, June 7,
12 CPU’s have to deal with general purpose instruction sets like moving around registers and interfacing with a multitude of daughter card interfaces. Some modern CPU’s are starting to incorporate some hardware that improves their performance on this stuff (SSE) but still not as powerful as dedicated hardware

*OKAY, MAYBE NOT THAT BAD AT MATH General purpose CPU’s
have to do things other than math Thursday, June 7, 12 CPU’s have to deal with general purpose instruction sets like moving around registers and interfacing with a multitude of daughter card interfaces. Some modern CPU’s are starting to incorporate some hardware that improves their performance on this stuff (SSE) but still not as powerful as dedicated hardware

have to do things other than math Their hardware reﬂects this Thursday, June 7, 12 CPU’s have to deal with general purpose instruction sets like moving around registers and interfacing with a multitude of daughter card interfaces. Some modern CPU’s are starting to incorporate some hardware that improves their performance on this stuff (SSE) but still not as powerful as dedicated hardware

have to do things other than math Their hardware reﬂects this Jack of all trades, Master of none Thursday, June 7, 12 CPU’s have to deal with general purpose instruction sets like moving around registers and interfacing with a multitude of daughter card interfaces. Some modern CPU’s are starting to incorporate some hardware that improves their performance on this stuff (SSE) but still not as powerful as dedicated hardware

have to do things other than math Their hardware reﬂects this Jack of all trades, Master of none CPU’s are great with integers. Too bad there aren’t a lot of integers in the real world. Thursday, June 7, 12 CPU’s have to deal with general purpose instruction sets like moving around registers and interfacing with a multitude of daughter card interfaces. Some modern CPU’s are starting to incorporate some hardware that improves their performance on this stuff (SSE) but still not as powerful as dedicated hardware

have to do things other than math Their hardware reﬂects this Jack of all trades, Master of none CPU’s are great with integers. Too bad there aren’t a lot of integers in the real world. CPU’s are good at math. They’re just not good at fast math. Thursday, June 7, 12 CPU’s have to deal with general purpose instruction sets like moving around registers and interfacing with a multitude of daughter card interfaces. Some modern CPU’s are starting to incorporate some hardware that improves their performance on this stuff (SSE) but still not as powerful as dedicated hardware

I WANNA GO FAST! Thursday, June 7, 12

I’M GOING FAST! Thursday, June 7, 12 Fast Inverse Square
Root -> Fast *RECIPROCAL* Square Root (poorly named, IMO) Useful for normalizing vectors, etc.

I’M GOING FAST! float Q_rsqrt( float number ) { long
i; float x2, y; const float threehalfs = 1.5F; x2 = number * 0.5F; y = number; i = * ( long * ) &y; // evil floating point bit level hacking i = 0x5f3759df - ( i >> 1 ); // what the fuck? y = * ( float * ) &i; y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration // y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed return y; } This is the original “Fast Inverse Square Root” as it appears in Quake III, including original comments. Thursday, June 7, 12 Fast Inverse Square Root -> Fast *RECIPROCAL* Square Root (poorly named, IMO) Useful for normalizing vectors, etc.

WHAT THE F**K INDEED. Thursday, June 7, 12

WHAT THE F**K INDEED. This is graphics code. Thursday, June
7, 12

WHAT THE F**K INDEED. This is graphics code. Graphics programming
is deeply rooted in Linear Algebra and Matrix Theory Thursday, June 7, 12

is deeply rooted in Linear Algebra and Matrix Theory Graphics programmers get tired of patting themselves on the back for being “clever” Thursday, June 7, 12

is deeply rooted in Linear Algebra and Matrix Theory Graphics programmers get tired of patting themselves on the back for being “clever” The GPU is born Thursday, June 7, 12

GPU’S ARE THE DR. PEPPER TO A CPU’S MR. PIBB*
GPU’s are designed, at a hardware level, to work with ﬂoating point vectors This makes them very good at wrangling matrices Not only can they do the math itself really fast, they are Massively Parallel in nature: They can do fast math to lots of things at once * Shoulda stayed in school, Pibb. Thursday, June 7, 12

Thursday, June 7, 12

PHYSICS ENGINES The math behind graphics and physics is fundamentally
similar Physics models are ofﬂoaded to GPU’s, but are coupled with animation/rendering and output is sent to the display Thursday, June 7, 12

GPGPU COMPUTING IS BORN Thursday, June 7, 12 Nvidia CUDA
is dedicated to using GPU’s to aid in computation; API and tools provide extremely low level hardware access OpenCL (Open Compute Language) is for general parallel architectures (CPU, GPU, etc.) and is less mature. This reﬂects in the performance of the libraries. OpenCL isn’t *worse*, it’s just different. If you need to be as close to the

GPGPU COMPUTING IS BORN Eventually, clever folk ﬁgure out a
way to trick GPU’s in to sending their output back to program memory space instead of the render pipeline Thursday, June 7, 12 Nvidia CUDA is dedicated to using GPU’s to aid in computation; API and tools provide extremely low level hardware access OpenCL (Open Compute Language) is for general parallel architectures (CPU, GPU, etc.) and is less mature. This reﬂects in the performance of the libraries. OpenCL isn’t *worse*, it’s just different. If you need to be as close to the

GPGPU COMPUTING IS BORN Eventually, clever folk figure out a
way to trick GPU’s in to sending their output back to program memory space instead of the render pipeline GPU Manufacturers don’t fight this trend; they decide to play along Nvidia CUDA OpenCL Thursday, June 7, 12 Nvidia CUDA is dedicated to using GPU’s to aid in computation; API and tools provide extremely low level hardware access OpenCL (Open Compute Language) is for general parallel architectures (CPU, GPU, etc.) and is less mature. This reflects in the performance of the libraries. OpenCL isn’t *worse*, it’s just different. If you need to be as close to the

NVIDIA COMPUTE UNIFIED DEVICE ARCHITECTURE Main program dispatches “kernels” to
the GPU GPU works on data, passes the result back Thursday, June 7, 12 Kernels are basically maps in the sense of functional forms (the “apply-to-all” functional form). When a kernel is dispatched, it applies a single functional operation to all elements in a data set.

THREADS, BLOCKS, GRIDS CUDA is comprised of blocks of threads,
and grids of blocks Threads in the same block may communicate Communication between blocks handled by shared memory (slow) “Multidimensional” algorithm design Thursday, June 7, 12 Threads are 1D, Blocks can be up to 3D, Grids can be up to 2D. Allows for the developer to set the topology of the data to be aligned with the memory and processor topology for efficiency

Thursday, June 7, 12

ALGORITHM DESIGN STRATEGIES ARE CHANGED DRASTICALLY The transfer of (usually
very large) chunks of data back and forth from the GPU to main RAM is now an algorithmic and overhead concern GPU’s can only hold so much in their onboard RAM; you don’t get to page on a GPU Alignment to word boundaries is suddenly a really big deal Information about problem domain dimensionality needs to be somewhat known Useless on small data sets Thursday, June 7, 12 When data sets align to a word boundary, performance increases. In many cases, the data set will be strided or padded to enforce this. Even better is when the number of elements to be operated on is a power of 2; an example is that in some of my old work data sets that were in the ~300^2 -> ~400^2 range would take longer to process than data that was exactly 512^2. This is an artifact of the hardware.

SO HOW DO WE DESIGN GPGPU ALGORITHMS? Thursday, June 7,
12 Vectorization -> transforming sets of independent “units” of information in to an array or matrix representation Padding -> stride arrays to word boundaries Revisit naïve algorithms in the sense that many algorithms presented in an academic context that are sub-optimal

SO HOW DO WE DESIGN GPGPU ALGORITHMS? Data set “vectorization”
Thursday, June 7, 12 Vectorization -> transforming sets of independent “units” of information in to an array or matrix representation Padding -> stride arrays to word boundaries Revisit naïve algorithms in the sense that many algorithms presented in an academic context that are sub-optimal

Memory Padding Thursday, June 7, 12 Vectorization -> transforming sets of independent “units” of information in to an array or matrix representation Padding -> stride arrays to word boundaries Revisit naïve algorithms in the sense that many algorithms presented in an academic context that are sub-optimal

Memory Padding Minimize Transfers Thursday, June 7, 12 Vectorization -> transforming sets of independent “units” of information in to an array or matrix representation Padding -> stride arrays to word boundaries Revisit naïve algorithms in the sense that many algorithms presented in an academic context that are sub-optimal

Memory Padding Minimize Transfers Revisit naïve algorithms Thursday, June 7, 12 Vectorization -> transforming sets of independent “units” of information in to an array or matrix representation Padding -> stride arrays to word boundaries Revisit naïve algorithms in the sense that many algorithms presented in an academic context that are sub-optimal

EXAMPLES Thursday, June 7, 12 Map-Reduce example: Parallel summation of
a large quantity of numbers (such as summation of a huge state vector) can be parallelized by performing the summation pairwise iteratively, with a thread dedicated to each pair. O(log n) vs. O(n). Parallel sorting: Parallel Radix Sort, Parallel Tournament Sort

EXAMPLES Parallel Map-Reduce Useful for normalizations, averages, etc. Thursday, June
7, 12 Map-Reduce example: Parallel summation of a large quantity of numbers (such as summation of a huge state vector) can be parallelized by performing the summation pairwise iteratively, with a thread dedicated to each pair. O(log n) vs. O(n). Parallel sorting: Parallel Radix Sort, Parallel Tournament Sort

EXAMPLES Parallel Map-Reduce Useful for normalizations, averages, etc. Parallel sorting
(!) Improvement over serial tree sorting algorithms by performing slower, naïve divide-and-conquer algorithms in a parallel manner Thursday, June 7, 12 Map-Reduce example: Parallel summation of a large quantity of numbers (such as summation of a huge state vector) can be parallelized by performing the summation pairwise iteratively, with a thread dedicated to each pair. O(log n) vs. O(n). Parallel sorting: Parallel Radix Sort, Parallel Tournament Sort

BIG EXAMPLE: PARALLEL DOOLITTLE LU DECOMPOSITION WITH PIVOTING LU Decomposition
Refresher: Element (i,n) in the Lower Triangular Matrix L: Elements in U created organically from multiplications performed on A for each column in A calulate each element of l,i Thursday, June 7, 12 Calculation step is nested inner loop.

BUT LU HAS RELIANCE ON THE PREVIOUS STEP... It relies
on the results from the previous column, however… Processing each *column* is an orthogonal problem We can “vectorize” the computations to create each column in parallel, reducing the complexity of LU from O(n) to O(sqrt n)! Thursday, June 7, 12 Because the only dependence on prior work in the algorithm lives in the outer loop, the inner loop can be parallelized so that all elements L(i,j) are calculated simultaneously.

A Brief Introduction to GPGPU Computing with Nv...

A Brief Introduction to GPGPU Computing with Nvidia CUDA

More Decks by Doug Stephen

Other Decks in Technology

Featured

Transcript